url
furl库的介绍

furl


平时项目中经常要使用到url的处理,尽管python标准库提供了urlliburlparse等库,但用起来并不是很方便,这里介绍一款小型且好用的url处理库——furl

Python's standard urllib and urlparse modules provide a number of URL related functions, but using these functions to perform common URL operations proves tedious. Furl makes parsing and manipulating URLs easy.

安装

pip install furl

官方文档


认识url


[scheme:]//[user[:password]@]host[:port][/path][?query][#fragment]

例子如:https://monkeyjerry:jerry@1eq066.coding-pages.com:8080/search;foo=1;bar=2?q=python&monkey=4#test

各部分解释:

组件 描述 默认值 举例
scheme 访问服务器获取资源时使用的协议 https
user 访问资源时使用的用户名 无(匿名) monkeyjerry
password 用户的密码,和用户名使用:分割 E-mail jerry
host 资源服务器主机名或IP地址 1eq066.coding-pages.com
port 资源服务器监听的端口,不同的scheme有不同的默认端口(HTTP使用80作为默认端口) 和scheme有关 8080
path 服务器上的资源路径。路径与服务器和scheme有关 默认值 /search
params 在某些scheme下指定输入参数,是键值对。可以有多个,使用;分割,单个内的多个值使用, 分割 默认值 foo=1;bar=2
query 该组件没有通用的格式,HTTP中大多使用&来分隔多个query。使用?分隔query和其他部分 q=python&monkey=4
fragment 一小片或一部分资源名称。引用对象时,不会将fragment传送给服务器,客户端内部使用。通过#分隔fragment和其余部分 test

由于params是path的一部分,我们将其归在path中

url编解码

furl的使用


提取Scheme, Username, Password, Host, Port, Path, Query, Fragment

from furl import furl

url = "https://monkeyjerry:jerry@1eq066.coding-pages.com/search/index.html;foo=1;bar=2?q=python&monkey=4#test"
f = furl(url)
print(type(f))
print(f.url, type(f.url), f.url==url)
print(f.scheme, f.username, f.password, f.host, f.port, f.path, f.query, f.fragment, sep='\n')
print(type(f.scheme), type(f.username), type(f.password), type(f.host), type(f.port), type(f.path), type(f.query), type(f.fragment))
-------------------------------------
<class 'furl.furl.furl'>
https://monkeyjerry:jerry@1eq066.coding-pages.com/search/index.html;foo=1;bar=2?q=python&monkey=4#test <class 'str'> True
https
monkeyjerry
jerry
1eq066.coding-pages.com
443 # 根据Scheme推断默认端口,无法推断则返回None
/search/index.html;foo=1;bar=2
q=python&monkey=4
test
<class 'str'> <class 'str'> <class 'str'> <class 'str'> <class 'int'> <class 'furl.furl.Path'> <class 'furl.furl.Query'> <class 'furl.furl.Fragment'>

注:其中path、query、fragement返回值并不是str

netloc和origin

  • netloc(Network Location),返回是username, password, host, 和port组成的字符串(如果port是None或者是默认port则不包含port部分)
  • origin返回的是scheme, host, 和port组成的字符串(如果port是None或者是默认port则不包含port部分)
url = "https://monkey:jerry@google.com"
url1 = "https://monkey:jerry@google.com:99"
f = furl(url)
f1 = furl(url1)
print(f.netloc)
print(f1.origin)
--------------
monkey:jerry@google.com
https://google.com:99

path

  • path.segments返回列表,其中元素都是percent-decoded(如果url被encode了的话)字符串
url = "http://www.google.com/a/large%20ish/path"
url1 = "http://www.google.com/a/large%20ish/path/"  # 注意url和url1的区别
f = furl(url)
f1 = furl(url1)
print(str(f.path))  # 不能解码
print(f.path.segments, f1.path.segments)  # 能够解码
f.path.segments = ['o', 'hi', 'there', 'with some encoding', '^`<>[]"#/?', '']  # 注意最后一个元素为空串
print(f.path)  # url编码
-------------------------
/a/large%20ish/path
['a', 'large ish', 'path'] ['a', 'large ish', 'path', '']
/o/hi/there/with%20some%20encoding/%5E%60%3C%3E%5B%5D%22%23%2F%3F/
  • path.isdir和path.isfile
    • 如果path以/结尾,在被认为是目录dir,否则是文件file
# 结合上例
f = furl('http://www.google.com/a/directory/')
print(f.path.isdir, f.path.isfile)
f = furl('http://www.google.com/a/file')
print(f.path.isdir, f.path.isfile)
------------------------------
True False
False True
  • path.isabsolute
    • 是否是绝对路径(即path是否是以/开头),返回True或False
    • 如果netloc存在,则path必须是绝对路径
    f = furl('/url/path')
    print(f.path.isabsolute)
    f.path.isabsolute = False  # netloc不存在则可直接设置isabsolute为False
    print(f.url)
    f.host = 'blaps.ru'  # netloc存在
    f.scheme = 'https'
    print(f.url)
    print(f.path.isabsolute)
    # f.path.isabsolute = False # AttributeError: Path.isabsolute is True and read-only for URLs with a netloc (a username, password, host, and/or port). A URL path must start with a '/' to separate itself from a netloc.
    
  • path.normalize()格式化:返回path对象
f = furl('http://www.google.com////a/./b/lolsup/../c/')
p = f.path.normalize()
print(p)  # 返回path对象
print(f.url)
-----------------------------
/a/b/c/
http://www.google.com/a/b/c/
  • /操作
    • 类似之前的介绍的pathlib.path路径拼接
    • furl.path对象/furl.path对象、furl.path对象/字符串 或者 字符串/furl.path对象
    • 不能字符串 / 字符串
    f = furl('path')
    f1 = furl('path1')
    f.path /= 'with'
    f.path = f.path / f1.path
    f.path = f.path / 'more' / 'path segments/'
    print(f.path)
    ------------
    path/with/path1/more/path%20segments/
    
  • path.asdict():返回path相关信息的字典
    f = furl('http://www.google.com/some/enc%20oding')
    print(f.path.asdict())
    -----------------------
    {'encoded': '/some/enc%20oding', 'isdir': False, 'isfile': True, 'segments': ['some', 'enc oding'], 'isabsolute': True}
    

query

  • query.params:返回furl的omdict1D对象,能够解码
  • f.args:同上,就是query.params
    f = furl('http://www.google.com/?one%20piece=1&two=2&two=22')  # 编码
    print(f.query, type(f.query), str(f.query))  # 不解码
    print(f.query.params, type(f.query.params))  # 解码
    print(f.args, type(f.args), f.args is f.query.params)  # 解码
    # ------------字典操作-----------------#
    print(f.query.params['two'])
    print(f.query.params.getlist('two'))
    f.args['three'] = 3
    f.args.addlist('four', [4, 44])
    print(f.args)
    f.args.popvalue('one piece')
    print(f.args)
    f.args.popvalue('four', 44)
    print(f.args)
    f.add(args={'params': ['a', 'b']})
    print(f.args)
    ----------------------------------------
    one+piece=1&two=2&two=22 <class 'furl.furl.Query'> one+piece=1&two=2&two=22
    {'one piece': '1', 'two': '2', 'two': '22'} <class 'furl.omdict1D.omdict1D'>
    {'one piece': '1', 'two': '2', 'two': '22'} <class 'furl.omdict1D.omdict1D'> True
    2
    ['2', '22']
    {'one piece': '1', 'two': '2', 'two': '22', 'three': 3, 'four': 4, 'four': 44}
    {'two': '2', 'two': '22', 'three': 3, 'four': 4, 'four': 44}
    {'two': '2', 'two': '22', 'three': 3, 'four': 4}
    {'two': '2', 'two': '22', 'three': 3, 'four': 4, 'params': 'a', 'params': 'b'}
    

    f.query.params返回的是omdict1D(ordered multivalue dictionary)为有序的多值字典,包含python字典的一些操作

  • 构造?param=?param格式
    f = furl('http://sprop.su')
    f.args['param'] = ''
    print(f.url)
    f = furl('http://sprop.su')
    f.args['param'] = None
    print(f.url)
    ---------------------
    http://sprop.su?param=
    http://sprop.su?param
    
  • query.asdict():返回query相关信息的字典

fragment

  • fragment与netloc用#分割
  • fragment可分为path和query,并用?(可选)分割开
  • f.fragment.separator为False时,则fragment的path和query不会用?分割开
    f = furl('http://www.google.com/#/fragment/path?with=params')
    print(f.fragment)
    print(f.query)
    print(f.fragment.separator, f.fragment.path, f.fragment.query)
    f = furl('http://www.google.com/')
    f.fragment.path = '/path'
    f.fragment.args = {'a': 'dict', 'of': 'args'}
    print(f.fragment.separator)
    print(f.fragment)
    print(f.url)
    f.fragment.separator = False
    print(f.fragment)
    print(f.url)
    ------------------------------------------------
    /fragment/path?with=params
    
    True /fragment/path with=params
    True
    /path?a=dict&of=args
    http://www.google.com/#/path?a=dict&of=args
    /patha=dict&of=args
    http://www.google.com/#/patha=dict&of=args
    

    由于fragment的path和query是path和query对象,因此可以使用上面介绍的path和query相关操作

  • fragment.asdict():返回fragment相关信息的字典

encoding

链式调用

  • furl的add(), set() 和 remove() 返回的是furl对象,因此可完成链式调用
    url = 'http://www.google.com/#fragment'
    print(furl(url).add(args={'example': 'arg'}).set(port=99).remove(fragment=True).url)
    
    url = "ftp://www.google.com/"
    f = furl(url)
    print(f.url)
    f.add(path='/search', fragment_path='frag/path', fragment_args={'frag': 'arg'})
    print(f.url)
    f.set(scheme='https', host='secure.google.com', port=99, path='a/path/', args={'some': 'args'}, fragment='great job')
    print(f.url)
    f.remove(args=['some'], path='path/', fragment=True, port=True)
    print(f.url)
    ----------------------------------------------
    ftp://www.google.com/
    ftp://www.google.com/search#frag/path?frag=arg
    https://secure.google.com:99/a/path/?some=args#great%20job
    https://secure.google.com/a/
    

    更多关于add(),set()和remove()函数相关参数及用法可参考源码或官方文档

  • furl.join()
    • 与相对或绝对路径字符串拼接
    • 返回新的furl对象,也可接着链式调用
    f = furl('http://www.google.com')
    print(f.join('new/path').url)
    print(f.join('replaced').url)
    print(f.join('../parent').url)
    print(f.join('path?query=yes#fragment').url)
    print(f.join('unknown://www.yahoo.com/new/url/').url)
    ----------------------------------------------
    http://www.google.com/new/path
    http://www.google.com/new/replaced
    http://www.google.com/parent
    http://www.google.com/path?query=yes#fragment
    unknown://www.yahoo.com/new/url/
    

简单url拼接


如果项目中用到的拼接并不复杂,则不需要引入第三方库,简单使用urllib.parse的urljoin即可

from urllib import parse

url = 'https://baidu.com'
query_str = 's?query=monkeyjerry'
print(url + query_str)  # 字符串简单拼接,并不能得到想要的url

url = parse.urljoin(url, query_str)
print(url)
--------------
https://baidu.coms?query=monkeyjerry
https://baidu.com/s?query=monkeyjerry

参考