url
furl库的介绍

furl

平时项目中经常要使用到url的处理，尽管python标准库提供了urllib和urlparse等库，但用起来并不是很方便，这里介绍一款小型且好用的url处理库——furl

Python's standard urllib and urlparse modules provide a number of URL related functions, but using these functions to perform common URL operations proves tedious. Furl makes parsing and manipulating URLs easy.

安装

pip install furl

官方文档

认识url

[scheme:]//[user[:password]@]host[:port][/path][?query][#fragment]

例子如：https://monkeyjerry:jerry@1eq066.coding-pages.com:8080/search;foo=1;bar=2?q=python&monkey=4#test

各部分解释：

组件	描述	默认值	举例
scheme	访问服务器获取资源时使用的协议	无	https
user	访问资源时使用的用户名	无（匿名）	monkeyjerry
password	用户的密码，和用户名使用`:`分割	E-mail	jerry
host	资源服务器主机名或IP地址	无	1eq066.coding-pages.com
port	资源服务器监听的端口，不同的scheme有不同的默认端口（HTTP使用80作为默认端口）	和scheme有关	8080
path	服务器上的资源路径。路径与服务器和scheme有关	默认值	/search
params	在某些scheme下指定输入参数，是键值对。可以有多个，使用`;`分割，单个内的多个值使用`,` 分割	默认值	foo=1;bar=2
query	该组件没有通用的格式，HTTP中大多使用`&`来分隔多个query。使用`?`分隔query和其他部分	无	q=python&monkey=4
fragment	一小片或一部分资源名称。引用对象时，不会将fragment传送给服务器，客户端内部使用。通过`#`分隔fragment和其余部分	无	test

由于params是path的一部分，我们将其归在path中

url编解码

站长之家url编解码工具

furl的使用

提取Scheme, Username, Password, Host, Port, Path, Query, Fragment

from furl import furl

url = "https://monkeyjerry:jerry@1eq066.coding-pages.com/search/index.html;foo=1;bar=2?q=python&monkey=4#test"
f = furl(url)
print(type(f))
print(f.url, type(f.url), f.url==url)
print(f.scheme, f.username, f.password, f.host, f.port, f.path, f.query, f.fragment, sep='\n')
print(type(f.scheme), type(f.username), type(f.password), type(f.host), type(f.port), type(f.path), type(f.query), type(f.fragment))
-------------------------------------
<class 'furl.furl.furl'>
https://monkeyjerry:jerry@1eq066.coding-pages.com/search/index.html;foo=1;bar=2?q=python&monkey=4#test <class 'str'> True
https
monkeyjerry
jerry
1eq066.coding-pages.com
443 # 根据Scheme推断默认端口，无法推断则返回None
/search/index.html;foo=1;bar=2
q=python&monkey=4
test
<class 'str'> <class 'str'> <class 'str'> <class 'str'> <class 'int'> <class 'furl.furl.Path'> <class 'furl.furl.Query'> <class 'furl.furl.Fragment'>

注：其中path、query、fragement返回值并不是str

netloc和origin

netloc(Network Location)，返回是username, password, host, 和port组成的字符串（如果port是None或者是默认port则不包含port部分）
origin返回的是scheme, host, 和port组成的字符串（如果port是None或者是默认port则不包含port部分）

url = "https://monkey:jerry@google.com"
url1 = "https://monkey:jerry@google.com:99"
f = furl(url)
f1 = furl(url1)
print(f.netloc)
print(f1.origin)
--------------
monkey:jerry@google.com
https://google.com:99

path

path.segments返回列表，其中元素都是percent-decoded（如果url被encode了的话）字符串

url = "http://www.google.com/a/large%20ish/path"
url1 = "http://www.google.com/a/large%20ish/path/"  # 注意url和url1的区别
f = furl(url)
f1 = furl(url1)
print(str(f.path))  # 不能解码
print(f.path.segments, f1.path.segments)  # 能够解码
f.path.segments = ['o', 'hi', 'there', 'with some encoding', '^`<>[]"#/?', '']  # 注意最后一个元素为空串
print(f.path)  # url编码
-------------------------
/a/large%20ish/path
['a', 'large ish', 'path'] ['a', 'large ish', 'path', '']
/o/hi/there/with%20some%20encoding/%5E%60%3C%3E%5B%5D%22%23%2F%3F/

path.isdir和path.isfile
- 如果path以/结尾，在被认为是目录dir，否则是文件file

# 结合上例
f = furl('http://www.google.com/a/directory/')
print(f.path.isdir, f.path.isfile)
f = furl('http://www.google.com/a/file')
print(f.path.isdir, f.path.isfile)
------------------------------
True False
False True

path.isabsolute

是否是绝对路径（即path是否是以/开头），返回True或False
如果netloc存在，则path必须是绝对路径

f = furl('/url/path')
print(f.path.isabsolute)
f.path.isabsolute = False  # netloc不存在则可直接设置isabsolute为False
print(f.url)
f.host = 'blaps.ru'  # netloc存在
f.scheme = 'https'
print(f.url)
print(f.path.isabsolute)
# f.path.isabsolute = False # AttributeError: Path.isabsolute is True and read-only for URLs with a netloc (a username, password, host, and/or port). A URL path must start with a '/' to separate itself from a netloc.

path.normalize()格式化：返回path对象

f = furl('http://www.google.com////a/./b/lolsup/../c/')
p = f.path.normalize()
print(p)  # 返回path对象
print(f.url)
-----------------------------
/a/b/c/
http://www.google.com/a/b/c/

/操作

类似之前的介绍的pathlib.path路径拼接
furl.path对象/furl.path对象、furl.path对象/字符串或者字符串/furl.path对象
不能字符串 / 字符串

f = furl('path')
f1 = furl('path1')
f.path /= 'with'
f.path = f.path / f1.path
f.path = f.path / 'more' / 'path segments/'
print(f.path)
------------
path/with/path1/more/path%20segments/

path.asdict()：返回path相关信息的字典

f = furl('http://www.google.com/some/enc%20oding')
print(f.path.asdict())
-----------------------
{'encoded': '/some/enc%20oding', 'isdir': False, 'isfile': True, 'segments': ['some', 'enc oding'], 'isabsolute': True}

query

query.params：返回furl的omdict1D对象，能够解码

f.args：同上，就是query.params

f = furl('http://www.google.com/?one%20piece=1&two=2&two=22')  # 编码
print(f.query, type(f.query), str(f.query))  # 不解码
print(f.query.params, type(f.query.params))  # 解码
print(f.args, type(f.args), f.args is f.query.params)  # 解码
# ------------字典操作-----------------#
print(f.query.params['two'])
print(f.query.params.getlist('two'))
f.args['three'] = 3
f.args.addlist('four', [4, 44])
print(f.args)
f.args.popvalue('one piece')
print(f.args)
f.args.popvalue('four', 44)
print(f.args)
f.add(args={'params': ['a', 'b']})
print(f.args)
----------------------------------------
one+piece=1&two=2&two=22 <class 'furl.furl.Query'> one+piece=1&two=2&two=22
{'one piece': '1', 'two': '2', 'two': '22'} <class 'furl.omdict1D.omdict1D'>
{'one piece': '1', 'two': '2', 'two': '22'} <class 'furl.omdict1D.omdict1D'> True
2
['2', '22']
{'one piece': '1', 'two': '2', 'two': '22', 'three': 3, 'four': 4, 'four': 44}
{'two': '2', 'two': '22', 'three': 3, 'four': 4, 'four': 44}
{'two': '2', 'two': '22', 'three': 3, 'four': 4}
{'two': '2', 'two': '22', 'three': 3, 'four': 4, 'params': 'a', 'params': 'b'}

f.query.params返回的是omdict1D(ordered multivalue dictionary)为有序的多值字典，包含python字典的一些操作

构造?param=和?param格式

f = furl('http://sprop.su')
f.args['param'] = ''
print(f.url)
f = furl('http://sprop.su')
f.args['param'] = None
print(f.url)
---------------------
http://sprop.su?param=
http://sprop.su?param

query.asdict()：返回query相关信息的字典

fragment

fragment与netloc用#分割
fragment可分为path和query，并用?(可选)分割开

f.fragment.separator为False时，则fragment的path和query不会用?分割开

f = furl('http://www.google.com/#/fragment/path?with=params')
print(f.fragment)
print(f.query)
print(f.fragment.separator, f.fragment.path, f.fragment.query)
f = furl('http://www.google.com/')
f.fragment.path = '/path'
f.fragment.args = {'a': 'dict', 'of': 'args'}
print(f.fragment.separator)
print(f.fragment)
print(f.url)
f.fragment.separator = False
print(f.fragment)
print(f.url)
------------------------------------------------
/fragment/path?with=params

True /fragment/path with=params
True
/path?a=dict&of=args
http://www.google.com/#/path?a=dict&of=args
/patha=dict&of=args
http://www.google.com/#/patha=dict&of=args

由于fragment的path和query是path和query对象，因此可以使用上面介绍的path和query相关操作

fragment.asdict()：返回fragment相关信息的字典

encoding

参考官方api文档

链式调用

furl的add(), set() 和 remove() 返回的是furl对象，因此可完成链式调用

url = 'http://www.google.com/#fragment'
print(furl(url).add(args={'example': 'arg'}).set(port=99).remove(fragment=True).url)

url = "ftp://www.google.com/"
f = furl(url)
print(f.url)
f.add(path='/search', fragment_path='frag/path', fragment_args={'frag': 'arg'})
print(f.url)
f.set(scheme='https', host='secure.google.com', port=99, path='a/path/', args={'some': 'args'}, fragment='great job')
print(f.url)
f.remove(args=['some'], path='path/', fragment=True, port=True)
print(f.url)
----------------------------------------------
ftp://www.google.com/
ftp://www.google.com/search#frag/path?frag=arg
https://secure.google.com:99/a/path/?some=args#great%20job
https://secure.google.com/a/

更多关于add()，set()和remove()函数相关参数及用法可参考源码或官方文档

furl.join()

与相对或绝对路径字符串拼接
返回新的furl对象，也可接着链式调用

f = furl('http://www.google.com')
print(f.join('new/path').url)
print(f.join('replaced').url)
print(f.join('../parent').url)
print(f.join('path?query=yes#fragment').url)
print(f.join('unknown://www.yahoo.com/new/url/').url)
----------------------------------------------
http://www.google.com/new/path
http://www.google.com/new/replaced
http://www.google.com/parent
http://www.google.com/path?query=yes#fragment
unknown://www.yahoo.com/new/url/

简单url拼接

如果项目中用到的拼接并不复杂，则不需要引入第三方库，简单使用urllib.parse的urljoin即可

from urllib import parse

url = 'https://baidu.com'
query_str = 's?query=monkeyjerry'
print(url + query_str)  # 字符串简单拼接，并不能得到想要的url

url = parse.urljoin(url, query_str)
print(url)
--------------
https://baidu.coms?query=monkeyjerry
https://baidu.com/s?query=monkeyjerry

参考

furl官方api文档

小猴子jerry

用python库furl来处理url