url
furl库的介绍
furl
平时项目中经常要使用到url的处理,尽管python标准库提供了urllib和urlparse等库,但用起来并不是很方便,这里介绍一款小型且好用的url处理库——furl
Python's standard urllib and urlparse modules provide a number of URL related functions, but using these functions to perform common URL operations proves tedious. Furl makes parsing and manipulating URLs easy.
安装
pip install furl
官方文档
认识url
[scheme:]//[user[:password]@]host[:port][/path][?query][#fragment]
例子如:https://monkeyjerry:jerry@1eq066.coding-pages.com:8080/search;foo=1;bar=2?q=python&monkey=4#test
各部分解释:
组件 | 描述 | 默认值 | 举例 |
---|---|---|---|
scheme | 访问服务器获取资源时使用的协议 | 无 | https |
user | 访问资源时使用的用户名 | 无(匿名) | monkeyjerry |
password | 用户的密码,和用户名使用: 分割 |
jerry | |
host | 资源服务器主机名或IP地址 | 无 | 1eq066.coding-pages.com |
port | 资源服务器监听的端口,不同的scheme有不同的默认端口(HTTP使用80作为默认端口) | 和scheme有关 | 8080 |
path | 服务器上的资源路径。路径与服务器和scheme有关 | 默认值 | /search |
params | 在某些scheme下指定输入参数,是键值对。可以有多个,使用; 分割,单个内的多个值使用, 分割 |
默认值 | foo=1;bar=2 |
query | 该组件没有通用的格式,HTTP中大多使用& 来分隔多个query。使用? 分隔query和其他部分 |
无 | q=python&monkey=4 |
fragment | 一小片或一部分资源名称。引用对象时,不会将fragment传送给服务器,客户端内部使用。通过# 分隔fragment和其余部分 |
无 | test |
由于params是path的一部分,我们将其归在path中
url编解码
furl的使用
提取Scheme, Username, Password, Host, Port, Path, Query, Fragment
from furl import furl
url = "https://monkeyjerry:jerry@1eq066.coding-pages.com/search/index.html;foo=1;bar=2?q=python&monkey=4#test"
f = furl(url)
print(type(f))
print(f.url, type(f.url), f.url==url)
print(f.scheme, f.username, f.password, f.host, f.port, f.path, f.query, f.fragment, sep='\n')
print(type(f.scheme), type(f.username), type(f.password), type(f.host), type(f.port), type(f.path), type(f.query), type(f.fragment))
-------------------------------------
<class 'furl.furl.furl'>
https://monkeyjerry:jerry@1eq066.coding-pages.com/search/index.html;foo=1;bar=2?q=python&monkey=4#test <class 'str'> True
https
monkeyjerry
jerry
1eq066.coding-pages.com
443 # 根据Scheme推断默认端口,无法推断则返回None
/search/index.html;foo=1;bar=2
q=python&monkey=4
test
<class 'str'> <class 'str'> <class 'str'> <class 'str'> <class 'int'> <class 'furl.furl.Path'> <class 'furl.furl.Query'> <class 'furl.furl.Fragment'>
注:其中path、query、fragement返回值并不是str
netloc和origin
- netloc(Network Location),返回是username, password, host, 和port组成的字符串(如果port是None或者是默认port则不包含port部分)
- origin返回的是scheme, host, 和port组成的字符串(如果port是None或者是默认port则不包含port部分)
url = "https://monkey:jerry@google.com"
url1 = "https://monkey:jerry@google.com:99"
f = furl(url)
f1 = furl(url1)
print(f.netloc)
print(f1.origin)
--------------
monkey:jerry@google.com
https://google.com:99
path
- path.segments返回列表,其中元素都是percent-decoded(如果url被encode了的话)字符串
url = "http://www.google.com/a/large%20ish/path"
url1 = "http://www.google.com/a/large%20ish/path/" # 注意url和url1的区别
f = furl(url)
f1 = furl(url1)
print(str(f.path)) # 不能解码
print(f.path.segments, f1.path.segments) # 能够解码
f.path.segments = ['o', 'hi', 'there', 'with some encoding', '^`<>[]"#/?', ''] # 注意最后一个元素为空串
print(f.path) # url编码
-------------------------
/a/large%20ish/path
['a', 'large ish', 'path'] ['a', 'large ish', 'path', '']
/o/hi/there/with%20some%20encoding/%5E%60%3C%3E%5B%5D%22%23%2F%3F/
- path.isdir和path.isfile
- 如果path以
/
结尾,在被认为是目录dir,否则是文件file
- 如果path以
# 结合上例
f = furl('http://www.google.com/a/directory/')
print(f.path.isdir, f.path.isfile)
f = furl('http://www.google.com/a/file')
print(f.path.isdir, f.path.isfile)
------------------------------
True False
False True
- path.isabsolute
- 是否是绝对路径(即path是否是以
/
开头),返回True或False - 如果netloc存在,则path必须是绝对路径
f = furl('/url/path') print(f.path.isabsolute) f.path.isabsolute = False # netloc不存在则可直接设置isabsolute为False print(f.url) f.host = 'blaps.ru' # netloc存在 f.scheme = 'https' print(f.url) print(f.path.isabsolute) # f.path.isabsolute = False # AttributeError: Path.isabsolute is True and read-only for URLs with a netloc (a username, password, host, and/or port). A URL path must start with a '/' to separate itself from a netloc.
- 是否是绝对路径(即path是否是以
- path.normalize()格式化:返回path对象
f = furl('http://www.google.com////a/./b/lolsup/../c/')
p = f.path.normalize()
print(p) # 返回path对象
print(f.url)
-----------------------------
/a/b/c/
http://www.google.com/a/b/c/
/
操作- 类似之前的介绍的pathlib.path路径拼接
- furl.path对象/furl.path对象、furl.path对象/字符串 或者 字符串/furl.path对象
- 不能字符串 / 字符串
f = furl('path') f1 = furl('path1') f.path /= 'with' f.path = f.path / f1.path f.path = f.path / 'more' / 'path segments/' print(f.path) ------------ path/with/path1/more/path%20segments/
- path.asdict():返回path相关信息的字典
f = furl('http://www.google.com/some/enc%20oding') print(f.path.asdict()) ----------------------- {'encoded': '/some/enc%20oding', 'isdir': False, 'isfile': True, 'segments': ['some', 'enc oding'], 'isabsolute': True}
query
- query.params:返回furl的omdict1D对象,能够解码
- f.args:同上,就是query.params
f = furl('http://www.google.com/?one%20piece=1&two=2&two=22') # 编码 print(f.query, type(f.query), str(f.query)) # 不解码 print(f.query.params, type(f.query.params)) # 解码 print(f.args, type(f.args), f.args is f.query.params) # 解码 # ------------字典操作-----------------# print(f.query.params['two']) print(f.query.params.getlist('two')) f.args['three'] = 3 f.args.addlist('four', [4, 44]) print(f.args) f.args.popvalue('one piece') print(f.args) f.args.popvalue('four', 44) print(f.args) f.add(args={'params': ['a', 'b']}) print(f.args) ---------------------------------------- one+piece=1&two=2&two=22 <class 'furl.furl.Query'> one+piece=1&two=2&two=22 {'one piece': '1', 'two': '2', 'two': '22'} <class 'furl.omdict1D.omdict1D'> {'one piece': '1', 'two': '2', 'two': '22'} <class 'furl.omdict1D.omdict1D'> True 2 ['2', '22'] {'one piece': '1', 'two': '2', 'two': '22', 'three': 3, 'four': 4, 'four': 44} {'two': '2', 'two': '22', 'three': 3, 'four': 4, 'four': 44} {'two': '2', 'two': '22', 'three': 3, 'four': 4} {'two': '2', 'two': '22', 'three': 3, 'four': 4, 'params': 'a', 'params': 'b'}
f.query.params返回的是omdict1D(ordered multivalue dictionary)为有序的多值字典,包含python字典的一些操作
- 构造
?param=
和?param
格式f = furl('http://sprop.su') f.args['param'] = '' print(f.url) f = furl('http://sprop.su') f.args['param'] = None print(f.url) --------------------- http://sprop.su?param= http://sprop.su?param
- query.asdict():返回query相关信息的字典
fragment
- fragment与netloc用#分割
- fragment可分为path和query,并用
?
(可选)分割开 - f.fragment.separator为False时,则fragment的path和query不会用
?
分割开f = furl('http://www.google.com/#/fragment/path?with=params') print(f.fragment) print(f.query) print(f.fragment.separator, f.fragment.path, f.fragment.query) f = furl('http://www.google.com/') f.fragment.path = '/path' f.fragment.args = {'a': 'dict', 'of': 'args'} print(f.fragment.separator) print(f.fragment) print(f.url) f.fragment.separator = False print(f.fragment) print(f.url) ------------------------------------------------ /fragment/path?with=params True /fragment/path with=params True /path?a=dict&of=args http://www.google.com/#/path?a=dict&of=args /patha=dict&of=args http://www.google.com/#/patha=dict&of=args
由于fragment的path和query是path和query对象,因此可以使用上面介绍的path和query相关操作
- fragment.asdict():返回fragment相关信息的字典
encoding
- 参考官方api文档
链式调用
- furl的add(), set() 和 remove() 返回的是furl对象,因此可完成链式调用
url = 'http://www.google.com/#fragment' print(furl(url).add(args={'example': 'arg'}).set(port=99).remove(fragment=True).url)
url = "ftp://www.google.com/" f = furl(url) print(f.url) f.add(path='/search', fragment_path='frag/path', fragment_args={'frag': 'arg'}) print(f.url) f.set(scheme='https', host='secure.google.com', port=99, path='a/path/', args={'some': 'args'}, fragment='great job') print(f.url) f.remove(args=['some'], path='path/', fragment=True, port=True) print(f.url) ---------------------------------------------- ftp://www.google.com/ ftp://www.google.com/search#frag/path?frag=arg https://secure.google.com:99/a/path/?some=args#great%20job https://secure.google.com/a/
更多关于add(),set()和remove()函数相关参数及用法可参考源码或官方文档
- furl.join()
- 与相对或绝对路径字符串拼接
- 返回新的furl对象,也可接着链式调用
f = furl('http://www.google.com') print(f.join('new/path').url) print(f.join('replaced').url) print(f.join('../parent').url) print(f.join('path?query=yes#fragment').url) print(f.join('unknown://www.yahoo.com/new/url/').url) ---------------------------------------------- http://www.google.com/new/path http://www.google.com/new/replaced http://www.google.com/parent http://www.google.com/path?query=yes#fragment unknown://www.yahoo.com/new/url/
简单url拼接
如果项目中用到的拼接并不复杂,则不需要引入第三方库,简单使用urllib.parse的urljoin即可
from urllib import parse
url = 'https://baidu.com'
query_str = 's?query=monkeyjerry'
print(url + query_str) # 字符串简单拼接,并不能得到想要的url
url = parse.urljoin(url, query_str)
print(url)
--------------
https://baidu.coms?query=monkeyjerry
https://baidu.com/s?query=monkeyjerry