python 中 re 模块的使用
常量 flag：引擎选项 re.S re.M re.I
方法：编译 compile
单次匹配 match search fullmatch，全文搜索 findall finditer
匹配替换 sub subn
分隔字符串 re.split
分组：group，groups，groupdict

在之前的文章中介绍过正则表达式，这篇介绍下正则表达式在python中的使用——python中的re模块

常量flag

忽略大小写re.I，单行模式re.S，多行模式re.M
开启多种选项用 |（位或），如re.S|re.M

编译

re.compile(pattern, flags=0)

pattern 正则表达式，flag 是选项
返回正则表达式对象 regex
编译正则表达式，为了提高效率，编译后的结果会被保存，下次调用同样的 pattern 时则不需要再次编译
re 的其他方法如 match 等（可查看源码）都调用了编译方法，为了提速

import re

pattern = '^a\w+'
regex = re.compile(pattern, re.M)
print(regex, type(regex), sep='\n')
----------------------------------------
re.compile('^a\\w+', re.MULTILINE)
<class '_sre.SRE_Pattern'>

re的其他函数内部都进行了封装，如match源码如下：

def match(pattern, string, flags=0):
    """Try to apply the pattern at the start of the string, returning
    a Match object, or None if no match was found."""
    return _compile(pattern, flags).match(string)

单次匹配

match，search，fullmatch
单次，只匹配一次

match

re.match(pattern, string, flags=0)
- 从字符串开头开始匹配
- 从开头开始如果匹配规则则返回 match 对象，如果不符合规则，返回 None
- match 对象中 span 区域为匹配到的字符串区域 [)(前包后不包)，match 为正则表达式匹配到的结果
- 不论是否单行模式还是多行模式，都是从头开始匹配
regex.match(string[, pos[, endpos]])
- regex 为 compile 返回的正则表达式对象
- 可以设定开始位置和结束位置，不给定开始位置，默认从字符串索引为 0 开始
- 从设定位置开始匹配，匹配规则则返回 match 对象，如果不符合规则，返回 None
- 多行模式能够起作用

import re

s = 'big\nbottle\nbag\napple'
for i, j in enumerate(s, 1):
    print((i - 1, j), end='\n' if i % 10 == 0 else ' ')

# match方法
match_result = re.match('a', s)
print(1, match_result, type(match_result))  # 从头开始匹配，不符合规则，返回为None

match_result = re.match('a', s, re.M)
print(2, match_result, type(match_result))  # 无论单行还是多行模式都是从头开始匹配

match_result = re.match('b', s)
print(3, match_result, type(match_result))  # 返回match对象，且只匹配一次，匹配到则返回


# 先编译，用正则表达式regex对象的match方法，可以指定起始位置
regex = re.compile('^a', re.M)  # 尽管是多行模式
regex_match_result = regex.match(s)  # 不指定起始位置，默认从头开始
print(4, regex_match_result, type(regex_match_result))

regex_match_result = regex.match(s, 15)  # regex对象不用多次编译，指定起始位置，多行模式能够起作用
print(5, regex_match_result, type(regex_match_result))

regex = re.compile('b\w+')  # 默认模式为单行模式
regex_match_result = regex.match(s, 11)  # 从指定位置开始匹配，如果满足规则，则返回结果
print(6, regex_match_result, type(regex_match_result))
----------------------------------------------------------------------------
(0, 'b') (1, 'i') (2, 'g') (3, '\n') (4, 'b') (5, 'o') (6, 't') (7, 't') (8, 'l') (9, 'e')
(10, '\n') (11, 'b') (12, 'a') (13, 'g') (14, '\n') (15, 'a') (16, 'p') (17, 'p') (18, 'l') (19, 'e')
1 None <class 'NoneType'>
2 None <class 'NoneType'>
3 <_sre.SRE_Match object; span=(0, 1), match='b'> <class '_sre.SRE_Match'> # span区域为匹配到的字符串区域[)(前包后不包)，match为正则表达式匹配到的结果
4 None <class 'NoneType'>
5 <_sre.SRE_Match object; span=(15, 16), match='a'> <class '_sre.SRE_Match'>
6 <_sre.SRE_Match object; span=(11, 14), match='bag'> <class '_sre.SRE_Match'>

search

re.search(pattern, string, flags=0)
- 从头开始搜索（一直找），直到找到第一个匹配
- 找到就立即返回 match 对象，只匹配一次结果，没找到返回 None
regex.search(string[, pos[, endpos]])
- 同样，regex 为 compile 返回的正则表达式对象
- 可以设定开始位置和结束位置
- 从指定开始位置开始找，找到就立即返回 match 对象，只匹配一次结果，没找到就返回 None

import re

s = 'big\nbottle\nbag\napple'
for i, j in enumerate(s, 1):
    print((i - 1, j), end='\n' if i % 10 == 0 else ' ')

# search方法
search_result = re.search('a\w+', s)
print(1, search_result, type(search_result))  # 从头开始找，找到就返回match对象

search_result = re.search('^a\w+', s, re.M)  # 多行模式能够起作用
print(2, search_result, type(search_result))


# 先编译，用正则表达式regex对象的search方法，可以指定起始位置
regex = re.compile('a\w+', re.S)
regex_search_result = regex.search(s, 15)  # 可指定起始位置
print(3, regex_search_result, type(regex_search_result))

regex = re.compile('^a\w+', re.M) # 若正则表达式没变，不用多次编译
regex_search_result = regex.search(s)  # 多行模式能够起作用
print(4, regex_search_result, type(regex_search_result))
----------------------------------------------------------------------------
(0, 'b') (1, 'i') (2, 'g') (3, '\n') (4, 'b') (5, 'o') (6, 't') (7, 't') (8, 'l') (9, 'e')
(10, '\n') (11, 'b') (12, 'a') (13, 'g') (14, '\n') (15, 'a') (16, 'p') (17, 'p') (18, 'l') (19, 'e')
1 <_sre.SRE_Match object; span=(12, 14), match='ag'> <class '_sre.SRE_Match'>
2 <_sre.SRE_Match object; span=(15, 20), match='apple'> <class '_sre.SRE_Match'>
3 <_sre.SRE_Match object; span=(15, 20), match='apple'> <class '_sre.SRE_Match'>
4 <_sre.SRE_Match object; span=(15, 20), match='apple'> <class '_sre.SRE_Match'>

match和search区别

match是从字符串头匹配模式
search能够匹配字符串中的模式，因此在项目中应用更多一些

import re

s = 'apple\norange'
print(re.match('apple', s))
print(re.match('orange', s)) # None
print(re.search('orange', s))

fullmatch

顾名思义，整个字符串和正则表达式匹配

re.fullmatch(pattern, string, flags=0)
regex.fullmatch(string[, pos[, endpos]])
有单行模式和多行模式的区别

import re

s = 'big\nbottle\nboy\napple'

for i, j in enumerate(s, 1):
    print((i - 1, j), end='\n' if i % 10 == 0 else ' ')

pattern = 'big'
result = re.fullmatch(pattern, s)
print(1, result, type(result))  # 全匹配，不能全匹配，返回为None

pattern = '.+'
result = re.fullmatch(pattern, s, re.S)  # re.S对.进行增强
print(2, result, type(result))  # 返回match对象

pattern = 'a\w+'
regex = re.compile(pattern)
result = regex.fullmatch(s, 15)  # 可指定起始位置
print(3, result, type(result))

pattern = '^a\w+'
regex = re.compile(pattern, re.M)
result = regex.fullmatch(s, 15)  # 多行模式起作用
print(4, result, type(result))
----------------------------------------
(0, 'b') (1, 'i') (2, 'g') (3, '\n') (4, 'b') (5, 'o') (6, 't') (7, 't') (8, 'l') (9, 'e')
(10, '\n') (11, 'b') (12, 'o') (13, 'y') (14, '\n') (15, 'a') (16, 'p') (17, 'p') (18, 'l') (19, 'e')
1 None <class 'NoneType'>
2 <_sre.SRE_Match object; span=(0, 20), match='big\nbottle\nboy\napple'> <class '_sre.SRE_Match'>
3 <_sre.SRE_Match object; span=(15, 20), match='apple'> <class '_sre.SRE_Match'>
4 <_sre.SRE_Match object; span=(15, 20), match='apple'> <class '_sre.SRE_Match'>

全文搜索多次匹配

findall，finditer
对整个字符串，从左到右匹配，返回所有匹配项
推荐使用

findall

re.findall(pattern, string, flags=0)
regex.findall(string[, pos[, endpos]])
- 返回所有匹配项的列表

finditer

re.finditer(pattern, string, flags=0)
regex.finditer(string[, pos[, endpos]])
- 返回迭代器
- 每次迭代返回的都是 match 对象（for 出来后是一个个 match 对象）

import re

s = 'big\nbottle\nboy\napple'

for i, j in enumerate(s, 1):
    print((i - 1, j), end='\n' if i % 10 == 0 else ' ')

result = re.findall('b\w', s)
print(1, result, type(result))  # 返回列表，列表中为所有匹配项,匹配不到则为空列表[]

regex = re.compile('b\w')
result = regex.findall(s, 4)
print(2, result, type(result))

result = re.finditer('b\w+', s)
print(result, type(result))  # 返回迭代器

for match_result in result: # for迭代器中一个个match对象
    print(match_result, match_result.start(), match_result.end(),  # 换行，如果是逗号结尾的可以不加换行符\
          s[match_result.start():match_result.end()])  # 获取match对象的span区间和匹配项

regex = re.compile('^b\w', re.M | re.S)  # 只有多行模式对^起作用
result = regex.finditer(s)

for match_result in result:
    print(match_result, match_result.start(), match_result.end(),
          s[match_result.start():match_result.end()])
----------------------------------------------------------------------------
(0, 'b') (1, 'i') (2, 'g') (3, '\n') (4, 'b') (5, 'o') (6, 't') (7, 't') (8, 'l') (9, 'e')
(10, '\n') (11, 'b') (12, 'o') (13, 'y') (14, '\n') (15, 'a') (16, 'p') (17, 'p') (18, 'l') (19, 'e')
1 ['bi', 'bo', 'bo'] <class 'list'>
2 ['bo', 'bo'] <class 'list'>
<callable_iterator object at 0x000001583B5AAA90> <class 'callable_iterator'>
<_sre.SRE_Match object; span=(0, 3), match='big'> 0 3 big
<_sre.SRE_Match object; span=(4, 10), match='bottle'> 4 10 bottle
<_sre.SRE_Match object; span=(11, 14), match='boy'> 11 14 boy
<_sre.SRE_Match object; span=(0, 2), match='bi'> 0 2 bi
<_sre.SRE_Match object; span=(4, 6), match='bo'> 4 6 bo
<_sre.SRE_Match object; span=(11, 13), match='bo'> 11 13 bo

获取连接安卓设备列表

import subprocess
import re

def get_device_uuid():
    stdout, stderr = subprocess.Popen("adb devices", shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
                                      text=True).communicate()

    pattern = r'([a-z-A-Z0-9]*)\s*device$'
    uuid_lst = re.findall(pattern, stdout, re.M)
    return uuid_lst

print(get_device_uuid())

匹配总结

match，search 函数返回 match 对象
findall 返回字符串列表，finditer 返回迭代器（一个个 match 对象）

匹配替换

sub，subn
使用 pattern 对字符串 string 进行匹配，然后对匹配项使用 replacement 替换

sub

re.sub(pattern, replacement, string, count=0, flags=0)
regex.sub(replacement, string, count=0)
- replacement 可以是 string，bytes，function
- count 是替换次数，否则全替换
- 返回新的替换后的对象
- 支持分组替换
re.subn(pattern, replacement, string, count=0, flags=0)
regex.subn(replacement, string, count=0)
- 返回 tuple 对象，包含替换后的对象和替换次数
- 支持分组替换
- 用的相对较少，一般直接针对某一模式全文替换

import re

s = 'big\nbottle\nboy\napple'

for i, j in enumerate(s, 1):
    print((i - 1, j), end='\n' if i % 10 == 0 else ' ')

# 用str的replace方法，不支持模式替换
repl_s = s.replace('big', 'small')
print(s, repl_s, sep='\n')

print('~' * 30)
# 用re的sub或者subn函数来模式替换
pattern = r'b\w+'
repl_s = re.sub(pattern, 'small', s)  # 返回替换后新的字符串，原字符串不变
print(s, repl_s, sep='\n')

print('~'*30)
# 同样可以用正则表达式对象regex的sub方法
regex = re.compile(pattern)
repl_s = regex.sub("small", s)
print(s, repl_s, sep='\n' )

# subn
print('~' * 30)
pattern = r'b\w+'
repl_s = re.subn(pattern, 'small', s, 2)  # 返回元组，包含替换后的字符串和替换次数,可指定替换次数,默认全替换
print(s, repl_s, sep='\n')  # 原字符串不变

print('~' * 30)
pattern = r'b\w+'
regex = re.compile(pattern)
repl_s = regex.subn('small', s)
print(s, repl_s, sep='\n')

# sub和subn支持分组替换
pattern = r'(b\w+)'  # 分组()
repl_s = re.sub(pattern, r'small<-----\1', s)  # 引用分组，r''不可少
print(s, repl_s, sep='\n')
------------------------------------------------------------------------------
(0, 'b') (1, 'i') (2, 'g') (3, '\n') (4, 'b') (5, 'o') (6, 't') (7, 't') (8, 'l') (9, 'e')
(10, '\n') (11, 'b') (12, 'o') (13, 'y') (14, '\n') (15, 'a') (16, 'p') (17, 'p') (18, 'l') (19, 'e')
big
bottle
boy
apple
small
bottle
boy
apple
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
big
bottle
boy
apple
small
small
small
apple
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
big
bottle
boy
apple
small
small
small
apple
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
big
bottle
boy
apple
('small\nsmall\nboy\napple', 2)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
big
bottle
boy
apple
('small\nsmall\nsmall\napple', 3)
big
bottle
boy
apple
small<-----big
small<-----bottle
small<-----boy
apple

分隔字符串

re.splite

可以按模式 pattern（正则匹配）分隔
返回列表

re.splite 与字符串的分隔函数 split 比较

import re

s = 'a\r\tb \nc\re\t'
print(s.split())  # 能够去除两头的空字符
print(re.split('\s+', s))  # 两头的空字符不能去除

s = '''
os.path.abspath(path)
normpath(join(os.getcwd(), path)).
'''
# 分隔得到所有的单词
# str的split方法做不到
print(s.split())
# 用re.split的模式分隔
# 正向思维
print(re.split(r'[\s().,]+', s))
# 逆向思维得到所有的字母
print(re.split(r'\W+', s))
print(re.split(r'[^0-9A-Z_a-z-]+', s))
---------------------------------------------
['a', 'b', 'c', 'e']
['a', 'b', 'c', 'e', '']
['os.path.abspath(path)', 'normpath(join(os.getcwd(),', 'path)).']
['', 'os', 'path', 'abspath', 'path', 'normpath', 'join', 'os', 'getcwd', 'path', '']
['', 'os', 'path', 'abspath', 'path', 'normpath', 'join', 'os', 'getcwd', 'path', '']
['', 'os', 'path', 'abspath', 'path', 'normpath', 'join', 'os', 'getcwd', 'path', '']

分组

分组：用包含小括号 () 的 pattern 捕获的数据放到组 group 中
只有 match 对象才有分组的相关方法，因此可以使用分组的是 match，search，fullmatch，finditer
pattern 使用分组，如果有匹配，则会在 match 对象中

group(N)

1 到 N 时对应的分组，分组号从 1 开始
0 返回的是匹配的整个字符串，即 match
N 不写缺省为 0 即 group() == group(0)

groups()

返回所有的分组
返回对象为元组

命名分组

(?P<name>pattern1)
如果使用了命名分组，可以使用 group('name') 的方式取分组

groupdict()

返回所有命名的分组
返回对象为字典

import re

s = 'bottle\nbig\nbag\napple'
result = re.match('(?P<head>b)(?P<tail>\w+)', s)  # 命名分组
print(result, type(result))  # match对象
print(result.groups())  # match对象才有group相关方法，返回所有分组的元组
print(result.group(1), result.group(2))  # 分组号从1开始，若查询result.group(3)则IndexError: no such group
print(result.group(), result.group(0))  # group()即group(0),即match
print(result.groupdict())  # 命名分组的字典
print(result.groupdict()['head'])
print(result.group('head'))  # 使用group('name')的方式取分组
-----------------------------------------------------------------------------
<_sre.SRE_Match object; span=(0, 6), match='bottle'> <class '_sre.SRE_Match'>
('b', 'ottle')
b ottle
bottle bottle
{'head': 'b', 'tail': 'ottle'}
b
b

参考

magedu

小猴子jerry

python学习笔记：python的re模块