Python正则表达式简介

正则表达式（Regular expressions，regex），是一种用于搜索、提取和操作更大文本中的特定字符串模式的语法或语言。它广泛应用于涉及文本验证、NLP和文本挖掘的项目中。

1、正则表达式介绍

几乎所有的计算机语言都实现了正则表达式。在Python语言中，它被包含在标准模块“re”中，这个库被广泛用于自然语言处理、需要验证字符串输入（如电子邮件地址）的web应用程序以及涉及文本挖掘的大多数数据科学项目中。

2、什么是正则表达式，如何编译？

正则表达式模式是一种特殊的语言，用于表示通用文本、数字或符号，因此它可以用于提取符合该模式的文本。

一个基础例子是“\s+”。这里的“\s”匹配任何空白字符，通过在结尾添加一个“+”符号将使图案至少匹配1个或更多的空白字符，包括“Tab”键的“\t”字符。

后文会给出一个正则表达式列表，这里我们要讲讲如何使用基本的正则表达式。

import re   
regex = re.compile('\s+')

上面的代码导入“re”包并编译可以匹配至少一个或多个空格字符的正则表达式模型。

3、如何用正则表达式分隔字符串？

下面有一段文字。

text = """101 COM    Computers
205 MAT   Mathematics
189 ENG   English"""

三个课程项目的格式为“[课程编号] [课程代码] [课程名称]“，单词之间的间距不相等。

那么怎么把这三个课程项目分成单个的数字和单词单位呢？

有两种方法：

使用re.split()
调用regex对象的split()方法，即regex.split()

# split the text around 1 or more space characters
re.split('\s+', text)
# or
regex.split(text)
#> ['101', 'COM', 'Computers', '205', 'MAT', 'Mathematics', '189', 'ENG', 'English']

这两种方法都可以做到，那该使用哪一个呢？如果多次使用一个相同的正则式，那当然是 regex.split() 了。

4、使用findall、research和match实现正则匹配

如果想从上面的文本中提取所有的课程编号，也就是数字101、 205和189。如何做到这一点？

4.1 re.findall()

# find all numbers within the text
print(text)
regex_num = re.compile('\d+')
regex_num.findall(text)
#> 101 COM    Computers
#> 205 MAT   Mathematics
#> 189 ENG   English
#> ['101', '205', '189']

在上面的代码中，特殊字符“\d”表示与任何数字匹配的正则表达式。后面，文章将介绍更多这样的表达式。向其添加一个“+”符号表示存在至少1个数字。类似于“+”，“*”符号表示需要0个或更多个数字匹配，它实际上是一个可选的数字。

最后，findall() 方法从文本中提取了所有包含1个或更多个数字的数字，并将它们返回到列表中。

4.2 re.search() vs re.match()

顾名思义，regex.search()是从给定的文本中搜索匹配的字符。但是，与将文本的匹配部分作为列表返回的findall()不同的是，regex.search()返回一个特定匹配对象，该对象包含匹配的第一次出现的起始位置和结束位置。

同样，regex.mathc() 也返回匹配对象。但不同之处在于，它从文本开头第一个字符开始匹配。

# define the text
text2 = """COM    Computers
205 MAT   Mathematics 189"""

# compile the regex and search the pattern
regex_num = re.compile('\d+')
s = regex_num.search(text2)

print('Starting Position: ', s.start())
print('Ending Position: ', s.end())
print(text2[s.start():s.end()])
#> Starting Position:  17
#> Ending Position:  20
#> 205

或者，可以使用匹配对象的 group() 方法获得相同的输出。

print(s.group())
#> 205

m = regex_num.match(text2)
print(m)
#> None

5、如何用正则替换？

若要替换文本，请使用regex.sub()。

让我们来看看下面的课程文本的修改版本。这里我在每个课程代码之后增加了一个额外的“\t”。

# define the text
text = """101   COM \t  Computers
205   MAT \t  Mathematics
189   ENG  \t  English"""  
print(text)
#> 101   COM    Computers
#> 205   MAT     Mathematics
#> 189   ENG     English

那么如何把所有多余的空格和换行去除，把所有的单词放在同一行中呢？只需使用regex.sub把“\s+”替换成“ ”就好了。

# replace one or more spaces with single space
regex = re.compile('\s+')
print(regex.sub(' ', text))
# or
print(re.sub('\s+', ' ', text))
#> 101 COM Computers 205 MAT Mathematics 189 ENG English

假设只想去除每行多余的空格，但希望保持课程条目在本行呢。那么就可以使用“(?!\n)”检查即将出现的换行符，并将其从模式中排除。

# get rid of all extra spaces except newline
regex = re.compile('((?!\n)\s+)')
print(regex.sub(' ', text))
#> 101 COM Computers
#> 205 MAT Mathematics
#> 189 ENG English

6、正则表达式组

正则表达式组是一个非常有用的特性，可以将所需的匹配对象提取为单个项。

假设想提取课程编号、代码和姓名作为单独的项目。如果每个表达式单独匹配，要这样写。

text = """101   COM   Computers
205   MAT   Mathematics
189   ENG    English"""

# 1. extract all course numbers
re.findall('[0-9]+', text)

# 2. extract all course codes
re.findall('[A-Z]{3}', text)

# 3. extract all course names
re.findall('[A-Za-z]{4,}', text)

#> ['101', '205', '189']
#> ['COM', 'MAT', 'ENG']
#> ['Computers', 'Mathematics', 'English']

分析下代码。代码分别使用了3个单独的正则表达式，每个都匹配了课程编号、代码和名称。

但如果要匹配好几个条件的这要搞一个个写会累死的，所以这时候我们要使用正则表达式组，把他们写在一起。

# define the course text pattern groups and extract
course_pattern = '([0-9]+)\s*([A-Z]{3})\s*([A-Za-z]{4,})'
re.findall(course_pattern, text)
#> [('101', 'COM', 'Computers'), ('205', 'MAT', 'Mathematics'), ('189', 'ENG', 'English')]

注意，课程编号：[0-9]+、课程代码：[A-Z]{3}和课程名称：[A-Za-z]{4,}的表达式都放在括号()里。

7、贪婪匹配？

正则表达式的默认行为是贪婪的。这意味着即使已经匹配到了需要的文本，它还会继续向后进行更大化的符合条件的匹配，直到文本结束。

让我们看看一个匹配HTML的例子，在那里我想检索HTML标签。

text = "<body>Regex Greedy Matching Example </body>"
re.findall('<.*>', text)
#> ['<body>Regex Greedy Matching Example </body>']

它提取了整个字符串，而不是一直匹配到第一次出现“>”，这是默认的贪婪或“采取一切”行为的正则表达式。

当然你也可以使用懒惰匹配，“尽可能少”地去匹配字符。可以通过添加“?”来实现。

re.findall('<.*?>', text)
#> ['<body>', '</body>']

如果只希望检索第一个匹配项，请使用 search。

re.search('<.*?>', text).group()
#> '<body>'

8、常见的正则表达式

现在，您了解了如何使用 re 模块。让我们看看一些常用的通配符模式。

基本表达式

.             One character except new line
\.            A period. \ escapes a special character.
\d            One digit
\D            One non-digit
\w            One word character including digits
\W            One non-word character
\s            One whitespace
\S            One non-whitespace
\b            Word boundary
\n            Newline
\t            Tab

修饰符

$             End of string
^             Start of string
ab|cd         Matches ab or de.
[ab-d]	      One character of: a, b, c, d
[^ab-d]	      One character except: a, b, c, d
()            Items within parenthesis are retrieved
(a(bc))       Items within the sub-parenthesis are retrieved

多重匹配

[ab]{2}       Exactly 2 continuous occurrences of a or b
[ab]{2,5}     2 to 5 continuous occurrences of a or b
[ab]{2,}      2 or more continuous occurrences of a or b
+             One or more
*             Zero or more
?             0 or 1

9、正则表达式示例

9.1. 任意字符（换行符除外）

text = 'ziiai.com'
print(re.findall('.', text))  # .   Any character except for a new line
print(re.findall('...', text))
#> ['z', 'i', 'i', 'a', 'i', '.', 'c', 'o', 'm']
#> ['zii', 'ai.', 'com']

9.2. 点号（“.”）

text = 'ziiai.com'
print(re.findall('\.', text))  # matches a period
print(re.findall('[^\.]', text))  # matches anything but a period
#> ['.']
#> ['z', 'i', 'i', 'a', 'i', 'c', 'o', 'm']

9.3. 任意数字

text = '01, Jan 2015'
print(re.findall('\d+', text))  # \d  Any digit. The + mandates at least 1 digit.
#> ['01', '2015']

9.4. 任意非数字

text = '01, Jan 2015'
print(re.findall('\D+', text))  # \D  Anything but a digit
#> [', Jan ']

9.5. 任意包括数字

text = '01, Jan 2015'
print(re.findall('\w+', text))  # \w  Any character
#> ['01', 'Jan', '2015']

9.6. 仅字符

text = '01, Jan 2015'
print(re.findall('\W+', text))  # \W  Anything but a character
#> [', ', ' ']

9.7. 字符集

text = '01, Jan 2015'
print(re.findall('[a-zA-Z]+', text))  # [] Matches any character inside
#> ['Jan']

9.8. 连续出现次数

text = '01, Jan 2015'
print(re.findall('\d{4}', text))  # {n} Matches repeat n times.
print(re.findall('\d{2,4}', text))
#> ['2015']
#> ['01', '2015']

9.9. 一次或多次

print(re.findall(r'Co+l', 'So Cooool'))  # Match for 1 or more occurrences
#> ['Cooool']

9.10. 0次或多次

print(re.findall(r'Pi*lani', 'Pilani'))
#> ['Pilani']

9.11. 0次或一次

print(re.findall(r'colou?r', 'color'))
['color']

9.12、匹配字边界

“\b” 通常用于检测和匹配单词的开头或结尾。也就是说，一边是单词字符，另一边是空白，反之亦然。

例如，“\btoy”将匹配“toy cat”中的“toy”而不会是“tolstoy”的。为了匹配“tolstoy”中的“toy”，应该使用“\toy\b”。

那怎样一个正则表达式，只匹配“play toy broke toys”中的第一个“toy”呢？

同样，\b将匹配任何非边界。如下

re.findall(r'\btoy\b', 'play toy broke toys')  # match toy with boundary on both sides
#> ['toy']

10、总结

本文尽可能的以一种简单明了的方式向大家介绍了python中正则表达式的用法，希望大家喜欢。同时也可以收藏起来以后用来参考。

更多内容请访问：IT源点

注意：本文归作者所有，未经作者允许，不得转载

Python正则表达式简介

4.2 re.search() vs re.match()

9.1. 任意字符（换行符除外）

9.2. 点号（“.”）

9.3. 任意数字

9.4. 任意非数字

9.5. 任意包括数字

9.6. 仅字符

9.7. 字符集

9.8. 连续出现次数

9.9. 一次或多次

9.10. 0次或多次

9.11. 0次或一次

全部评论: 0 条

本文目录

热门标签

程序员导航

热门文章

阿里云新老用户最新优惠

最新发布

最新评论