RegEx Module

Python has a built-in package called re, which can be used to work with Regular Expressions. Import the re module:

import re

RegEx in Python

When you have imported the re module, you can start using regular expressions: Example Search the string to see if it starts with “The” and ends with “Spain”:

import re
txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)

RegEx Functions

The re module offers a set of functions that allows us to search a string for a match:

FunctionDescription
[findall](https://www.w3schools.com/python/python_regex.asp#findall)Returns a list containing all matches
[search](https://www.w3schools.com/python/python_regex.asp#search)Returns a Match object if there is a match anywhere in the string
[split](https://www.w3schools.com/python/python_regex.asp#split)Returns a list where the string has been split at each match
[sub](https://www.w3schools.com/python/python_regex.asp#sub)Replaces one or many matches with a string

Metacharacters

Metacharacters are characters with a special meaning:

CharacterDescriptionExample
[]A set of characters"[a-m]"
\Signals a special sequence (can also be used to escape special characters)"\d"
.Any character (except newline character)"he..o"
^Starts with"^hello"
$Ends with"planet$"
*Zero or more occurrences (n ký tự)"he.*o"
+One or more occurrences (1-n ký tự)"he.+o"
?Zero or one occurrences (0-1 ký tự)"he.?o"
{}Exactly the specified number of occurrences"he.{2}o"
|Either or"falls|stays"
()Capture and group
Ví dụ:
import re
txt = "The rain in Spain"
\#Find all lower case characters alphabetically between "a" and "m":
x = re.findall("[a-m]", txt)
print(x)
 
>>> Output:
['h', 'e', 'a', 'i', 'i', 'a', 'i']
## VD: pattern đơn giản gồm danh sách các ký tự (không liên tục)
txt1 = 'Python 3.13 was released on October \7, /2024.'
result = re.findall('[aze]', txt1) # những ký tự a, z, e
print(result)
 
>>> Output:
['a', 'e', 'e', 'a', 'e']

Lưu ý khi đặt * phải xác định là sau kí tự cần duyệt:

## VD: Áp dụng với 1 ký tự cụ thể là 'n'
result = re.findall('.on*', txt1) # 'n' có thể lặp lại tùy ý hoặc không xuất hiện
print(result)
 
>>> Output:
['hon', ' on', 'to']

→ Hiểu là bao nhiêu “n” cũng được.

## VD: Áp dụng với 1 ký tự bất kỳ (.)
result = re.findall('o.*e', txt1) # theo sau 'o' là dãy ký tự tùy ý rồi đến 'e'
                                  # vì '*' là tùy ý nên kéo dài đến 'e' xa nhất!
print(result)
 
>>> Output:
['ono 3.13 was released on Octobe']

→ Hiểu là bao nhiêu ký tự nào (.) cũng được.

Flags

You can add flags to the pattern when using regular expressions.

FlagShorthandDescription
re.ASCIIre.AReturns only ASCII matches
re.DEBUGReturns debug information
re.DOTALLre.SMakes the . character match all characters (including newline character)
re.IGNORECASEre.ICase-insensitive matching
re.MULTILINEre.MReturns only matches at the beginning of each line
re.NOFLAGSpecifies that no flag is set for this pattern
re.UNICODEre.UReturns Unicode matches. This is default from Python 3. For Python 2: use this flag to return only Unicode matches
re.VERBOSEre.XAllows whitespaces and comments inside patterns. Makes the pattern more readable

Special Sequences

A special sequence is a \ followed by one of the characters in the list below, and has a special meaning:

CharacterDescriptionExample
\AReturns a match if the specified characters are at the beginning of the string"\AThe"
\bReturns a match where the specified characters are at the beginning or at the end of a word(the “r” in the beginning is making sure that the string is being treated as a “raw string”)r"\bain"
r"ain\b"
\BReturns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word(the “r” in the beginning is making sure that the string is being treated as a “raw string”)r"\Bain" r"ain\B"
\dReturns a match where the string contains digits (numbers from 0-9)"\d"
\DReturns a match where the string DOES NOT contain digits"\D"
\sReturns a match where the string contains a white space character"\s"
\SReturns a match where the string DOES NOT contain a white space character"\S"
\wReturns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)"\w"
\WReturns a match where the string DOES NOT contain any word characters"\W"
\ZReturns a match if the specified characters are at the end of the string"Spain\Z"
Ví dụ:
import re
txt = "The rain in Spain"
\#Check if "ain" is present at the end of a WORD:
x = re.findall(r"ain\b", txt)
print(x)
if x:
  print("Yes, there is at least one match!")
else:
  print("No match")
 
>>> Output: 
['ain', 'ain']
Yes, there is at least one match!
import re
txt = "The rain in Spain"
\#Check if "ain" is present at the beginning of a WORD:
x = re.findall(r"\bain", txt)
print(x)
if x:
  print("Yes, there is at least one match!")
else:
  print("No match")
 
>>> Output:
[]
No match

Sets

A set is a set of characters inside a pair of square brackets [] with a special meaning:

SetDescription
[arn]Returns a match where one of the specified characters (ar, or n) is present
[a-n]Returns a match for any lower case character, alphabetically between a and n
[^arn]Returns a match for any character EXCEPT ar, and n
[0123]Returns a match where any of the specified digits (012, or 3) are present
[0-9]Returns a match for any digit between 0 and 9
[0-5][0-9]Returns a match for any two-digit numbers from 00 and 59
[a-zA-Z]Returns a match for any character alphabetically between a and z, lower case OR upper case
[+]In sets, +*.|()$,{} has no special meaning, so [+] means: return a match for any + character in the string