RegEx Module
RegEx in Python
RegEx Functions
Metacharacters
Flags
Special Sequences
Sets Link: https://www.w3schools.com/python/python_regex.asp A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern. RegEx can be used to check if a string contains the specified search pattern.

RegEx Module

Python has a built-in package called re, which can be used to work with Regular Expressions. Import the re module:

import re

RegEx in Python

When you have imported the re module, you can start using regular expressions: Example Search the string to see if it starts with “The” and ends with “Spain”:

import re
txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)

RegEx Functions

The re module offers a set of functions that allows us to search a string for a match:

Function	Description
`[findall](https://www.w3schools.com/python/python_regex.asp#findall)`	Returns a list containing all matches
`[search](https://www.w3schools.com/python/python_regex.asp#search)`	Returns a Match object if there is a match anywhere in the string
`[split](https://www.w3schools.com/python/python_regex.asp#split)`	Returns a list where the string has been split at each match
`[sub](https://www.w3schools.com/python/python_regex.asp#sub)`	Replaces one or many matches with a string

Metacharacters

Metacharacters are characters with a special meaning:

Character	Description	Example
`[]`	A set of characters	`"[a-m]"`
`\`	Signals a special sequence (can also be used to escape special characters)	`"\d"`
`.`	Any character (except newline character)	`"he..o"`
`^`	Starts with	`"^hello"`
`$`	Ends with	`"planet$"`
`*`	Zero or more occurrences (n ký tự)	`"he.*o"`
`+`	One or more occurrences (1-n ký tự)	`"he.+o"`
`?`	Zero or one occurrences (0-1 ký tự)	`"he.?o"`
`{}`	Exactly the specified number of occurrences	`"he.{2}o"`
`\|`	Either or	`"falls\|stays"`
`()`	Capture and group
Ví dụ:

import re
txt = "The rain in Spain"
\#Find all lower case characters alphabetically between "a" and "m":
x = re.findall("[a-m]", txt)
print(x)
 
>>> Output:
['h', 'e', 'a', 'i', 'i', 'a', 'i']

## VD: pattern đơn giản gồm danh sách các ký tự (không liên tục)
txt1 = 'Python 3.13 was released on October \7, /2024.'
result = re.findall('[aze]', txt1) # những ký tự a, z, e
print(result)
 
>>> Output:
['a', 'e', 'e', 'a', 'e']

Lưu ý khi đặt * phải xác định là sau kí tự cần duyệt:

## VD: Áp dụng với 1 ký tự cụ thể là 'n'
result = re.findall('.on*', txt1) # 'n' có thể lặp lại tùy ý hoặc không xuất hiện
print(result)
 
>>> Output:
['hon', ' on', 'to']

→ Hiểu là bao nhiêu “n” cũng được.

## VD: Áp dụng với 1 ký tự bất kỳ (.)
result = re.findall('o.*e', txt1) # theo sau 'o' là dãy ký tự tùy ý rồi đến 'e'
                                  # vì '*' là tùy ý nên kéo dài đến 'e' xa nhất!
print(result)
 
>>> Output:
['ono 3.13 was released on Octobe']

→ Hiểu là bao nhiêu ký tự nào (.) cũng được.

Flags

You can add flags to the pattern when using regular expressions.

Flag	Shorthand	Description
`re.ASCII`	`re.A`	Returns only ASCII matches
`re.DEBUG`		Returns debug information
`re.DOTALL`	`re.S`	Makes the . character match all characters (including newline character)
`re.IGNORECASE`	`re.I`	Case-insensitive matching
`re.MULTILINE`	`re.M`	Returns only matches at the beginning of each line
`re.NOFLAG`		Specifies that no flag is set for this pattern
`re.UNICODE`	`re.U`	Returns Unicode matches. This is default from Python 3. For Python 2: use this flag to return only Unicode matches
`re.VERBOSE`	`re.X`	Allows whitespaces and comments inside patterns. Makes the pattern more readable

Special Sequences

A special sequence is a \ followed by one of the characters in the list below, and has a special meaning:

Character	Description	Example
`\A`	Returns a match if the specified characters are at the beginning of the string	`"\AThe"`
`\b`	Returns a match where the specified characters are at the beginning or at the end of a word(the “r” in the beginning is making sure that the string is being treated as a “raw string”)	`r"\bain"` `r"ain\b"`
`\B`	Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word(the “r” in the beginning is making sure that the string is being treated as a “raw string”)	`r"\Bain" r"ain\B"`
`\d`	Returns a match where the string contains digits (numbers from 0-9)	`"\d"`
`\D`	Returns a match where the string DOES NOT contain digits	`"\D"`
`\s`	Returns a match where the string contains a white space character	`"\s"`
`\S`	Returns a match where the string DOES NOT contain a white space character	`"\S"`
`\w`	Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)	`"\w"`
`\W`	Returns a match where the string DOES NOT contain any word characters	`"\W"`
`\Z`	Returns a match if the specified characters are at the end of the string	`"Spain\Z"`
Ví dụ:

import re
txt = "The rain in Spain"
\#Check if "ain" is present at the end of a WORD:
x = re.findall(r"ain\b", txt)
print(x)
if x:
  print("Yes, there is at least one match!")
else:
  print("No match")
 
>>> Output: 
['ain', 'ain']
Yes, there is at least one match!

import re
txt = "The rain in Spain"
\#Check if "ain" is present at the beginning of a WORD:
x = re.findall(r"\bain", txt)
print(x)
if x:
  print("Yes, there is at least one match!")
else:
  print("No match")
 
>>> Output:
[]
No match

Sets

A set is a set of characters inside a pair of square brackets [] with a special meaning:

Set	Description
[arn]	Returns a match where one of the specified characters (`a`, `r`, or `n`) is present
[a-n]	Returns a match for any lower case character, alphabetically between `a` and `n`
[^arn]	Returns a match for any character EXCEPT `a`, `r`, and `n`
[0123]	Returns a match where any of the specified digits (`0`, `1`, `2`, or `3`) are present
[0-9]	Returns a match for any digit between `0` and `9`
[0-5][0-9]	Returns a match for any two-digit numbers from `00` and `59`
[a-zA-Z]	Returns a match for any character alphabetically between `a` and `z`, lower case OR upper case
[+]	In sets, `+`, `*`, `.`, `\|`, `()`, `$`,`{}` has no special meaning, so `[+]` means: return a match for any `+` character in the string

An Hoai Thai's Notes

Trong bài này

1. Regex