2.1. String¶
2.1.1. String find: Find The Index of a Substring in a Python String¶
If you want to find the index of a substring in a string, use find()
method. This method will return the index of the first occurrence of the substring if found and return -1
otherwise.
sentence = "Today is Saturaday"
# Find the index of first occurrence of the substring
sentence.find("day")
2
sentence.find("nice")
# No substring is found
-1
You can also provide the starting and stopping position of the search:
# Start searching for the substring at index 3
sentence.find("day", 3)
15
2.1.2. re.sub: Replace One String with Another String Using Regular Expression¶
If you want to either replace one string with another string or to change the order of characters in a string, use re.sub
.
re.sub
allows you to use a regular expression to specify the pattern of the string you want to swap.
In the code below, I replace 3/7/2021
with Sunday
and replace 3/7/2021
with 2021/3/7
.
import re
text = "Today is 3/7/2021"
match_pattern = r"(\d+)/(\d+)/(\d+)"
re.sub(match_pattern, "Sunday", text)
'Today is Sunday'
re.sub(match_pattern, r"\3-\1-\2", text)
'Today is 2021-3-7'
2.1.3. difflib.SequenceMatcher: Detect The “Almost Similar” Articles¶
When analyzing articles, different articles can be almost similar but not 100% identical, maybe because of the grammar, or because of the change in two or three words (such as cross-posting). How can we detect the “almost similar” articles and drop one of them? That is when difflib.SequenceMatcher
comes in handy.
from difflib import SequenceMatcher
text1 = 'I am Khuyen'
text2 = 'I am Khuen'
print(SequenceMatcher(a=text1, b=text2).ratio())
0.9523809523809523
2.1.4. difflib.get_close_matches: Get a List of he Best Matches for a Certain Word¶
If you want to get a list of the best matches for a certain word, use difflib.get_close_matches
.
from difflib import get_close_matches
tools = ['pencil', 'pen', 'erasor', 'ink']
get_close_matches('pencel', tools)
['pencil', 'pen']
To get closer matches, increase the value of the argument cutoff
(default 0.6).
get_close_matches('pencel', tools, cutoff=0.8)
['pencil']