Overview

  • Preprocessing is a critical step that involves the transformation and cleaning of raw text in such a way that it becomes easy to understand and efficient to work with.
  • These preprocessing techniques help to reduce the complexity of the language data, improving computational efficiency and performance of the models. However, these techniques need to be applied judiciously, keeping in mind the requirements of the specific NLP task.
  • These techniques help to reduce the complexity of the data, increase the efficiency of the computational processes, and often enhance the performance of NLP models.
  • Below, we will look at a few ways NLP preprocesses its raw data.

Stemming

  • Stemming is a fundamental and heuristic process in Natural Language Processing (NLP) utilized to reduce words to their root or base form. This is accomplished by removing the ends of words, which often include suffixes. For instance, the stem of the words “jumps,” “jumping,” and “jumped” is “jump.” Stemming aids in reducing the corpus of words that a model needs to process, thereby enhancing computational efficiency.
  • However, stemming can sometimes be a rudimentary method, as it merely truncates the ends of words using simplistic rules without understanding the context. This can lead to erroneous outputs where the stemmed word is not a valid word or has a different meaning. The Porter Stemming algorithm does not maintain a lookup table for the actual stems of each word but applies algorithmic rules to generate stems.

The Porter Stemming Algorithm

  • The Porter Stemming algorithm, developed by Martin Porter in 1980, is one of the most widely used stemming algorithms. It processes words through a series of steps, each containing transformation rules that specify conditions under which a suffix should be removed or replaced. This iterative approach ensures that words are reduced to their root form in a systematic manner.

Example Implementation Using NLTK

  • To demonstrate the application of the Porter Stemming algorithm, we use the Natural Language Toolkit (NLTK) in Python. Below is an example of how to implement Porter’s Stemming:
# Import necessary modules
import nltk
nltk.download('punkt')

from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

# Instantiate a PorterStemmer object
stemmer = PorterStemmer()

# Examples of stemming individual words
print(stemmer.stem("cat"))     # -> cat
print(stemmer.stem("cats"))    # -> cat
print(stemmer.stem("walking")) # -> walk
print(stemmer.stem("walked"))  # -> walk
print(stemmer.stem("achieve")) # -> achiev
print(stemmer.stem("am"))      # -> am
print(stemmer.stem("is"))      # -> is
print(stemmer.stem("are"))     # -> are

# Stemming all words in a text
text = "The cats are sleeping. What are the dogs doing?"
tokens = word_tokenize(text)
tokens_stemmed = [stemmer.stem(token) for token in tokens]
print(tokens_stemmed) # ['the', 'cat', 'are', 'sleep', '.', 'what', 'are', 'the', 'dog', 'do', '?']
  • This implementation showcases how the Porter Stemming algorithm trims suffixes to achieve a common base form. The results illustrate the effectiveness of the algorithm in normalizing text for NLP applications.

Considerations and Alternatives

  • While Porter’s Stemming algorithm is efficient and widely used, it is important to recognize its limitations in terms of context and accuracy. For more advanced linguistic processing, lemmatization may be preferred as it considers the context and returns valid lemmas of words. Additionally, for languages other than English, the SnowballStemmer can be employed to perform stemming.

Conclusion

  • Stemming is a critical preprocessing step in NLP that reduces the complexity of text data by converting words to their base forms. The Porter Stemming algorithm, with its rule-based approach, is a popular choice for this task. However, practitioners should be aware of its limitations and consider alternative methods when appropriate to ensure the accuracy and relevance of their NLP models.
  • For further reading and practical examples, refer to nlplanet.org.

Lemmatization

  • Lemmatization, similar to stemming, is used to reduce a word to its base form, but it considers the morphological analysis of the words. It returns the lemma of the word, which is the dictionary form or the base form. The process involves understanding the context and part of speech of the word, making it more complex and accurate than stemming.
  • For example, the word “better” has “good” as its lemma. Lemmatization would correctly identify this, while stemming would not. However, the complexity of lemmatization can also make it computationally more expensive than stemming.
  • Let’s see lemmatization in action in the code snippet below (source: nlplanet.org).
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')

from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# We’ll use the WordNetLemmatizer which leverages WordNet to find existing lemmas. Then, we create an instance of the WordNetLemmatizer class and use the lemmatize method.

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("achieve")) # -> achieve
achieve

# The lemmatizer is able to reduce the word “achieve” to its lemma “achieve”, differently from stemmers which reduce it to the non-existing word “achiev”.

Stopwords

  • Stopwords in NLP are the words that are filtered out before processing a text. These are usually the most common words in a language like “is”, “in”, “an”, “the”, “and”, etc. These words do not carry significant meaning and are usually removed from texts to help reduce the dataset size and improve computational efficiency.
  • However, in some NLP tasks, like sentiment analysis, stopwords can carry significant meaning, and removing them could potentially affect the performance of the model. Therefore, the removal of stopwords should be carefully considered based on the task at hand.
  • Let’s see stopwords in action from (nlplanet.org)
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords


# Then, we retrieve the stopwords for the English language with stopwords.words("english"). There are 179 stopwords in total, which are words (note that they are all in lowercase) that are very common in different types of English texts.

english_stopwords = stopwords.words('english')
print(f"There are {len(english_stopwords)} stopwords in English")
print(english_stopwords[:10])

# There are 179 stopwords in English
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

Tokenization

  • Tokenization is the process of breaking up text into smaller pieces called tokens. These tokens could be sentences or words. This is often the first step in NLP preprocessing.
  • For example, sentence tokenization breaks a paragraph into individual sentences. Word tokenization breaks a sentence into individual words.

Lowercasing

  • Lowercasing is a common preprocessing step where all the text is converted to lower case. This helps to avoid having multiple copies of the same words. For example, “House”, “house”, and “HOUSE” will be considered as different words unless they are all converted into the same case, preferably lower case.

Punctuation Removal

  • Punctuation can provide grammatical context to a sentence which supports our understanding. However, for our vectorizer which counts the number of words and not the context, it does not add value, so we remove all special characters.

Spell Check and Correction

  • Typos and spelling mistakes are common in text data. Spell check and correction can be used to correct these errors. This step can help in reducing multiple copies of words. For example, “speling” and “spelling” will be considered as two different words unless corrected.

Noise Removal

  • Noise removal is about removing characters digits and pieces of text that can interfere with your text analysis. Noise removal could be performed in various ways, including removal of text file headers, footers, HTML, XML, etc.

Text Normalization

  • Text normalization includes converting all text to the same case (usually lowercase), removing punctuation, converting numbers to their word equivalents, and so on.

Part-of-Speech (POS) Tagging

  • Part-of-speech tagging involves identifying the part of speech (noun, verb, adjective, etc.) of each word in a sentence. This can be important for understanding the sentence structure and can be especially useful in tasks like named entity recognition, question answering, etc.
  • We will look at this in action from (nlplanet.org)
# The NLTK library provides an easy-to-use pos_tag function that takes a text as input and returns the part-of-speech of each token in the text.

text = word_tokenize("They refuse to go")
print(nltk.pos_tag(text))

text = word_tokenize("We need the refuse permit")
print(nltk.pos_tag(text))
[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('go', 'VB')]
[('We', 'PRP'), ('need', 'VBP'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]

# PRP are propositions, NN are nouns, VBP are present tense verbs, VB are verbs, DT are definite articles, and so on. Read this article to see the complete list of parts-of-speech that can be returned by the pos_tag function. In the previous example, the word “refuse” is correctly tagged as verb and noun depending on the context.

# The pos_tag function assigns parts-of-speech to words leveraging their context (i.e. the sentences they are in), applying rules learned over tagged corpora.