Part 1: Natural Language Processing Foundations, Concepts, and Preprocessing

This 5-part series covers concepts, analysis, and Machine Learning Models for Natural Language Processing. It was part of a set of lectures I gave at Emory University. It is intended to provide a practical introduction to NLP with Python, covering fundamental NLP terminology, tools, and data processing. In this note, I will cover the following concepts:

Formal Concepts and Definitions

  • 1. Corpus
  • 2. Documents
  • 3. Tokens

General Text Preprocessing Techniques

  • 1. Tokenization: Word, Sentence, Character
  • 2. Stemming and Lemmatization
  • 3. Stop words and how to deal with them
  • 4. Punctuation and how to deal with them

Introducing NLTK for Text Preprocessing

  • 1. Combining preprocessing steps
  • 2. Implementing preprocessing steps

Dataset: Amazon Product Reviews

The dataset used in this series is for Amazon Product reviews. It is available at: Amazon Product Reviews Github

To begin, download the dataset from GitHub and save it in your working directory and use the following code to read in the dataset

import pandas as pd
pd.__version__
'1.4.4'
review_data = pd.read_csv('./amazon_product_reviews.csv')
review_data.head()
review rating sentiment
0 All of my kids have cried non-stop when I trie... 5 1
1 We wanted to get something to keep track of ou... 5 1
2 My daughter had her 1st baby over a year ago. ... 5 1
3 One of baby's first and favorite books, and it... 4 1
4 Very cute interactive book! My son loves this ... 5 1

Now that we have our dataset, let's take a quick look at what we're working with. The dataset consists of three columns with the description below:

  • review: Text review on a baby product
  • rating: Star rating associated with sentiment 1-5 with 3 removed
  • sentiment: Dummy variable for sentiment: 1: positive, -1: negative
print(review_data.rating.value_counts(normalize=True))
print(review_data.sentiment.value_counts(normalize=True)) 
5    0.381953
1    0.286083
2    0.213107
4    0.118857

Name: rating, dtype: float64
1    0.50081
-1    0.49919
Name: sentiment, dtype: float64

Overall we have a 50-50 split of negative and positive reviews. Now let's get into the natural language processing concepts.

1. Corpus

In the field of Natural Language Processing, the term 'corpus' refers to a body of text, which is a collection of text data utilized for analytical or machine learning purposes. In our specific case, the corpus consists of 53,072 reviews.

corpus = review_data.review.tolist()
  
print("The size of our corpus is:", len(corpus), "\n")
print("The first Review in our corpus is:\n", corpus[0])
The size of our corpus is : 53072
    
The first Review in our corpus is:
All of my kids have cried non-stop when I tried to ween them off their pacifier, until I found Thumbuddy To Love's Binky Fairy Puppet.  
It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from it.This is a must buy book, 
and a great gift for expecting parents!!  You will save them soo many headaches.Thanks for this book!  You all rock!!

2. Documents

Documents are individual entities that together make up a corpus. In our case, every review in our dataset is a document. That is we have 50,000 thousand documents in our corpus of Amazon product reviews. Here is a sample document.

sample_document = corpus[6]
    sample_document
'Try this out for a spring project !Easy ,fun and affordable wall decals ...Fine quality and brightens up any room.. 5+ **********'

3. Tokens

A token is a meaningful entity that makes up a document, similar to how words make up sentences. The choice of tokens can vary depending on the task and corpus being analyzed; they can encompass sentences, phrases, words, or even individual characters. In many NLP applications, words serve as the fundamental tokens.

General Text Preprocessing Techniques

Preprocessing refers to a set of operations applied to the body of text (corpus) to enable meaningful extraction of features and insights on the data. Similar to transformations in numerical data, text data needs to be preprocessed to determine things like the size of vocabulary, correction of misspellings, removing unnecessary text that have no useful features for the required tasks, and so forth. One of the primary ways of preprocessing data is to separate them into meaningful entities through tokenization

1. Tokenization

Tokenization is the process of dividing documents into tokens, which is an integral part of effective preprocessing. Now, let's explore the topic of preprocessing.

Sentence Tokenization

Now that we have a few formal definitions, let's implement some tokenization steps. We begin with sentence tokenization using the nltk.

nltk is a wonderful package that provides tokenization tools that perform very well. Let's use the sent_tokenize method to tokenize our first document

from nltk.tokenize import sent_tokenize

print('Original Review:')
review_data.review[11]
Original Review:
"I'm a new mom and I was looking for something to record my baby's daily activities and patterns and it is perfect! It has a lot of space to write 
down anything extra beyond the diaper changes, feedings and sleeping. I like that it has an area to write down medications. It is a great reminder 
for dr appointments and to track your baby's patterns. I am extremely happy with this book and will be ordering another one for when I run out of 
pages in 3 months."
print('Sentence Tokenized:')
sent_tokenize(review_data.review[11])
Sentence Tokenized:
["I'm a new mom and I was looking for something to record my baby's daily activities and patterns and it is perfect!",
 'It has a lot of space to write down anything extra beyond the diaper changes, feedings and sleeping.',
 'I like that it has an area to write down medications.',
 "It is a great reminder for dr appointments and to track your baby's patterns.",
 'I am extremely happy with this book and will be ordering another one for when I run out of pages in 3 months.']

Notice that the resulting output is a list of sentences. The sent_tokenize uses heuristic rules about the English language to separate sentences. In this case, look for periods and punctuation that separate complete sentences i.e. !

Punckt Sentence Tokenizer

Another sentence tokenizer is the PunktSentenceTokenizer which performs similarly and may be more effective in some instances.

from nltk.tokenize import PunktSentenceTokenizer

punkt_sent_tokenizer = PunktSentenceTokenizer()
punkt_sent_tokenizer.tokenize( review_data.review[11])
["I'm a new mom and I was looking for something to record my baby's daily activities and patterns and it is perfect!",
 'It has a lot of space to write down anything extra beyond the diaper changes, feedings and sleeping.',
 'I like that it has an area to write down medications.',
 "It is a great reminder for dr appointments and to track your baby's patterns.",
 'I am extremely happy with this book and will be ordering another one for when I run out of pages in 3 months.']

In this instance both tokenizers split the sentences similarly. Now let's look at word tokenizers.

Word Tokenization with nltk

We can further tokenize our documents into even more granular tokens. Let's look at examples of word tokenizers in word_tokenize and wordpunct_tokenize methods from nltk.

from nltk.tokenize import word_tokenize, wordpunct_tokenize

print("Word Tokenizer: ", word_tokenize(corpus[0])[:10])
print("Word Punct Tokenizer: ", wordpunct_tokenize(corpus[0])[:10])
Word Tokenizer:  ['All', 'of', 'my', 'kids', 'have', 'cried', 'non-stop', 'when', 'I', 'tried']
Word Punct Tokenizer:  ['All', 'of', 'my', 'kids', 'have', 'cried', 'non', '-', 'stop', 'when']

We can observe that the result is a list of individual words extracted from the document. The output above displays the first 10 words generated by both word tokenizers. It is worth noting that the word_tokenize function does not separate the word 'non-stop' into distinct tokens, whereas the wordpunct_tokenize function does split it into separate tokens

2. Stemming

Stemming is the process of reducing texts to their root words. This feature is powerful as it allows us to reduce the number of tokens in a document or corpus significantly while retaining the context as much as possible. In the example below, we implement two stemmers on the set of variants with the same root.

Let's see the examples of stemmers in nltk

from nltk.stem import PorterStemmer, SnowballStemmer

porter_stem = PorterStemmer()
snowb_stem = SnowballStemmer('english') # Need to initialize 
porter_stem.stem("loved"), porter_stem.stem("lovely"), porter_stem.stem("loveness")
('love', 'love', 'love')
snowb_stem.stem('amazing'), snowb_stem.stem('amazed'), snowb_stem.stem('amazes'), snowb_stem.stem('amazingly')
('amaz', 'amaz', 'amaz', 'amaz')

3. Lemmatization

Lemmatization also performs stemming however, it uses vocabulary and morphology to return the base of the word.

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

lemmatizer.lemmatize("wolves"), lemmatizer.lemmatize("saying"), lemmatizer.lemmatize("is", pos='v')
('wolf', 'saying', 'be')

4. Stop Words

Stopwords are words that are part of the grammatical structure of the language but do not carry much semantic meaning to the text. These are often frequent words like 'is' and 'the' that whether they are in or out of the text, the meaning does not change. In NLP, we often deal with stopwords by removing them or selecting what to retain.

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

sentence = "the restaurant at the city served amazing food"
print("With Stop words:", sentence, "\n" )

without_stopwords = [word for word in word_tokenize(sentence) if word not in stopwords.words('english')]
print('Without Stop words:', ' '.join(without_stopwords))
With Stop words: the restaurant at the city served amazing food 

Without Stop words: restaurant city served amazing food

Combining Preprocessing Steps

We have discussed concepts and techniques to preprocess data for feature extraction. The next step is to put them together into a preprocessing function that can be run against the dataset. Below is an example of one such function.

import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
porter_stemmer = PorterStemmer()

def cleaningText(text):
    """
    Text Cleaning:
        - Remove Punctuation
        - Remove Numbers
        - Tokenize Text
        - Stem Text
        - Remove Stopwords
    """ 
    text = re.sub("[^a-zA-Z]", " ", text) # Remove Punctuation
    text = re.sub("[0-9]+", "", text) # Remove Numbers
    text = [ porter_stemmer.stem(word.lower()) for word in word_tokenize(text) if word not in stopwords.words('english') ]
    return " ".join(text)
cleaningText("The restaurantant has really amazing service and great food")
'the restaurant realli amaz servic great food'

Now that we have learned the basics, we can go ahead and use these preprocessing to clean the data and then extract features from the data. In Part 2, we build on these concepts to explore feature extraction techniques.