Part 1: Natural Language Processing Foundations, Concepts, and Preprocessing
This 5-part series covers concepts, analysis, and Machine Learning Models for Natural Language Processing. It was part of a set of lectures I gave at Emory University. It is intended to provide a practical introduction to NLP with Python, covering fundamental NLP terminology, tools, and data processing. In this note, I will cover the following concepts:
Formal Concepts and Definitions
- 1. Corpus
- 2. Documents
- 3. Tokens
General Text Preprocessing Techniques
- 1. Tokenization: Word, Sentence, Character
- 2. Stemming and Lemmatization
- 3. Stop words and how to deal with them
- 4. Punctuation and how to deal with them
Introducing NLTK for Text Preprocessing
- 1. Combining preprocessing steps
- 2. Implementing preprocessing steps
Dataset: Amazon Product Reviews
The dataset used in this series is for Amazon Product reviews. It is available at: Amazon Product Reviews Github
To begin, download the dataset from GitHub and save it in your working directory and use the following code to read in the dataset
import pandas as pd
pd.__version__
'1.4.4'
review_data = pd.read_csv('./amazon_product_reviews.csv')
review_data.head()
review | rating | sentiment | |
---|---|---|---|
0 | All of my kids have cried non-stop when I trie... | 5 | 1 |
1 | We wanted to get something to keep track of ou... | 5 | 1 |
2 | My daughter had her 1st baby over a year ago. ... | 5 | 1 |
3 | One of baby's first and favorite books, and it... | 4 | 1 |
4 | Very cute interactive book! My son loves this ... | 5 | 1 |
Now that we have our dataset, let's take a quick look at what we're working with. The dataset consists of three columns with the description below:
- review: Text review on a baby product
- rating: Star rating associated with sentiment 1-5 with 3 removed
- sentiment: Dummy variable for sentiment: 1: positive, -1: negative
print(review_data.rating.value_counts(normalize=True))
print(review_data.sentiment.value_counts(normalize=True))
5 0.381953 1 0.286083 2 0.213107 4 0.118857 Name: rating, dtype: float64 1 0.50081 -1 0.49919 Name: sentiment, dtype: float64
Overall we have a 50-50 split of negative and positive reviews. Now let's get into the natural language processing concepts.
1. Corpus
In the field of Natural Language Processing, the term 'corpus' refers to a body of text, which is a collection of text data utilized for analytical or machine learning purposes. In our specific case, the corpus consists of 53,072 reviews.
corpus = review_data.review.tolist()
print("The size of our corpus is:", len(corpus), "\n")
print("The first Review in our corpus is:\n", corpus[0])
The size of our corpus is : 53072 The first Review in our corpus is: All of my kids have cried non-stop when I tried to ween them off their pacifier, until I found Thumbuddy To Love's Binky Fairy Puppet. It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from it.This is a must buy book, and a great gift for expecting parents!! You will save them soo many headaches.Thanks for this book! You all rock!!
2. Documents
Documents are individual entities that together make up a corpus. In our case, every review in our dataset is a document. That is we have 50,000 thousand documents in our corpus of Amazon product reviews. Here is a sample document.
sample_document = corpus[6]
sample_document
'Try this out for a spring project !Easy ,fun and affordable wall decals ...Fine quality and brightens up any room.. 5+ **********'
3. Tokens
A token is a meaningful entity that makes up a document, similar to how words make up sentences. The choice of tokens can vary depending on the task and corpus being analyzed; they can encompass sentences, phrases, words, or even individual characters. In many NLP applications, words serve as the fundamental tokens.
General Text Preprocessing Techniques
Preprocessing refers to a set of operations applied to the body of text (corpus) to enable meaningful extraction of features and insights on the data. Similar to transformations in numerical data, text data needs to be preprocessed to determine things like the size of vocabulary, correction of misspellings, removing unnecessary text that have no useful features for the required tasks, and so forth. One of the primary ways of preprocessing data is to separate them into meaningful entities through tokenization
1. Tokenization
Tokenization is the process of dividing documents into tokens, which is an integral part of effective preprocessing. Now, let's explore the topic of preprocessing.
Sentence Tokenization
Now that we have a few formal definitions, let's implement some tokenization steps. We begin with sentence tokenization using the nltk.
nltk is a wonderful package that provides tokenization tools that perform very well. Let's use the sent_tokenize method to tokenize our first document
from nltk.tokenize import sent_tokenize
print('Original Review:')
review_data.review[11]
Original Review: "I'm a new mom and I was looking for something to record my baby's daily activities and patterns and it is perfect! It has a lot of space to write down anything extra beyond the diaper changes, feedings and sleeping. I like that it has an area to write down medications. It is a great reminder for dr appointments and to track your baby's patterns. I am extremely happy with this book and will be ordering another one for when I run out of pages in 3 months."
print('Sentence Tokenized:')
sent_tokenize(review_data.review[11])
Sentence Tokenized: ["I'm a new mom and I was looking for something to record my baby's daily activities and patterns and it is perfect!", 'It has a lot of space to write down anything extra beyond the diaper changes, feedings and sleeping.', 'I like that it has an area to write down medications.', "It is a great reminder for dr appointments and to track your baby's patterns.", 'I am extremely happy with this book and will be ordering another one for when I run out of pages in 3 months.']
Notice that the resulting output is a list of sentences. The sent_tokenize uses heuristic rules about the English language to separate sentences. In this case, look for periods and punctuation that separate complete sentences i.e. !
Punckt Sentence Tokenizer
Another sentence tokenizer is the PunktSentenceTokenizer which performs similarly and may be more effective in some instances.
from nltk.tokenize import PunktSentenceTokenizer
punkt_sent_tokenizer = PunktSentenceTokenizer()
punkt_sent_tokenizer.tokenize( review_data.review[11])
["I'm a new mom and I was looking for something to record my baby's daily activities and patterns and it is perfect!", 'It has a lot of space to write down anything extra beyond the diaper changes, feedings and sleeping.', 'I like that it has an area to write down medications.', "It is a great reminder for dr appointments and to track your baby's patterns.", 'I am extremely happy with this book and will be ordering another one for when I run out of pages in 3 months.']
In this instance both tokenizers split the sentences similarly. Now let's look at word tokenizers.
Word Tokenization with nltk
We can further tokenize our documents into even more granular tokens. Let's look at examples of word tokenizers in word_tokenize and wordpunct_tokenize methods from nltk.
from nltk.tokenize import word_tokenize, wordpunct_tokenize
print("Word Tokenizer: ", word_tokenize(corpus[0])[:10])
print("Word Punct Tokenizer: ", wordpunct_tokenize(corpus[0])[:10])
Word Tokenizer: ['All', 'of', 'my', 'kids', 'have', 'cried', 'non-stop', 'when', 'I', 'tried'] Word Punct Tokenizer: ['All', 'of', 'my', 'kids', 'have', 'cried', 'non', '-', 'stop', 'when']
We can observe that the result is a list of individual words extracted from the document. The output above displays the first 10 words generated by both word tokenizers. It is worth noting that the word_tokenize function does not separate the word 'non-stop' into distinct tokens, whereas the wordpunct_tokenize function does split it into separate tokens
2. Stemming
Stemming is the process of reducing texts to their root words. This feature is powerful as it allows us to reduce the number of tokens in a document or corpus significantly while retaining the context as much as possible. In the example below, we implement two stemmers on the set of variants with the same root.
Let's see the examples of stemmers in nltk
from nltk.stem import PorterStemmer, SnowballStemmer
porter_stem = PorterStemmer()
snowb_stem = SnowballStemmer('english') # Need to initialize
porter_stem.stem("loved"), porter_stem.stem("lovely"), porter_stem.stem("loveness")
('love', 'love', 'love')
snowb_stem.stem('amazing'), snowb_stem.stem('amazed'), snowb_stem.stem('amazes'), snowb_stem.stem('amazingly')
('amaz', 'amaz', 'amaz', 'amaz')
3. Lemmatization
Lemmatization also performs stemming however, it uses vocabulary and morphology to return the base of the word.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("wolves"), lemmatizer.lemmatize("saying"), lemmatizer.lemmatize("is", pos='v')
('wolf', 'saying', 'be')
4. Stop Words
Stopwords are words that are part of the grammatical structure of the language but do not carry much semantic meaning to the text. These are often frequent words like 'is' and 'the' that whether they are in or out of the text, the meaning does not change. In NLP, we often deal with stopwords by removing them or selecting what to retain.
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
sentence = "the restaurant at the city served amazing food"
print("With Stop words:", sentence, "\n" )
without_stopwords = [word for word in word_tokenize(sentence) if word not in stopwords.words('english')]
print('Without Stop words:', ' '.join(without_stopwords))
With Stop words: the restaurant at the city served amazing food Without Stop words: restaurant city served amazing food
Combining Preprocessing Steps
We have discussed concepts and techniques to preprocess data for feature extraction. The next step is to put them together into a preprocessing function that can be run against the dataset. Below is an example of one such function.
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
porter_stemmer = PorterStemmer()
def cleaningText(text):
"""
Text Cleaning:
- Remove Punctuation
- Remove Numbers
- Tokenize Text
- Stem Text
- Remove Stopwords
"""
text = re.sub("[^a-zA-Z]", " ", text) # Remove Punctuation
text = re.sub("[0-9]+", "", text) # Remove Numbers
text = [ porter_stemmer.stem(word.lower()) for word in word_tokenize(text) if word not in stopwords.words('english') ]
return " ".join(text)
cleaningText("The restaurantant has really amazing service and great food")
'the restaurant realli amaz servic great food'
Now that we have learned the basics, we can go ahead and use these preprocessing to clean the data and then extract features from the data. In Part 2, we build on these concepts to explore feature extraction techniques.