Part 2: Natural Language Processing Features Extraction
This note builds on Part 1, to discuss feature extraction techniques for Natural Language Processing. Broadly, this section deals with moving from text to numerical representation for which analysis and models can be developed.
Specifically, we cover the following concepts:
Product Reviews Preprocessing
- 1. Text Cleaning
- 2. Preprocessing: Tokenization Stemming
- 1. N-grams
- 2. Bag of Words
- 3. TF-IDF: Term Frequency - Inverse Document Frequency
Techniques to Develop Features
- 1. CountVectorizer
- 2. TfidfVectorizer
- 3. Features to Matrix
- 4. Pickling Vectorizer and Features Dataframe
Feature to Matrix Transformation
Preprocessing Amazon Product Reviews
Picking up from Part 1, we implement the text cleaning function to remove punctuation, numbers together with tokenizing words. Below is the full implementation
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
porter_stemmer = PorterStemmer()
def cleaningText(text):
"""
Text Cleaning:
- Remove Punctuation
- Remove Numbers
- Tokenize Text
- Stem Text
- Remove Stopwords
"""
text = re.sub("[^a-zA-Z]", " ", text) # Remove Punctuation
text = re.sub("[0-9]+", "", text) # Remove Numbers
text = [ porter_stemmer.stem(word.lower()) for word in word_tokenize(text) if word not in stopwords.words('english') ]
return " ".join(text)
import pandas as pd
review_data = pd.read_csv('amazon_baby_review.csv')
# running the preprocesing_step
review_data['clean_review'] = review_data.review.apply( lambda x: cleaningText( str(x)) )
review_data[['review', 'clean_review']].head()
review | clean_review | |
---|---|---|
0 | All of my kids have cried non-stop when I trie... | all kid cri non stop i tri ween pacifi i found... |
1 | We wanted to get something to keep track of ou... | we want get someth keep track child mileston c... |
2 | My daughter had her 1st baby over a year ago. ... | my daughter st babi year ago she receiv fill f... |
3 | One of baby's first and favorite books, and it... | one babi first favorit book washabl i gave les... |
4 | Very cute interactive book! My son loves this ... | veri cute interact book my son love book the b... |
1. N-grams
N-grams is a process of tokenizing a body of text sequentially to the nth value. When we used word_tokenize, we were effectively performing 1-gram tokenization. Alternatively, we can choose an n-gram that may combine tokens that are more meaningful together i.e. "New York" or "Thank You". Let's see this as an example.
from nltk.util import ngrams
from nltk.tokenize import word_tokenize
document = "New York is truly an amazing city to live in"
[ ' '.join(gram) for gram in ngrams(word_tokenize(document), 2) ]
['New York', 'York is', 'is truly', 'truly an', 'an amazing', 'amazing city', 'city to', 'to live', 'live in']
N-grams have the advantage of capturing sentiments like "not bad" or "very good" into a singular token which can be an effective feature for analysis and modeling as compared to individual tokenization.
2. Bag of Words
The bag of words is a process that generates features by collecting all the tokens in the corpus and placing them in a bag, thereby creating the vocabulary for the corpus. From this vocabulary, one-hot encoding can be applied to determine the presence or absence of each vocabulary in a document, thus creating features.
Now, let's illustrate this process with a simple corpus consisting of 5 short documents below:
corpus = [ "the restaurant had great food",
"i love python programming",
"i prefer R to python",
"computers are fun to use",
"i did not like the movie"]
from sklearn.feature_extraction.text import CountVectorizer
bows_counter = CountVectorizer( analyzer='word', # Word level vectorizer
lowercase=True, # Lower case the text
ngram_range=(1, 1), # Create 1 n-grams
tokenizer= word_tokenize, # Use this tokenizer
stop_words = 'english') # remove english stopwords
bows_counter.fit(corpus)
features = bows_counter.transform(corpus).toarray()
The code above implements a count vectorizer that tokenizes words at 1-gram, removes stop words, and creates a one-hot encoding feature set. We can look at the results of our feature conversion by transforming the features into a data frame.
features_df = pd.DataFrame(features, columns=bows_counter.get_feature_names_out())
features_df
computers | did | food | fun | great | like | love | movie | prefer | programming | python | r | restaurant | use | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 |
3 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
Notice that the dataframe has 1-gram tokens and an encoding that shows whether a document contains the token. This set of features can help us model the sentiment of the text.
Another thing to notice is that the matrix can be quite sparse depending on the number of vocabularies and their relative frequency. Therefore, it may be useful to limit n-grams and use features using frequency thresholds.
3. Term Frequency Inverse Document Frequency a.k.a TF-IDF
"TF-IDF is a commonly used weighting technique that assigns weights reflecting the importance of a word to a document. The basis of this technique is the idea that if a word appears frequently across all documents, it is less likely to hold significant information about any specific document. On the other hand, words that appear frequently in one or a few documents and rarely across all documents are considered to have specific importance and should be assigned higher weights.
The mathematical expression of tf-idf (in one of the many forms) is:
$$ tfidf = frequency_{t,d} * log ( \frac{total \ documents}{total \ documents \ containing \ the \ term} ) $$
It is simply the multiplication of the number of times a word appears in a document by the logarithm of the total number of documents divided by the number of documents that contain the word
Intuitively, high-frequency words that appear in nearly all documents are weighted by the logarithm of 1 (log1), resulting in a weight of zero. Conversely, words with high frequency within a specific document and low frequency across the corpus will have a higher weight. Let's see an example using our small corpus above.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer( analyzer='word', # Word level vectorizer
lowercase=True, # Lowercase the text
tokenizer= word_tokenize) # Use this tokenizer)
tfidf_vectorizer.fit(corpus)
tfidf_features = tfidf_vectorizer.transform(corpus).toarray()
tfidf_df = pd.DataFrame(tfidf_features, columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df
are | computers | did | food | fun | great | had | i | like | love | movie | not | prefer | programming | python | r | restaurant | the | to | use | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.000000 | 0.000000 | 0.000000 | 0.463693 | 0.000000 | 0.463693 | 0.463693 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.463693 | 0.374105 | 0.000000 | 0.000000 |
1 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.380406 | 0.000000 | 0.568014 | 0.000000 | 0.000000 | 0.000000 | 0.568014 | 0.458270 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
2 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.345822 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.516374 | 0.000000 | 0.416607 | 0.516374 | 0.000000 | 0.000000 | 0.416607 | 0.000000 |
3 | 0.463693 | 0.463693 | 0.000000 | 0.000000 | 0.463693 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.374105 | 0.463693 |
4 | 0.000000 | 0.000000 | 0.442832 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.296570 | 0.442832 | 0.000000 | 0.442832 | 0.442832 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.357274 | 0.000000 | 0.000000 |
Notice that we now have weights computed. Because we are using a small corpus, the disparity of the weights is not highly visible but it does show. We will implement this for our review data.
4. CountVectorizer and TfidfVectorizer
We can use a vectorizer for text outside of the training data. It will create a vector corresponding to the column names and adds a tfidf value if the word is present in the column and zero otherwise.
bows_counter.transform(['python programming is great']).toarray()
array([[0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0]])
tfidf_vectorizer.transform(['python programming is great']).toarray()
array([[0. , 0. , 0. , 0. , 0. , 0.61418897, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.61418897, 0.49552379, 0. , 0. , 0. , 0. , 0. ]])
4.1. Features to Matrix
The final step is to apply the tfidf vectorize to the dataset to obtain the feature and convert features into a matrix that can be ingested into a model for training. The example below demonstrates this implementation using the Amazon product reviews dataset.
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
review_tfidf_vectorizer = TfidfVectorizer( #max_feautures = 1000, # Return the top 1000 features
analyzer='word', # Word level vectorizer
lowercase=True, # Lower case the text
min_df = 5, # Use tokens that appear at least 5 times
ngram_range=(1, 2), # Create 1 n-grams
tokenizer= word_tokenize, # Use this tokenizer
stop_words = 'english', # remove english stopwords
sublinear_tf=1, smooth_idf=1, use_idf=1) # Additional Features
review_tfidf_vectorizer.fit(review_data.clean_review)
features_df = pd.DataFrame( review_tfidf_vectorizer.transform(review_data.clean_review).toarray(),
columns=review_tfidf_vectorizer.get_feature_names_out())
features_df.head()
aa | aa aaa | aa batteri | aaa | aaa batteri | aacut | aap | ab | aback | abandon | ... | zo | zoli | zoli bot | zoli cup | zone | zoo | zoom | zoom featur | zoom pan | zooper | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Now that we have the features set, we can move on to developing a model to predict the sentiment.
5. Pickling Vectorizer and Features Dataframe
Working with NLP often means working with sparse datasets and vectorizer classes which take a long time to run. Pickling these objects and datasets can help implement checkpoints to save time and compute. To complete this note, we implement pickling for the features dataframe and tfidt_vectorizer we developed.
import pickle
# vectorizer pickling
vectorizer_object = open( 'review_tfidf_vectorizer.pk' , 'wb')
pickle.dump( review_tfidf_vectorizer , vectorizer_object )
vectorizer_object.close()
# features_data pickling
feature_object = open( 'features.pk' , 'wb')
pickle.dump( features_df , feature_object )
feature_object.close()
# sentiments_pickling
sentiment_object = open( 'sentiment.pk' , 'wb')
pickle.dump( review_data['sentiment'] , sentiment_object )
sentiment_object.close()
Now that features are defined and set, we can move to the sentiment Analysis Model. In part 3, we build on part two to generate two models for sentiment classification