Part 4: Modern Feature Engineering - Distributed Representation for Text Modelling
So far we have looked at traditional techniques that include using word frequency and weighting to determine the features from text and using those features to build classification models. On this note, we explore distributed representation as a way of generating features from words. In this notebook, we will look at the implementation of distributed representation techniques. More specifically
Word2Vec Implementation with Gensim:
- 1. Continuous Bag of Words Model
- 2. Skip-Gram Model
- 3. Gensim Vocabularity Object
Word Vectors to Feature Matrix
- 1. Averaging Word Vectors
- 2. Building a vectorizer for new text
Dataset: Restaurant Reviews
Let's begin with importing the necessary packages
import pandas as pd
review_data = pd.read_csv('restaurant_reviews.tsv', sep='\t')
review_data.head()
Review | Liked | |
---|---|---|
0 | Wow... Loved this place. | 1 |
1 | Crust is not good. | 0 |
2 | Not tasty and the texture was just nasty. | 0 |
3 | Stopped by during the late May bank holiday of... | 1 |
4 | The selection on the menu was great and so wer... | 1 |
We will need to clean them up so that we have a sequence of words that we can build our word2vector representation. Before we go into the word2vec, let's use Keras to tokenize the sentiments.
from keras.preprocessing.text import text_to_word_sequence
corpus_tokens = [text_to_word_sequence(review) for review in review_data.Review.values ]
corpus_tokens[:3]
[['wow', 'loved', 'this', 'place'], ['crust', 'is', 'not', 'good'], ['not', 'tasty', 'and', 'the', 'texture', 'was', 'just', 'nasty']]
Notice that in the above example, we have created a list of documents in which every document is a set of tokens in the documents. This is the input we will need to use for our word2vec implementation.
What is Word2Vec?
So far we have generally described distributed representation. The simplified understanding of distributed representation is the idea that words that have the same meaning tend to appear within the same context. More broadly, we distributed representation provides a way assign vectors to words based on the context such that words within a similar context are contained within the same vector space.
For more information, see stanford's lecture: https://www.youtube.com/watch?v=ERibwqs9p38
There are two main ways of computing word vectors:
1. CBOW - Continuous Bag of Words
Suppose we have some text: "generalized linear models have link functions that enable flexibility beyond that of ordinary least squares". Also, suppose that we want to predict the word vector for the word "models", the continuous bag of words uses the context words: "generalized linear ___ have link" to predict the target word. Given the corpus may contain other texts with similar context, then we can train a cbow model to provide the most probable word for the text above.
If you are interested in understanding the mechanics of training word2vec, it may be useful to visit the Stanford link above for more technical details of the training algorithms available for this estimation.
In Python, I use the gensim package to compute the words vectors for the review corpus using cbow model
Before we implement cbow word2vec, we need to determine the following:
- 1. size of vector: The size of the vectors for every word in the corpus
- 2. window size: The size of context words to use in computing the vector
In this example, let's use a window of 5 and a vector of size 10
from gensim.models import Word2Vec
vector_size = 10
window_size = 5
cbow_model = Word2Vec( sentences= corpus_tokens,
vector_size = vector_size, # Setting the vector size
window =window_size, # Setting the Window size
sg=0, # Initialize CBOW
min_count = 2, # Minimum word count
sample=.0000001 ) # Lower Weighting/Downsampling of frequent words
The model has been trained. We can now extract a vector of size 10 using the 5 nearest words
cbow_model.wv['good']
array([ 0.07817571, -0.09510187, -0.00205531, 0.03469197, -0.00938972, 0.08381772, 0.09010784, 0.06536506, -0.00711621, 0.07710405], dtype=float32)
1.2. CBOW Similarity
One of the advantages of using a CBOW model is we can then compute word similarity within the corpus. The cosine similarity method is used for this calculation.
Using the gensim package, we can call the most_similar method for any word as shown in the example below:
cbow_model.wv.most_similar('amazing')
[('few', 0.8727703094482422), ('seated', 0.8448812961578369), ('seen', 0.7667219638824463), ('full', 0.7428421974182129), ('said', 0.7405533790588379), ('brought', 0.7369356155395508), ('enjoy', 0.7215623259544373), ('heat', 0.718512773513794), ('stuffed', 0.6971375346183777), ('potatoes', 0.6860308647155762)]
Based on our dataset and tuning parameters of our model, the word amazing is similar to words 'seated', 'few', 'enjoy' and 'stuff' are generally positive words in a review.
2. Skip-Gram Model
The Skip-Gram model is trained to perform the reverse function of the CBOW model. That is, while the cbow model predicts the target word given context words, the Skip-Gram model predicts context words based on the presence of the target word.
In terms of the general implementation of skip-gram in gensim, we will only activate the skip-gram argument. Let's see the implementation.
skipgram_model = Word2Vec( sentences= corpus_tokens,
vector_size = 20, # Setting the vector size
window = 5, # Setting the Window size
min_count = 2, # Minimum word count
sg = 1, # Initialize Skip Gram
sample=.0000001 ) # Lower Weighting/Downsampling of frequent words
Just like with CBOW, we can generate a word vector for each word in the corpus
skipgram_model.wv['good']
array([-0.04121573, 0.04649187, -0.00098335, -0.00983143, 0.02302161, -0.02047365, 0.01371725, 0.03470667, 0.03032966, -0.03756103, 0.04690514, 0.0233656 , 0.01983496, -0.03122055, 0.0423056 , -0.01075502, 0.04413366, -0.02680901, -0.04064848, 0.03411919], dtype=float32)
2.2. Skip-Gram Word Similarity
We can also retrieve similar words from the skip-gram model.
skipgram_model.wv.similar_by_word('amazing')
[('serve', 0.6504205465316772), ('incredible', 0.6315867900848389), ('your', 0.6247786283493042), ('flavorful', 0.6199944019317627), ('few', 0.5851094126701355), ("friend's", 0.5779340267181396), ('full', 0.5542045831680298), ('potatoes', 0.5514717698097229), ('said', 0.5444715023040771), ('overall', 0.5155755877494812)]
Unsurprisingly, words like 'flavorful', 'incredible' appear very similar to 'amazing'
3. Gensim Vocabulary
Suppose we have a new text input that may have new words that our models have not seen yet, it is therefore impossible to return the vectors for those words. It is always important to know how to access the vocabulary in the model so that you can provide an alternative to new words are you are trying to leverage the models for feature extraction.
The gensim vocabulary is a dictionary of all words with the vector object.
vocab = skipgram_model.wv.index_to_key
vocab[20:30]
['with', 'had', 'great', 'that', 'be', 'so', 'were', 'are', 'but', 'have']
len(vocab)
897
Saving this vocabulary list will be important in initializing vectors for new text, particularly for words not contained in the word2vec model.
4. Convert Word Vectors to Features
We have seen earlier how to implement bag of words and tfidf feature extraction. With those techniques, every word has a single value as a feature. With word vectors, every word has a vector of features so we will need to find a way to summarize the features such that every document has one vector representing all the tokens in the document. One simple approach is to sum the vectors and average them by the count of the words/tokens.
Let's see this in implementation
import numpy as np
def avg_word_vectors(words, model, vocabulary, feature_size):
feature_vector = np.zeros((feature_size,), dtype='float64')
word_count = 0.
for word in words:
if word in vocabulary:
word_count += 1
feature_vector = np.add(feature_vector, model.wv[word])
if word_count:
feature_vector = np.divide(feature_vector, word_count)
return feature_vector
Testing the function on a sample of text
test = ["This", 'is', 'delicious', 'food']
avg_word_vectors(test, skipgram_model, vocab, 20 )
array([-0.12273251, 0.05990233, 0.20953931, 0.14712462, -0.02474864, -0.04072365, 0.12475592, 0.41853103, -0.23380686, 0.15436821, 0.1927391 , -0.16070034, 0.25334843, -0.07903712, 0.13786161, 0.13087943, 0.2813911 , -0.03589961, -0.17655849, -0.32781931])
Averaging across the full dataset
Now let's convert the averaged vectors to a full dataframe with words as the predictors/features
def avg_word_vectorizer(corpus, model, feature_size):
vocabulary = set(model.wv.index_to_key)
features = [ avg_word_vectors(text, model, vocabulary, feature_size) for text in corpus]
return np.array(features)
skipgram_feaures = avg_word_vectorizer(corpus_tokens, skipgram_model, 20)
Converting Vectors into a Feature Dataframe
skipgram_df = pd.DataFrame(skipgram_feaures)
skipgram_df.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.072557 | 0.020669 | 0.109571 | 0.091545 | -0.024006 | -0.045733 | 0.069171 | 0.243920 | -0.119203 | 0.092028 | 0.113550 | -0.087990 | 0.161464 | -0.045470 | 0.094618 | 0.106789 | 0.164789 | -0.029297 | -0.082386 | -0.213857 |
1 | -0.090617 | 0.046973 | 0.119375 | 0.079829 | -0.015860 | -0.041904 | 0.079698 | 0.252796 | -0.122075 | 0.073598 | 0.129392 | -0.091764 | 0.140735 | -0.055646 | 0.079972 | 0.079374 | 0.165859 | -0.036726 | -0.102102 | -0.185127 |
2 | -0.013918 | 0.003828 | 0.023774 | 0.017910 | -0.001977 | -0.009082 | 0.016131 | 0.046615 | -0.022055 | 0.021035 | 0.018877 | -0.024454 | 0.025174 | -0.003205 | 0.010999 | 0.012328 | 0.032873 | -0.012591 | -0.021173 | -0.043988 |
3 | -0.026085 | 0.011909 | 0.030719 | 0.022893 | -0.005852 | -0.007413 | 0.024078 | 0.063204 | -0.031559 | 0.025878 | 0.027619 | -0.030772 | 0.040444 | -0.009891 | 0.021081 | 0.019070 | 0.037710 | -0.010358 | -0.023222 | -0.056440 |
4 | -0.016082 | 0.008462 | 0.022884 | 0.021306 | -0.002130 | -0.006800 | 0.016417 | 0.051119 | -0.023960 | 0.019188 | 0.022142 | -0.019820 | 0.029589 | -0.009846 | 0.013566 | 0.020006 | 0.025518 | -0.008707 | -0.020258 | -0.037383 |
Conclusion
What we have done is reduce all of our text to a dataframe that has 20 features representing the average of all of the word vectors for the word in the text. We can use this matrix for some of the classification tasks that we did before. In the last section, we will cover how to use deep learning for the classification of the sentiments.