Part 5: Sentiment Classification with Deep Learning using Keras

In this final notebook, we cover the implementation of advanced techniques for sentiment analysis using modern techniques in deep learning. The implementation of these techniques may often require longer training periods but can often produce better classification results if the right architecture and features are used.

This notebook covers the following:

Dense Layer Neural Network for Sentiment Classification

  • 1. Implementation of Dense Layer in Keras
  • 2. Activation and Optimization metric set up
  • 3. Model Metric Specification

Recurrent Neural Networks for Sentiment Classification

  • 1. Embedding layers with Keras
  • 2. Recurrent Neural Network Architecture
  • 3. Keras Model Implementation and Parameter Setting

0. Data Preparation

Before we implement any of the above, we need to perform the following operations on the review dataset.

  • 1. Text Processing - Cleaning, Removing Punctuations and Numeric values
  • 2. Build a word2vector model - Using Word vectors for feature generation
  • 3. Averaging Word Vectors across review/text

Let's begin

import re
import numpy as np
import pandas as pd 
 
from keras.preprocessing.text import text_to_word_sequence
from gensim.models import Word2Vec

0.1. Preprocessing Text

The function below implements basic preprocessing steps to clean the text ahead of developing word vectors

def processText(text):
  """ Cleaning Function """
  text = re.sub('[^a-zA-Z]', ' ', text)
  text = re.sub('[0-9]+', '', text)
  text = [word for word in text_to_word_sequence(text) ]
  return " ".join(text)
review_data = pd.read_csv('restaurant_reviews.tsv', delimiter='\t')
review_data['clean_review'] = review_data.Review.apply( lambda x: processText(str(x)) )
review_data.head()
### table

0.2 . Building Word2Vec Model: CBOW

The implementation below develops a CBOW word-to-vector model of vector size 500 and window size 150.

vector_size = 500
window_size = 150
  
corpus = [text_to_word_sequence(review) for review in review_data.clean_review.values]
cbow_model = Word2Vec( sentences= corpus, vector_size = vector_size, window = window_size, sg=0,  min_count = 2, sample=.000001 )

0.3. Averaging Word Vectors

The last step is to average the word vectors in the CBOW model so that we can have a single vector that summarizes each individual vector in the sentiment.

def avg_words_vectors(words, model, vocabulary , num_features ):
    """  """
    feature_vector = np.zeros((num_features,), dtype='float64')
    word_count = 0
    
    for word in words:
        if word in vocabulary:
            word_count += 1
            feature_vector = np.add(feature_vector, model.wv.get_vector(word))
    
    if word_count:
        feature_vector = np.divide(feature_vector, word_count)
    
    return feature_vector

def word_vectorizer(corpus, model, num_features):
    """ Average Word Vectors """
    vocabulary = list(model.wv.index_to_key)
    
    features = [ avg_words_vectors(sentence, model, vocabulary, num_features) for sentence in corpus ]   
    return np.array(features)
text_features = word_vectorizer(corpus=corpus, model=cbow_model, num_features=500)
text_features.shape
(1000, 500)

1. Classification with Dense Neural Network

In this example, we will implement a simple fully connected Neural Network for classification. The neural network will have 2-3 Dense/Fully connected layers with drop-out layers to avoid overfitting. The output layer will be a binary output.

Initial parameter for the model:

  • 1. batch_size = 50
  • 2. Training Epochs = 20
  • 3. Input_size/Vector Size = 500
from keras.models import Sequential
from keras.layers import Dense, Dropout, BatchNormalization
from keras.utils import to_categorical


# Dense Layer Architecture
fully_connected_nn = Sequential()
fully_connected_nn.add(Dense(200, activation='relu', input_shape=(vector_size, )))
fully_connected_nn.add(Dropout(.5))
fully_connected_nn.add(Dense(200, activation='relu'))
fully_connected_nn.add(Dropout(.5))
fully_connected_nn.add(Dense(200, activation='relu'))
fully_connected_nn.add(Dropout(.5))
fully_connected_nn.add(Dense(2, activation='softmax'))
fully_connected_nn.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                Output Shape              Param #   
=================================================================
dense (Dense)               (None, 200)               100200    
                                                                
dropout (Dropout)           (None, 200)               0         
                                                                
dense_1 (Dense)             (None, 200)               40200     
                                                                
dropout_1 (Dropout)         (None, 200)               0         
                                                                
dense_2 (Dense)             (None, 200)               40200     
                                                                
dropout_2 (Dropout)         (None, 200)               0         
                                                                
dense_3 (Dense)             (None, 2)                 402       
                                                                
=================================================================
Total params: 181,002
Trainable params: 181,002
Non-trainable params: 0
_________________________________________________________________

The next step is to compile the model before training it on the dataset.

fully_connected_nn.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

1.1. Training/Fitting the Model

To train the model, we pass on features and targets, train and test split (as test data percentage), epochs for training, and batch sizes.

target = to_categorical(review_data.Liked)

fully_connected_nn.fit(text_features, target , validation_split=.3, shuffle=True, epochs=50, batch_size=150, verbose=1)
Epoch 31/50
5/5 [==============================] - 0s 18ms/step - loss: 0.3265 - accuracy: 0.8686 - val_loss: 0.6063 - val_accuracy: 0.7633
Epoch 32/50
5/5 [==============================] - 0s 21ms/step - loss: 0.3042 - accuracy: 0.8771 - val_loss: 0.7175 - val_accuracy: 0.7067
Epoch 33/50
5/5 [==============================] - 0s 18ms/step - loss: 0.2987 - accuracy: 0.8643 - val_loss: 0.6350 - val_accuracy: 0.7567
Epoch 34/50
5/5 [==============================] - 0s 21ms/step - loss: 0.2803 - accuracy: 0.8843 - val_loss: 0.6719 - val_accuracy: 0.7167
Epoch 35/50
5/5 [==============================] - 0s 23ms/step - loss: 0.2785 - accuracy: 0.8914 - val_loss: 0.7174 - val_accuracy: 0.7200
Epoch 36/50
5/5 [==============================] - 0s 23ms/step - loss: 0.2864 - accuracy: 0.8771 - val_loss: 0.5400 - val_accuracy: 0.7833
Epoch 37/50
5/5 [==============================] - 0s 20ms/step - loss: 0.3090 - accuracy: 0.8657 - val_loss: 0.6064 - val_accuracy: 0.7600
Epoch 38/50
5/5 [==============================] - 0s 20ms/step - loss: 0.2826 - accuracy: 0.8843 - val_loss: 0.6501 - val_accuracy: 0.7533
Epoch 39/50
5/5 [==============================] - 0s 20ms/step - loss: 0.2749 - accuracy: 0.8943 - val_loss: 0.6804 - val_accuracy: 0.7333
Epoch 40/50
5/5 [==============================] - 0s 21ms/step - loss: 0.2699 - accuracy: 0.8943 - val_loss: 0.5571 - val_accuracy: 0.7800
Epoch 41/50
5/5 [==============================] - 0s 23ms/step - loss: 0.2895 - accuracy: 0.8829 - val_loss: 0.5829 - val_accuracy: 0.7833
Epoch 42/50
5/5 [==============================] - 0s 21ms/step - loss: 0.2884 - accuracy: 0.8729 - val_loss: 0.6943 - val_accuracy: 0.7400
Epoch 43/50
5/5 [==============================] - 0s 24ms/step - loss: 0.2527 - accuracy: 0.9043 - val_loss: 0.8703 - val_accuracy: 0.7000
Epoch 44/50
5/5 [==============================] - 0s 19ms/step - loss: 0.2572 - accuracy: 0.8957 - val_loss: 0.7495 - val_accuracy: 0.7067
Epoch 45/50
5/5 [==============================] - 0s 19ms/step - loss: 0.2776 - accuracy: 0.8786 - val_loss: 0.6321 - val_accuracy: 0.7533
Epoch 46/50
5/5 [==============================] - 0s 20ms/step - loss: 0.2858 - accuracy: 0.8843 - val_loss: 0.5431 - val_accuracy: 0.7833
Epoch 47/50
5/5 [==============================] - 0s 20ms/step - loss: 0.2991 - accuracy: 0.8657 - val_loss: 0.6092 - val_accuracy: 0.7633
Epoch 48/50
5/5 [==============================] - 0s 20ms/step - loss: 0.2524 - accuracy: 0.8986 - val_loss: 0.8246 - val_accuracy: 0.7033
Epoch 49/50
5/5 [==============================] - 0s 21ms/step - loss: 0.2712 - accuracy: 0.8914 - val_loss: 0.8334 - val_accuracy: 0.6967
Epoch 50/50
5/5 [==============================] - 0s 22ms/step - loss: 0.2577 - accuracy: 0.8957 - val_loss: 0.6313 - val_accuracy: 0.7633

1.2. Model Performance and Improvements

The trained model has an 89% accuracy on the train set and 76% accuracy on the test set which is very good performance for a simple dense model.

Notice that the validation loss is increasing as the training loss is decreasing which suggests overfitting. To improve performance, we can introduce more regularizers and change the architecture. For our purposes, this is sufficient. We move on to using RNNs.

2. Embedding Layers and Recurrent Neural Networks

Word Embeddings are a little different from the averaged word vector features we used in our deep layer neural network because they use word indexing in place for the words in the corpus and assign an indeces to the words to represent their presence or lack of in a particular document.

Let's demonstrate this in practice. We can first count the frequency of the words below and use the variable to determine the number of unique words in the corpus

from collections import Counter

token_counter = Counter([token for review in corpus for token in review])

We can use the token counter to create a dictionary that indexes all the words to some unique value.

vocab_map = { item[0]:index+1 for index,item in enumerate(dict(token_counter).items()) } # Index all the words starting at 1
max_index = np.max(list(vocab_map.values()))   

Notice that we may have instances when loading new text that the words are not in the vocabulary map. In the code below, we add two things: a padding index value matched to index 0 and not found index as the last words to catch those two scenarios.

vocab_map['PAD_INDEX'] = 0
vocab_map['NOT_FOUND_INDEX'] = max_index + 1
vocab_size = len(vocab_map)

We also need to obtain the maximum length of the review. We use the following list comprehension to return the maximum review.

max_len = np.max([len(review) for review in corpus])
max_len
32

2.1. Padding Text Sequences

We need to pad the text sequences to make sure that all reviews have the same input length. To do this, we use the keras padding method.

#from keras.preprocessing import sequence
from tensorflow.keras.preprocessing.sequence import pad_sequences

padding_reviews = [[vocab_map[token] for token in review] for review in corpus]
input_features = pad_sequences(padding_reviews, max_len)

input_features
array([[   0,    0,    0, ...,    2,    3,    4],
        [   0,    0,    0, ...,    6,    7,    8],
        [   0,    0,    0, ...,   13,   14,   15],
        ...,
        [   0,    0,    0, ...,    7,   76,   77],
        [   0,    0,    0, ...,  516,  512,   63],
        [   0,    0,    0, ..., 1323,   11,  528]], dtype=int32)

2.2. Building a Recurrent Neural Network

The RNN will be a relatively simple network with one LSTM layer, an Embedding layer, and a dropout layer for regularization. See the architecture below.

from keras.models import Sequential
from keras.layers import Dense, Embedding, Dropout, Flatten
from keras.layers import LSTM

Embedding_dim = 128
LSTM_DIM = 64
rnn_model = Sequential()
rnn_model.add(Embedding(input_dim=vocab_size, output_dim=Embedding_dim, input_length=max_len))
rnn_model.add(Dropout(.2))
rnn_model.add(LSTM(LSTM_DIM, dropout=.2, recurrent_dropout=.2))
rnn_model.add(Dense(2, activation='softmax'))

rnn_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

2.3. Fittting the RNN Classifier

In fitting our RNN we will use the following hyperparameters:

  • 1. batch_size = 100
  • 2. Epochs = 20
rnn_model.fit(input_features, target, epochs=20, batch_size=100, shuffle=True, validation_split=.3, verbose=1)
Epoch 1/20
7/7 [==============================] - 5s 307ms/step - loss: 0.6892 - accuracy: 0.5429 - val_loss: 0.7084 - val_accuracy: 0.3600
Epoch 2/20
7/7 [==============================] - 1s 191ms/step - loss: 0.6766 - accuracy: 0.5600 - val_loss: 0.7546 - val_accuracy: 0.3600
Epoch 3/20
7/7 [==============================] - 1s 118ms/step - loss: 0.6650 - accuracy: 0.5614 - val_loss: 0.7268 - val_accuracy: 0.3567
Epoch 4/20
7/7 [==============================] - 1s 115ms/step - loss: 0.6424 - accuracy: 0.5900 - val_loss: 0.7420 - val_accuracy: 0.3833
Epoch 5/20
7/7 [==============================] - 1s 114ms/step - loss: 0.5954 - accuracy: 0.6814 - val_loss: 0.7123 - val_accuracy: 0.4467
Epoch 6/20
7/7 [==============================] - 1s 113ms/step - loss: 0.5170 - accuracy: 0.7843 - val_loss: 0.7047 - val_accuracy: 0.5100
Epoch 7/20
7/7 [==============================] - 1s 114ms/step - loss: 0.4079 - accuracy: 0.8686 - val_loss: 0.6066 - val_accuracy: 0.7033
Epoch 8/20
7/7 [==============================] - 1s 112ms/step - loss: 0.3092 - accuracy: 0.9071 - val_loss: 0.6246 - val_accuracy: 0.7200
Epoch 9/20
7/7 [==============================] - 1s 110ms/step - loss: 0.2387 - accuracy: 0.9371 - val_loss: 0.6720 - val_accuracy: 0.7133
Epoch 10/20
7/7 [==============================] - 1s 120ms/step - loss: 0.1791 - accuracy: 0.9529 - val_loss: 0.6168 - val_accuracy: 0.7367
Epoch 11/20
7/7 [==============================] - 1s 114ms/step - loss: 0.1392 - accuracy: 0.9686 - val_loss: 0.8643 - val_accuracy: 0.6867
Epoch 12/20
7/7 [==============================] - 1s 116ms/step - loss: 0.1215 - accuracy: 0.9757 - val_loss: 0.6693 - val_accuracy: 0.7300
Epoch 13/20
7/7 [==============================] - 1s 108ms/step - loss: 0.0876 - accuracy: 0.9786 - val_loss: 0.6404 - val_accuracy: 0.7300
Epoch 14/20
7/7 [==============================] - 1s 114ms/step - loss: 0.0720 - accuracy: 0.9943 - val_loss: 0.6354 - val_accuracy: 0.7167
Epoch 15/20
7/7 [==============================] - 1s 155ms/step - loss: 0.0574 - accuracy: 0.9900 - val_loss: 0.7458 - val_accuracy: 0.7367
Epoch 16/20
7/7 [==============================] - 1s 210ms/step - loss: 0.0464 - accuracy: 0.9900 - val_loss: 0.8075 - val_accuracy: 0.7200
Epoch 17/20
7/7 [==============================] - 1s 199ms/step - loss: 0.0388 - accuracy: 0.9929 - val_loss: 0.7612 - val_accuracy: 0.7500
Epoch 18/20
7/7 [==============================] - 1s 206ms/step - loss: 0.0343 - accuracy: 0.9943 - val_loss: 0.9616 - val_accuracy: 0.7233
Epoch 19/20
7/7 [==============================] - 1s 115ms/step - loss: 0.0279 - accuracy: 0.9971 - val_loss: 0.7260 - val_accuracy: 0.7467
Epoch 20/20
7/7 [==============================] - 1s 109ms/step - loss: 0.0284 - accuracy: 0.9986 - val_loss: 0.8207 - val_accuracy: 0.7533

We see that over time the accuracy increases in the training set to 99% but the validation loss decreases and then increases indicating overfitting. It is also important to note that the dataset we are using is rather small with only about 1000 textual reviews for which neural networks can easily overfit.

Predicting Sentiment of New Reviews

We have built our model and wish to deploy it in new reviews to see the performance. We will need to preprocess the new input in the same way we preprocessed the training dataset. Let's see this in action below.

new_text = "the food was great!"

new_text = [[vocab_map[token] for token in text_to_word_sequence(new_text)]]
padded_text_input = pad_sequences(new_text, max_len)
padded_text_input
array([[  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
        0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
        0,   0,  11, 124,  13,  31]], dtype=int32)
rnn_model.predict(padded_text_input)
1/1 [==============================] - 1s 1s/step
array([[0.00378104, 0.996219  ]], dtype=float32)

Notice that our model has predicted 99% probability of a positive review and abour .3% probability that our sentiment is a negative review. Not bad for test prediction.

Conclusion

In the series, we covered a number of NLP techniques from traditional feature extraction techniques like TFIDF and Bag of Words to modern techniques like Word2Vect and Embeddings. We have learned how to use word embedding and word vectors to build classification and sentiment analysis models. Try using the above model architectures and techniques to build a model for the Amazon product reviews data and set the results. Deep Learning techniques offer a lot of opportunities to improve model performance with additional layers, regularizers, and hyperparameter changes!