Part 3: Sentiment Classification with NaiveBayes and SVM - Linear Classifier

In Part 2, we developed feature extraction techniques like Bag of Words, N-grams, and tdidf to create a feature list from the Amazon product review dataset. On this note, we move on to building a sentiment classifier using Naive Bayes and Support Vector Machines. Specifically, we cover:

Introducing Restaurant Review Dataset

  • 1. Preprocessing Restaurant Reviews
  • 2. Train and Test Split

Classification Pipeline and Modeling

  • 1. Naive Bayes Classifier
  • 2. Classifying New Text
  • 3. Support Vector Machine - Linear Classifier

Dataset: Restaurant Reviews

If you have been following the notebooks, you will notice that using the Amazon Product Reviews dataset is computationally expensive and sometimes prohibitive. Cases of running out of memory are common. To mitigate this, I introduce a lighter dataset called restaurant_reviews. It is available here: Restaurant Review Data

import pandas as pd 

review_data = pd.read_csv('restaurant_reviews.tsv', sep='\t')
review_data.head()
Review Liked
0 Wow... Loved this place. 1
1 Crust is not good. 0
2 Not tasty and the texture was just nasty. 0
3 Stopped by during the late May bank holiday of... 1
4 The selection on the menu was great and so wer... 1
len(review_data)
1000
len(review_data), review_data.Liked.value_counts()
(1000,
 1    500
 0    500
 Name: Liked, dtype: int64)

Preprocessing: Cleaning and Stemming Reviews

The code below implements cleaning of the text and stemming to return root words in each review. The outcome is shorter sentences with a high indication of sentiment

import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

porter_stemmer = PorterStemmer()

def cleaningText(text):
    """
    Text Cleaning:
        - Remove Punctuation
        - Remove Numbers
        - Tokenize Text
        - Stem Text
        - Remove Stopwords
    """ 
    text = re.sub("[^a-zA-Z]", " ", text) # Remove Punctuation
    text = re.sub("[0-9]+", "", text) # Remove Numbers
    text = [ porter_stemmer.stem(word.lower()) for word in word_tokenize(text) if word not in stopwords.words('english') ]
    return " ".join(text)


review_data['clean_review'] = review_data.Review.apply(lambda x: cleaningText( str(x) ))
review_data[[ 'Review', 'clean_review' ]].head()
Review clean_review
0 Wow... Loved this place. wow love place
1 Crust is not good. crust good
2 Not tasty and the texture was just nasty. not tasti textur nasti
3 Stopped by during the late May bank holiday of... stop late may bank holiday rick steve recommen...
4 The selection on the menu was great and so wer... the select menu great price

Preproceing: Features to Matrix

The code below convert the clean_review column into 1-2 ngram features using TfidfVectorizer

from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer( #max_feautures = 1000,        # Return the top 1000 features
                                    analyzer='word',              # Word level vectorizer
                                    lowercase=True,               # Lower case the text
                                    min_df = 5,                   # Use tokens that appear at least 5 times
                                    ngram_range=(1, 2),           # Create 1 n-grams
                                    tokenizer= word_tokenize,     # Use this tokenizer
                                    stop_words = 'english',       # remove english stopwords 
                                    sublinear_tf=1, smooth_idf=1, use_idf=1) # Additional Features

tfidf_vectorizer.fit(review_data.clean_review)
features = pd.DataFrame( tfidf_vectorizer.transform(review_data.clean_review).toarray(), 
                         columns=tfidf_vectorizer.get_feature_names_out())
features.head()
absolut alway amaz ambianc anoth anoth minut anytim anytim soon area arriv ... wast watch way went wine wonder worst worth wrong year
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Train and Test Split

The code below implement a 70-30 percent Train to Test Split.

import numpy as np
from sklearn.model_selection import train_test_split


x_train, x_test, y_train, y_test = train_test_split( features, review_data.Liked, test_size=.30, stratify=review_data.Liked, random_state=42)
y_train.value_counts(), y_test.value_counts()
(1    350
 0    350
 Name: Liked, dtype: int64,
 1    150
 0    150
 Name: Liked, dtype: int64)

1. Naive Bayes Classification

Naive Bayes classification algorithm is generally an effective technique to classify texts. Derived from the Bayesian theorem, it evaluates the probability of a sentiment being positive or negative given the presence of the words contained in the review.

Mathematically, it all begins with the Bayes rule of conditional probability:

$$P(A|B) = \frac {P(A)P(B|A)} {P(B)} $$

Translating it to the Naive Bayes Algorithm, we want to predict the sentiment of some text given the presence of words in the text. Mathematically, it can be expressed as:

$$ P(positive \ sentiment | w_1, w_2,w_3,...) = \frac {P(positive \ sentiment) P(w_1, w_2, w_3,...| positive \ sentiment)} {P(w_1, w_2,w_3)} $$

where $w_i$ is a word

The most important assumption of naive Bayes (and where it gets its name from) is conditional independence which stipulates that every word $w_i$ is independent of each other as long as their condition to the same class. This property is very useful because we can then write the probability equation like:

$$ P(positive \ sentiment | w_1, w_2,w_3,...) $$ $$ = P(Positive Sentiment)P(w1|Positive Sentiment)P(w2|Positive \ Sentiment)P(w3|Positive Sentiment)P(w...|Positive Sentiment)$$

The above expansion can help us formulate a general formula for the probability class as follows:

$$ P(sentiment|w_i) = \frac{1}{Z} \prod P(sentiment) P(w_i|sentiment) $$

where $ Z$ is the normalizer i.e. Product probability of the occurrence of the words.

Enough with the math and theory, let's see this in action with Python

Naive Bayes Model Implementation in Python

The code below initializes a NaiveBayes Model with a laplace estimator parameter at .3

from sklearn.naive_bayes import MultinomialNB
    
naive_bayes_model = MultinomialNB(alpha=.3, fit_prior=True)

Model Training

naive_bayes_model.fit(x_train, y_train)
MultinomialNB(alpha=0.3, class_prior=None, fit_prior=True)

Train and Test Assessment

After training the model, we can compute the train and the test error. This helps us determine how good or not-so-good our model did on new reviews.

from sklearn.metrics import accuracy_score

print("Training Accuracy:", round( accuracy_score(naive_bayes_model.predict(x_train), y_train ), 2) )
print("Test Accuracy:", round( accuracy_score(naive_bayes_model.predict(x_test), y_test ), 2))
Training Accuracy: 0.85
Test Accuracy: 0.74

The training and test accuracies are not bad for a basic model. We should expect it to perform well most of the time.

Predicting New Reviews

To run the model on reviews outside of the test and train set, we must implement the same preprocessing steps and vectorization. Below is an example of the implementation.

pos_text = tfidf_vectorizer.transform( [cleaningText("The restaurant had great food")] ).toarray()
neg_text = tfidf_vectorizer.transform( [cleaningText("The restaurant had terrible food")] ).toarray()

naive_bayes_model.predict(pos_text), naive_bayes_model.predict(neg_text)
(array([1]), array([0]))

As we can see with the simple examples above, the model performs decently, prediction 1 for positive and 0 for negative.

Sentiment Classification with Support Vector Machine

Support Vector Machine is a family of classification algorithms that perform classification by determining the hyperplane that separates the classes in question. SVMs are linear classifiers that can be modified to take a variety of linear functions as a way to separate two or more classes by determining the hyperplane that maximized the distance between observations across classes.

SVM turns out to be very effective in working with sparse data as they are linear. Given that we have very sparse metrics of features, let's use SVM to determine the sentiment.

Notice that SVM has the following tuning parameters:

  • 1. Kernel: Specifies a kernel formula to use when determining the decision boundary
  • 2. Gamma: Weighting based on observation distance from the decision boundary
  • 3. C Parameter: Balance between model complexity (correct classification) and smooth boundary

Below is the implementation in Python:

from sklearn.svm import SVC

svm_linear =  SVC( C=1,                # Setting C at default parameter
                   kernel='linear',    # Using linear kernel transformation 
                   gamma=100,          # Setting Gamma at 100
                   probability=True,
                   random_state= 42)

Fitting the model

Much like we did with Naive Bayes, we fit the model for SVM using the fit() method

svm_linear.fit(x_train, y_train)
SVC(C=1, gamma=100, kernel='linear', probability=True, random_state=42)
print("Training Accuracy:", round( accuracy_score(svm_linear.predict(x_train), y_train), 2) )
print("Test Accuracy:", round( accuracy_score(svm_linear.predict(x_test), y_test ), 2) )
Training Accuracy: 0.87
Test Accuracy: 0.75

Predicting with SVM Model

We can now test our model with data outside of the training and testing set.

pos_text = tfidf_vectorizer.transform( [cleaningText("The food was okay")] ).toarray()
neg_text = tfidf_vectorizer.transform( [cleaningText("I did not like the food")] ).toarray()

svm_linear.predict(pos_text), svm_linear.predict_proba(pos_text), svm_linear.predict(neg_text), svm_linear.predict_proba(neg_text)
(array([0]),
 array([[0.69524805, 0.30475195]]),
 array([0]),
 array([[0.79130427, 0.20869573]]))

We see that our model is doing a reasonably good job of predicting sentiment and their associated probabilities. For example, the negative sentiment has higher confidence that the neutral sentiment. Notice that neutral in this case fall on the negative classification because of our binary target variables.