Part 3: Sentiment Classification with NaiveBayes and SVM - Linear Classifier
In Part 2, we developed feature extraction techniques like Bag of Words, N-grams, and tdidf to create a feature list from the Amazon product review dataset. On this note, we move on to building a sentiment classifier using Naive Bayes and Support Vector Machines. Specifically, we cover:
Introducing Restaurant Review Dataset
- 1. Preprocessing Restaurant Reviews
- 2. Train and Test Split
Classification Pipeline and Modeling
- 1. Naive Bayes Classifier
- 2. Classifying New Text
- 3. Support Vector Machine - Linear Classifier
Dataset: Restaurant Reviews
If you have been following the notebooks, you will notice that using the Amazon Product Reviews dataset is computationally expensive and sometimes prohibitive. Cases of running out of memory are common. To mitigate this, I introduce a lighter dataset called restaurant_reviews. It is available here: Restaurant Review Data
import pandas as pd
review_data = pd.read_csv('restaurant_reviews.tsv', sep='\t')
review_data.head()
Review | Liked | |
---|---|---|
0 | Wow... Loved this place. | 1 |
1 | Crust is not good. | 0 |
2 | Not tasty and the texture was just nasty. | 0 |
3 | Stopped by during the late May bank holiday of... | 1 |
4 | The selection on the menu was great and so wer... | 1 |
len(review_data)
1000
len(review_data), review_data.Liked.value_counts()
(1000, 1 500 0 500 Name: Liked, dtype: int64)
Preprocessing: Cleaning and Stemming Reviews
The code below implements cleaning of the text and stemming to return root words in each review. The outcome is shorter sentences with a high indication of sentiment
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
porter_stemmer = PorterStemmer()
def cleaningText(text):
"""
Text Cleaning:
- Remove Punctuation
- Remove Numbers
- Tokenize Text
- Stem Text
- Remove Stopwords
"""
text = re.sub("[^a-zA-Z]", " ", text) # Remove Punctuation
text = re.sub("[0-9]+", "", text) # Remove Numbers
text = [ porter_stemmer.stem(word.lower()) for word in word_tokenize(text) if word not in stopwords.words('english') ]
return " ".join(text)
review_data['clean_review'] = review_data.Review.apply(lambda x: cleaningText( str(x) ))
review_data[[ 'Review', 'clean_review' ]].head()
Review | clean_review | |
---|---|---|
0 | Wow... Loved this place. | wow love place |
1 | Crust is not good. | crust good |
2 | Not tasty and the texture was just nasty. | not tasti textur nasti |
3 | Stopped by during the late May bank holiday of... | stop late may bank holiday rick steve recommen... |
4 | The selection on the menu was great and so wer... | the select menu great price |
Preproceing: Features to Matrix
The code below convert the clean_review column into 1-2 ngram features using TfidfVectorizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer( #max_feautures = 1000, # Return the top 1000 features
analyzer='word', # Word level vectorizer
lowercase=True, # Lower case the text
min_df = 5, # Use tokens that appear at least 5 times
ngram_range=(1, 2), # Create 1 n-grams
tokenizer= word_tokenize, # Use this tokenizer
stop_words = 'english', # remove english stopwords
sublinear_tf=1, smooth_idf=1, use_idf=1) # Additional Features
tfidf_vectorizer.fit(review_data.clean_review)
features = pd.DataFrame( tfidf_vectorizer.transform(review_data.clean_review).toarray(),
columns=tfidf_vectorizer.get_feature_names_out())
features.head()
absolut | alway | amaz | ambianc | anoth | anoth minut | anytim | anytim soon | area | arriv | ... | wast | watch | way | went | wine | wonder | worst | worth | wrong | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Train and Test Split
The code below implement a 70-30 percent Train to Test Split.
import numpy as np
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split( features, review_data.Liked, test_size=.30, stratify=review_data.Liked, random_state=42)
y_train.value_counts(), y_test.value_counts()
(1 350 0 350 Name: Liked, dtype: int64, 1 150 0 150 Name: Liked, dtype: int64)
1. Naive Bayes Classification
Naive Bayes classification algorithm is generally an effective technique to classify texts. Derived from the Bayesian theorem, it evaluates the probability of a sentiment being positive or negative given the presence of the words contained in the review.
Mathematically, it all begins with the Bayes rule of conditional probability:
$$P(A|B) = \frac {P(A)P(B|A)} {P(B)} $$
Translating it to the Naive Bayes Algorithm, we want to predict the sentiment of some text given the presence of words in the text. Mathematically, it can be expressed as:
$$ P(positive \ sentiment | w_1, w_2,w_3,...) = \frac {P(positive \ sentiment) P(w_1, w_2, w_3,...| positive \ sentiment)} {P(w_1, w_2,w_3)} $$
where $w_i$ is a word
The most important assumption of naive Bayes (and where it gets its name from) is conditional independence which stipulates that every word $w_i$ is independent of each other as long as their condition to the same class. This property is very useful because we can then write the probability equation like:
$$ P(positive \ sentiment | w_1, w_2,w_3,...) $$ $$ = P(Positive Sentiment)P(w1|Positive Sentiment)P(w2|Positive \ Sentiment)P(w3|Positive Sentiment)P(w...|Positive Sentiment)$$
The above expansion can help us formulate a general formula for the probability class as follows:
$$ P(sentiment|w_i) = \frac{1}{Z} \prod P(sentiment) P(w_i|sentiment) $$
where $ Z$ is the normalizer i.e. Product probability of the occurrence of the words.
Enough with the math and theory, let's see this in action with Python
Naive Bayes Model Implementation in Python
The code below initializes a NaiveBayes Model with a laplace estimator parameter at .3
from sklearn.naive_bayes import MultinomialNB
naive_bayes_model = MultinomialNB(alpha=.3, fit_prior=True)
Model Training
naive_bayes_model.fit(x_train, y_train)
MultinomialNB(alpha=0.3, class_prior=None, fit_prior=True)
Train and Test Assessment
After training the model, we can compute the train and the test error. This helps us determine how good or not-so-good our model did on new reviews.
from sklearn.metrics import accuracy_score
print("Training Accuracy:", round( accuracy_score(naive_bayes_model.predict(x_train), y_train ), 2) )
print("Test Accuracy:", round( accuracy_score(naive_bayes_model.predict(x_test), y_test ), 2))
Training Accuracy: 0.85 Test Accuracy: 0.74
The training and test accuracies are not bad for a basic model. We should expect it to perform well most of the time.
Predicting New Reviews
To run the model on reviews outside of the test and train set, we must implement the same preprocessing steps and vectorization. Below is an example of the implementation.
pos_text = tfidf_vectorizer.transform( [cleaningText("The restaurant had great food")] ).toarray()
neg_text = tfidf_vectorizer.transform( [cleaningText("The restaurant had terrible food")] ).toarray()
naive_bayes_model.predict(pos_text), naive_bayes_model.predict(neg_text)
(array([1]), array([0]))
As we can see with the simple examples above, the model performs decently, prediction 1 for positive and 0 for negative.
Sentiment Classification with Support Vector Machine
Support Vector Machine is a family of classification algorithms that perform classification by determining the hyperplane that separates the classes in question. SVMs are linear classifiers that can be modified to take a variety of linear functions as a way to separate two or more classes by determining the hyperplane that maximized the distance between observations across classes.
SVM turns out to be very effective in working with sparse data as they are linear. Given that we have very sparse metrics of features, let's use SVM to determine the sentiment.
Notice that SVM has the following tuning parameters:
- 1. Kernel: Specifies a kernel formula to use when determining the decision boundary
- 2. Gamma: Weighting based on observation distance from the decision boundary
- 3. C Parameter: Balance between model complexity (correct classification) and smooth boundary
Below is the implementation in Python:
from sklearn.svm import SVC
svm_linear = SVC( C=1, # Setting C at default parameter
kernel='linear', # Using linear kernel transformation
gamma=100, # Setting Gamma at 100
probability=True,
random_state= 42)
Fitting the model
Much like we did with Naive Bayes, we fit the model for SVM using the fit() method
svm_linear.fit(x_train, y_train)
SVC(C=1, gamma=100, kernel='linear', probability=True, random_state=42)
print("Training Accuracy:", round( accuracy_score(svm_linear.predict(x_train), y_train), 2) )
print("Test Accuracy:", round( accuracy_score(svm_linear.predict(x_test), y_test ), 2) )
Training Accuracy: 0.87 Test Accuracy: 0.75
Predicting with SVM Model
We can now test our model with data outside of the training and testing set.
pos_text = tfidf_vectorizer.transform( [cleaningText("The food was okay")] ).toarray()
neg_text = tfidf_vectorizer.transform( [cleaningText("I did not like the food")] ).toarray()
svm_linear.predict(pos_text), svm_linear.predict_proba(pos_text), svm_linear.predict(neg_text), svm_linear.predict_proba(neg_text)
(array([0]), array([[0.69524805, 0.30475195]]), array([0]), array([[0.79130427, 0.20869573]]))
We see that our model is doing a reasonably good job of predicting sentiment and their associated probabilities. For example, the negative sentiment has higher confidence that the neutral sentiment. Notice that neutral in this case fall on the negative classification because of our binary target variables.