World Scientific
  • Search
  •   
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at [email protected] for any enquiries.

Multinomial Naïve Bayes Classifier for Sentiment Analysis of Internet Movie Database

    https://doi.org/10.1142/S2196888823500100Cited by:2 (Source: Crossref)

    Abstract

    Sentiment analysis (SA), also known as opinion mining, is a natural language processing (NLP) technique used to determine the sentiment or emotional tone behind a piece of text. It involves analyzing the text to identify whether it expresses a positive, negative, or neutral sentiment. SA can be applied to various types of text data such as social media posts, customer reviews, news articles, and more. This experiment is based on the Internet Movie Database (IMDB) dataset, which comprises movie reviews and the positive or negative labels related to them. Our research experiment’s objective is to identify the model with the best accuracy and the most generality. Text preprocessing is the first and most critical phase in an NLP system since it significantly impacts the overall accuracy of the classification algorithms. The experiment implements unsupervised sentiment classification algorithms including Valence Aware Dictionary and sentiment Reasoner (VADER) and TextBlob. We also examine the supervised sentiment classifications methods such as Naïve Bayes (Bernoulli NB and Multinomial NB). The Term Frequency-Inverse Document Frequency (TFIDF) model is used to feature selection and extractions. The combination of Multinomial NB and TFIDF achieves the highest accuracy, 87.63%, for both classification reports based on our experiment result.

    1. Introduction

    The exponential growth of digital information has resulted in the rapid development of a new field of research: Sentiment analysis (SA). In the modern science of artificial intelligence (AI), SA is a crucial technique for gleaning emotional information from massive volumes of data. SA is also known as “sentiment analysis.”1 SA2,3 has been improved over the years using several machine learning and dictionary-based algorithms in order to improve its accuracy. Prior knowledge also plays an important role in conveying the polarity of views well with the advent of deep learning algorithms4 in SA.

    SA can be applied in various types of text data such as social media posts, customer reviews, news articles, and more. It is widely used in a range of applications, including: (1) Brand monitoring: Companies use SA to monitor online conversations about their brand and products, helping them understand customer opinions and sentiment towards their offerings.5,6 (2) Customer feedback analysis: SA enables businesses to analyze customer feedback from surveys, reviews, and social media to gain insights into customer satisfaction and identify areas for improvement.7,8 (3) Market research: SA can be employed to analyze public sentiment towards specific products, services, or market trends, providing valuable insights for market research and decision-making.9,10 (4) Social media analysis: By analyzing sentiment in social media posts, companies and organizations can gauge public opinion on various topics, track trends, and understand the impact of their social media campaigns.11,12 (5) Reputation management: SA helps organizations monitor their online reputation by identifying negative sentiment or potential PR issues, allowing them to address problems promptly and maintain a positive image. (6) Political analysis: SA is used to gauge public opinion towards political candidates, parties, or specific policies, assisting in election campaigns and policy development.13,14

    In the industrial context, businesses primarily use SA to collect and assess client feedback. The fields of natural language processing (NLP) and SA are inextricably linked.15 The majority of the material on the internet comes in the form of natural language, which robots cannot understand owing to its complexity and inter-word semantics.16 NLP is analyzing these natural texts to create things that the machine can understand for SA. The Internet Movie Database (IMDB) dataset comprises of 50,000 reviews, equally split between 25,000 train and 25,000 test reviews. Positive or negative movie reviews are categorized, and the task is to guess the sentiment of an unseen review. The sentiment of a movie review is usually associated with a different rating, which can be used for classification dilemmas. It can be used as a reference instrument for movie preference.17,18

    This study is an expanded version of the article presented at the ACIIDS 2022 conference.19 In that article, only Linear Model and Naïve Bayes are proposed for the sentiment classifications. The following are the significant contributions of this work: (1) The study is based on the IMDB dataset, which comprises movie reviews and the positive or negative labels that relate to them. (2) The goal of our study experiment is to find the model with the highest accuracy and the greatest generality. (3) The following classifiers are used in this work: Unsupervised Learning (VADER and Text Blob), and Supervised Learning (Naïve Bayes including the Bernoulli NB and Multinomial NB). Different techniques, including Count Vectorizer, Term Frequency-Inverse Document Frequency (TFIDF) model Vectorizer, minimum–maximum number of words, and max features, are implemented. We can observe that the Multinomial Naïve Bayes model performs well compared to other methods. The rest of the paper is structured in the following manner. Section 2 introduces the related works. Section 3 describes the research process. Section 4 discusses the experiment’s findings and conclusions. Section 5 summarizes the conclusion and future research endeavors.

    2. Related Work

    2.1. Sentiment analysis (SA)

    Researchers have been working on various recommendation algorithms based on text data supplied by internet users over the last couple of years. Zirn et al.20 developed a completely automated system for fine-grained SA at the sub-sentence level, incorporating several sentiment lexicons, neighborhood links, and discourse linkages. Appel et al.21 established a hybrid strategy based on ambiguity management, semantic rules, and a sentiment lexicon using Twitter sentiment and movie review datasets. The authors evaluated the performance of their suggested hybrid system to that of conventional supervised algorithms such as Naïve Bayes (NB) and Maximum Entropy (ME). The suggested approach outperforms supervised methods in terms of precision and accuracy. Similarly, Pang et al.22 used unigram, bigram, and unigram models to analyze IMDB movie review data SA. They classified the data into two classes using NB, ME, and SVM classifiers.23 SA has improved over the years using several dictionary-based and machine learning algorithms to improve accuracy. Prior knowledge has also played a significant role in properly conveying the polarity of views with the emergence of deep learning algorithms in SA.

    2.2. Naïve Bayes classifier

    The NB classifier is a simple but effective probabilistic algorithm used for classification tasks, including SA. It’s based on Bayes’ theorem, which describes the probability of an event given prior knowledge or evidence.24 The “Naïve” assumption in NB refers to the assumption whose features are conditionally independent, meaning that the presence or absence of one feature does not affect the presence or absence of other features. NB is a well-known classification technique in data mining.25 The NB classification model computes a class’s posterior probability based on its posterior probability distribution using the word distribution in the text. The model is based on bag-of-words (BOWs) feature extraction, which does not consider the location of the word in the text. It makes use of the Bayes Theorem to forecast the likelihood that a given feature set belongs to a certain label in a given situation.26

    P(label|features)=P(label)P(features|label)P(features)(1)
    where P(label|features) quantifies the likelihood that a feature set is associated with a given label. P(label) is the label’s previous estimate. P(label|features) denotes the likelihood that the provided feature set is associated with this label, and P(features) is the previous estimate for the occurrence of this feature collection. However, this categorization technique involves a basic assumption, namely that words in a review, category pair appear independently of other terms.

    When dealing with discrete values, such as word counts, NB classification is the best option. Further, we expect it to show the best accuracy and count the occurrences of a word in the complement to the class. The classification result is the class with the lowest sum of weights for each word in the message, which is represented by the class with the lowest value.

    Bernoulli NB is a variant of the NB classifier that is specifically designed for binary feature data, where each feature can take on one of two values (typically 0 or 1).27,28 It is commonly used in text classification tasks, including SA. The Bernoulli NB classifier assumes that each feature is conditionally independent of others, given the class label, and follows a Bernoulli distribution.29 It calculates the probability of a document belonging to a particular class (e.g. positive, or negative sentiment) based on the presence or absence of binary features. The Bernoulli formula is very similar to the multinomial formula, with the exception that the input is a set of Boolean values (whether the word is in the message or not), rather than a set of frequencies. Consequently, the algorithm explicitly penalizes the absence of a feature (word in the message is absent from the vocabulary), whereas the multinomial approach makes use of the smoothing parameter for the values that are not present in the algorithm.30,31

    Lee et al.32 developed a novel approach for determining feature weights in Naïve with accent Bayesian learning by averaging the Kullback–Leibler measure across feature values. Additionally, they proposed a novel weight assignment paradigm for categorization learning, nicknamed the value weighting approach. Rather than weighing each attribute, they assign a different weight to each value. In Ref. 33, the authors present a weighting attributes to Alleviate NB’ Independence Assumption (WANBIA), that optimizes the negative conditional log-likelihood or the mean squared error objective functions. Their experiment conduct rigorous analyses and demonstrate that WANBIA is a viable alternative to state-of-the-art classifiers such as Random Forest and Logistic Regression.31,34

    Multinomial NB is another variant of the NB classifier that is commonly used in text classification tasks, including SA. It is specifically suited for features that represent word frequencies or counts, which are typically obtained using techniques like the bag-of-words model or TF-IDF.35,36 The Multinomial NB classifier assumes that the features are conditionally independent of each other, given the class label, and follows a multinomial distribution. It calculates the probability of a document belonging to a particular class based on the frequencies or counts of features (words) in the document.

    2.3. Performance evaluation metrics

    Our experiments used conventional performance indicators to evaluate our model’s performance on IMDB datasets, notably the F1 and accuracy scores and their related class support divisions. Precision and recall are defined in Formulas (2) and (3).37 Moreover, accuracy and F1 are defined in Formulas (4) and (5).38,39

    Precision=TPTP+FP,(2)
    Recall=TPTP+FN(3)
    Accuracy score=TP+TNTP+TN+FP+FN(4)
    F1=2×Precision×RecallPrecision+Recall(5)
    where True Positive (TP) is the number of reviews sorted properly into the appropriate sentiment classifications. Next, False Positive (FP) is the number of reviews assigned to an emotion class to which they do not belong. Hence, False Negative (FN) is the number of reviews labeled as not belonging to a sentiment category in which they really fit.40

    3. Methodology

    3.1. Research workflow

    This section will describe our research workflow, as shown in Fig. 1. Furthermore, the IMDB dataset is gathered and imported. Pre-processing is applied to the data to eliminate noise and tidy it up before further processing. Next, word embedding is a technique for expressing words in a low-dimensional space, most typically represented as real-valued vectors. It enables words with similar meanings and semantics to be expressed more closely together than words with less comparable meanings and semantics to be rendered farther apart. Therefore, the experiment implements the TFIDF model to feature selection and extractions. Hence, the TFIDF vector converts text documents to a matrix of TFIDF features.

    Fig. 1.

    Fig. 1. Research workflow.

    The following methods are used to resolve the classification problem: Unsupervised Learning (VADER and Text Blob) and Supervised Learning (NB including the Bernoulli NB and Multinomial NB). Finally, the classification procedure is carried out, SA will show the positive or negative result, and the methods used are analyzed.

    3.2. Internet movie database (IMDB) dataset

    There are 25,000 training data, 25,000 test data, and 50,000 unlabeled data in the IMDB dataset, as seen in Fig. 2.41 The IMDB dataset is a binary sentiment classification dataset comprised of movie reviews extracted from the IMDB.42 Dataset training documents are highly polarized. There is a 1:1 ratio of negative to positive for labeled documents. Because the classes in the data are evenly distributed, we are able to rely on all of the performance indicators presented before. Accuracy metrics would not yield much in the way of meaningful information if, for instance, the data were extremely unbalanced (for instance, practically all positive or negative evaluations). The rationale for this is that it is mechanically easier to correctly forecast the dominant class given the numerical abundance of the class in question. The vectors for the documents are constructed using all of the documents in the collection, including train, test, and unlabeled data. Table 1 contains several samples drawn from the IMDB dataset.

    Fig. 2.

    Fig. 2. Class distribution of IMDB dataset.

    Table 1. Examples of movie reviews for each class.

    NoReviewSentiment
    1A wonderful little production. <br/><br/>The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br/><br/>The actors are extremely well-chosen- Michael Sheen not only “has got all the polari” but he has all the voices down pat too! You can genuinely see the seamless editing guided by the references to Williams’ diary entries; not only is it well worth the watching, but it is a terrifically written and performed piece. A masterful production about one of the great masters of comedy and his life. <br/><br/>The realism comes home with the little things: the fantasy of the guard, which, rather than using the traditional ‘dream’ techniques, remains solid then disappears. It plays on our knowledge and senses, particularly with the scenes concerning Orton and Halliwell, and the sets (particularly of their flat with Halliwell’s murals decorating every surface) are well done.Positive
    2There’s a family where a little boy (Jake) thinks there’s a zombie in his closet and his parents are fighting all the time. <br/><br/>This movie is slower than a soap opera and suddenly, Jake decides to become Rambo and kill the zombie. <br/><br/>OK, first of all when you’re going to make a film you must Decide if it’s a thriller or a drama! As a drama, the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet, which ruins all the film! I expected to see a BOOGEYMAN similar movie, but instead, I watched a drama with some meaningless thriller spots. <br/><br/>3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: ignore them.Negative

    4. Experiment and Result

    4.1. Result of unsupervised sentiment classification

    Unsupervised learning refers to a family of machine learning techniques that make use of data that does not already have truth labels attached to it. Although the IMDB dataset contains truth labels, I produce predictions with two unsupervised SA algorithms — VADER and TextBlob — to facilitate a later comparison with more well-known supervised NLP algorithms (such as NB and LSTM, for example). We calculate predictions directly for the test subset because these two methods are lexicon-based rather than learning-based. Unsupervised sentiment classification, also known as unsupervised SA, refers to the task of automatically assigning sentiment labels (such as positive, negative, or neutral) to text data without the use of pre-labeled training data. Instead, it relies on other techniques, such as lexicons, linguistic patterns, or clustering algorithms, to determine sentiment polarity.

    Valence Aware Dictionary and sEntiment Reasoner (VADER) is a popular a rule/lexicon-based sentiment analyzer.43 Lexicon approaches use a pre-defined dictionary of words and phrases, which are rated for polarity and intensity. VADER’s lexicon contains about 7,500 sentiment features, which are scored each between extremely negative [4] to extremely positive [+4]) by a panel human expert. Words not included in the lexicon are awarded a neutral sentiment of 0. VADER also makes use of the grammatical structure of documents (e.g. punctuation, capitalization, contrasting conjunctions). To calculate the compound sentiment score of a given document, VADER calculates the sum of the sentiment of all features in that document, and normalizes to [1,1] (source).44

    The performance result of each classifier describes in Table 2. Out-of-the-box, the results of VADER’s classification of sentiment scores with these data are very positive.45,46 It accurately categorizes the feeling conveyed in 70.72% of the reviews that were used for the test. The performance of VADER, on the other hand, is significantly different between evaluations that are positive and reviews that are bad. On the one hand, it accurately classifies 86.73% of all positive reviews, but on the other side, it only successfully classifies 41.95% of negative reviews. Adjusting the classifier threshold with a precision-recall curve and/or an F-score would be a good idea if one were particularly interested in either positive or negative reviews. We are going to skip this additional step because we want to keep things as simple as possible, and because the point of this exercise is to improve my accuracy performance.

    Table 2. Performance result of unsupervised sentiment classification.

    ItemsVADERText Blob
    Accuracy0.70720.6852
    Sensitivity0.86730.9356
    Specificity0.54890.4375
    Precision0.65530.6219
    F10.74650.7472
    Roc_Auc0.70810.6866

    TextBlob is an NLP package that provides a variety of tools.47,48 Some of these capabilities include parts-of-speech tagging, SA, and noun phrase extraction. Furthermore, TextBlob, just like VADER, treats individual documents as strings; hence, to compute a sentiment vector, we must loop over each document individually. TextBlob().sentiment gives back a namedtuple, the first element of which is the polarity (sentiment) score, which can range anywhere from 0.0 to 1.0, and the second element is the subjectivity score, which can range anywhere from 0.0 to 1.0. Because TextBlob was designed to process raw text, we base our predictions on the unprocessed versions of the original reviews. TextBlob has a good performance even without any tuning, like VADER. The accuracy of TextBlob is just slightly lower than that of VADER. Even more so than with VADER, we observe significant performance disparities between classes. For example, while the model accurately categorizes approximately 94% of positive reviews, it only successfully categorizes 29% of negative reviews.

    4.2. Result of supervised sentiment classification

    Supervised learning algorithms are machine learning algorithms that learn from labeled training data to make predictions or classify new, unseen instances. In the context of SA, supervised learning algorithms can be trained on labeled data where each text example is associated with a sentiment label (e.g. positive, negative, or neutral).

    A class of straightforward supervised learning algorithms, NB relies on Bayes’ theorem and makes the “Naïve” assumption that characteristics are independent of one another regardless of the value of the outcome variable. While the fact that this assumption is commonly broken in many real-world applications, NB algorithms can be effective classifiers (particularly in document categorization and spam filtering) while being poor estimators. This contradiction has a simple solution: The distribution of dependencies within a class often cancels each other out. NB classifiers are particularly fast and scalable in comparison to more sophisticated classifiers. To analyze text (words or word combinations), this NB classifier model employs a BOWs technique. However, unlike more advanced context-dependent algorithms like BERT, BOW just considers how frequently a specific word appears in a document rather than how the words themselves are arranged. Our experiment employs sklearn’s CountVectorizer, which tallies the percentage of total tokens in the corpus that occur in each document. Punctuation marks are automatically removed by CountVectorizer, then we use NLTK’s regex tokenizer and English stopwords to get rid of any remaining non-alphanumeric characters. Since the NB API in sklearn doesn’t come with a predefined method, our experiments started by contrasting how well each algorithm performed with my model. In addition to n-gram, our work also tries out the vectorizer and another hyperparameter.

    TF-IDF is a statistic that reflects the relevance of a word in a document. It is defined by the number of times a word appears in a document in comparison to the number of documents that contain the word. The higher the TF-IDF value, the more significant the word is in the document. Please read this post for an informative introduction. The accuracy improves by approximately 0.4% points when utilizing sklearn’s TfidfVectorizer. This will be the last iteration of our NB model. Notably, based on Table 3, the performance of NB is significantly superior to that of TextBlob or VADER in Table 2. It correctly classifies 87.64% of reviews, which is 16.92% points greater than the performance of the VADER model. In addition, the performance of the NB model is satisfactory with both classes, properly categorizing 86% of positive reviews.

    Table 3. Performance result of supervised sentiment classification.

    MetricsBernoulliNBMultinomialNBMultinomialNB +N-gramMultinomialNB +TFIDF
    Accuracy0.85560.86080.8720.8764
    Sensitivity0.82220.83990.8560.86
    Specificity0.88860.88150.88780.8926
    Precision0.87950.87510.8830.8879
    F10.84990.85710.86930.8737
    Roc_Auc0.85540.86070.87190.8763

    Figures 3 and 4 describe the positive and negative words using Word Cloud.49 As we can see in Fig. 3, word cloud for the positive text includes fresh air, show, perform, origin, idea, drop, write, complete, etc. Our research utilized the function WordCloud (width=1000, height=500, max_words=500, min_font_size=5). The experiment shows the positive_words with interpolation=‘bilinear’. Some examples of negative text generated by word cloud are explained in Fig. 4. The negative text includes thriller, divorce, fight, parent, movie, descent, drama, kill, must, etc. Our experiment illustrates the max_words=500 and negative_word with interpolation=‘bilinear’.

    Fig. 3.

    Fig. 3. Word cloud positive text.

    Fig. 4.

    Fig. 4. Word cloud negative text.

    The following is a list of some of the benefits that the NB Algorithm offers: (1) The NB approach is effective and can cut down on the amount of time needed to complete a task significantly. (2) NB is an excellent tool for addressing problems that need the prediction of multiple classes. (3) NB has the potential to outperform other models while also using less data for training if the notion of feature independence holds true. (4) NB is better suitable for categorical input variables as opposed to numerical input variables.

    5. Conclusions

    The study is based on the IMDB dataset, which comprises movie reviews and the positive or negative labels that relate to them. We conducted a series of experiments to find the model with the highest accuracy and the greatest generality. The following classifiers are used in this work: Unsupervised Learning (VADER and TextBlob) and Supervised Learning (NB including the Bernoulli NB and Multinomial NB). Different techniques, including Count Vectorizer, TFIDF Vectorizer, minimum–maximum number of words, and max features, are implemented. Based on our experiment result, we can conclude the following: (1) The combination of MultinomialNB and matrix TFIDF achieves the highest accuracy, 87.64%, for both classification reports. (2) The performance of NB is significantly superior to that of TextBlob or VADER in Table 2. It correctly classifies 87.64% of reviews, which is 16.92% points greater than the performance of the VADER model. In addition, the performance of the NB model is satisfactory with both classes, properly categorizing 86% of positive reviews.

    We will explore the other Sentiment Classification algorithm in future research to increase our performance results. We also want to combine SA with Shapley Additive Explanations (SHAP) for explainable artificial intelligence (XAI).

    Acknowledgments

    This paper is supported by the Ministry of Science and Technology, Taiwan under Grant Nos. MOST-110-2927-I-324-50, MOST-110-2221-E-324-010, and MOST-111-2221-E-324-020.

    References