Abstract:
Social media platforms have provided dominant arenas for people to express their opinions and thoughts about products, services, individuals, and public policy in the form of posts. These posts are characterized by high volume, unstructured, semi-structured, and normally full of colloquial language thus accurate sentiment analysis models for social media data are required. Text representation is a key determinant of the accuracy and computational cost of machine learning models for such sentiment analysis. Existing text representation techniques do not consider relationships between words, they ignore words’ characteristics for instance word sentiment orientation and they suffer high feature dimensionality. The high dimensionality is attributed to the brute-force approach of generating representation vectors from the entire input text. This research aimed to develop and evaluate a sentiment lexicon-augmented text representation model for social media sentiment analysis. Three public datasets (online reviews) from Amazon product reviews, Yelp Restaurants’ reviews, and IMDB Movies’ reviews were used. Pre-processing involved cleaning the reviews, tokenization, lemmatization, and Part-of-speech (POS) tagging. Text representation was done using a bag of words, N-grams, hybrid representations, word embeddings, and the proposed sentiment lexicon-augmented approaches. Term Frequency-Inverse Document Frequency (TF-IDF) and Binary Occurrences were used as term weighting algorithms. The resultant text representation vectors obtained were used as input in four supervised machine learning base classifiers (Decision Tree, K-Nearest Neighbor, Naïve Bayes, and Support Vector Machines) and deep learning’s Convolutional Neural Network (CNN). Experimental results from sentiment lexicon-enhanced approaches showed that they performed better than other baseline approaches with an F-measure score that ranged between 84.68% and 90.15%. Ablation studies on the N-grams showed that N=3 performed better than other values of N=1, 2, 4, or n with an F-measure score of 88.73%. Using base classifiers, Support Vector Machines were found to perform better than Naïve Bayes, K-Nearest Nei, neighbor, and Decision Tree. Using the sentiment lexicon-augmented word embeddings and CNN, the results showed that Bidirectional Encoder Representation from Transformers (BERT) outperformed the Global Vectors (Glove) and Word2Vec embeddings with an F-measure of 88.63%. The results of binary sentiment classification before and after sentiment lexicon enhancement showed a reduction of resultant vector feature dimensions and improvement in sentiment classification models’ performance both in base machine learning classifiers and in deep learning CNN. This study demonstrated that the sentiment lexicon-based approach combined with conventional text representation algorithms can be used in enhancing sentiment analysis models for social media data.
Keywords: Sentiment Analysis, Machine Learning, Text Representation, Social Media, and Word Embedding.