TF-IDF vs. Word2Vec Vectorization Techniques for Twitter Sentiment Analysis
In this blog, we examine the different approaches of sentiment analyses on real tweets. Prior to creating models with different vectorizing methodologies and comparing results, we split the data into train and text then clean each part. After comparing the models with different vectorization, it was determined that the Logical Regression (LR) model with TF-IDF with the f1 score of 0.81 and high precision and recall scores delivered a better performance than the other models.
Introduction
Sentiment Analysis
The amount of digital interactions and transactions continues to grow exponentially each year and sentiment analysis provides a way to understand the attitudes, opinions, and emotions that underlie the online text. It is particularly useful in the world of social media, where it can provide an overview of public opinion on specific topics. The applications for data gleaned from sentiment analyses on social media platforms is incredibly diverse — these can range from micro-level applications, such as enhancing products and marketing efforts to broader public concerns, such as informing political policy and predicting economic performance.[i]
Social media, including information on Twitter, serves as a trove of expressive data, including that which is related to emotions, opinions, and share views about users’ daily lives.[ii] Much of this content is used for decision-making, which is where sentiment analysis fits in. These analyses can indicate how positive, negative or neutral a message is by automating the process for mining attitudes, opinions, views and emotions within the text.[iii]
The purpose of this blog is to compare different approaches of sentiment analyses on real tweets. Datasets are taken from Kaggle (which contains information about real tweets), provided with a unique identification, and labeled with the corresponding sentiments; zero indicates a negative sentiment and one indicates a positive one.
Preparation and Cleaning — Training and Testing Dataset
Before creating the main models with different vectorizing methodologies and comparing the results, the data first had to be cleaned. Datasets were comprised of two parts — training and testing. After cleaning, these two sets were then merged back into a whole dataset for homogeneity.
In the process of cleaning the data, all unnecessary columns were removed, such as ItemID. All upper-case letters were changed to lower case and all unnecessary signs and numbers were removed. The result of this cleaning left tweets which contained only words and their relative sentiment.
During the aggregation of the data, sentiment text was labeled as X, which served as the feature set, and the labeled data into Y, which should be predicted. Feature separated into train and test datasets.
Vectorization
The dataset was then vectorized using two methods: TF-IFD vectorization and Word2Vec mean vectorization.
TF-IDF, or term frequency-inverse document frequency, is a numerical statistic that defines how important a term is to a document in the collection (corpus).[iv] Its primary use Is to stop filtering words in in-text summarization and categorization application. The TF-IDF value increases proportionally depending upon how frequently the word appears in the document, but it decreases by its frequency in the corpus in order to offset the fact that some words are simply more common than others.[v]
Word2Vec functions to build word embeddings by maximizing the likelihood that words are predicted from their context or vice versa.[vi] It is a relatively new approach to text classification that converts words and phrases into a vector representation, bringing new semantic features that help in-text classification.[vii] The vectors are numbers that represent the meaning of the word. In sum, it is a statistical method for efficiently learning a standalone word embedding from a text corpus.
Word Vectors
As natural language processing abilities became more sophisticated, word vectors are able to provide machines with much more information about the words than had been available for previous analysis. Whereas traditional NPL approaches were incapable of capturing syntactic and semantic relationships across collections of words, word vectors show words as “multidimensional continuous floating-point numbers” (Ahire, 2018) by displaying a set of real-valued numbers in which each point captures a dimension of the word’s meaning. In this scenario, semantically similar words have similar vectors.[viii]
Models
Two models were used as classification algorithms to predict the sentiments of the tweets: The Logistic Regression (LR) and the Random Forest (RF). Notably, the grid search technique was used to detect hyperparameters. This will be discussed further in the next section.
Logistic Regression is routine when studying outcomes that are represented by binary variables. It is a technique borrowed by machine learning from the field of statistics and serves as predictive analysis. It is used to describe data and to explain the relationship between one dependent binary variable and its independent variables.
Random Forest constructs many decision trees that will be used to classify a new instance by the majority vote. Each decision tree node uses a number of attributes randomly selected from the whole set of original attributes. In sum, it is a meta estimator that uses averaging of these decision tree classifiers and their sub-attributes to better predict accuracy.
Results
The Logistic Regression of TF-IDF
In this part, we will discuss the results based on TF-IDF vectorization and the LR model. TF-IDF is and n-gram equal to two, because it treats n-grams additionally. To illustrate: an example in which the descriptor “very bad” is a 2-gram, which is an attribute assigned separately from the individual words “very” and “bad.”
When applied to the LR model, the data test set results were:
Recall: [0.76, 0.8]
Precision: [0.7, 0.81]
F1_score: [0.75. 0.81]
The Logistic Regression of Word2Vec
This section will discuss the Word2Vec vectorization and the logistic regression model on Word2Vec’s chosen parameters and results. When converting a word to a vector (array of numbers), it is simply a method to input and process words for any natural language processing task. For this study, the word2vec mean methodology was selected to provide the mean of each word vector. Within the logistic regression, the same parameters were used in order to adequately compare the methodologies.
The results were as follows:
Recall: [0.71, 0.7]
Precision: [0.65, 0.76]
F1_score: [0.68, 0.73]
The Random Forest for TF-IDF
The TF-IDF parameters for the random forest model were kept the same as the ones used for logistic regression. The random forest classifier created as defined as having a maximum depth of 13 with the number of estimators set at 500.
Results of this data set were:
Recall: [0.72, 0.76]
Precision: [0.7, 0.78]
F1_score: [0.71, 0.77]
The Random Forest for Word2Vec
In this model, the parameters used were the same as those used previously.
The results of this data set include:
Recall: [0.63, 0.76]
Precision: [0.67, 0.72]
F1_score: [0.65, 0.74]
Conclusion
In the comparison of these models with different vectorizations, the LR model with TF-IDF resulted in the best performance with an f1 score of 0.81 and high precision and recall scores. Despite evidence that LR with TF-IDF performed the best, Word2vec analyses bring additional tools to the table and can still be very helpful in many analyses. For further improvements, the use of NPL deep learning techniques to get higher results should be explored.
___________________________________________________________________
[i] Bannister, K. (2018). Understanding sentiment analysis: What it is & why it’s used. Brandwatch. Retrieved April 17, 2019, from https://www.brandwatch.com/blog/understanding-sentiment-analysis/
[ii] Kharde, V.A., & Sonawane, S.S. (2016). Sentiment analysis of Twitter data: A survey of techniques. International Journal of Computer Applications, 139 (11), 5–15.
[iii] Pak, A., & Paroubek, P. (2010). Twitter as a corpus for sentiment analysis and opinion mining. In proceedings of the seventh Conference on International Language Resources and Evaluation, 1320–1326.
[iv] Christian, H., Agus, M.P., & Suhartono, D. (2016). Single document automatic text summarization using term frequency-inverse document frequency. ComTech, 7(4), 285–294.
[v] Christian, H., Agus, M.P., & Suhartono, D. (2016). Single document automatic text summarization using term frequency-inverse document frequency. ComTech, 7(4), 285–294.
[vi] Ling, W., Dyer, C., Black, A. & Tancoso, I. (2015). Two/too simple adaptations of Word2Vec for syntax problems. Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL. 1299–1304.
[vii] Lilleberg, J., Zhu, Y., & Zhang, Y. (2015). Support vector machines and Word2vec for text classification with semantic features. 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing.
[viii] Ahire, J.B. (2018, March 12). Introduction to word vectors. Medium. Retrieved April 8, 2019, from https://medium.com/@jayeshbahire/introduction-to-word-vectors-ea1d4e4b84bf