Sentiment Analysis of Twitter: Exploring Machine Learning and Data Mining Techniques

“Sentiment analysis – Twitter”;

Nowadays social network sites make the entire world a small village, where users can share their views, feelings, experiences, advice through those sites so that others can get help from these [1]. Twitter is one of the best social network sites where users can communicate with each other or share their opinions in the form of short blogs. Large number of text posts exist on twitter which increases every day. The rapid growth in data makes the existing databases unable to handle large amounts of data in a short time. Also, these databases are designed to process structured data but there are limitations when dealing with such a vast amount of data.

Sentiment analysis is one of the most widely interesting research topics in academic as well as commercial field. The expression sentiment refers to the feelings or thought of the person across some particular domains. Hence, it is also defined as opinion mining. The huge amount of Tweets jotted down daily makes Twitter data one of the most essential data volumes; therefore, this data has different aims, such as business, industrial or social aims according to the data requirement and needed processing. Actually, the amount of data, which is massive, grows rapidly per second and this is called big data [2].

The main goal of analysing of sentiments is to evaluate these thoughts, determining its polarity using NLP, statistics, or machine learning Algorithms to extract, identify, or otherwise characterize the sentiment content of a text.

Technically, data mining has been generally considered as the process of finding correlations or patterns among dozens of fields in large relational databases. Nowadays it is not necessary to turn to large structured databases to have large data volumes to analyze [3]. The traditional method of turning data into knowledge relies on the manual analysis and interpretation carried out by expert analysts that become intimately familiar with the data and serve as an interface between the data and the users. Unfortunately, this manual approach is slow, expensive and highly subjective, and it has become impracticable due to the dramatically huge volumes of data on the Web [4].

In this project, the handling of Tweets was investigated and different processing techniques were applied. Furthermore, to familiarize ourselves with different aspects of machine learning and data science, different data mining techniques have been applied to make analyze the performance of these techniques. I.e. acquiring data, preprocessing, use different classifiers, measure and interpret their performance. And finally discuss our observations.

1.Problem Statement

The task of automatic classification of sentiments on Tweets, as one of mining tasks, has become an interesting area for research. For this task to be accomplished, Tweets are processed and transformed from the full text version to a document vector by mapping each document into a compact form of its content. This reduces complexity and makes processing much easier.

A general statement of the problem can be formulated as follows:

“Given a set of Tweets in a particular domain and in a particular language, it’s required that we indicate the appropriate category (class) of each of them depending on a predefined dataset”

The process of indicating the appropriate class (classification process) for Tweets faces some challenges, mainly the handling of such a large amount of textual data. This large amount of textual data, leads to several problems. It makes the work more difficult. As the the size of the feature vector increases, the more noise appears, and the more errors occur. As a result, this will increase the run time, and raise the overfitting problem.

So, preprocessing, cleaning, and feature engineering is to be done to increase the efficiency and classification performance and to decrease the computational costs, storage cost, and ease of interpretation and modeling.

2.Proposed Approach

A possible solution of the general problem “sentiment analysis of Tweets” may be subdivided into two different stages: learning and classification as shown in Fig. 1.

•Learning Stage:

Learning stage is divided into two phases: the first phase is the pre-processing phases. In this phase, each document is processed and the set of words that describe the document are extracted. While the second phase is the feature vector generation phase in which a weighted feature vector for each Tweet is generated.

•Classification stage

This is the stage in which classification models are built for the training data and classification algorithms are subsequently applied on the test data.

Figure ‎1. General block diagram for the task of Sentiment analysis of Tweets

2.Related work

Sentiment Analysis is the area of web data mining which deals with evaluating the feelings or thoughts of people across some particular domains; to determine its polarity, and to what degree [5].

In the past few years, the field of detecting and evaluating the feelings of users or customers in many domains has been prevalent. This is due to the recent growth of data which exists on the World Wide Web. Especially those that deal with sentiment analysis such as describing people’s point of view, experiences, and thoughts.

Aurangzeb Khan et al. [6] and Rajni Jindal et al. [7] gives a detailed and deep review on machine learning approaches, documents representation techniques, and the datasets and techniques used for Text-Documents Classification. The review clearly indicates that the most common method for representing a document in text classification is the Vector-Space-Model (VSM), which represents each document as a vector consisting of an array of words.

Walaa Medhat et al. [5] discussed different types of sentiment analysis in their applications, and gave explanation and categorization of some Sentiment analysis algorithms and their originating references.

Yang et al. [8] introduced the common sentiment analysis techniques such as, Support Vector Machine method, Naive Bayes method, Maximum Entropy method and Artificial Neural Network method and performance assessment and difficulties.

Pang et al. [9] were the first to apply Machine Learning for sentiment mining on movie reviews corpus, many classification algorithms were used, whereas unigram and bag of words are utilized for obtaining features. The ratio of accuracy differs according to what they applied for example it was 82.9%, 78.5% when applying Support Vector Machines and Naive Bayes classifiers respectively. Wang et al. [10], used training dataset which contain 17000 Tweets to come up with a real time Twitter Sentiment Analysis System regarding to U.S. voting Presidential Election Cycle in 2012.

Domingos et al. [11] saw that if features are highly dependent on other variables, Naive Bayes works relatively well. This may be counterintuitive because Naive Bayes uses features that are independent of each other. Zhen Niu et al. [12] used a new model where there were well thought out approaches used for feature selection, weight computation and classification. This was based on the Bayesian algorithm in which the weights of the classifier are modified by using both representative feature and unique feature. 'Representative feature’ deals with the information that represents a class and ‘ Unique feature’ deals with the information that aids in distinguishing classes from each other. Probabilities were calculated using this method on each classification and this helped improve the Bayesian algorithm.

Barbosa et al. [13] created a 2-step automatic sentiment analysis method for classifying tweets. A noisy training set was used to reduce the effort required for labelling in developing classifiers. They initially classified tweets into subjective and objective tweets. Next, subjective tweets were classified as positive and negative tweets. Celikyilmaz et al. [14] also created a pronunciation based word clustering method that cold normalize noisy tweets. In pronunciation based word clustering, words with a similar pronunciation are grouped together and labeled as common tokens. In addition to this, they also assigned similar tokens for numbers, html links, user identifiers, and target organization names for normalization. Finally, after normalization, they used probabilistic models to identify polarity lexicons for features. Classification using the BoosTexter classifier with these polarity lexicons as features was performed and a reduced error rate was obtained.

Mahmoud Nabil et al. [15] introduced ASTD, an Arabic social sentiment analysis dataset gathered from Twitter. It consists of about 10,000 tweets which are classified as objective, subjective positive, subjective negative, and subjective mixed.they presented the properties and the statistics of the dataset, and run experiments using standard partitioning of the dataset.

Bhavitha et. al. in [16] compared different classifiers for sentiment analysis – has been done on Twitter posts related to electronic products such as mobiles and laptops is performed using a Machine Learning approach, and they utilized various classifiers such as Naive Bayes, Support Vector Machine (SVM), Maximum Entropy, and Ensemble for text classification and compares their accuracy, precision, and recall. A dataset of 1200 twitter posts equally divided into positive and negative are used as training (1000) and test (200) sets. Results concluded that the Naive Bayes classifier has the highest precision, but lacks accuracy and recall.

Ankita Rane [17] worked on a dataset comprising of tweets for 6 major US Airlines and performed a multi-class sentiment analysis. This approach starts off with pre-processing techniques used to clean the tweets and then representing these tweets as vectors using a deep learning concept (Doc2vec) to do a phrase-level analysis. They applied different classification techniques, and uses the precision, Recall, and F-Measure to evaluate the performance of the classifiers.

3.Proposed model for Sentiment Analysis of Tweets

For Tweets to be classified, each Tweet is represented by a set of words that expresses its global meaning. In traditional approaches, it is represented by a group of words describing its contents, and the classification process goes through two different two stages: learning and classification stages as described in the following subsections. Generally, tweets can be in any language. In our work we are dealing with those that are in Arabic and English languages only.

1.The learning Stage

The first issue that needs to be addressed in text classification is how to represent texts, not only to ease the manipulation of them by machines saving the processing time and the used memory, but also to retain as much information as needed without any losses.

In order to represent the text documents; the Vector Space Model (VSM) is considered the most commonly used text representation technique, in which each document is represented as a vector, Bag of Words (BoW), i.e., each document is represented by the set of words it contains and their frequency regardless their order [7] [18] [19].

In our approach, we use a method in which we apply the he Pre-Processing phase, on both the training and the testing text documents to build the feature vector for mining tasks; as explained in (3.1.1.). After the we perform the phase of generating a weighted feature vector for each of the documents, either training or testing ones as explained in (3.1.2).

1.The pre-processing Phase

The Pre-Processing phase aims in preprocess the input Tweets; extracting the BoW that represents them. Fig. 3-2 shows the detailed tasks of the Pre-Processing phase.

Figure 3. Detailed tasks of the Pre-Processing phase

As shown in Fig. 3, the Pre-Processing phase consists of the following tasks:

•Natural Language Processing (NLP) Parser; which is the responsible of processing text to detect sentences, tokens, by separating the words for the analysis.

•Stopping words removal; in which a removal of the stopping words, for example, if the textual data is in English language the stopping words are such as (a, an, in, at,…, etc.), also the auxiliary verbs, adverbs, etc., the same as if we have a textual data in any other language such as Arabic Language for which the stopping words are such as (فى, لها, ولم, امس,…, etc.). And this is done by searching for words in pre-existing Stopping words list.

•Data Cleaning and Normalization; normalize the words to remove special arabic characters which may cause ambiguity in the classification process such as (“إ” will be “ا”), remove words that contains symbolic and special arabic characters; that accepts only the words that contain Arabic characters and numbers and words with any combination of them. Also, for english data, non English words words that contains symbolic characters are being Removed; that accepts only the words that contain English characters and numbers and words with any combination of them, furthermore it converts all the words to be lowercase words.

•Stemming; it is a task in which each of the extracted words are replaced by its morphological root by applying any of the stemming algorithms. It is fundamental to avoid the redundancy of extracting the different equivalent morphological forms in which a word can be presented. In this project, for Arabic data, we used ISRI Arabic stemmer from the nltk python library to perform the word stemming process. SRI Arabic stemmer based on algorithm: Arabic Stemming without a root dictionary. Information Science Research Institute. University of Nevada, Las Vegas, USA. A few minor modifications have been made to ISRI basic algorithm.

While for English Data, we used Porter English stemmer from the nltk python library to perform the word stemming process.

At the end of the Pre-Processing Phase, we obtain a set of stemmed bag of words which represents the original feature vector that will be used for generation of the used feature vector.

2.The generation a weighted feature vector phase

The task of generating a weighted feature vector, aims in assigning Weights to each word in the feature vector to give an indication of the importance of them.one of the most straightforward and useful techniques to weight words is the Term Frequency; the limitations of using this technique as it doesn’t take the length of the documents into account [18]. However, [20] [21], introduces the use of the term TFIDF as a weighting technique is a straightforward solution to this problem. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

It is composed by two terms; the first computes the normalized Term Frequency (TF), the second term is the Inverse Document Frequency (IDF).

TF: This measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones, as follows [22]:

TF(t) = Number of time term (t) appears in a tweetTotal number of terms in the whole dataset

( ‎eq. 1 )

IDF: This measures how important a term is. While computing TF, all terms are considered equally important, as follows [22]:

IDF(t) = log(total number of tweets)Number of tweets with term(t) in it

( ‎eq. 2 )

Finally, the TFIDF measure: is calculated by the multiplication of TF and IDF, as follows [22]:

TFIDF(t) =(TF(t)(IDF(t))

( ‎eq. 3 )

This means that larger weights are assigned to terms that appear relatively rarely throughout the corpus, but very frequently in individual documents.

2.The Classification Stage

1.Building a classification model and applying a classification algorithm

In this task, the data is processed and prepared in a suitable format. The data will be splitted into training and testing sets, in 5 folds .Then for the training data, the produced will be supplied to the set of built in data mining algorithms in the python sklearn library. This algorithms are being used in building the classification model for different classifiers (Naive-Bayes, Decision Trees, SVM,…etc. While For the test data, the classification algorithm corresponding to the builded model is being applied, and the classification results are being obtained.

4.Experimental Results and discussion

In this section, we discuss our experimental setup and the results for evaluating the performance of our proposed approach on Five different dataset, two in arabic language; the ASTD dataset[15] and the RES-dataset [23] . And three in English language which are Sentiment140 dataset [24], The Twitter US Airline Sentiment dataset [25], and the Uber Ride Reviews Dataset [26].

1.Dataset used

The Restaurant Reviews dataset (RES): is For the restaurants domain two sources were scrapped for reviews: the first is Qaym4 from which 8.6K Arabic reviews were obtained, and the second is TripAdvisor from which 2.6K reviews were collected. Both datasets cover 4.5K restaurants and have reviews written by over 3K users

Arabic Sentiment Tweets Dataset (ASTD):contains over 10k Arabic sentiment tweets classified into four classes subjective positive, subjective negative, subjective mixed, and objective. Two sets of baseline sentiment analysis experiments are supported with the dataset.

The Stanford Twitter sentiment corpus (sentiment140): It contains 1.6 million tweets automatically labelled as positive or negative based on emotions. For example, a tweet is labelled as positive if it contains :), :-), : ), :D, or =) and is labelled as negative if it contains :(, :-(, or : (. ie. The tweets have been annotated (0 = negative, 2 = neutral, 4 = positive) and they can be used to detect sentiment. Although automatic sentiment annotation of tweets using emoticons is fast, its accuracy is arguable because emoticons might not reflect the actual sentiment of tweets.

The Twitter US Airline Sentiment dataset: This data originally came from Crowdflower Data for Everyone library. It was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service"). Furthermore, this data we're providing on Kaggle is a slightly reformatted version of the original source.

The Uber Ride Reviews Dataset: It contains 1345 uber ride reviews, collected in the period between ( 2014-2017).these data were which are scraped from two websites,

•https://www.consumeraffairs.com/travel/uber.html and

•https://www.sitejabber.com/reviews/www.uber.com#reviews.

2.Evaluation criteria

All documents for training and testing Passes through the stages in section 3, Experimental results reported in this section are based on precision, recall and F1 measures.The F1 measure is the harmonic mean of precision and recall as follows [27]:

F1(Recall, Precision) = 2 Recall PrecisionRecall + Precision

( ‎eq. 4 )

In the above formula, precision and recall are two standard measures widely used in text categorization literature to evaluate the algorithm’s effectiveness on a given category where

Precision = True PositiveTrue Positive + False Positive

( ‎eq. 5 )

Recall = True PositiveTrue Positive + False Negative

( ‎eq. 6 )

While the accuracy is simply defined by the following equation.

Accuracy % = Total Number of correctly classifed dataTotal number of the whole dataset100

( ‎eq. 7 )

3.Experimental results and Discussion

This section analyses and discusses the performance of different classifiers used in the project.

Table 1 – Results based on various datasets run against several classifiers.

Classification

Algorithm

Performance Metrics

Datasets

Arabic Dataset

English Dataset

RES

ASTD

US Airline Tweets

Uber Ride Reviews

Stanford Tweets

Decision Tree

Accuracy

precision

0.73

0.56

0.65

0.73

0.69

recall

0.73

0.58

0.65

0.73

0.69

f1-score

0.73

0.57

0.65

0.73

0.69

Multinomial Naive Bayes

Accuracy

precision

0.62

0.6

0.73

0.84

0.78

recall

0.79

0.67

0.70

0.80

0.78

f1-score

0.69

0.55

0.63

0.72

0.78

Bernoulli Naive Bayes

Accuracy

precision

0.72

0.59

0.77

0.84

0.77

recall

0.77

0.66

0.76

0.84

0.77

f1-score

0.74

0.61

0.77

0.84

0.77

Logistic Regression

Accuracy

precision

0.72

0.64

0.74

0.64

0.79

recall

0.79

0.68

0.74

0.80

0.79

f1-score

0.70

0.56

0.71

0.79

K-Nearest Neighbors

Accuracy

–

precision

0.74

0.48

0.68

0.82

–

recall

0.74

0.66

0.69

0.84

–

f1-score

0.74

0.54

0.68

0.81

–

Perceptron

Accuracy

precision

0.76

0.59

0.72

0.78

0.73

recall

0.79

0.61

0.73

0.80

0.73

f1-score

0.77

0.60

0.72

0.79

0.73

Multilayer Perceptron

Accuracy

–

precision

0.78

0.59

0.73

0.80

–

recall

0.82

0.61

0.74

0.83

–

f1-score

0.78

0.60

0.73

0.80

–

Figure 4. Accuracies of various classifiers.

Figure 5. Precision of various classifiers.

Figure 6. Recall of various classifiers.

Figure 7. F1-Score of various classifiers.

To test the proposed model; Experiments were implemented on Different datasets, two Arabic language and three in English language, as in 4.1., using seven Different Classification algorithms for analysing their Performance. we use 5-folds for the evaluation and data were splitted into 70% training and 30% testing randomly. Table 1 shows the classification accuracies, Precision, Recall, and F- Measure for applying the seven classifiers on the five different datasets. It could be noticed from figures (4:7) that, for the RES dataset, the Multilayer Perceptron classification algorithm (81 %) is much better than other classifiers. While for the ASTD dataset, the Logistic Regression classification algorithm (68 %) is much better than other classifiers.

In addition, it could be noticed that, the accuracy of the Bernoulli Naive Bayes classifier when it is being applied on the US Airline Tweets dataset (76%), and on Uber Ride Reviews dataset (84%) is much better than the other classifiers. Furthermore, it could be perceived that, the accuracy of Logistic Regression classifier (79%) after being applied on the Stanford Tweets dataset (79%) is much better than the other classifiers.

Moreover, we denoted that from the results obtained when applying the kNN classifier the following:

•Firstly, the kNN classifier assigns equal weights to each attribute in the feature vector. This may cause confusion when there are many irrelevant attributes in the data.

•Second, the traditional kNN classifier is based on measuring the distance between the documents using different distance measures, such as Euclidean distance, Manhattan distance, Chebyshev distance, and Minkowski distance, ignoring the semantic relations between them. Which may lead to misclassification of some the tested data.

It could be remarked that, due to the limitations of the granted processing power of the Jupyter Server, we take a subset of the Stanford Tweets dataset. We take the first 50000 tweets on each class. Although, we still facing the same problem of the continuously dead kernel and the memory overflow when applying both the kNN and the Multilayer Perceptron classification Algorithms.

furthermore, when comparing our obtained results with the results from [17], we find the superiority of our proposed model with much better results in all the used classifiers.

5.Conclusions and Future Work

1.Conclusions

This project addresses the task of performing sentiment analysis over textual datasets. To achieve this goal, a classification model it being introduced; in which each tweet passes through two stages containing different steps. Additionally, TFIDF is being used to generate the used feature classification tasks.

After building the feature vector, different classifiers are being used to test our model, and performance of our model is being reported using different metrics, Accuracy, Precision, F-Measure, and recall. Experimental results through the entire thesis indicate that:

1.For the RES dataset, the best results obtained when applying the Multilayer Perceptron algorithm (81%) for the accuracy. Also in terms of getting high True positive correctly detection rates (TPR), the best results obtained also for the same classifier with recall of (82%)

2.For the ASTD dataset, the best results obtained when applying the Logistic Regression algorithm (68%) for the accuracy. Also in terms of getting high True positive correctly detection rates (TPR), the best results obtained also for the same classifier with recall of (68%).

3.For the US Airline Tweets dataset, the best results obtained when applying the Bernoulli Naive Bayes algorithm (76%) for the accuracy. But in terms of getting high True positive correctly detection rates (TPR), the best results obtained when applying the Logistic Regression classifier with recall of (74%).

4.For the Uber Ride Reviews dataset, the best results obtained when applying both the Bernoulli Naive Bayes algorithm and the kNN classification Algorithm (84%) for the accuracy.Also in terms of getting high True positive correctly detection rates (TPR), the best results obtained also for the same classifier with recall of (84%).

5.For the Stanford Tweets dataset, we faced a limitations in the available processing jupyter server, even when we work on a subset of 100K tweets, we couldn’t obtain results for both the kNN and the Multilayer Perceptron classification algorithms. But of best results obtained when applying the Logistic Regression (79%) for the accuracy. Also in terms of getting high True positive correctly detection rates (TPR), the best results obtained also for the same classifier with recall of (79%).

2.Future work

There are some recommendations that either belongs to the same direction of the current work to be carried out, or new area of work belongs to the web text document classification in general, which includes the following:

6. Utilizing the WordNet ontology for building the extracted feature vector to improve the classification accuracy.

7.Extend and Evaluate the proposed system against other datasets in other languages .

8.Apply dimensionality reduction techniques to minimize the size of the used feature vector without affecting the overall system performance.

9.Extend the proposed work to handle tweets that uses franco-arabic texts.

10.Improving the existing techniques and methodologies to be used in classifying the web text documents.

11. Implementation of the web text document classification system using parallel platforms, such as Apache Spark, to handle the huge amount of data in less time without affecting the overall system performance.

12.Combining the decisions of two or more classification algorithms to improve the resultant accuracies.

Performing clustering tasks over the set of feature vectors to help application to give accurate results and search quickly.

Essay: Sentiment Analysis of Twitter: Exploring Machine Learning and Data Mining Techniques

Essay details and download:

Text preview of this essay:

About this essay:

Essay details and download:

Text preview of this essay:

About this essay:

Essay Categories: