Social media has emerged as the most powerful form of communication these days. The ability to voice an opinion from anywhere over anything has coined and popularized the term ‘Global Citizen. Many social platforms available these days provide the avenues for the companies and organization to make a brand for themselves by directly engaging with their customers. If you don't like some brand you bought or a movie you just watched, login on Twitter or Facebook and tell the world. These platforms provide an access into what is currently trending or making the stir at the moment. Thus, there is a huge repository of opinionated data that if harnessed strategically can help to solve problems ranging from what product to invest into complex business deals. In order to get a good understanding of this soaring new area, I have harnessed a tool to gather the data from the Twitter and turn it into intelligent information. I have focused on Twitter platform for my project because it has the most active users out there and it falls into the top tier social platform categories. Twitter has more than half a billion active users, and they are tweeting about everything out there. Twitter has become a source for people to share ideas, build networks, collaborate, meet new people or even start national trending discussions. Every other day, there is a new hashtag popularized by the users over twitter either to put a collective front or just to support a cause. Suppose your company just released a new product, which was highly anticipated, so there is obviously a legible curiosity to know what is the current word out these on the internet regarding the product. This is where this tool comes handy, fetching real-time tweets from twitter and applying classification algorithms, projecting sentiments over charts and graphs
depicting whether the buzz created by the marketing team acquiesces with how consumers are actually perceiving it. Not just limiting to product marketing or perception, this tool can also be used to gain insights into the perception of any topic like a national event, sports, celebrity, political issues, natural disaster etc.
1.2 Opinion Mining
Opinion mining (or Sentiment Analysis) is the practice of extracting simple human sentiments like positive or negative, objectivity or subjectivity from the text written in natural language . The way we perceive things or express feelings are different from other and that's why opinion mining is most intriguing research area in today's times. Opinion mining has developed into the most active research area in data mining. Due to its wide range of influence, it has spread across the areas of marketing and social science and not just been confined to computer science.
1.3 Levels of Opinion Mining
¬ Document Level
In this type of opinion mining, the sentiment extracted from the entire document is summarized as the opinion of the entire data. It is assumed that the document is written from the perception of a single author with a specific opinion in mind.
¬ Sentence Level
In this type of opinion mining first, it is determined, if the sentence is subjective or objective then it is determined, whether that subject is positive, negative or neutral. The most common assumption in this level of mining is that the entire sentence conveys only single opinion
¬ Feature Level
In this type, the text is viewed to extract the features of the subject of concern. After extracting the features, a summarized opinion is constructed based on those features.
1.4 Techniques for Opinion Mining
Opinion mining (also known as sentiment analysis) focuses mainly on two things, first identifying whether the text is subjective or objective and then determining the polarity of the subject . Opinion words in the text mostly contribute towards the subjectivity of the text. These opinion words together inform about the emotional context of the text. Objective words contribute towards the factual construct of the text. There are three most common approaches used for determining the sentimental content of a text
¬ Lexical Analysis
This is basically a dictionary approach. The text under analysis is converted into tokens. Each token is then matched with the dictionary. If the match is positive, it increments the total score of the text so far, else there is a decrement in the text score if there is a negative match. For example, the word ‘beautiful' will increment the total score since it is a positive match. It is a basic but very efficient technique. One major drawback of this approach is that its accuracy and efficiency degrades drastically with the growth of the size of the dictionary .
¬ Machine learning analysis
This approach comprises of Data collection, Pre-Processing of data, Training the model, Classification of the text and plotting result . The efficiency of this approach depends upon the accuracy of the text classifier. This approach overcomes the limitation of lexical analysis as its performance doesn't degrade as the size of dictionary grows.
¬ Hybrid Analysis
This approach combines the efficiency of machine learning and speed of lexical analysis. One such common hybrid approach is pSenti , which uses sentiment words from public source for feature detection. The weight of these features contributes towards the sentiment of the text.
1.5 Why Opinion Mining is important?
There is an ample of diverse data available on social media these days, but this data is only useful when handled strategically. In order to gain actionable insights, it is important to drill beyond the basic level . Most social platforms give an access to the general audience metrics like number of users, total posts, likes etc. but the key factor here is audience participation. How the users or public is perceiving things. In order to engage directly with the end user, it's important to comprehend the positive, negative or neutral sentiments behind the social engagements . We as humans respond to emotions a lot better than numbers or statistics. Our emotions drive the decisions we make, with better understanding and interpretation of these emotions decisions to improve campaign success, determining market strategy, generating buzz, improving product placements becomes a whole lot easier.
2.1 Twitter Search API
Search API allows to poll tweets that have occurred recently, and restricts the access by Twitter's rate limit. An individual can access at least 3200 tweets, irrespective of the query criteria. This API can be thought of as directly searching a keyword in Twitter's search feature. For a specific keyword there can be 5000 tweets polled at once. Although this API limits the number of requests you can make in a certain time frame. The current Twitter request rate is 180 requests in a 15-minute timespan .
I chose to go with Search API over Stream because Stream API only supports a sample of tweets that are currently occurring. This is so because Twitter does not have the infrastructure at the moment to provide real time tweets via Streaming API. Search API is commonly used to monitor events and situations that need real time analysis .
In order to stream tweets, an application needs to be first registered with the twitter to get the necessary authentication keys
Figure 1 Creating Twitter Application
Figure 2: Twitter Authentication
2.2 Apache Streaming
Previously, Apache Spark framework was mostly used for processing data in a batch mode. With Spark streaming structure, the computing and analysis of data is done in real time as it arrives. Apache-streaming library enables the processing of the data on the fly and create valuable insights for data driven decision-making .
Spark streaming is the extension of the core Spark structure that provides a unified framework for all processing needs. Spark streaming divides the live data stream into small batches with pre-defined intervals. These individual batches are called Resilient Distributed Datasets (RDDs).
These RDDs then can be analyses using different techniques. The results of these RDDs analysis is also generated in batches . These batches are usually stored in databases or file system for any further analytics or to generate reports or dashboards. Data sources for Spark stream includes platforms like Kafka, Flume, Twitter, Kinesis.
Figure 3: Spark Streaming Structure 
Spark Streaming comes with various different APIs useful for data stream processing like flatmap, filter, count, reduce, and join. Along with these basic APIs, it also provides window based operations like countByWindow, window, reduceByWindow, updateStateByKey. Spark Streaming library currently supports Scala, Java, and Python programming languages .
2.2.1 Apache Spark vs Hadoop
Ever since the boom in the area of Big Data, Hadoop and Spark have been the primary Big Data Framework tools. However, there are few aspects of Apache Spark that makes it a wise choice for complex data analytics.
¬ Real Time Analytics
Real time analysis is the most recent trend in the world of big data. Companies are investing in setting up real time frameworks and social media is becoming the modern tool for customer engagement. Hadoop runs on batch processing, while storing the data on the disk. Spark on the other hand provides the capabilities to tackle real time data streams, which mean it, can tackle the data, as it is fetched form the source to perform the analytics. This is one the many reasons why Spark is used commonly these days for many machine learning projects and applications. Spark also provides its very own basic machine-learning library MLlib, whereas Hadoop has to rely on external library.
Spark provides 100 times faster processing than MapReduce due to its in memory processing capabilities. The Resilient Distributed Dataset (RDD) enables the Spark to store the intermediate process in the memory and uses the disk only when needed, reducing the input and output count.
Figure 4: MapReduce vs Apache Spark 
2.3 React+D3 Library
two libraries provide excellent flexible and reusable visualizations across the application. Also using these libraries together enables creating responsive charts without any custom code creation for every chart.
2.4 Scala Build Tool
Scala Build Tool or SBT is for compiling, running, testing and packaging a project . SBT creates a jar file containing all the important dependencies required to run the project. SBT provides an interactive and plugin architecture.
Figure 5: SBT Dependencies
2.5 IntelliJ IDEA
I have used the IntelliJ IDEA for the development of this project because it provides excellent support for Scala scripting and has better understanding of the code construct and flow thus, provides better syntax recommendations as compared to popular Eclipse IDE.
Figure 6: Scala Build in IntelliJ IDEA
2.6 System Workflow
Figure 7: System Diagram
OPINION MINING USING SPARK
3.1 Streaming Data from Twitter
To set up a twitter stream, we first need to create access token for out Twitter account. Twitter has various regulations and rate limits due to legal reason concerning the privacy imposed on its API. This requires all the users to provide authentication details to query the API. After the verification setup, the user is provided with verified credential keys to access the API. I have first registered my application at the “ https://apps.twitter.com”. After registering my application, my authentication credentials were created which I have secured in twitter.txt file. I have filtered the data stream to contain data only with the specific keywords as per requirement .
3.2 Pre-Processing of the Data
Twitter restricts the tweets to be of 140 characters or less, thus the data collected from Twitter is full of missing information and inconsistency. Therefore, we need to process the data before getting into the analytics and mining. We need to get rid of any unwanted data that will hinder the machine-learning algorithm to label the tweets. Tokenization was performed to impart additional information to the tweet. Part of Speech (POS) tagger takes a text as an input and assigns reasonable tokens like nouns, adjectives, verbs, prepositions to each word in the text. I have used the Stanford POS tagger to accomplish this task.
It becomes easy to analyze the text when we have extracted the additional information like number of positive or negative words, hashtags, special characters etc. These features serve as the classification parameters when we apply the machine-learning algorithm on the processed data. Words like ‘the', ‘at', ‘is' are often considered as the stop words in natural language processing. They provide very little or no meaning to the text when fed to the classifier and are approximately measures equal in positive and negative sets. So, removing them allows more specific data to be passed the classifier. Since there are no specific lists of stop words, I have used the Natural Language Tool Kit (NLTK) default list for Scala to sort these out.
3.3 Sentiment Analysis
There are many good Natural Language Processing Tool Kits available like Deeplearning4j, Apache OpenNLP, and Natural Language Tool Kit with features like pipeline construction, lexical analysis, and saving or loading model. I chose Stanford's CoreNLP because of its highest quality analytics library and support to many popular programming languages.
After tokenization and feature extraction, the next step is to evaluate the score for individual token and construct the sentiment tree for the entire text. A tree structure is used to aggregate the overall sentiment.
The final score is calculated by taking the average of the sum of the score by total words of the entire tree.
The final score is then mapped to the respective weighted sentiment like 0 for neutral, 1for positive and -1 for negative.
3.4 Visualization of the result
To visualize the result from the analysis, I used the combined libraries of React and D3. I created object files in the React enclosed in the Components. These object fields hold the data that need to be displayed along with the information that is embodied to customize the display. The weighted sentiments from the spark were sent to the visualization scripts in React to display the result.
4.1 Project Execution
Spark needs to all the codes and dependencies to be packaged along the application. This project is launched on the spark using UBER jar file. Scala Build Tool (SBT) was used to package all the code and dependencies to create the file. The jar file executes all the dependencies for the run time.
Once the jar file is created, bin/spark-submit script is used to launch the application.
Figure 8: Project Build
4.2 Submitting the Application
This project can be thought of as an analysis engine for processing real time data to have good insight on the perception of the topic of interest. To invoke the engine, there are two ways :
4.2.1 Via Command Line
The analysis engone can be invoked through a simple command “spark-submit --class "com.pp.sentiment.TweetStream" --master local[*] ./target/scala-2.10/Spark-Sentiment-assembly-1.0.jar <keywords for sentiment analysis>” This is a spark submit script that can run the bundled up code and dependencies which we build using SBT.
4.2.2 Using the Web Application
User can fill in the required keyword through the deployed web application main page and spark engine will start compiling the code to perform the analysis. It is advised to let the analysis to run for good stretch of time so as to collect good chunk of data for a better understanding. The size of the data highly influences the behavior of the analysis, so the analysis will be more informative with a good set of data collected.
Figure 9 : Main Page
User can input the keyword of interest. This application is based on n-gram model, which means it can take up multiple consequence of words to stream the data although the keywords should be spaced properly.
Here we are running the analysis on two spaced words.
Figure 10: Bar Graph
Figure 11: Bar Graph
Figure 12: Time Series Bar Chart
Opinion mining is the most challenging and intriguing area of Natural Language Processing. With the crazy boom in social media these days, it has become the primary tool for customer engagement for the companies to gain insights into how their products or services are perceived by their active or potential consumer base.
Working on this project helped me to gain better understanding of data analytics and all the tools and technologies that goes with it. Exploring Spark, which is an immensely powerful tool for real time analysis, was a good learning experience, along with the various aspects of turning the results of analytics into an informative visualization. I also learned about the implementation of Scala, which is a pure Object-Oriented and functional language.
During the course of this project, I have learned to tackle problems and explore technologies that are new for my understanding. Being new to the area of Natural Language Processing and Machine Learning, this project helped me a lot to gain hands experience on how to convert raw unstructured data into useful information that can help in making business decisions. Working on a real time project, learning at each stage from the start to finish was definitely a great experience
Apart from text, tweets also have images tagged in them. Sharing images is more personal and enable people in expressing their sentiment more freely like sharing a meme to assert their emotions on certain topic in a fun way. Therefore, the future work of this project includes extending the scope of this project by incorporating image processing.
 Slide Share. Sentiment Analysis of Twitter Data. [Online]. Available:
https://www.slideshare.net/sumit786raj/sentiment-analysis-of-twitter-data. Accessed September 2016
 Meaning Cloud. An Introduction to Sentiment Analysis (Opinion Mining). [Online]. Available:
mining-in-meaningcloud. Accessed in October 2016
 B. Pang and L. Lee, “Opinion Mining and Sentiment Analysis”, Foundations and Trends of Information Retrieval, Vol. 2,1-135,2008.
B. Liu, “Sentiment Analysis and Opinion Mining”, Morgan & Claypool , 2012
 Thakkar H, Patel D. “Approaches for sentiment analysis on twitter: A state-of art study”. arXiv preprint arXiv:1512.01043. Dec 2015.
 A. Mudinas, D. Zhang and M. Levene, “Combining Lexicon and Learning based Approaches for Concept Level Sentiment Analysis”. Proceedings of the First International Workshop on Issues of Sentiment Discovery and Opinion Mining, Beijing, China, 2012
 Our Social Times-Social Media for Business. Is sentiment analysis useful? [Online]. Available:
http://oursocialtimes.com/is-sentiment-analysis-useful/. Accessed on March 2017
 Digital Marketing Blog. Social Enablement: The Importance of Sentiment Analysis. [Online]. Available
https://blogs.adobe.com/digitalmarketing/social-media/social-enablement-importance sentiment-analysis/. Accessed on March 2017
 Bright Planet. Twitter Firehose vs Twitter API: What's the difference and why should you care? [Online]. Available:
https://brightplanet.com/2013/06/twitter-firehose-vs-twitter-api-whats-the-difference-and-why-should-you-care/ . Accessed on October 2016
 InfoQ. Big Data Processing with Apache Spark - Part 3: Spark Streaming. [Online]. Available:
https://www.infoq.com/articles/apache-spark-streaming/ Accessed on November 2016
A Developer Diary. How to integrate React and D3- The right way. [Online]. Available:
http://www.adeveloperdiary.com/react-js/integrate-react-and-d3/ .Accessed on March 2017
 ReactJS News. Playing with React and D3. [Online]. Available:
https://reactjsnews.com/playing-with-react-and-d3. Accessed on March 2017
 Quick SBT tutorial. [Online]. Available:
http://grosdim.blogspot.com/2013/01/quick-sbt-tutorial.html. Accessed on January 2017
 Ampcamp tutorials. [Online]. Available:
Accessed on October 2016
 LinkedIn. Apache Spark: The game changer. [Online]. Available:
https://www.linkedin.com/pulse/apache-spark-game-changer-mohan-krishna-mannava. Accessed on October 2016
...(download the rest of the essay above)