. BUSINESS UNDERSTANDING:
1.1. BUSINESS PROBLEM:
Social Media has emerged as a primary asset to gather data about public opinion and sentiments towards various subjects as people these days spend a lot of time each day on social media platforms like Facebook, Twitter, etc sharing their thoughts and opinions. Because of this, the voice of the customer has become more influential than before, because people make sure that whatever they say is heard by their friends and followers in microblogs. These sites can be used for brand endorsements, spreading awareness on social issues, promoting public figures, political campaigns during elections, events, award shows, etc. The data that is collected from these microblogging sites can be utilized by large organizations to analyse public feedback to improve their products and services and also enhance their marketing strategies. Thus, there is a huge potential to find and analyse intriguing examples from the limitless internet based social media information for business.
Before defining the business problem, it is important to define the business. Companies that wish to understand how they are perceived by their stakeholders, so that they can improve their business and reach more potential customers, are our target businesses. For this assignment, the business that we are more interested in is Apple. Our aim is to use sentiment analysis for opinion processing to analyse the information collected from social media to forecast trends and risks. We chose this method because on analysing various case studies about large and medium sized companies, we discovered that by measuring the customer's perception of different goods, services and commercials, Sentiment Analysis has helped on improving their products, operations and department communications.
Sentiment Analysis is defined as a set of methods, that perceive, measure, report and utilize attitudes, opinions, and emotions of people automatically.
Thus, our business problem statement is “Sentiment Analysis on iPhone X using Twitter Data to understand the users views around the new launch of iPhone X based on Tweets containing #iPhone X, @apple, etc.” This can be broken down into sub-level business problems like –
This analysis will help the company in understanding people's feelings toward their brand, business, new product launch and directors. It can be useful to inform the key stakeholders about specific positive or negative discussions and issues that can affect the brand. Based on this information, the company can take actions to enhance their customer experiences and perceived brand value.
1.2. DATA MINING PROBLEM:
The theoretical framework used for this analysis is based on the theories of Social Media Mining (SMM) and Sentiment Analysis. Big Data, Data Mining, Predictive Analysis and Machine Learning techniques have been adopted to analyse and classify the tweets.
Figure – Framework for the Data Mining Problem
The data analysis part of the problem comprises of 2 phases:
1. Data Conditioning phase – to transform the noisy raw social media data into high quality data that enables the computation of predictor variables.
2. Predictive Analysis phase – to evaluate and create a predictor model that will enable accurate prediction of phenomenon outcomes based on new set of observations that were not included in the original datasets.
Data conditioning phase consists of 2 steps:
1. collection of twitter data
2. data pre-processing
Predictive Analysis phase consists of 2 steps:
1. Model evaluation and data analysis
2. Correlation and Predictive Analysis
The figure below summarizes the Data Mining Problem and the Analysis Phases.
Figure – Data Analysis Phases
2.1. R PROGRAMMING LANGUAGE
What is R, tools and packages, twitter ROauth, sentiment R package for Bayesian, plyr, tm
R is a programming language for statistical computing and graphics. It is an independent, open source, highly extensible and widely used language for statistical modelling, classification, clustering, time-series analysis, graphical techniques, etc.
Tools and Packages used in R for this project:
1. twitteR – Twitter Web API that provides an interface
2. ROAuth – Provides an interface to the OAuth specification, and allows users to connect and authenticate the server via OAuth Package
3. Tm – it is a framework for text mining applications in R
4. Tmap- This package offers flexible, layer based and easy approach for creating thematic map visualizations.
5. Wordcloud – provides support for creating visually appealing word clouds inside text mining
6. snowballC – An R interface to C libstemmer library to implement word stemming algorithm that merges all words with common root.
7. syuzhet – sentiment extraction tool for NLP and has four inbuilt dictionaries.
8. Tidyverse -
9. stringr – set of packages that make string functions reliable, simple and easy to use.
10. Tidytext – text mining package for word processing and sentiment analysis
11. rtweet – Package designed to collect and organize twitter data via twitters API
12. dplyr – Next iteration of plyr which focusses on tools working with data frames.
13. devtools – collection of package development tools
14. igraph – package for network analysis and graph visualizations
15. ggraph – an implementation of grammar of graphics for graphs and networks
16. RColorBrewer – This package offers palettes for shading maps according to a variable
17. Sentiment – R package that contains tools for sentiment analysis including Bayesian classifier for positive, negative and emotion classification.
18. RCurl – network (http/ftp/…) client interfaces for R
2.2. CREATING A TWITTER APPLICATION
To implement text analysis, the first step is to collect data. For this, twitter application has to be created using Twitter application API. This application allows us to perform analysis by linking R console to Twitter.
The steps for creating Twitter application are-
1. login to your twitter account using https://dev.twitter.com/
2. Navigate to My Applications and create a new application.
3. A screenshot of app creation is attached below.
4. The twitter application will have the following tokens and keys that are required –
• Consumer API key
• Consumer Secret API key
• Access Token
• Access token secret
The next step is to authenticate Twitter credentials object with R:
Once we have access to the app keys from Twitter API settings, we can use them to authenticate the connection using R studio. ROAuth and twitteR libraries are used for authentication along with setup_twitter_oauth() function.
A screenshot of the authorization code is attached below.
Figure – Authentication code
On executing the code, we are re-directed to the web browser link of twitter API which asks us to authorize the app and generates a unique security code.
Figure – Authorize Twitter App
Figure – Unique security code to be entered on the console
On entering the unique code in the console, a message ‘true' will be displayed on console indicating that Handshake is complete. Thus, we can now extract tweets from twitter timeline.
2.3. DATA COLLECTION REPORT - EXTRACTING AND SAVING TWEETS
After authorization is complete, we can extract the most recent tweets on twitter associated with any hashtag. For this assignment, we extracted 5000 tweets using #iPhone X. This will be used as a training dataset. The following code snippet is used for the same.
Figure – Extraction of iPhone tweets and data frame creation
The function “search twitter” is used to download and save tweets from the twitter timeline. These tweets are then saved in a data frame (df) which is then converted to a .csv file.
A screenshot of the .csv file is attached below.
Figure – Raw data extracted and stored as .csv file.
For testing data,
2.4. DATA DESCRIPTION REPORT
The above image gives an overview of the tweets extracted from twitter API. The dataset has 17 columns and 5000 rows.
The dataset contains description and important information related to the tweets like:
• Text – The actual UTF-8 text of the status update
• Favourited – Indicates whether this Tweet has been liked by the authenticating user.
• favouriteCount – Indicates approximately how many times this Tweet has been liked by Twitter users
• replytoSN – Screen Name of the user this twitter is in reply to
• created – mentions the date and time the tweet was created.
• truncated – Indicates whether the text value parameter was truncated. Example – retweet exceeding original text length of 140 characters will be truncated and end in ellipsis (….). True – indicates truncated, false – not truncated.
• replytoSID – Screen ID of the user this tweet is in reply to
• Id – Integer representing the unique identifier for the tweet.
• replytoUID – ID of the user this was in reply to
• statusSource – Source user agent for the tweet
• ScreenName – Indicates the screen name of the referenced user
• Retweetcount - Number of times this Tweet has been retweeted
• isRetweet – delivers explicit retweets that match a pattern. Quoted and modified tweets which do not use twitters Retweet functionality are not included.
• retweeted - Indicates whether this Tweet has been Retweeted by the authenticating user.
• longitude – represents the geographic location of the tweet
• latitude - represents the geographic location of the tweet
2.5. DATA TRANSFORMATION:
The data extracted from twitter API is in raw form. After analysing the raw dataset, we can eliminate some of the attributes that are not necessary for solving our business problem. We only use text, Id and Created for analysis part. The dataset also contains emoticons, whitespaces, punctuations, stop words, URLs, etc which had to be pre-processed to obtain a clean dataset.
2.6. DATA CLEANING:
2.7. DATA PRE-PROCESSING:
...(download the rest of the essay above)