Detecting patterns in open and private data has become an area of interest in recent years to researchers around the globe from different sectors. For instance, the medical sector takes use of pattern detection to diagnose patients or analyse a disease. Pattern detection goes far beyond the given example. Today’s social media platforms have offered a new opportunity for researchers to explore and examine the human behaviour of their users in regard to their social activities over the internet. Furthermore, today’s social media platforms, such as Facebook, Twitter, Q Zone and many others, have contributed to what is known today as the Data Science field by providing open environments for their users to express their opinions and share their ideas whilst providing the opportunity to data scientists around the glob to use the data to detect patterns about user behaviour in association with different factors.
Twitter is one amongst many other highly active platforms that provide similar services, however, due to the ease of communication status that Twitter has accomplished by applying a length-constraint on the number of characters to be 140, Twitter has differentiated itself from its neighbours which in return have given it a higher number of monthly active users – according to [1], as of the fourth quarter of 2017, Twitter has an average of 330M monthly active users. As a result, Twitter has been able to offer massive amount of data to developers around the world who have been successful to implement artificial intelligent models that makes use of Twitter data starting from gathering basic statics about Twitter and Twitters’ users and their behaviours to as far as mining Twitter data and discovering more about the underlying patterns in a given number of tweets [2]. Detecting Trending Topics is another subject of interest in Twitter that can be beneficial to both the private and public sector. Due to the reflective view it gives to Twitter most trending topics, it can used to help detect and identify the public concern, especially in countries where the number of Twitter users is sufficient enough to be used as a measurement to the public concern. In addition, detecting trending topics in Twitter can be extended to cover many other aspects of user behaviour such as applying sentiment analysis in order to identify the user satisfaction towards the Trending Topics or extending the system to identify the external factors that contribute to the trending of the Trending Topics.
In this report, we discuss the use of Twitter data to detect Twitters’ trending topics. We define a trending topic as a certain topic that is discussed by the relatively highest number of active users within a given time interval. The length-constraint applied to Twitter text has popularised twitter as we discussed before over other platforms to many users however, it has indeed made twitter text much harder to process than any other platform text. Hashtags, abbreviations, misspelled words, and many other examples are the kind of entities which can be found in Twitter text.
There have been a number of attempts, mostly from the far east, to design and implement an AI model that is capable of identifying the trending topics inTwitter. In the next subsection, Topic Research, we discuss some of the work done in this regard in terms of their design, evaluation and the advantages and disadvantages of their work.
// structure of the subsection. e.g. we first start by looking at the data structure
2.1. Topic Research
In this section, we discuss the topic research following the data flow structure. We start by discussing the data acquisition, then we closely look at data preprocessing and the most fundamental aspects of it. After that, we discuss the design, implementation and pros and cons of each model. Finally, we look at the model evaluation process in terms of its methodology and outcome.
2.1.1. Data Acquisition
As mentioned above, Twitter provides an API that can be used to extract Twitter data. However, in order to connect the application to Twitter API, an external package needed to be used. For each programming language, there is a different Twitter package. Java developers for example uses Twitter4j which is a Java package developed by Yusuke Yamamoto that provides all the key functionalities needed.
2.1.2. Data Preprocessing
To the extent of our research, majority of the work published in regard to Trending Topics identification have followed similar approaches to preprocessing Twitter data stream in order to prepare the data to be processed. After removing the common unneeded entities, such as reserved words, URLs,.. etc, <the names of the researchers/a number of researchers> [3] decided on taking advantage of the hashtag entity in order to help identify the topic of the tweet. A prime example given by [author(s)][3] to justify the use of hashtags compares between two sentences that have nothing in common but a single hashtag and one common term. “I will receive lots of gifts during this festival #Christmas” and “#Christmas is the most welcomed festival for kids in U.S.A.” are both about the Christmas festival, however, if we theoretically remove the hashtag from both sentences, we will not be able to identify the sentiment similarity between the content of the two as efficiently as if we kept them. In addition, we might be able to draw a connection between the two based on the term festival however, with the various and many possibilities that the term festival introduces in regard to what festival might be, it is clear for us why <authors names> have decided on keeping the hashtag entity.
Apart from <the authors above>, other researchers [4][5][6] preferred keeping the important terms in the text whilst removing the other entities. Such terms are hard to define as their importance depends on the research definition of Hot Topics however, for illustration purposes we give a general definition to what we mean by important terms as the terms within the tweet text that contribute the most to the meaning of the tweet. To understand the importance of identifying the important terms, let us consider the two sentences given in <authors above> example. Both tweets have ten terms that describe each however, if we closely examine the text, it would be easy for us to distinguish between the common terms which contribute to the construction of the sentence, such as will, is, the,.. etc, and the distinct ones that describe the meaning of the sentence such as festival, Christmas, and U.S.A. The first and most common unneeded terms are stop words which cover approximately 20% – 30% of the total words counts [7]. Stopword removal results in removing approximately 20% – 30% of the data to be processed. According to [8] stopword removal has also appeared to be beneficial in terms of increasing the accuracy of the model by up to 1.43% and 1.68%.