Data mining and Twitter | EssaySauce.com

INTRODUCTION
Data analysis for twitter using clustering and classification involves different techniques of data mining, web mining and algorithms used. Data mining is also called knowledge discovery in databases (KDD). It is commonly defined as the process of discovering useful patterns or knowledge from data sources, e.g., databases, texts, images, the Web, etc. The patterns must be valid, potentially useful, and understandable. Data mining is a multi-disciplinary field involving machine learning, statistics, databases, artificial intelligence, information retrieval, and visualization. There are many data mining tasks. Some of the common ones are supervised learning (or classification), unsupervised learning (or clustering), association rule mining, and sequential pattern mining.
A data mining application usually starts with an understanding of the application domain by data analysts (data miners), who then identify suitable data sources and the target data. With the data, data mining can be performed, which is usually carried out in three main steps:
– Pre-processing: The raw data is usually not suitable for mining due to various reasons. It may need to be cleaned to remove noises or abnormalities. The data may also be too large and/or involve many irrelevant attributes, which call for data reduction through sampling and attribute or feature selection. Details about data pre-processing can be found in any standard data mining textbook.
– Data mining: The processed data is then fed to a data mining algorithm which will produce patterns or knowledge.
– Post-processing: In many applications, not all discovered patterns are useful. This step identifies those useful ones for applications. Various evaluation and visualization techniques are used to make the decision. The whole process (also called the data mining process) is almost always iterative. It usually takes many rounds to achieve the final satisfactory result, which is then incorporated into real-world operational tasks. Traditional data mining uses structured data stored in relational tables, spread sheets, or flat files in the tabular form.
2.2 Support Vector Machines
The foundations of SVMs have been developed by Vladimir Vapnik and are gaining popularity due to many attractive features, and promising empirical performance. The formulation embodies the SRM principle. SVMs were developed to solve the classification problem, but recently they have been extended to the domain of regression problems (for prediction of continuous variables). SVMs can be applied to regression problems by the introduction of an alternative loss function that is modified to include a distance measure. The term SVM is referring to both classification and regression methods, and the terms Support Vector Classification (SVC) and Support Vector Regression (SVR) may be used for more precise specification.
An SVM is a supervised learning algorithm creating learning functions from a set of labeled training data. It has a sound theoretical foundation and requires relatively small number of samples for training; experiments showed that it is insensitive to the number of samples dimensions. Initially, the algorithm addresses the general problem of learning to discriminate between members of two classes represented as n-dimensional vectors. The function can be a classification function (the output is binary) or the function can be a general regression function.
Support vector machines (SVM) are another type of learning system which has many desirable qualities that make it one of most popular algorithms. It not only has a solid theoretical foundation, but also performs classification more accurately than most other algorithms in many applications, especially those applications involving very high dimensional data. For instance, it has been shown by several researchers that SVM is perhaps the most accurate algorithm for text classification. It is also widely used in Web page classification and bioinformatics applications.
In general, SVM is a linear learning system that builds two-class classifiers. Let the set of training examples D be
{(x1,y1),(x2,y2),””””’.,(xn,yn)},
where xi = (xi1, xi2, ”’, xir) is a r-dimensional input vector in a real-valued
space X ”’ ”’ r, yi is its class label (output value) and yi ”’ {1, -1}. 1 denotes the positive class and -1 denotes the negative class. Note that we use slightly different notations in this section. We use y instead of c to represent a class because y is commonly used to represent a class in the SVM literature. Similarly, each data instance is called an input vector and denoted by a bold face letter. In the following, we use bold face letters for all vectors.
To build a classifier, SVM finds a linear function of the form
f(x) = ”’w ””” x”’ + b ””””””””””’.(1)
so that an input vector xi is assigned to the positive class if f(xi) ”’ 0, and to
the negative class otherwise, i.e.,
yi={”'(1 if w.xi+b”’0@-1 if w.xi+b<0)}”””””””””.(2)
Hence, f(x) is a real-valued function f: X ”’ ”’ r”’ ”’. w = (w1, w2, ”’, wr) ”’ ”’ r is called the weight vector. b ”’ ”’ is called the bias. ”’w ”’ x”’ is the dot product of w and x (or Euclidean inner product). Without using vector notation, Equation (1) can be written as:
f(x1, x2, ”’, xr) = w1x1+w2x2 + ”’ + wrxr + b,
where xi is the variable representing the ith coordinate of the vector x. For convenience, we will use the vector notation from now on. In essence, SVM finds a hyperplane
”’w ””” x”’ + b = 0””””””””””””””””’.(3)
that separates positive and negative training examples. This hyperplane is called the decision boundary or decision surface. Geometrically, the hyperplane ”’w ””” x”’ + b = 0 divides the input space into two half spaces: one half for positive examples and the other half for negative examples. Recall that a hyperplane is commonly called a line in a 2-dimensional space and a plane in a 3-dimensional space.
Fig. 2.1(A) shows an example in a 2-dimensional space. Positive instances (also called positive data points or simply positive points) are represented with small filled rectangles, and negative examples are represented with small empty circles. The thick line in the middle is the decision boundary hyperplane (a line in this case), which separates positive (above the line) and negative (below the line) data points. Equation (1), which is also called the decision rule of the SVM classifier, is used to make classification decisions on test instances.
Figure2.1: (A) A linearly separable data set and (B) possible decision boundaries
Figure2.1 (A) raises two interesting questions:
1. There are an infinite number of lines that can separate the positive and negative data points as illustrated by Fig. 2.1(B). Which line should we choose?
2. A hyperplane classifier is only applicable if the positive and negative data can be linearly separated. How can we deal with nonlinear separations or data sets that require nonlinear decision boundaries?
The SVM framework provides good answers to both questions. Briefly, for question 1, SVM chooses the hyperplane that maximizes the margin (the gap) between positive and negative data points, which will be defined formally shortly. For question 2, SVM uses kernel functions. Before we dive into the details, we want to stress that SVM requires numeric data and only builds two-class classifiers. At the end of the section, we will discuss how these limitations may be addressed.
2.3 Cluster Analysis
The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. A cluster of data objects can be treated collectively as one group and so may be considered as a form of data compression. Although classification is an effective means for distinguishing groups or classes of objects, it requires the often costly collection and labeling of a large set of training tuples or patterns, which the classifier uses to model each group. It is often more desirable to proceed in the reverse direction: First partition the set of data into groups based on data similarity (e.g., using clustering), and then assign labels to the relatively small number of groups. Additional advantages of such a clustering-based process are that it is adaptable to changes and helps single out useful features that distinguish different groups. Cluster analysis is an important human activity. Early in childhood, we learn how to distinguish between cats and dogs, or between animals and plants, by continuously improving subconscious clustering schemes. By automated clustering, we can identify dense and sparse regions in object space and, therefore, discover overall distribution patterns and interesting correlations among data attributes. Cluster analysis has been widely used in numerous applications, including market research, pattern recognition, data analysis, and image processing. In business, clustering can help marketers discover distinct
groups in their customer bases and characterize customer groups based on purchasing patterns. In biology, it can be used to derive plant and animal taxonomies, categorize genes with similar functionality, and gain insight into structures inherent in populations. Clustering may also help in the identification of areas of similar land use in an earth observation database and in the identification of groups of houses in a city according to house type, value, and geographic location, as well as the identification of groups of automobile insurance policy holders with a high average claim cost. It can also be used to help classify documents on the Web for information discovery.
Clustering is also called data segmentation in some applications because clustering partitions large data sets into groups according to their similarity. Clustering can also be used for outlier detection, where outliers (values that are ”’far away”’ from any cluster) may be more interesting than common cases. Applications of outlier detection include the detection of credit card fraud and the monitoring of criminal activities in electronic commerce. For example, exceptional cases in credit card transactions, such as very expensive and frequent purchases, may be of interest as possible fraudulent activity. As a data mining function, cluster analysis can be used as a stand-alone tool to gain insight into the distribution of data, to observe the characteristics of each cluster, and to focus on a particular set of clusters for further analysis. Alternatively, it may serve as a preprocessing step for other algorithms, such as characterization, attribute subset selection, and classification, which would then operate on the detected clusters and the selected attributes or features.
Data clustering is under vigorous development. Contributing areas of research include data mining, statistics, machine learning, spatial database technology, biology, and marketing. Owing to the huge amounts of data collected in databases, cluster analysis has recently become a highly active topic in data mining research.
As a branch of statistics, cluster analysis has been extensively studied for many years, focusing mainly on distance-based cluster analysis. Cluster analysis tools based on k-means, k- medoids and several other methods have also been built into many statistical analysis software packages or systems, such as S-Plus, SPSS, and SAS. In machine learning, clustering is an example of unsupervised learning. Unlike classification, clustering and unsupervised learning do not rely on predefined classes and class-labeled training examples. For this reason, clustering is a form of learning by observation, rather than learning by examples. In data mining, efforts have focused on finding methods for efficient and effective cluster analysis in large databases. Active themes of research focus on the scalability of clustering methods, the effectiveness of methods for clustering complex shapes and types of data, high-dimensional clustering techniques, and methods for clustering mixed numerical and categorical data in large databases.
2.3.1 Fuzzy c ”’ means
Fuzzy logic creates intermediate classifications, rather than binary, or ”’crisp”’ ones. So far, the clustering techniques we have discussed are referred to as hard or crisp clustering, which means that each data object is assigned to only one cluster. For fuzzy clustering, this restriction is relaxed, and the object can belong to all of the clusters with a certain degree of membership. This is particularly useful when the boundaries between clusters are ambiguous and not well separated.
Moreover, the memberships may help the users discover more sophisticated relationships between a given object and the disclosed clusters. Fuzzy C – Means (FCM) can be regarded as a generalization of ISODATA and was realized by Bezdek (1981). FCM attempts to find a partition, represented as c fuzzy clusters, for a set of data objects x j ”’ ”’ d , j = 1, ”’ , N , while minimizing the cost function
Where
U = [ u ij ] c ” N is the fuzzy partition matrix and u ij ”’ [0, 1] is the membership coefficient of the j th object in the ith cluster that satisfies the following two
Constraints:
Which assures the same overall weight for every data point (when fuzzy clustering meets this constraint, it is also called probabilistic clustering), and
Which assures no empty clusters;
”’ M = [ m 1 , ”’ , m c ] is the cluster prototype (mean or center) matrix;
”’ m ”’ [1, ”’ ) is the fuzzification parameter and a larger m favors fuzzier clusters. m is usually set to 2
”’ Dij = D ( x j , m i ) is the distance between x j and m i , and with FCM,
i.e., the Euclidean or L 2 norm distance function is used.
The criterion function in Eq. 1 can be optimized with an iterative procedure that leads to the standard FCM algorithm. The membership and prototype matrix update equations are obtained through alternating optimization.
2.4 DECISION TREES
A particularly efficient method of producing classifiers from data is to generate a decision tree. The decision – tree representation is the most widely used logic method. There is a large number of decision – tree induction algorithms described primarily in the machine learning and applied statistics literature. They are supervised learning methods that construct decision trees from a set of input output samples. It is an efficient nonparametric method for classification and regression. A decision tree is a hierarchical model for supervised learning where the local region is identified in a sequence of recursive splits through decision nodes with test function. A decision tree is also a nonparametric model in the sense that we do not assume any parametric form for the class density.
A typical decision tree learning system adopts a top down strategy that searches for a solution in a part of the search space. It guarantees that a simple, but not necessarily the simplest, tree will be found. A decision tree consists of nodes where attributes are tested. In a univariate tree, for each internal node, the test uses only one of the attributes for testing. The outgoing branches of a node correspond to all the possible outcomes of the test at the node. A simple decision tree for classification of samples with two input attributes X and Y is given in Figure 2.2. All samples with feature values X > 1 and Y = B belong to Class2, while the samples with values X < 1 belong to Class1, whatever the value for feature Y is. The samples, at a non – leaf node in the tree structure, are thus partitioned along the branches and each child node gets its corresponding subset of samples. Decision trees that use univariate splits have a simple representational form, making it relatively easy for the user to understand the inferred model; at the same time, they represent a restriction on the expressiveness of the model. In general, any restriction on a particular tree representation can significantly restrict the functional form and thus the approximation power of the model. A well ”’ known tree – growing algorithm for generating decision trees based on univariate splits is Quinlan ”’ s ID3 with an extended version called C4.5 . Greedy search methods, which involve growing and pruning decision – tree structures, are typically employed in these algorithms to explore the exponential space of possible models.
Figure 2.2: A simple decision tree with the tests on attributes X and Y.
The ID3 algorithm starts with all the training samples at the root node of the tree. An attribute is selected to partition these samples. For each value of the attribute a branch is created, and the corresponding subset of samples that have the attribute value specified by the branch is moved to the newly created child node. The algorithm is applied recursively to each child node until all samples at a node are of one class. Every path to the leaf in the decision tree represents a classification rule. Note that the critical decision in such a top down decision tree – generation algorithm is the choice of an attribute at a node. Attribute selection in ID3 and C4.5 algorithms are based on minimizing an information entropy measure applied to the examples at a node. The approach based on information entropy insists on minimizing the number of tests that will allow a sample to classify in a database. The attribute – selection part of ID3 is based on the assumption that the complexity of the decision tree is strongly related to the amount of information conveyed by the value of the given attribute. Information based heuristic selects the attribute providing the highest information gain, that is, the attribute that minimizes the information needed in the resulting sub tree to classify the sample. An extension of ID3 is the C4.5 algorithm, which extends the domain of classification from categorical attributes to numeric ones. The measure favors attributes that result in partitioning the data into subsets that have a low – class entropy, that is, when the majority of examples in it belong to a single class. The algorithm basically chooses the attribute that provides the maximum degree of discrimination between classes locally.
2.4.1 Decision Tree Algorithm
The pseudo-code for decision tree model construction is shown in Algorithm. It takes as input a training dataset D, and two parameters ” and ”, where ” is the leaf size and ” the leaf purity threshold. Different split points are evaluated for each attribute in D. Numeric decisions are of the form Xj ”’ v for some value v in the value range for attribute Xj , and categorical decisions are of the form Xj ”’ V for some subset of values in the domain of Xj . The best split point is chosen to partition the data into two subsets, DY and DN, where DY corresponds to all points x ”’ D that satisfy the split decision, and DN corresponds to all points that do not satisfy the split decision. The decision tree method is then called recursively on DY and DN. A number of stopping conditions can be used to stop the recursive partitioning process. The simplest condition is based on the size of the partition D. If the number of points n in D drops below the user-specified size threshold ”, then we stop the partitioning process and make D a leaf. This condition prevents over-fitting the model to the training set, by avoiding to model very small subsets of the data. Size alone is not sufficient because if the partition is already pure then it does not make sense to split it further. Thus, the recursive partitioning is also terminated if the purity of D is above the purity threshold ”.
2.4.2 Algorithm Decision Tree Algorithm
2.5 J48 Classification
J48 algorithm of Weka software is a popular machine learning algorithm based upon J.R. Quilan C4.5 algorithm. All data to be examined will be of the categorical type and therefore continuous data will not be examined at this stage. The algorithm will however leave room for adaption to include this capability. The algorithm will be tested against C4.5 for verification purposes
J48 implements Quinla”’s C4.5 algorithm for generating C4.5 pruned or unpruned decision tree. An extension of Quinlan’s earlier ID3 algorithm is C4.5. The decision trees J48 can be used for classification. Using the information entropy, J48 builds decision trees from a labeled training data. It uses the fact that each data attribute can be used to make a decision by the data splitting into smaller subsets. J48 which is examines the normalized information gain or difference in entropy that results for splitting the data from choosing an attribute. Highest normalized information gain of attributes is used to make the decision and then the algorithm recurs on the smaller subsets of element. If all instances in a subset belong to the same class then after the splitting procedure stops. Leaf node in the decision tree telling to choose that class which are used. Sometimes happens that none of the features will give any information gain. So that J48 creates a decision node which higher up in the tree using the expected class value. J48 can handle both discrete and continuous attributes, missing attribute values and attributes of training data which in differing costs. After creation of J48, it provides a further option for pruning trees.
In Weka, the implementation of a particular learning algorithm is encapsulated in a class, and it may depend on other classes for some of its functionality. J48 class builds a C4.5 decision tree. Each time the Java virtual machine executes J48, it creates an instance of this class by allocating memory for building and storing a decision tree classifier. The algorithm, the classifier it builds, and a procedure for outputting the classifier is all part of that instantiation of the J48 class. Larger programs are usually split into more than one class. The J48 class does not actually contain any code for building a decision tree. It includes references to instances of other classes that do most of the work. When there are a number of classes as in Weka software they become difficult to comprehend and navigate.
2.5.1 J48 decision tree classifier
J48 is an extension of ID3. The additional features of J48 are accounting for missing values, decision trees pruning, continuous attribute value ranges, derivation of rules, etc. In the WEKA data mining tool, J48 is an open source Java implementation of the C4.5 algorithm. The WEKA tool provides a number of options associated with tree pruning. In case of potential over fitting pruning can be used as a tool for pr”cising. In other algorithms the classification is performed recursively till every single leaf is pure, that is the classification of the data should be as perfect as possible. This algorithm it generates the rules from which particular identity of that data is generated. The objective is progressively generalization of a decision tree until it gains equilibrium of flexibility and accuracy.
Basic Steps in the Algorithm:
(i) In case the instances belong to the same class the tree represents a leaf so the leaf is returned by labeling with the same class.
(ii) The potential information is calculated for every attribute, given by a test on the attribute. Then the gain in information is calculated that would result from a test on the attribute
(iii) Then the best attribute is found on the basis of the present selection criterion and that attribute selected for branching.
The J48 Decision tree classifier follows the following simple algorithm. In order to classify a new item, it first needs to create a decision tree based on the attribute values of the available training data. So, whenever it encounters a set of items (training set) it identifies the attribute that discriminates the various instances most clearly. This feature that is able to tell us most about the data instances so that we can classify them the best is said to have the highest information gain. Now, among the possible values of this feature, if there is any value for which there is no ambiguity, that is, for which the data instances falling within its category have the same value for the target variable, then we terminate that branch and assign to it the target value that we have obtained.
For the other cases, we then look for another attribute that gives us the highest information gain. Hence we continue in this manner until we either get a clear decision of what combination of attributes gives us a particular target value, or we run out of attributes. In the event that we run out of attributes, or if we cannot get an unambiguous result from the available information, we assign this branch a target value that the majority of the items under this branch possess.
Now that we have the decision tree, we follow the order of attribute selection as we have obtained for the tree. By checking all the respective attributes and their values with those seen in the decision tree model, we can assign or predict the target value of this new instance. The above description will be more clear and easier to understand with the help of an example. Hence, let us see an example of J48 decision tree classification.
2.6 Web Mining
Web mining aims to discover useful information or knowledge from the Web hyperlink structure, page content, and usage data. Although Web mining uses many data mining techniques, as mentioned above it is not purely an application of traditional data mining techniques due to the heterogeneity and semi-structured or unstructured nature of the Web data. Many new mining tasks and algorithms were invented in the past decade. Based on the primary kinds of data used in the mining process,
The Web is an ever growing body of hypertext and multimedia documents. As of 2008, Google had discovered 1 trillion Web pages. The Internet Archive, which makes regular copies of many publicly available Web pages and media files, was three petabytes in size as of March 2009. Several billions of pages are added each day to that number. As the information offered in the Web grows daily, obtaining that information becomes more and more tedious. The main difficulty lies in the semi – structured or unstructured Web content that is not easy to regulate and where enforcing a structure or standards is difficult. A set of Web pages lacks a unifying structure and shows far more authoring styles and content variation than that seen in traditional print document collections. This level of complexity makes an ”’off – the – shelf”’ database – management and information ”’ retrieval solution very complex and almost impossible to use. New methods and tools are necessary. Web mining may be defined as the use of data – mining techniques to automatically discover and extract information from Web documents and services. It refers to the overall process of discovery, not just to the application of standard data – mining tools. Some authors suggest decomposing Web – mining task into four subtasks:
1. Resource Finding. T his is the process of retrieving data, which is either online or offline, from the multimedia sources on the Web, such as news articles, forums, blogs, and the text content of HTML documents obtained by removing the HTML tags.
2. Information Selection and Preprocessing. This is the process by which different kinds of original data retrieved in the previous subtask is transformed. These transformations could be either a kind of preprocessing such as removing stop words and stemming or a preprocessing aimed at obtaining the desired representation, such as finding phrases in the training corpus and representing the text in the first – order logic form.
3. Generalization. Generalization is the process of automatically discovering general patterns within individual Web sites as well as across multiple sites. Different general – purpose machine – learning techniques, data – mining techniques, and specific Web – oriented methods are used.
4. Analysis. T his is a task in which validation and/or interpretation of the mined patterns is performed.
Web mining tasks can be categorized into three types: Web structure mining, Web content mining and Web usage mining.
1. Web content mining: Web content mining extracts or mines useful information or knowledge from Web page contents. For example, we can automatically classify and cluster Web pages according to their topics. These tasks are similar to those in traditional data mining. However, we can also discover patterns in Web pages to extract useful data such as descriptions of products, postings of forums, etc., for many purposes. Furthermore, we can mine customer reviews and forum postings to discover consumer opinions. These are not traditional data mining tasks. Web – content mining uses Web – page content as the data source for the mining process. This could include text, images, videos, or any other type of content on Web pages. Web – structure mining focuses on the link structure of Web pages. Web – usage mining does not use data from the Web itself but takes as input data recorded from the interaction of users using the Internet
The most common use of Web – content mining is in the process of searching. There are many different solutions that take as input Web – page text or images with the intent of helping users find information that is of interest to them. For example, crawlers are currently used by search engines to extract Web content into the indices that allow immediate feedback from searches. The same crawlers can be altered in such a way that rather than seeking to download all reachable content on the Internet, they can be focused on a particular topic or area of interest.
To create a focused crawler, a classifier is usually trained on a number of documents selected by the user to inform the crawler as to the type of content to search for. The crawler will then identify pages of interest as it finds them and follow any links on that page. If those links lead to pages that are classified as not being of interest to the user, then the links on that page will not be used further by the crawler. Web – content mining can also be seen directly in the search process. All major search engines currently use a list – like structure to display search results. The list is ordered by a ranking algorithm behind the scenes. An alternative view of search results that has been attempted is to provide the users with clusters of Web pages as results rather than individual Web pages. Often a hierarchical clustering that will give multiple topic levels is performed.
2. Web structure mining: Web structure mining discovers useful knowledge from hyperlinks (or links for short), which represent the structure of the Web. For example, from the links, we can discover important Web pages, which is a key technology used in search engines. We can also discover communities of users who share common interests. Traditional data mining does not perform such tasks because there is usually no link structure in a relational table.
Web – structure mining considers the relationships between Web pages. Most Web pages include one or more hyperlinks. These hyperlinks are assumed in structure mining to provide an endorsement by the linking page of the page linked. This assumption underlies PageRank and HITS, which will be explained later in this section. Web – structure mining is mainly used in the information retrieval (IR) process. PageRank may have directly contributed to the early success of Google. Certainly the analysis of the structure of the Internet and the interlinking of pages currently contributes to the ranking of documents in most major search engines.
Web – structure mining is also used to aid in Web – content mining processes. Often, classification tasks will consider features from the content of the Web page and may consider the structure of the Web pages. One of the more common features in Web – mining tasks taken from structure mining is the use of anchor text. Anchor text refers to the text displayed to users on an HTML hyperlink. Oftentimes the anchor text provides summary keywords not found on the original Web page. The anchor text is often as brief as search – engine queries. Additionally, if links are endorsements of Web pages, then the anchor text offers keyword specific endorsements.
3. Web usage mining: Web usage mining refers to the discovery of user access patterns from Web usage logs, which record every click made by each user. Web usage mining applies many data mining algorithms. One of the key issues in Web usage mining is the pre processing of click stream data in usage logs in order to produce the right data for mining.
Web – usage mining refers to the mining of information about the interaction of users with Web sites. This information may come from server logs, logs recorded by the client”’s browser, registration form information, and so on. Many usage questions exist, such as the following: How does the link structure of the Web site differ from how users may prefer to traverse the page? Where are the inefficiencies in the e ”’ commerce process of a Web site? What segments exist in our customer base there are some key terms in Web – usage mining that require defining. A ”’visitor”’ to a Web site may refer to a person or program that retrieves a Web page from a server. A ”’session”’ refers to all page views that took place during a single visit to a Web site. Sessions are often defined by comparing page views and determining the maximum allowable time between page views before a new session is defined. Thirty minutes is a standard setting. Web – usage mining data often requires a number of preprocessing steps before meaningful data mining can be performed. For example, server logs often include a number of computer visitors that could be search – engine crawlers, or any other computer program that may visit Web sites. Sometimes these ”’ robots ”’ identify themselves to the server passing a parameter called ”’ user agent ”’ to the server that uniquely identify es them as robots. Some Web page requests do not make it to the Web server for recording, but instead a request may be filled by a cache used to reduce latency.
Servers record information on a granularity level that is often not useful for mining. For a single Web – page view, a server may record the browsers ”’ request for the HTML page, a number of requests for images included on that page, the Cascading Style Sheets (CSS) of a page, and perhaps some JavaScript libraries used by that Web page. Often there will need to be a process to combine all of these requests into a single record. Some logging solutions sidestep this issue by using JavaScript embedded into the Web page to make a single request per page view to a logging server. However, this approach has the distinct disadvantage of not recording data for users that have disabled JavaScript in their browser.
Web – usage mining takes advantage of many of the data – mining approaches available. Classification may be used to identify characteristics unique to users that make large purchases. Clustering may be used to segment the Web – user population. For example, one may identify three types of behavior occurring on a university class Web site. These three behavior patterns could be described as users cramming for a test, users working on projects, and users consistently downloading lecture notes from home for study. Association mining may identify two or more pages often viewed together during the same session, but that are not directly linked on a Web site. Sequence analysis may offer opportunities to predict user navigation patterns and therefore allow for within site, targeted advertisements.
2.6.1 Information Retrieval and Web Search
Web search needs no introduction. Due to its convenience and the richness of information on the Web, searching the Web is increasingly becoming the dominant information seeking method. People make fewer and fewer trips to libraries, but more and more searches on the Web. In fact, without effective search engines and rich Web contents, writing this book would have been much harder. Web search has its root in information retrieval (or IR for short), a field of study that helps the user find needed information from a large collection of text documents. Traditional IR assumes that the basic information unit is a document, and a large collection of documents is available to form the text database. On the Web, the documents are Web pages.
Retrieving information simply means finding a set of documents that is relevant to the user query. A ranking of the set of documents is usually also performed according to their relevance scores to the query. The most commonly used query format is a list of keywords, which are also called terms. IR is different from data retrieval in databases using SQL queries because the data in databases are highly structured and stored in relational tables, while information in text is unstructured. There is no structured query language like SQL for text retrieval.
It is safe to say that Web search is the single most important application of IR. To a great extent, Web search also helped IR. Indeed, the tremendous success of search engines has pushed IR to the center stage. Search is, however, not simply a straightforward application of traditional IR models. It uses some IR results, but it also has its unique techniques and presents many new problems for IR research.
First of all, efficiency is a paramount issue for Web search, but is only secondary in traditional IR systems mainly due to the fact that document collections in most IR systems are not very large. However, the number of pages on the Web is huge. For example, Google indexed more than 8 billion pages when this book was written. Web users also demand very fast responses. No matter how effective an algorithm is, if the retrieval cannot be done efficiently, few people will use it.
Web pages are also quite different from conventional text documents used in traditional IR systems. First, Web pages have hyperlinks and anchor texts, which do not exist in traditional documents (except citations in research publications). Hyperlinks are extremely important for search and play a central role in search ranking algorithms as we will see in the next chapter. Anchor texts associated with hyperlinks too are crucial because a piece of anchor text is often a more accurate description of the page that its hyperlink points to. Second, Web pages are semi-structured.
A Web page is not simply a few paragraphs of text like in a traditional document. A Web page has different fields, e.g., title, metadata, body, etc. The information contained in certain fields (e.g., the title field) is more important than in others. Furthermore, the content in a page is typically organized and presented in several structured blocks (of rectangular shapes). Some blocks are important and some are not (e.g., advertisements, privacy policy, copyright notices, etc). Effectively detecting the main content block(s) of a Web page is useful to Web search because terms appearing in such blocks are more important.
Finally, spamming is a major issue on the Web, but not a concern for traditional IR. This is so because the rank position of a page returned by a search engine is extremely important. If a page is relevant to a query but is ranked very low (e.g., below top 30), then the user is unlikely to look at the page. If the page sells a product, then this is bad for the business. In order to improve the ranking of some target pages, ”’illegitimate”’ means, called spamming, are often used to boost their rank positions. Detecting and fighting Web spam is a critical issue as it can push low quality (even irrelevant) pages to the top of the search rank, which harms the quality of the search results and the user”’s search experience.
2.6.2 Basic Concepts of Information Retrieval
Information retrieval (IR) is the study of helping users to find information that matches their information needs. Technically, IR studies the acquisition, organization, storage, retrieval, and distribution of information. Historically, IR is about document retrieval, emphasizing document as the basic unit. Fig. 2.3 gives a general architecture of an IR system. In Fig. 2.3, the user with information need issues a query (user query) to the retrieval system through the query operations module. The retrieval module uses the document index to retrieve those documents that contain some query terms (such documents are likely to be relevant to the query), compute relevance scores for them, and then rank the retrieved documents according to the scores. The ranked documents are then presented to the user. The document collection is also called the text database, which is indexed by the indexer for efficient retrieval.
Figure 2.3: A general IR system architecture
A user query represents the user”’s information needs, which is in one of the following forms:
1. Keyword queries: The user expresses his/her information needs with a list of (at least one) keywords (or terms) aiming to find documents that contain some (at least one) or all the query terms. The terms in the list are assumed to be connected with a ”’soft”’ version of the logical AND. For example, if one is interested in finding information about Web mining, one may issue the query ”’Web mining”’ to an IR or search engine system. ”’Web mining”’ is retreated as ”’Web AND mining”’. The retrieval system then finds those likely relevant documents and ranks them suitably to present to the user. Note that a retrieved document does not have to contain all the terms in the query. In some IR systems, the ordering of the words is also significant and will affect the retrieval results.
2. Boolean queries: The user can use Boolean operators, AND, OR, and NOT to construct complex queries. Thus, such queries consist of terms and Boolean operators. For example, ”’data OR Web”’ is a Boolean query, which requests documents that contain the word ”’data”’ or ”’Web. A page is returned for a Boolean query if the query is logically true in the page (i.e., exact match). Although one can write complex Boolean queries using the three operators, users seldom write such queries. Search engines usually support a restricted version of Boolean queries.
3. Phrase queries: Such a query consists of a sequence of words that makes up a phrase. Each returned document must contain at least one instance of the phrase. In a search engine, a phrase query is normally enclosed with double quotes. For example, one can issue the following phrase query (including the double quotes), ”’Web mining techniques and applications”’ to find documents that contain the exact phrase.
4. Proximity queries: The proximity query is a relaxed version of the phrase query and can be a combination of terms and phrases. Proximity queries seek the query terms within close proximity to each other. The closeness is used as a factor in ranking the returned documents or pages. For example, a document that contains all query terms close together is considered more relevant than a page in which the query terms are far apart. Some systems allow the user to specify the maximum allowed distance between the query terms. Most search engines consider both term proximity and term ordering in retrieval.
5. Full document queries: When the query is a full document, the user wants to find other documents that are similar to the query document. Some search engines (e.g., Google) allow the user to issue such a query by providing the URL of a query page. Additionally, in the returned results of a search engine, each snippet may have a link called ”’more like this”’ or ”’similar pages.”’ When the user clicks on the link, a set of pages similar to the page in the snippet is returned.
6. Natural language questions: This is the most complex case, and also the ideal case. The user expresses his/her information need as a natural language question. The system then finds the answer. However, such queries are still hard to handle due to the difficulty of natural language understanding. Nevertheless, this is an active research area, called question answering. Some search systems are starting to provide question answering services on some specific types of questions, e.g., definition questions, which ask for definitions of technical terms. Definition questions are usually easier to answer because there are strong linguistic patterns indicating definition sentences, e.g., ”’defined as”’, ”’refers to”’, etc. Definitions can usually be extracted offline
2.7 LITREATURE REVIEW
Traffic issues affect the mobility of many people and the dynamics in the big urban centers. The study of traffic is of paramount importance for the improvement of facilities and the generation of efficient routes. Social media has thus emerged as a significant control of data which can be influenced to get better crisis response. Twitter is a well-liked standard which has been working in modern crises. However, it presents new challenges: the data is noisy and uncrated, and it has high volume and high velocity.
With the fast development of social media, Twitter has been converted into one of the most extensively established stages for people to post short and instant messages. Because of such extensive approval of Twitter, events like breaking news and discharge of well-liked videos can simply confine people”’s consideration and extend quickly on Twitter. Consequently, the attractiveness and significance of an event can be just about measured by the volume of tweets covering the occurrence.
In addition, the relevant tweets also replicate the public”’s estimations and responses to events. It is for that reason very significant to recognize and examine the events on Twitter. In addition, there are a number of event detection systems from social text streams. Currently, existing studies are considered on global events [1] and local events [2]. Moreover, there are several researches of emerging event detection [3] which are focused on global events and some of studies are focused on identifying events. However, this task cannot be achieved by classifying each message in real-time on the platform; the classes cannot be predefined because new events constantly appear in the social stream, and labeling tweets for training is not feasible as a result of the huge amount of messages posted. However, event detection from social networks analysis is a more challenging problem than event detection from conventional medium like blogs, emails, etc., where texts are well-formatted [5]. In fact, SUMs are shapeless and unequal texts; they control easy or abbreviated words, misspellings or grammatical errors [4]. Due to their nature, they are usually very brief, thus becoming an incomplete source of information [5]. Furthermore, SUMs have an enormous amount of not useful or meaningless information [6], which has to be filtered.
In[17]Eleonora D”’Andrea, Pietro Ducange, Beatrice Lazzerini, and Francesco Marcelloni,, present a real-time monitoring scheme for traffic event detection from Twitter stream analysis. The scheme obtains tweets from Twitter according to numerous search criteria; processes tweets by concerning text mining methods; and to conclude achieves the classification of tweets. The plan is to allocate the suitable class label to each tweet as associated to a traffic event or not. The traffic detection scheme was utilized for real-time monitoring of numerous regions of the Italian road network permitting for detection of traffic events approximately in real time often before online traffic news web sites. Here they utilized the support vector machine as a classification model and they realized a correctness value of 95.75% by solving a binary classification difficulty i.e. traffic versus no-traffic tweets. Here they were also proficient to distinguish if traffic is reasoned by an outside event or not by solving a multiclass classification difficulty and obtaining correctness value of 88.89% for the 3-class problem in which they have also measured the traffic due to outside event class.
In this paper, they focus on an exacting small-scale event, i.e., road traffic, and they plan to detect and examine traffic events by dealing out users”’ SUMs belonging to a definite area and written in the Italian language. To this aspire, they propose a scheme capable to fetch, detailed and categorize SUMs as associated to a road traffic event or not. To the most excellent of our information a small amount of papers have been suggested for traffic detection using Twitter stream analysis. However, with respect to our work, all of them focus on languages different from Italian, employ different input features and/or feature selection algorithms and think about only binary classifications. In addition, a few works employ machine learning algorithms [18], [19], while the others rely on NLP techniques only. The proposed system may approach both binary and multi-class classification problems. As regards binary classification, we consider traffic-related tweets, and tweets not related with traffic. As regards multi-class classification, we split the traffic-related class into two classes, namely traffic congestion or crash, and traffic due to external event. In this paper, with external event we refer to a scheduled event (e.g., a football match, a concert), or to an unexpected event (e.g., a flash-mob, a political demonstration, a fire). In this way we aim to support traffic and city administrations for managing scheduled or unexpected events in the city.
Social Media presents itself with an interesting opportunity to harness the data and predict the real world outcome by building models to aggregate the opinions of the collective population and gain insights into their behavior. Sitaramet al. demonstrated how social media content like chatter from Twitter can be used to predict real-world outcomes of forecasting box-office revenues for movies [20]. From the study it is evident, how a simple model built from the rate at which tweets are formed about exacting areas can outperform market-based predictors. Sentiments extracted can be additional used to get better the estimating power of social media.
To examine the real-time interaction of events such as earthquakes in Twitter, Sakaki et al. in [21] present a real-time event detection approach by using Twitter messages associated with time and geographic location information to detect event occurrences. This work is concerned with quickly detecting specific types of events (i.e., earthquakes, typhoons and traffic jams) in order to issue a timely warning for the areas that were about to be affected by these disasters. The authors manually define a set of keywords relevant to the types of events they want to detect such as {”’earthquake”’ and ”’shaking”’} in the earthquake situation and {”’typhoon”’} in the typhoon situation.
For each message, the Support Vector Machine (SVM) is used to classify whether it is about an event or not. Three groups of features for each message are used including statistical features (i.e., the number of words and the position of the query word within the message), keyword features (i.e., the words in a message), and word context features (i.e., the words before and after the query word). Each Twitter user is regarded as a sensor for detecting a target event. Each message is associated with a time and location (i.e., a set of latitude and longitude). The event is identified if there are enough messages that were classified as being about an event occurring in a short time period. The correspondence between event detection from Twitter and object detection in a ubiquitous environment is presented in Figure 4. However, this approach needs to manually define a set of keywords for each event. Also, it requires labeled data to train classifiers for every event type.
Figure 2.4: The correspondence between sensory data detection and Twitter processing
Ozdikis et al. In 2012”’Semantic expansion of hashtags for enhanced event detection in twitter”’ propose an event detection method in Twitter based on the clustering of hashtags, the ”’#”’ symbol used to mark keywords or topics in Twitter, and apply a semantic expansion to message vectors. For each hashtag the most similar three hashtags are extracted by using cosine similarity. A tweet vector with a single hashtag is expanded with three similar hashtags and then used in the clustering process. However, using messages with a single hashtag can suffer from the problem of ignoring all messages that do not contain a hashtag. Also they do not implement any credibility filter in order to decide whether a tweet is about an event or not.
Long et al. (2011) adapted a traditional clustering approach by integrating some specific features to the characteristics of microblog data.8 These features are based on ”’topical words,”’ which are more popular than others with respect to an event. Topical words are extracted from daily messages on the basis of word frequency, word occurrence in hashtag, and word entropy. A (top-down) hierarchical divisive clustering approach is employed on a co-occurrence graph (connecting messages in which topical words co-occur) to divide topical words into event clusters. To track changes among events at different time, a maximum-weighted bipartite graph matching is employed to create event chains, with a variation of Jaccard coefficient as similarity measures between clusters. Finally, cosine similarity augmented with a time interval between messages is used to find the top-k most relevant posts that summarize an event. These event summaries are then linked to event chain clusters and plotted on the time line. For event detection, the authors found that topdown divisive clustering outperforms both k-means and traditional hierarchical clustering algorithms.
Weng and Lee (2011) proposed an event detection based on clustering of discrete wavelet signals built from individual words generated by Twitter. In contrast with Fourier transforms, which have been proposed for event detection from traditional media wavelet transformations are localized in both time and frequency domain and hence able to identify the time and the duration of a bursty event within the signal. Wavelets convert the signals from the time domain to time-scale domain, where the scale can be considered as the inverse of frequency. Signal construction is based on time-dependent variant of document frequency”’inverse document frequency (DF-IDF), where DF counts the number of tweets (document) containing a specific word, while IDF accommodates word frequency up to the current time step. A sliding window is then applied to capture the change over time using the H-measure (normalized wavelet entropy). Trivial words are filtered out on the basis of (a threshold set on) signals cross-correlation, which measure similarity between two signals as function of a time lag. The remaining words are then clustered to form events with a modularity-based graph partitioning technique, which splits the graph into subgraphs each corresponding to an event. Finally, significant events are detected on the basis of the number of words and the cross-correlation among the words related to an event.
Similarly, Cordeiro (2012) proposed a continuous wavelet transformation based on hashtag occurrences combined with a topic model inference using latent Dirichlet allocation (LDA) (Blei et al. 2003). Instead of individual words, hashtags are used for building wavelet signals. An abrupt increase in the number of a given hashtag is considered a good indicator of an event that is happening at a given time. Therefore, all hashtags were retrieved from tweets and then grouped in intervals of 5 minutes. Hashtag signals are constructed over time by counting the hashtag mentions in each interval, grouping them into separated time series (one for each hashtag), and concatenating all tweets that mention the hashtag during each time series. Adaptive filters are then used to remove noisy hashtag signals, before applying the continuous wavelet transformation and getting a time-frequency representation of the signal. Next, wavelet peak and local maxima detection techniques are used to detect peaks and changes in the hashtag signal. Finally, when an event is detected within a given time interval, LDA is applied to all tweets related to the hashtag in each corresponding time series to extract a set of latent topics, which provide an improved summary of event description
Li et al. 2012 present Twevent in ”’Twevent segment-based event detection from tweets”’. It is a state-of-the-art system detecting events from the tweet stream. The authors use the notion of tweet segments instead of unigram to detect and describe events. Given Twitter messages, Twevent firstly segments each individual message into a sequence of consecutive phrases by using Microsoft Web N-Gram. Then bursty segments are identified by modeling the frequency of a segment. User frequency of the tweet segments is used to identify the event-related bursty segments. Then, a clustering algorithm is applied to group event-related segments as candidate events. Wikipedia is utilized to approximately evaluate important and unusual aspects of a candidate event. The system architecture of Twevent is shown in Figure 5. As a result, the events detected with Twevent are seriously subjective by Microsoft Web N-Gram and Wikipedia, which could potentially distort the perception of events by Twitter consumers and also offer less importance to modern events that are not yet reported on Wikipedia.
Figure2.5: Segment-based event detection system architecture
Abel et al. in [24] introduce Twitcident, a framework and web-based scheme for filtering, searching and examine information about real-world incidents or crises. Given an incident, the system automatically collects and filters relevant information from Twitter. When a new message is posted, it searches for related tweets which are semantically extended in order to allow for effective filtering. Users may also make use of a faceted search interface to delve deeper into these tweets. However, this work focuses on how to enrich the semantics of Twitter messages to improve the incident profiling and filtering rather than to detect sub-events and users”’ opinions of each event.
Zhu et al. built a predictive model for finding the retweeting decision of a user [25]. They have found out the factors affecting the retweet decision. The features can be classified into three categories ”’ contextual influence, network influence and time influence from which a set of features are found. A Monte-Carlo simulation was also performed for finding how the information propagates in Twitter network. Even though the information on social media is important for spreading awareness, credibility of the information might be a problem.

Essay: Data mining and Twitter

Essay details and download:

Text preview of this essay:

About this essay:

Essay details and download:

Text preview of this essay:

About this essay:

Essay Categories: