INTRODUCTION

Data analysis for twitter using clustering and classification involves different techniques of data mining, web mining and algorithms used. Data mining is also called knowledge discovery in databases (KDD). It is commonly defined as the process of discovering useful patterns or knowledge from data sources, e.g., databases, texts, images, the Web, etc. The patterns must be valid, potentially useful, and understandable. Data mining is a multi-disciplinary field involving machine learning, statistics, databases, artificial intelligence, information retrieval, and visualization. There are many data mining tasks. Some of the common ones are supervised learning (or classification), unsupervised learning (or clustering), association rule mining, and sequential pattern mining.

A data mining application usually starts with an understanding of the application domain by data analysts (data miners), who then identify suitable data sources and the target data. With the data, data mining can be performed, which is usually carried out in three main steps:

– Pre-processing: The raw data is usually not suitable for mining due to various reasons. It may need to be cleaned to remove noises or abnormalities. The data may also be too large and/or involve many irrelevant attributes, which call for data reduction through sampling and attribute or feature selection. Details about data pre-processing can be found in any standard data mining textbook.

– Data mining: The processed data is then fed to a data mining algorithm which will produce patterns or knowledge.

– Post-processing: In many applications, not all discovered patterns are useful. This step identifies those useful ones for applications. Various evaluation and visualization techniques are used to make the decision. The whole process (also called the data mining process) is almost always iterative. It usually takes many rounds to achieve the final satisfactory result, which is then incorporated into real-world operational tasks. Traditional data mining uses structured data stored in relational tables, spread sheets, or flat files in the tabular form.

2.2 Support Vector Machines

The foundations of SVMs have been developed by Vladimir Vapnik and are gaining popularity due to many attractive features, and promising empirical performance. The formulation embodies the SRM principle. SVMs were developed to solve the classification problem, but recently they have been extended to the domain of regression problems (for prediction of continuous variables). SVMs can be applied to regression problems by the introduction of an alternative loss function that is modified to include a distance measure. The term SVM is referring to both classification and regression methods, and the terms Support Vector Classification (SVC) and Support Vector Regression (SVR) may be used for more precise specification.

An SVM is a supervised learning algorithm creating learning functions from a set of labeled training data. It has a sound theoretical foundation and requires relatively small number of samples for training; experiments showed that it is insensitive to the number of samples dimensions. Initially, the algorithm addresses the general problem of learning to discriminate between members of two classes represented as n-dimensional vectors. The function can be a classification function (the output is binary) or the function can be a general regression function.

Support vector machines (SVM) are another type of learning system which has many desirable qualities that make it one of most popular algorithms. It not only has a solid theoretical foundation, but also performs classification more accurately than most other algorithms in many applications, especially those applications involving very high dimensional data. For instance, it has been shown by several researchers that SVM is perhaps the most accurate algorithm for text classification. It is also widely used in Web page classification and bioinformatics applications.

In general, SVM is a linear learning system that builds two-class classifiers. Let the set of training examples D be

{(x1,y1),(x2,y2),””””’.,(xn,yn)},

where xi = (xi1, xi2, ”’, xir) is a r-dimensional input vector in a real-valued

space X ”’ ”’ r, yi is its class label (output value) and yi ”’ {1, -1}. 1 denotes the positive class and -1 denotes the negative class. Note that we use slightly different notations in this section. We use y instead of c to represent a class because y is commonly used to represent a class in the SVM literature. Similarly, each data instance is called an input vector and denoted by a bold face letter. In the following, we use bold face letters for all vectors.

To build a classifier, SVM finds a linear function of the form

f(x) = ”’w ””” x”’ + b ””””””””””’.(1)

so that an input vector xi is assigned to the positive class if f(xi) ”’ 0, and to

the negative class otherwise, i.e.,

yi={”'(1 if w.xi+b”’[email protected] if w.xi+b<0)}”””””””””.(2)

Hence, f(x) is a real-valued function f: X ”’ ”’ r”’ ”’. w = (w1, w2, ”’, wr) ”’ ”’ r is called the weight vector. b ”’ ”’ is called the bias. ”’w ”’ x”’ is the dot product of w and x (or Euclidean inner product). Without using vector notation, Equation (1) can be written as:

f(x1, x2, ”’, xr) = w1x1+w2x2 + ”’ + wrxr + b,

where xi is the variable representing the ith coordinate of the vector x. For convenience, we will use the vector notation from now on. In essence, SVM finds a hyperplane

”’w ””” x”’ + b = 0””””””””””””””””’.(3)

that separates positive and negative training examples. This hyperplane is called the decision boundary or decision surface. Geometrically, the hyperplane ”’w ””” x”’ + b = 0 divides the input space into two half spaces: one half for positive examples and the other half for negative examples. Recall that a hyperplane is commonly called a line in a 2-dimensional space and a plane in a 3-dimensional space.

Fig. 2.1(A) shows an example in a 2-dimensional space. Positive instances (also called positive data points or simply positive points) are represented with small filled rectangles, and negative examples are represented with small empty circles. The thick line in the middle is the decision boundary hyperplane (a line in this case), which separates positive (above the line) and negative (below the line) data points. Equation (1), which is also called the decision rule of the SVM classifier, is used to make classification decisions on test instances.

Figure2.1: (A) A linearly separable data set and (B) possible decision boundaries

Figure2.1 (A) raises two interesting questions:

1. There are an infinite number of lines that can separate the positive and negative data points as illustrated by Fig. 2.1(B). Which line should we choose?

2. A hyperplane classifier is only applicable if the positive and negative data can be linearly separated. How can we deal with nonlinear separations or data sets that require nonlinear decision boundaries?

The SVM framework provides good answers to both questions. Briefly, for question 1, SVM chooses the hyperplane that maximizes the margin (the gap) between positive and negative data points, which will be defined formally shortly. For question 2, SVM uses kernel functions. Before we dive into the details, we want to stress that SVM requires numeric data and only builds two-class classifiers. At the end of the section, we will discuss how these limitations may be addressed.

2.3 Cluster Analysis

The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. A cluster of data objects can be treated collectively as one group and so may be considered as a form of data compression. Although classification is an effective means for distinguishing groups or classes of objects, it requires the often costly collection and labeling of a large set of training tuples or patterns, which the classifier uses to model each group. It is often more desirable to proceed in the reverse direction: First partition the set of data into groups based on data similarity (e.g., using clustering), and then assign labels to the relatively small number of groups. Additional advantages of such a clustering-based process are that it is adaptable to changes and helps single out useful features that distinguish different groups. Cluster analysis is an important human activity. Early in childhood, we learn how to distinguish between cats and dogs, or between animals and plants, by continuously improving subconscious clustering schemes. By automated clustering, we can identify dense and sparse regions in object space and, therefore, discover overall distribution patterns and interesting correlations among data attributes. Cluster analysis has been widely used in numerous applications, including market research, pattern recognition, data analysis, and image processing. In business, clustering can help marketers discover distinct

groups in their customer bases and characterize customer groups based on purchasing patterns. In biology, it can be used to derive plant and animal taxonomies, categorize genes with similar functionality, and gain insight into structures inherent in populations. Clustering may also help in the identification of areas of similar land use in an earth observation database and in the identification of groups of houses in a city according to house type, value, and geographic location, as well as the identification of groups of automobile insurance policy holders with a high average claim cost. It can also be used to help classify documents on the Web for information discovery.

Clustering is also called data segmentation in some applications because clustering partitions large data sets into groups according to their similarity. Clustering can also be used for outlier detection, where outliers (values that are ”’far away”’ from any cluster) may be more interesting than common cases. Applications of outlier detection include the detection of credit card fraud and the monitoring of criminal activities in electronic commerce. For example, exceptional cases in credit card transactions, such as very expensive and frequent purchases, may be of interest as possible fraudulent activity. As a data mining function, cluster analysis can be used as a stand-alone tool to gain insight into the distribution of data, to observe the characteristics of each cluster, and to focus on a particular set of clusters for further analysis. Alternatively, it may serve as a preprocessing step for other algorithms, such as characterization, attribute subset selection, and classification, which would then operate on the detected clusters and the selected attributes or features.

Data clustering is under vigorous development. Contributing areas of research include data mining, statistics, machine learning, spatial database technology, biology, and marketing. Owing to the huge amounts of data collected in databases, cluster analysis has recently become a highly active topic in data mining research.

As a branch of statistics, cluster analysis has been extensively studied for many years, focusing mainly on distance-based cluster analysis. Cluster analysis tools based on k-means, k- medoids and several other methods have also been built into many statistical analysis software packages or systems, such as S-Plus, SPSS, and SAS. In machine learning, clustering is an example of unsupervised learning. Unlike classification, clustering and unsupervised learning do not rely on predefined classes and class-labeled training examples. For this reason, clustering is a form of learning by observation, rather than learning by examples. In data mining, efforts have focused on finding methods for efficient and effective cluster analysis in large databases. Active themes of research focus on the scalability of clustering methods, the effectiveness of methods for clustering complex shapes and types of data, high-dimensional clustering techniques, and methods for clustering mixed numerical and categorical data in large databases.

2.3.1 Fuzzy c ”’ means

Fuzzy logic creates intermediate classifications, rather than binary, or ”’crisp”’ ones. So far, the clustering techniques we have discussed are referred to as hard or crisp clustering, which means that each data object is assigned to only one cluster. For fuzzy clustering, this restriction is relaxed, and the object can belong to all of the clusters with a certain degree of membership. This is particularly useful when the boundaries between clusters are ambiguous and not well separated.

Moreover, the memberships may help the users discover more sophisticated relationships between a given object and the disclosed clusters. Fuzzy C – Means (FCM) can be regarded as a generalization of ISODATA and was realized by Bezdek (1981). FCM attempts to find a partition, represented as c fuzzy clusters, for a set of data objects x j ”’ ”’ d , j = 1, ”’ , N , while minimizing the cost function

Where

U = [ u ij ] c ” N is the fuzzy partition matrix and u ij ”’ [0, 1] is the membership coefficient of the j th object in the ith cluster that satisfies the following two

Constraints:

Which assures the same overall weight for every data point (when fuzzy clustering meets this constraint, it is also called probabilistic clustering), and

Which assures no empty clusters;

”’ M = [ m 1 , ”’ , m c ] is the cluster prototype (mean or center) matrix;

”’ m ”’ [1, ”’ ) is the fuzzification parameter and a larger m favors fuzzier clusters. m is usually set to 2

”’ Dij = D ( x j , m i ) is the distance between x j and m i , and with FCM,

i.e., the Euclidean or L 2 norm distance function is used.

The criterion function in Eq. 1 can be optimized with an iterative procedure that leads to the standard FCM algorithm. The membership and prototype matrix update equations are obtained through alternating optimization.

2.4 DECISION TREES

A particularly efficient method of producing classifiers from data is to generate a decision tree. The decision – tree representation is the most widely used logic method. There is a large number of decision – tree induction algorithms described primarily in the machine learning and applied statistics literature. They are supervised learning methods that construct decision trees from a set of input output samples. It is an efficient nonparametric method for classification and regression. A decision tree is a hierarchical model for supervised learning where the local region is identified in a sequence of recursive splits through decision nodes with test function. A decision tree is also a nonparametric model in the sense that we do not assume any parametric form for the class density.

A typical decision tree learning system adopts a top down strategy that searches for a solution in a part of the search space. It guarantees that a simple, but not necessarily the simplest, tree will be found. A decision tree consists of nodes where attributes are tested. In a univariate tree, for each internal node, the test uses only one of the attributes for testing. The outgoing branches of a node correspond to all the possible outcomes of the test at the node. A simple decision tree for classification of samples with two input attributes X and Y is given in Figure 2.2. All samples with feature values X > 1 and Y = B belong to Class2, while the samples with values X < 1 belong to Class1, whatever the value for feature Y is. The samples, at a non – leaf node in the tree structure, are thus partitioned along the branches and each child node gets its corresponding subset of samples. Decision trees that use univariate splits have a simple representational form, making it relatively easy for the user to understand the inferred model; at the same time, they represent a restriction on the expressiveness of the model. In general, any restriction on a particular tree representation can significantly restrict the functional form and thus the approximation power of the model. A well ”’ known tree – growing algorithm for generating decision trees based on univariate splits is Quinlan ”’ s ID3 with an extended version called C4.5 . Greedy search methods, which involve growing and pruning decision – tree structures, are typically employed in these algorithms to explore the exponential space of possible models.

Figure 2.2: A simple decision tree with the tests on attributes X and Y.

The ID3 algorithm starts with all the training samples at the root node of the tree. An attribute is selected to partition these samples. For each value of the attribute a branch is created, and the corresponding subset of samples that have the attribute value specified by the branch is moved to the newly created child node. The algorithm is applied recursively to each child node until all samples at a node are of one class. Every path to the leaf in the decision tree represents a classification rule. Note that the critical decision in such a top down decision tree – generation algorithm is the choice of an attribute at a node. Attribute selection in ID3 and C4.5 algorithms are based on minimizing an information entropy measure applied to the examples at a node. The approach based on information entropy insists on minimizing the number of tests that will allow a sample to classify in a database. The attribute – selection part of ID3 is based on the assumption that the complexity of the decision tree is strongly related to the amount of information conveyed by the value of the given attribute. Information based heuristic selects the attribute providing the highest information gain, that is, the attribute that minimizes the information needed in the resulting sub tree to classify the sample. An extension of ID3 is the C4.5 algorithm, which extends the domain of classification from categorical attributes to numeric ones. The measure favors attributes that result in partitioning the data into subsets that have a low – class entropy, that is, when the majority of examples in it belong to a single class. The algorithm basically chooses the attribute that provides the maximum degree of discrimination between classes locally.

2.4.1 Decision Tree Algorithm

The pseudo-code for decision tree model construction is shown in Algorithm. It takes as input a training dataset D, and two parameters ” and ”, where ” is the leaf size and ” the leaf purity threshold. Different split points are evaluated for each attribute in D. Numeric decisions are of the form Xj ”’ v for some value v in the value range for attribute Xj , and categorical decisions are of the form Xj ”’ V for some subset of values in the domain of Xj . The best split point is chosen to partition the data into two subsets, DY and DN, where DY corresponds to all points x ”’ D that satisfy the split decision, and DN corresponds to all points that do not satisfy the split decision. The decision tree method is then called recursively on DY and DN. A number of stopping conditions can be used to stop the recursive partitioning process. The simplest condition is based on the size of the partition D. If the number of points n in D drops below the user-specified size threshold ”, then we stop the partitioning process and make D a leaf. This condition prevents over-fitting the model to the training set, by avoiding to model very small subsets of the data. Size alone is not sufficient because if the partition is already pure then it does not make sense to split it further. Thus, the recursive partitioning is also terminated if the purity of D is above the purity threshold ”.

2.4.2 Algorithm Decision Tree Algorithm

2.5 J48 Classification

J48 algorithm of Weka software is a popular machine learning algorithm based upon J.R. Quilan C4.5 algorithm. All data to be examined will be of the categorical type and therefore continuous data will not be examined at this stage. The algorithm will however leave room for adaption to include this capability. The algorithm will be tested against C4.5 for verification purposes

J48 implements Quinla”’s C4.5 algorithm for generating C4.5 pruned or unpruned decision tree. An extension of Quinlan’s earlier ID3 algorithm is C4.5. The decision trees J48 can be used for classification. Using the information entropy, J48 builds decision trees from a labeled training data. It uses the fact that each data attribute can be used to make a decision by the data splitting into smaller subsets. J48 which is examines the normalized information gain or difference in entropy that results for splitting the data from choosing an attribute. Highest normalized information gain of attributes is used to make the decision and then the algorithm recurs on the smaller subsets of element. If all instances in a subset belong to the same class then after the splitting procedure stops. Leaf node in the decision tree telling to choose that class which are used. Sometimes happens that none of the features will give any information gain. So that J48 creates a decision node which higher up in the tree using the expected class value. J48 can handle both discrete and continuous attributes, missing attribute values and attributes of training data which in differing costs. After creation of J48, it provides a further option for pruning trees.

In Weka, the implementation of a particular learning algorithm is encapsulated in a class, and it may depend on other classes for some of its functionality. J48 class builds a C4.5 decision tree. Each time the Java virtual machine executes J48, it creates an instance of this class by allocating memory for building and storing a decision tree classifier. The algorithm, the classifier it builds, and a procedure for outputting the classifier is all part of that instantiation of the J48 class. Larger programs are usually split into more than one class. The J48 class does not actually contain any code for building a decision tree. It includes references to instances of other classes that do most of the work. When there are a number of classes as in Weka software they become difficult to comprehend and navigate.

2.5.1 J48 decision tree classifier

J48 is an extension of ID3. The additional features of J48 are accounting for missing values, decision trees pruning, continuous attribute value ranges, derivation of rules, etc. In the WEKA data mining tool, J48 is an open source Java implementation of the C4.5 algorithm. The WEKA tool provides a number of options associated with tree pruning. In case of potential over fitting pruning can be used as a tool for pr”cising. In other algorithms the classification is performed recursively till every single leaf is pure, that is the classification of the data should be as perfect as possible. This algorithm it generates the rules from which particular identity of that data is generated. The objective is progressively generalization of a decision tree until it gains equilibrium of flexibility and accuracy.

Basic Steps in the Algorithm:

(i) In case the instances belong to the same class the tree represents a leaf so the leaf is returned by labeling with the same class.

(ii) The potential information is calculated for every attribute, given by a test on the attribute. Then the gain in information is calculated that would result from a test on the attribute

(iii) Then the best attribute is found on the basis of the present selection criterion and that attribute selected for branching.

The J48 Decision tree classifier follows the following simple algorithm. In order to classify a new item, it first needs to create a decision tree based on the attribute values of the available training data. So, whenever it encounters a set of items (training set) it identifies the attribute that discriminates the various instances most clearly. This feature that is able to tell us most about the data instances so that we can classify them the best is said to have the highest information gain. Now, among the possible values of this feature, if there is any value for which there is no ambiguity, that is, for which the data instances falling within its category have the same value for the target variable, then we terminate that branch and assign to it the target value that we have obtained.

**...(download the rest of the essay above)**