Abstract
Outlier detection is a crucial task in data mining which aims to detect an outlier from given data set. An outlier is a data which appears to have inconsistent observation with the remaining data. Outliers are generated because of improper measurements, data entry errors or data arriving from various sources than rest of the data. Outlier detection is the technique which discovers such type of data from the given data set. Several outlier detection techniques have been introduced which requires input parameter from the user such as distance threshold, density threshold, etc. However user needs to have prior knowledge.
The proposed work focuses on partitioning given data set into a number of clusters and then outlier is detected from each cluster by using the pruning technique. This work aims at noise removal which will affect computational time and quality of clusters.
Index Terms—Cluster-Based, Noise Removal, Outlier Detection.
I. INTRODUCTION
Outlier is a data which appears to have inconsistent observation with the remaining data and outlier detection is the technique which discovers such type of data from the given data set. Outlier is generated because of data entry errors, improper measurements or data arriving from various sources than rest of the data [1].
Outlier detection is a crucial task in data mining which aims to detect an outlier from given data set. In many data analysis tasks a large number of variables are being recorded or sampled .Outlier detection is the first step towards obtaining a coherent analysis in many data-mining applications. The technique of outlier detection is used in many fields such as data cleansing, environment monitoring, criminal activities in e-commerce, clinical trials, network intrusion detection etc.
For outlier detection the considerations about outlier must be made. First outlier must be defined i.e. what data are considered as an outlier in given data set and the method must be designed to compute defined outlier efficiently.
Statistical community [3, 4] is the first to study the problem of outlier. They assume that the given data set is generated by some fixed distribution; if an object deviates from this distribution then it is declared as an outlier. However it is impossible to find distribution followed by data set for high dimensional data. Hence to overcome this drawback some model free approaches like distance-based outliers [5-7] and density-based outliers [8] are introduced by data management community. These algorithms dont consider any assumption about data set and has some drawbacks like distance d for distance-based outlier detection technique and the densitybased outlier detection technique require high computational cost. Hence cluster-based outlier detection comes into picture which has an advantage that it works with data set consists of many clusters with different densities.
The cluster-based outlier detection the method works in two phases. In first phase the data set needs to be clustered using Unsupervised Extreme Learning Machine [2]. Unsupervised learning machine (US-ELM) deals with unlabeled data and performs clustering efficiently. UL-ELM can be used for multicluster clustering for unlabeled data. In second phase the defined outliers are detected from each cluster.
Proposed system extends ELM to Unsupervised Extreme Learning Machine. It deals only with unlabeled data and also handles clustering task efficiently. Proposed system works in two phases where in first phase k-number of clusters are generated using US-ELM from input data set and in second phase using pruning technique the outlier from each cluster is detected. Then the systems final output is the set of outliers.
II. REVIEW OF LITERATURE
Huang et al. [1] introduced Extreme Learning machine (ELM) used for training Single Layer Feed Forward Network (SLFNs).The bias and parameters of SLFNs are randomly generated and ELM updates the output weights between hidden layer and output layer ELM solves regularized least squared problem faster than the quadratic programming problem in Support Vector Machine (SVM).But ELM only works with labeled data.
D. Liu [2] extended ELM to the Semi-Supervised Extreme Learning Machine (SS-ELM) where the manifold regularization framework was imported into the ELMs model to deal with both labeled and unlabeled data .When the number of patterns is larger than the number of neurons the ELM and SS-ELM are work effectively. But SS-ELM is not able to achieve this because the data is not sufficient as compared to the number of hidden neurons.
J. Zhang [3] proposed co-training technique to train ELMs in SS-ELM. The labeled training sets grows progressively by transferring a small set of most confidently judged unlabeled data to the labeled set at each iteration, and ELMs are trained regularly on the pseudo-labeled set. Since the algorithm has to train ELMs regularly, it makes effects on computational cost.
Statistical community [4, 5] is the first to study the problem of outlier and proposed model based outliers. They assumed that the data set follows some distribution or at least statistical estimates of unknown distribution parameters. An outlier is the data from dataset that deviates from assumed distribution of dataset. These model based approaches degrades their performance with high dimensional data set and arbitrary data set since there is no chance to have prior knowledge about distribution followed by these type of data set.
K. Li [6] proposed some model free outliers methods to overcome the drawback of model based outliers. Distance based outliers and Density based outliers are two model free outliers methods. But these two model free outlier approaches required some input parameter to declare an object as an outlier e.g. distance threshold, number of objects nearest neighbor, density threshold etc.
Knorr and Ng [7-9] proposed another algorithm NestedLoop (NL) to compute distance-based outlier. In this algorithm the buffer is partitioned into two halves viz. first array and second array. It copies dataset into both arrays and computes the distance between each pair of objects. The count of neighbor is maintained for objects in first array. It stops counting neighbors of an object as soon as count of neighbors reaches to the D. Drawback of this algorithm is it takes high computation time. Typically nested loop algorithm requires O (N2) distance computations where N is no of objects in data set.
Bay et al. [11] proposed improved version on Nested Loop algorithm. The technique efficiently reduces the searching space by randomizing the data set before outlier detection.
This algorithm works well when the dataset consist data in random order but performance is poor for the sorted data set and also if the data is dependent of each other since the algorithm may have to travel complete data set to find dependent objects.
Angiulli et al. [12] proposed a method Detecting Outliers Pushing objects into an Index (DOLPHIN) which works with data sets resident to disk. It is simple to implement and can work with any data type. It has I/O cost of successive reading two times the input dataset file is inputted. Its performance is linear in time with respect to data set size since it performs similarity search without pre-indexing the whole data set. This method is improved further in efficient computations adopting spatial indexing by other researchers e.g. R-Trees, M-Trees etc. But these methods are sensitive to the dimensions.
III. SYSTEM ARCHITECTURE
A. System Overview
Fig.1. gives the detail idea about working of the system.
The system works in two phases in first phase k number of clusters of input data sets are formed using US-ELM whereas in second phase the outliers from each cluster is detected and finally system gives set of outliers as an output.
Here for clustering US-ELM algorithm is used to form good quality clusters.The clusters are given as an input to the outlier detection block where pruning technique is used to find outlier from each cluster .
Fig. 1. Block Diagram of the system B. US-ELM Algorithm [13] Input:fXigi=1toNX 2 RNd,Training Data.
Output: The label vector of cluster index y 2 NN1 Steps:
Step 1: Construct Graph Laplacian from L from X Step 2: Construct an ELM network of nh mapping neurons and calculate the output matrix generated using Sigmoid function for each pattern with each for each hidden neuron H 2 RNnh Step 3:
1. If nh N
Find generalized eigenvectors v2; v3; ; vn0+1forA = Inh + HTLH
For second through the n0 + 1 smallest eigenvalues the eigenvector is generated Let = fv2; v3; :; vn0+1gis the matrix where each column is the eigenvector
Where vi = vi=jjH vijj; i = 2; 3; ; no + 1 is the normalized eigenvectors.
2. Else i.e. nh > N Find generalized eigenvectors u2; u3; ; un0+1forA = Inh + HTLH
For second through the n0 + 1 smallest eigenvalues the eigenvector is generated Let = HT fu2; u3; :; un0+1gis the matrix where each column is the eigenvector
Where ui = ui=jjHHT uijj; i = 2; 3; ; no+1 is the normalized eigenvectors.
Step 4: Calculate E = H as Embedding Matrix 5: For Clustering Treat each row of E as a point and cluster N patterns into K number of clusters using K-means clustering algorithm.
Let y be the label vector of consisting of cluster index for all the patterns from data set.
Step 5: Return y labeled vector for clustering.
C. Outlier Detection In this section first the Cluster Based Outliers are defined and then the algorithm to compute outlier from each cluster of given data set.
1) Defining Cluster Based (CB) Outliers: Let X is data set of N points each of d-dimensions, a point x is denoted as x =< x[1]; x[2]; x[3]; :::; x[d] >.The distance between two pints x1 and x2 is calculated by Euclidean Distance formula i.e. vuut
Xd
i=1
(x1[i] x2[i])2 (1) The m number of clusters are generated by US-ELM algorithm e.g. C1;C2;C3; :::;Cm for given data set X. The centroid Ci:center for each cluster Ci is calculated by using equation:
Ci:center[i] =
P
x2Ci
x[i]
jCij
(2)
It is observed that the normal points are close to the centroid point whereas the outliers point x is far from the centroid point.
There is less number of points close to x by observing this the weight of point is defined as follows.
Weight of a point: Given an integer k, for a point x in cluster C, we use nnk(x) to denote the set of the k-nearest neighbors of x in C. Then, the weight of x is: w(x) =
P dip(x;C:center) k q2nnk(x) dis(q;C:center) (3)
Final Set of Outliers detected using CB Technique: For a data set X, given two integers k and n, let RCB be a subset of X with n points. If x 2 RCB, there is no point q 2 XCB that w(q) > w(x), RCB is the result set of CB outlier detection.
2) Algorithm for CB Outlier Detection: According to above given CB Outliers definition to determine whether the point x from cluster C is an outlier, we need to perform search for k-nearest neighbors (kNNs) for point x in cluster C. To make this search efficient the method design to prune the searching space.
Suppose there is cluster C and the points in that cluster have been sorted in ascending order according to the distances of the points from the cluster centroid point. For point x in cluster C, we look through the points to search kNNs of point x. Let set of k points that are the nearest to x from the scanned points is denoted as nntemp
k (x) and the maximum distance value from set of points nntemp k (x) to x is denoted as kdistemp(x) .The pruning technique follows following theorems.
Theorem 1: For a point q in front of x, if dis(q;C:center) < dis(x;C:center) kdistemp(x), the points in front of q and q itself cannot be the kNNs of x. Theorem 2: For a point q at the back of x, if dis(q;C:center) > dis(x;C:center) + kdistemp(x), the points at the back of q and q itself cannot be the kNNs of x.
D. Mathematical Model The system S accepts the numeric data and detects outlier using cluster based approach.
The proposed system S is defined as: S = fI; F;Og Where,
I = fI1; I2; I3; I4g set of input.
I1 = fXigi=1toNX 2 RNd Training Data.
X consists of unlabeled N training patterns,Xi 2 Rd.
I2 = Trade off Parameter.
I3 = Number of clusters.
I4 = Random Variable.
O = fO1;O2;O3;O4;O5;O6;O7g,set of output.
O1 =Graph Laplacian.
O2 =Parameters of Hidden mapping functions.
O3 = Output Matrix of Hidden Neurons.
O4 = Eigen Vector.
O5 = Embedding Matrix.
O6 = Input Data into K number of Clusters.
O7 = Outliers from each cluster.
F = fF1; F2; F3; F4; F5; F6; F7g ,set of function.
F1 = It is a function of constructing a graph Laplacian from given input.
F(I1) ! O1
F2 =It is a function of randomly generating parameters of hidden mapping functions by Continuous Uniform Distribution in the interval (-1, 1).
F2(I4) ! O2
F3 = It is a function of initiating ELM network of nh neurons and calculate output matrix of hidden neurons.
F3(O2) ! O3
F4 =It is a function to calculate eigenvalues and eigenvectors.
F4(O1;O3; I2) ! O4
F5 =It is a function to calculate Embedding Matrix.
F5(O3;O4) ! O5
F6 =It is a function that forms clusters for input data using K-Means algorithm.
F6(O5; I3) ! O6
F7 =It is a function to find the outliers from each cluster using Pruning technique.
F7(O6) ! O7
Table 1 shows functional dependency among the different functions used.
IV. SYSTEM ANALYSIS V. CONCLUSION
The conclusion and future work explain here.
TABLE I
FUNCTIONAL DEPENDENCY F1 F2 F3 F4 F5 F6 F7 F1 1 0 0 0 0 0 0
F2 0 1 0 0 0 0 0
F3 0 1 1 0 0 0 0
F4 1 0 1 1 0 0 0
F5 0 0 1 1 1 0 0
F6 0 0 0 0 1 1 0
F7 0 0 0 0 0 1 1
VI. ACKNOWLEDGMENT
References
[1] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, Extreme learning machine: A new learning scheme of feedforward neural networks, in Proc. Int. Joint Conf. Neural Netw., vol. 2. 2004, pp. 985990.
[2] L. Li, D. Liu, and J. Ouyang, A new regularization classification method based on extreme learning machine in network data, J. Inf. Comput.
Sci.,vol. 9, no. 12, pp. 33513363, 2012.
[3] K. Li, J. Zhang, H. Xu, S. Luo, and H. Li, A semi-supervised extreme learning machine method based on co-training, J. Comput. Inf. Syst.,vol.
9, no. 1, pp. 207214, 2013.
[4] Barnett, V., Lewis, T.: Outliers in Statistical Data.Wiley, New York (1994) [5] Rousseeuw, P.J., Leroy, A.M.: Robust Regression and Outlier Detection.
Wiley, New York (2005) [6] He, Z., Xu, X., Deng, S.: Discovering cluster-based local outliers. Pattern Recog. Lett. 24(9), 16411650 (2003) [7] Knorr, E.M., Ng, R.T.: Algorithms for mining distance based outliers in large datasets. In: Proceedings of the International Conference on Very Large Data Bases, pp. 392403 (1998) [8] Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. ACM SIGMOD Rec. 29(2), 427438 (2000) [9] Angiulli, F., Pizzuti, C.: Outlier mining in large high-dimensional data sets. IEEE Trans. Knowl. Data Eng. 17(2), 203215 (2005) [10] Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: Lof: identifying density-based local outliers. ACM Sigmod Rec. 29(2), 93104 (2000) [11] Bay, S.D, Schwabacher, M.: Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2938 (2003) [12] Angiulli, F., Fassetti, F.: Very efficientmining of distance-based outliers.
In: Proceedings of the Sixteenth ACMConference on Information and KnowledgeManagement, pp. 791800 (2007) [13] Huang, G., Song, S., Gupta, J.N.D.,Wu, C.: Semi-supervised and unsupervised extreme learning machines. IEEE Trans. Cybern. 44(12), 24052417 (2014)
[14] Hawkins, D.M.: Identification of Outliers. Springer, New York (1980) [15] C. H. Papadimitriou and K. Steiglitz, Combinatorial Optimization: Algorithms and Complexity. Mineola, NY, USA: Courier Dover Publications, 1998.