Discussing Outliers in Data with Cluster-based ELM Algorithm: A Review

INTRODUCTIONOutlier is a data which appears to have inconsistent ob-servation with the remaining data and outlier detection is thetechnique which discovers such type of data from the givendata set. Outlier is generated because of data entry errors,improper measurements or data arriving from various sourcesthan rest of the data [1].Outlier detection is a crucial task in data mining whichaims to detect an outlier from given data set. In many dataanalysis tasks a large number of variables are being recordedor sampled .Outlier detection is the first step towards obtaininga coherent analysis in many data-mining applications. Thetechnique of outlier detection is used in many fields such asdata cleansing, environment monitoring, criminal activities ine-commerce, clinical trials, network intrusion detection etc.For outlier detection the considerations about outlier mustbe made. First outlier must be defined i.e. what data areconsidered as an outlier in given data set and the method mustbe designed to compute defined outlier efficiently.Statistical community [3, 4] is the first to study the problemof outlier. They assume that the given data set is generatedby some fixed distribution; if an object deviates from thisdistribution then it is declared as an outlier. However it isimpossible to find distribution followed by data set for highdimensional data. Hence to overcome this drawback somemodel free approaches like distance-based outliers [5-7] anddensity-based outliers [8] are introduced by data managementcommunity. These algorithms dont consider any assumptionabout data set and has some drawbacks like distance d fordistance-based outlier detection technique and the density-based outlier detection technique require high computationalcost. Hence cluster-based outlier detection comes into picturewhich has an advantage that it works with data set consists ofmany clusters with different densities.The cluster-based outlier detection the method works in twophases. In first phase the data set needs to be clustered usingUnsupervised Extreme Learning Machine [2]. Unsupervisedlearning machine (US-ELM) deals with unlabeled data andperforms clustering efficiently. UL-ELM can be used formulticluster clustering for unlabeled data. In second phase thedefined outliers are detected from each cluster.Proposed system extends ELM to Unsupervised ExtremeLearning Machine. It deals only with unlabeled data and alsohandles clustering task efficiently. Proposed system works intwo phases where in first phase k-number of clusters aregenerated using US-ELM from input data set and in secondphase using pruning technique the outlier from each cluster isdetected. Then the systems final output is the set of outliers.II. REVIEW OFLITERATUREHuang et al. [1] introduced Extreme Learning machine(ELM) used for training Single Layer Feed Forward Network(SLFNs).The bias and parameters of SLFNs are randomly gen-erated and ELM updates the output weights between hiddenlayer and output layer ELM solves regularized least squaredproblem faster than the quadratic programming problem inSupport Vector Machine (SVM).But ELM only works withlabeled data.D. Liu [2] extended ELM to the Semi-Supervised ExtremeLearning Machine (SS-ELM) where the manifold regulariza-tion framework was imported into the ELMs model to dealwith both labeled and unlabeled data .When the number ofpatterns is larger than the number of neurons the ELM andSS-ELM are work effectively. But SS-ELM is not able toachieve this because the data is not sufficient as comparedto the number of hidden neurons.J. Zhang [3] proposed co-training technique to train ELMsin SS-ELM. The labeled training sets grows progressively bytransferring a small set of most confidently judged unlabeleddata to the labeled set at each iteration, and ELMs are trainedregularly on the pseudo-labeled set. Since the algorithm has totrain ELMs regularly, it makes effects on computational cost.Statistical community [4, 5] is the first to study the problemof outlier and proposed model based outliers. They assumed

that the data set follows some distribution or at least statisticalestimates of unknown distribution parameters. An outlier isthe data from dataset that deviates from assumed distributionof dataset. These model based approaches degrades theirperformance with high dimensional data set and arbitrary dataset since there is no chance to have prior knowledge aboutdistribution followed by these type of data set.K. Li [6] proposed some model free outliers methods toovercome the drawback of model based outliers. Distancebased outliers and Density based outliers are two model freeoutliers methods. But these two model free outlier approachesrequired some input parameter to declare an object as anoutlier e.g. distance threshold, number of objects nearestneighbor, density threshold etc.Knorr and Ng [7-9] proposed another algorithm Nested-Loop (NL) to compute distance-based outlier. In this algorithmthe buffer is partitioned into two halves viz. first array andsecond array. It copies dataset into both arrays and computesthe distance between each pair of objects. The count ofneighbor is maintained for objects in first array. It stopscounting neighbors of an object as soon as count of neighborsreaches to the D. Drawback of this algorithm is it takes highcomputation time. Typically nested loop algorithm requires O(N2) distance computations where N is no of objects in dataset.Bay et al. [11] proposed improved version on Nested Loopalgorithm. The technique efficiently reduces the searchingspace by randomizing the data set before outlier detection.This algorithm works well when the dataset consist data inrandom order but performance is poor for the sorted dataset and also if the data is dependent of each other sincethe algorithm may have to travel complete data set to finddependent objects.Angiulli et al. [12] proposed a method Detecting OutliersPushing objects into an Index (DOLPHIN) which works withdata sets resident to disk. It is simple to implement and canwork with any data type. It has I/O cost of successive readingtwo times the input dataset file is inputted. Its performance islinear in time with respect to data set size since it performssimilarity search without pre-indexing the whole data set. Thismethod is improved further in efficient computations adoptingspatial indexing by other researchers e.g. R-Trees, M-Treesetc. But these methods are sensitive to the dimensions.III. SYSTEMARCHITECTUREA. System OverviewFig.1. gives the detail idea about working of the system.The system works in two phases in first phase k number ofclusters of input data sets are formed using US-ELM whereasin second phase the outliers from each cluster is detected andfinally system gives set of outliers as an output.Here for clustering US-ELM algorithm is used to form goodquality clusters.The clusters are given as an input to the outlierdetection block where pruning technique is used to find outlierfrom each cluster .Fig. 1. Block Diagram of the systemB. US-ELM Algorithm [13]Input:{Xi}i=1toNX'''RN'''d,Training Data.Output: The label vector of cluster indexy'''NN'''1Steps:Step 1: Construct Graph Laplacian from L from XStep 2: Construct an ELM network ofnhmapping neurons andcalculate the output matrix generated using Sigmoid functionfor each pattern with each for each hidden neuronH'''RN'''nhStep 3:1. Ifnh'''NFind generalized eigenvectorsv2,v3,,vn0+1forA=Inh+''HTLHFor second through then0+ 1smallest eigenvalues theeigenvector is generatedLet''={ ''v2, ''v3,., ''vn0+1}is the matrix where each column isthe eigenvectorWhere ''vi= ''vi/||H ''vi||,i= 2,3,,no+ 1is the normalizedeigenvectors.2. Else i.e.nh> NFind generalized eigenvectorsu2,u3,,un0+1forA=Inh+''HTLHFor second through then0+ 1smallest eigenvalues theeigenvector is generatedLet''=HT{ ''u2, ''u3,., ''un0+1}is the matrix where each columnis the eigenvectorWhere ''ui= ''ui/||HHT ''ui||,i= 2,3,,no+1is the normalizedeigenvectors.

Step 4: CalculateE=H'''as Embedding Matrix 5: ForClustering Treat each row of E as a point and cluster N patternsinto K number of clusters using K-means clustering algorithm.Let y be the label vector of consisting of cluster index for allthe patterns from data set.Step 5: Return y labeled vector for clustering.C. Outlier DetectionIn this section first the Cluster Based Outliers are definedand then the algorithm to compute outlier from each clusterof given data set.1) Defining Cluster Based (CB) Outliers:Let X is dataset of N points each of d-dimensions, a point x is denotedasx=< x[1],x[2],x[3],…,x[d]>.The distance between twopints x1 and x2 is calculated by Euclidean Distance formulai.e.''''''''''''d'''i=1(x1[i]'''x2[i])2(1)The m number of clusters are generated by US-ELMalgorithm e.g.C1,C2,C3,…,Cmfor given data set X. ThecentroidCi.centerfor each clusterCiis calculated by usingequation:Ci.center[i] ='''x'''Cix[i]|Ci|(2)2) Algorithm for CB Outlier Detection:According to abovegiven CB Outliers definition to determine whether the pointx from cluster C is an outlier, we need to perform search fork-nearest neighbors (kNNs) for point x in cluster C. To makethis search efficient the method design to prune the searchingspace.Suppose there is cluster C and the points in that cluster havebeen sorted in ascending order according to the distances of thepoints from the cluster centroid point. For point x in cluster C,we look through the points to search kNNs of point x. Let setof k points that are the nearest to x from the scanned points isdenoted asnntempk(x)and the maximum distance value fromset of pointsnntempk(x)to x is denoted askdistemp(x).Thepruning technique follows following theorems.Theorem 1: For a point q in front of x, ifdis(q,C.center)<dis(x,C.center)'''kdistemp(x), the points in front of q andq itself cannot be the kNNs of x.Theorem 2: For a point q at the back of x, ifdis(q,C.center)> dis(x,C.center) +kdistemp(x), thepoints at the back of q and q itself cannot be the kNNs ofx.D. Mathematical ModelThe system S accepts the numeric data and detects outlierusing cluster based approach.The proposed system S is defined as:S={I,F,O}Where,I={I1,I2,I3,I4}set of input.I1={Xi}i=1toNX'''RN'''dTraining Data.X consists of unlabeled N training patterns,Xi'''Rd.I2=Trade off Parameter.I3=Number of clusters.I4=Random Variable.O={O1,O2,O3,O4,O5,O6,O7},set of output.O1=Graph Laplacian.O2=Parameters of Hidden mapping functions.O3=Output Matrix of Hidden Neurons.O4=Eigen Vector.O5=Embedding Matrix.O6=Input Data into K number of Clusters.O7=Outliers from each cluster.F={F1,F2,F3,F4,F5,F6,F7},set of function.F1=It is a function of constructing a graph Laplacian fromgiven input.F(I1)'''O1F2=It is a function of randomly generating parameters ofhidden mapping functions by Continuous Uniform Distributionin the interval (-1, 1).F2(I4)'''O2F3=It is a function of initiating ELM network of nhneurons and calculate output matrix of hidden neurons.F3(O2)'''O3F4=It is a function to calculate eigenvalues and eigenvec-tors.F4(O1,O3,I2)'''O4F5=It is a function to calculate Embedding Matrix.F5(O3,O4)'''O5F6=It is a function that forms clusters for input data usingK-Means algorithm.F6(O5,I3)'''O6F7=It is a function to find the outliers from each clusterusing Pruning technique.F7(O6)'''O7Table 1 shows functional dependency among the differentfunctions used.TABLE IFUNCTIONALDEPENDENCYF1F2F3F4F5F6F7F11000000F20100000F30110000F41011000F50011100F60000110F70000011IV. SYSTEMANALYSISA. Performance Measures1) Cluster Quality:Cluster Quality is one of the measuresof this system, which gives how correctly the classes/labels arepredicted for each data point in given data set. To get clusteraccuracy the actual class label and the predicted class labelboth are considered.

2) Computational Time:Computational time is the timerequired to perform both phases i.e. for clustering using US-ELM and Outlier Detection too using pruning technique.time of the system will be measured against the k in kNNs,dimensionality (d) and size of the data set (N)

Essay: Discussing Outliers in Data with Cluster-based ELM Algorithm: A Review

Essay details and download:

Text preview of this essay:

About this essay:

Essay details and download:

Text preview of this essay:

About this essay:

Essay Categories: