Outlier Detection using Cluster-based approach

Ms.Mayuri. A. Bhangare

Department of Computer Engineering

K.K.W.I.E.E.R,Nashik

Prof.J.R.Mankar

Department of Computer Engineering

K.K.W.I.E.E.R,Nashik

Abstract—Outlier detection is a crucial task in data mining

which aims to detect an outlier from given data set. An outlier

is a data which appears to have inconsistent observation with

the remaining data. Outliers are generated because of improper

measurements, data entry errors or data arriving from various

sources than rest of the data. Outlier detection is the technique

which discovers such type of data from the given data set. Several

outlier detection techniques have been introduced which requires

input parameter from the user such as distance threshold, density

threshold, etc. However user needs to have prior knowledge.

The proposed work focuses on partitioning given data set into a

number of clusters and then outlier is detected from each cluster

by using the pruning technique. This work aims at noise removal

which will affect computational time and quality of clusters.

Index Terms—Cluster-Based, Noise Removal, Outlier Detection.

I. INTRODUCTION

Outlier is a data which appears to have inconsistent observation

with the remaining data and outlier detection is the

technique which discovers such type of data from the given

data set. Outlier is generated because of data entry errors,

improper measurements or data arriving from various sources

than rest of the data [1].

Outlier detection is a crucial task in data mining which

aims to detect an outlier from given data set. In many data

analysis tasks a large number of variables are being recorded

or sampled .Outlier detection is the first step towards obtaining

a coherent analysis in many data-mining applications. The

technique of outlier detection is used in many fields such as

data cleansing, environment monitoring, criminal activities in

e-commerce, clinical trials, network intrusion detection etc.

For outlier detection the considerations about outlier must

be made. First outlier must be defined i.e. what data are

considered as an outlier in given data set and the method must

be designed to compute defined outlier efficiently.

Statistical community [3, 4] is the first to study the problem

of outlier. They assume that the given data set is generated

by some fixed distribution; if an object deviates from this

distribution then it is declared as an outlier. However it is

impossible to find distribution followed by data set for high

dimensional data. Hence to overcome this drawback some

model free approaches like distance-based outliers [5-7] and

density-based outliers [8] are introduced by data management

community. These algorithms dont consider any assumption

about data set and has some drawbacks like distance d for

distance-based outlier detection technique and the densitybased

outlier detection technique require high computational

cost. Hence cluster-based outlier detection comes into picture

which has an advantage that it works with data set consists of

many clusters with different densities.

The cluster-based outlier detection the method works in two

phases. In first phase the data set needs to be clustered using

Unsupervised Extreme Learning Machine [2]. Unsupervised

learning machine (US-ELM) deals with unlabeled data and

performs clustering efficiently. UL-ELM can be used for

multicluster clustering for unlabeled data. In second phase the

defined outliers are detected from each cluster.

Proposed system extends ELM to Unsupervised Extreme

Learning Machine. It deals only with unlabeled data and also

handles clustering task efficiently. Proposed system works in

two phases where in first phase k-number of clusters are

generated using US-ELM from input data set and in second

phase using pruning technique the outlier from each cluster is

detected. Then the systems final output is the set of outliers.

II. REVIEW OF LITERATURE

Huang et al. [1] introduced Extreme Learning machine

(ELM) used for training Single Layer Feed Forward Network

(SLFNs).The bias and parameters of SLFNs are randomly generated

and ELM updates the output weights between hidden

layer and output layer ELM solves regularized least squared

problem faster than the quadratic programming problem in

Support Vector Machine (SVM).But ELM only works with

labeled data.

D. Liu [2] extended ELM to the Semi-Supervised Extreme

Learning Machine (SS-ELM) where the manifold regularization

framework was imported into the ELMs model to deal

with both labeled and unlabeled data .When the number of

patterns is larger than the number of neurons the ELM and

SS-ELM are work effectively. But SS-ELM is not able to

achieve this because the data is not sufficient as compared

to the number of hidden neurons.

J. Zhang [3] proposed co-training technique to train ELMs

in SS-ELM. The labeled training sets grows progressively by

transferring a small set of most confidently judged unlabeled

data to the labeled set at each iteration, and ELMs are trained

regularly on the pseudo-labeled set. Since the algorithm has to

train ELMs regularly, it makes effects on computational cost.

Statistical community [4, 5] is the first to study the problem

of outlier and proposed model based outliers. They assumed

that the data set follows some distribution or at least statistical

estimates of unknown distribution parameters. An outlier is

the data from dataset that deviates from assumed distribution

of dataset. These model based approaches degrades their

performance with high dimensional data set and arbitrary data

set since there is no chance to have prior knowledge about

distribution followed by these type of data set.

K. Li [6] proposed some model free outliers methods to

overcome the drawback of model based outliers. Distance

based outliers and Density based outliers are two model free

outliers methods. But these two model free outlier approaches

required some input parameter to declare an object as an

outlier e.g. distance threshold, number of objects nearest

neighbor, density threshold etc.

Knorr and Ng [7-9] proposed another algorithm Nested-

Loop (NL) to compute distance-based outlier. In this algorithm

the buffer is partitioned into two halves viz. first array and

second array. It copies dataset into both arrays and computes

the distance between each pair of objects. The count of

neighbor is maintained for objects in first array. It stops

counting neighbors of an object as soon as count of neighbors

reaches to the D. Drawback of this algorithm is it takes high

computation time. Typically nested loop algorithm requires O

(N2) distance computations where N is no of objects in data

set.

Bay et al. [11] proposed improved version on Nested Loop

algorithm. The technique efficiently reduces the searching

space by randomizing the data set before outlier detection.

This algorithm works well when the dataset consist data in

random order but performance is poor for the sorted data

set and also if the data is dependent of each other since

the algorithm may have to travel complete data set to find

dependent objects.

Angiulli et al. [12] proposed a method Detecting Outliers

Pushing objects into an Index (DOLPHIN) which works with

data sets resident to disk. It is simple to implement and can

work with any data type. It has I/O cost of successive reading

two times the input dataset file is inputted. Its performance is

linear in time with respect to data set size since it performs

similarity search without pre-indexing the whole data set. This

method is improved further in efficient computations adopting

spatial indexing by other researchers e.g. R-Trees, M-Trees

etc. But these methods are sensitive to the dimensions.

III. SYSTEM ARCHITECTURE

A. System Overview

Fig.1. gives the detail idea about working of the system.

The system works in two phases in first phase k number of

clusters of input data sets are formed using US-ELM whereas

in second phase the outliers from each cluster is detected and

finally system gives set of outliers as an output.

Here for clustering US-ELM algorithm is used to form good

quality clusters.The clusters are given as an input to the outlier

detection block where pruning technique is used to find outlier

from each cluster .

Fig. 1. Block Diagram of the system

B. US-ELM Algorithm [13]

Input:fXigi=1toNX 2 RNd,Training Data.

Output: The label vector of cluster index y 2 NN1

Steps:

Step 1: Construct Graph Laplacian from L from X

Step 2: Construct an ELM network of nh mapping neurons and

calculate the output matrix generated using Sigmoid function

for each pattern with each for each hidden neuron H 2 RNnh

Step 3:

1. If nh N

Find generalized eigenvectors v2; v3; ; vn0+1forA = Inh +

HTLH

For second through the n0 + 1 smallest eigenvalues the

eigenvector is generated

Let = fv2; v3; :; vn0+1gis the matrix where each column is

the eigenvector

Where vi = vi=jjH vijj; i = 2; 3; ; no + 1 is the normalized

eigenvectors.

2. Else i.e. nh > N

Find generalized eigenvectors u2; u3; ; un0+1forA = Inh +

HTLH

For second through the n0 + 1 smallest eigenvalues the

eigenvector is generated

Let = HT fu2; u3; :; un0+1gis the matrix where each column

is the eigenvector

Where ui = ui=jjHHT uijj; i = 2; 3; ; no+1 is the normalized

eigenvectors.

Step 4: Calculate E = H as Embedding Matrix 5: For

Clustering Treat each row of E as a point and cluster N patterns

into K number of clusters using K-means clustering algorithm.

Let y be the label vector of consisting of cluster index for all

the patterns from data set.

Step 5: Return y labeled vector for clustering.

C. Outlier Detection

In this section first the Cluster Based Outliers are defined

and then the algorithm to compute outlier from each cluster

of given data set.

1) Defining Cluster Based (CB) Outliers: Let X is data

set of N points each of d-dimensions, a point x is denoted

as x =< x[1]; x[2]; x[3]; :::; x[d] >.The distance between two

pints x1 and x2 is calculated by Euclidean Distance formula

i.e. vuut

Xd

i=1

(x1[i] x2[i])2 (1)

The m number of clusters are generated by US-ELM

algorithm e.g. C1;C2;C3; :::;Cm for given data set X. The

centroid Ci:center for each cluster Ci is calculated by using

equation:

Ci:center[i] =

P

x2Ci

x[i]

jCij

(2)

It is observed that the normal points are close to the centroid

point whereas the outliers point x is far from the centroid point.

There is less number of points close to x by observing this

the weight of point is defined as follows.

Weight of a point: Given an integer k, for a point x in cluster

C, we use nnk(x) to denote the set of the k-nearest neighbors

of x in C. Then, the weight of x is:

w(x) =

P dip(x;C:center) k

q2nnk(x) dis(q;C:center)

(3)

Final Set of Outliers detected using CB Technique: For a

data set X, given two integers k and n, let RCB be a subset of

X with n points. If x 2 RCB, there is no point q 2 XCB that

w(q) > w(x), RCB is the result set of CB outlier detection.

2) Algorithm for CB Outlier Detection: According to above

given CB Outliers definition to determine whether the point

x from cluster C is an outlier, we need to perform search for

k-nearest neighbors (kNNs) for point x in cluster C. To make

this search efficient the method design to prune the searching

space.

Suppose there is cluster C and the points in that cluster have

been sorted in ascending order according to the distances of the

points from the cluster centroid point. For point x in cluster C,

we look through the points to search kNNs of point x. Let set

of k points that are the nearest to x from the scanned points is

denoted as nntemp

k (x) and the maximum distance value from

set of points nntemp

k (x) to x is denoted as kdistemp(x) .The

pruning technique follows following theorems.

Theorem 1: For a point q in front of x, if dis(q;C:center) <

dis(x;C:center) kdistemp(x), the points in front of q and

q itself cannot be the kNNs of x.

Theorem 2: For a point q at the back of x, if

dis(q;C:center) > dis(x;C:center) + kdistemp(x), the

points at the back of q and q itself cannot be the kNNs of

x.

D. Mathematical Model

The system S accepts the numeric data and detects outlier

using cluster based approach.

The proposed system S is defined as: S = fI; F;Og

Where,

I = fI1; I2; I3; I4g set of input.

I1 = fXigi=1toNX 2 RNd Training Data.

X consists of unlabeled N training patterns,Xi 2 Rd.

I2 = Trade off Parameter.

I3 = Number of clusters.

I4 = Random Variable.

O = fO1;O2;O3;O4;O5;O6;O7g,set of output.

O1 =Graph Laplacian.

O2 =Parameters of Hidden mapping functions.

O3 = Output Matrix of Hidden Neurons.

O4 = Eigen Vector.

O5 = Embedding Matrix.

O6 = Input Data into K number of Clusters.

O7 = Outliers from each cluster.

F = fF1; F2; F3; F4; F5; F6; F7g ,set of function.

F1 = It is a function of constructing a graph Laplacian from

given input.

F(I1) ! O1

F2 =It is a function of randomly generating parameters of

hidden mapping functions by Continuous Uniform Distribution

in the interval (-1, 1).

F2(I4) ! O2

F3 = It is a function of initiating ELM network of nh

neurons and calculate output matrix of hidden neurons.

F3(O2) ! O3

F4 =It is a function to calculate eigenvalues and eigenvectors.

F4(O1;O3; I2) ! O4

F5 =It is a function to calculate Embedding Matrix.

F5(O3;O4) ! O5

F6 =It is a function that forms clusters for input data using

K-Means algorithm.

F6(O5; I3) ! O6

F7 =It is a function to find the outliers from each cluster

using Pruning technique.

F7(O6) ! O7

Table 1 shows functional dependency among the different

functions used.

IV. SYSTEM ANALYSIS

V. CONCLUSION

The conclusion and future work explain here.

TABLE I

FUNCTIONAL DEPENDENCY

F1 F2 F3 F4 F5 F6 F7

F1 1 0 0 0 0 0 0

F2 0 1 0 0 0 0 0

F3 0 1 1 0 0 0 0

F4 1 0 1 1 0 0 0

F5 0 0 1 1 1 0 0

F6 0 0 0 0 1 1 0

F7 0 0 0 0 0 1 1

VI. ACKNOWLEDGMENT

I would like to express my gratitude to my guide Prof.

J. R. Mankar, Associate Professor, Computer Engineering,

K.K.W.I.E.E.R., Nashik for giving me a moral support, valuable

guidance and encouragement in making this dissertation.

A special thanks to Prof. Dr. K. N. Nandurkar, Principal

and Prof. Dr. S. S. Sane, Head of Department of Computer

Engineering, K.K.W.I.E.E.R.,Nashik for their kind support and

suggestions. It would not have been possible without the kind

support. I would like to extend my sincere thanks to all the

faculty members of the department of computer for their help.

I am also thankful to my colleagues who encouraged and

willingly helped me with their abilities. Lastly, I am also

thankful to those who helped me directly or indirectly.

### References

[1] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, Extreme learning machine: A

new learning scheme of feedforward neural networks, in Proc. Int. Joint

Conf. Neural Netw., vol. 2. 2004, pp. 985990.

[2] L. Li, D. Liu, and J. Ouyang, A new regularization classification method

based on extreme learning machine in network data, J. Inf. Comput.

Sci.,vol. 9, no. 12, pp. 33513363, 2012.

[3] K. Li, J. Zhang, H. Xu, S. Luo, and H. Li, A semi-supervised extreme

learning machine method based on co-training, J. Comput. Inf. Syst.,vol.

9, no. 1, pp. 207214, 2013.

[4] Barnett, V., Lewis, T.: Outliers in Statistical Data.Wiley, New York (1994)

[5] Rousseeuw, P.J., Leroy, A.M.: Robust Regression and Outlier Detection.

Wiley, New York (2005)

[6] He, Z., Xu, X., Deng, S.: Discovering cluster-based local outliers. Pattern

Recog. Lett. 24(9), 16411650 (2003)

[7] Knorr, E.M., Ng, R.T.: Algorithms for mining distance based outliers in

large datasets. In: Proceedings of the International Conference on Very

Large Data Bases, pp. 392403 (1998)

[8] Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining

outliers from large data sets. ACM SIGMOD Rec. 29(2), 427438 (2000)

[9] Angiulli, F., Pizzuti, C.: Outlier mining in large high-dimensional data

sets. IEEE Trans. Knowl. Data Eng. 17(2), 203215 (2005)

[10] Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: Lof: identifying

density-based local outliers. ACM Sigmod Rec. 29(2), 93104 (2000)

[11] Bay, S.D, Schwabacher, M.: Mining distance-based outliers in near linear

time with randomization and a simple pruning rule. In: Proceedings of the

Ninth ACM SIGKDD International Conference on Knowledge Discovery

and Data Mining, pp. 2938 (2003)

[12] Angiulli, F., Fassetti, F.: Very efficientmining of distance-based outliers.

In: Proceedings of the Sixteenth ACMConference on Information and

KnowledgeManagement, pp. 791800 (2007)

[13] Huang, G., Song, S., Gupta, J.N.D.,Wu, C.: Semi-supervised and

unsupervised extreme learning machines. IEEE Trans. Cybern. 44(12),

24052417 (2014)

[14] Hawkins, D.M.: Identification of Outliers. Springer, New York (1980)

[15] C. H. Papadimitriou and K. Steiglitz, Combinatorial Optimization: Algorithms

and Complexity. Mineola, NY, USA: Courier Dover Publications,

1998.

**...(download the rest of the essay above)**