Outlier Detection using Cluster-based approach
Ms.Mayuri. A. Bhangare
Department of Computer Engineering
Department of Computer Engineering
Abstract—Outlier detection is a crucial task in data mining
which aims to detect an outlier from given data set. An outlier
is a data which appears to have inconsistent observation with
the remaining data. Outliers are generated because of improper
measurements, data entry errors or data arriving from various
sources than rest of the data. Outlier detection is the technique
which discovers such type of data from the given data set. Several
outlier detection techniques have been introduced which requires
input parameter from the user such as distance threshold, density
threshold, etc. However user needs to have prior knowledge.
The proposed work focuses on partitioning given data set into a
number of clusters and then outlier is detected from each cluster
by using the pruning technique. This work aims at noise removal
which will affect computational time and quality of clusters.
Index Terms—Cluster-Based, Noise Removal, Outlier Detection.
Outlier is a data which appears to have inconsistent observation
with the remaining data and outlier detection is the
technique which discovers such type of data from the given
data set. Outlier is generated because of data entry errors,
improper measurements or data arriving from various sources
than rest of the data .
Outlier detection is a crucial task in data mining which
aims to detect an outlier from given data set. In many data
analysis tasks a large number of variables are being recorded
or sampled .Outlier detection is the first step towards obtaining
a coherent analysis in many data-mining applications. The
technique of outlier detection is used in many fields such as
data cleansing, environment monitoring, criminal activities in
e-commerce, clinical trials, network intrusion detection etc.
For outlier detection the considerations about outlier must
be made. First outlier must be defined i.e. what data are
considered as an outlier in given data set and the method must
be designed to compute defined outlier efficiently.
Statistical community [3, 4] is the first to study the problem
of outlier. They assume that the given data set is generated
by some fixed distribution; if an object deviates from this
distribution then it is declared as an outlier. However it is
impossible to find distribution followed by data set for high
dimensional data. Hence to overcome this drawback some
model free approaches like distance-based outliers [5-7] and
density-based outliers  are introduced by data management
community. These algorithms dont consider any assumption
about data set and has some drawbacks like distance d for
distance-based outlier detection technique and the densitybased
outlier detection technique require high computational
cost. Hence cluster-based outlier detection comes into picture
which has an advantage that it works with data set consists of
many clusters with different densities.
The cluster-based outlier detection the method works in two
phases. In first phase the data set needs to be clustered using
Unsupervised Extreme Learning Machine . Unsupervised
learning machine (US-ELM) deals with unlabeled data and
performs clustering efficiently. UL-ELM can be used for
multicluster clustering for unlabeled data. In second phase the
defined outliers are detected from each cluster.
Proposed system extends ELM to Unsupervised Extreme
Learning Machine. It deals only with unlabeled data and also
handles clustering task efficiently. Proposed system works in
two phases where in first phase k-number of clusters are
generated using US-ELM from input data set and in second
phase using pruning technique the outlier from each cluster is
detected. Then the systems final output is the set of outliers.
II. REVIEW OF LITERATURE
Huang et al.  introduced Extreme Learning machine
(ELM) used for training Single Layer Feed Forward Network
(SLFNs).The bias and parameters of SLFNs are randomly generated
and ELM updates the output weights between hidden
layer and output layer ELM solves regularized least squared
problem faster than the quadratic programming problem in
Support Vector Machine (SVM).But ELM only works with
D. Liu  extended ELM to the Semi-Supervised Extreme
Learning Machine (SS-ELM) where the manifold regularization
framework was imported into the ELMs model to deal
with both labeled and unlabeled data .When the number of
patterns is larger than the number of neurons the ELM and
SS-ELM are work effectively. But SS-ELM is not able to
achieve this because the data is not sufficient as compared
to the number of hidden neurons.
J. Zhang  proposed co-training technique to train ELMs
in SS-ELM. The labeled training sets grows progressively by
transferring a small set of most confidently judged unlabeled
data to the labeled set at each iteration, and ELMs are trained
regularly on the pseudo-labeled set. Since the algorithm has to
train ELMs regularly, it makes effects on computational cost.
Statistical community [4, 5] is the first to study the problem
of outlier and proposed model based outliers. They assumed
that the data set follows some distribution or at least statistical
estimates of unknown distribution parameters. An outlier is
the data from dataset that deviates from assumed distribution
of dataset. These model based approaches degrades their
performance with high dimensional data set and arbitrary data
set since there is no chance to have prior knowledge about
distribution followed by these type of data set.
K. Li  proposed some model free outliers methods to
overcome the drawback of model based outliers. Distance
based outliers and Density based outliers are two model free
outliers methods. But these two model free outlier approaches
required some input parameter to declare an object as an
outlier e.g. distance threshold, number of objects nearest
neighbor, density threshold etc.
Knorr and Ng [7-9] proposed another algorithm Nested-
Loop (NL) to compute distance-based outlier. In this algorithm
the buffer is partitioned into two halves viz. first array and
second array. It copies dataset into both arrays and computes
the distance between each pair of objects. The count of
neighbor is maintained for objects in first array. It stops
counting neighbors of an object as soon as count of neighbors
reaches to the D. Drawback of this algorithm is it takes high
computation time. Typically nested loop algorithm requires O
(N2) distance computations where N is no of objects in data
Bay et al.  proposed improved version on Nested Loop
algorithm. The technique efficiently reduces the searching
space by randomizing the data set before outlier detection.
This algorithm works well when the dataset consist data in
random order but performance is poor for the sorted data
set and also if the data is dependent of each other since
the algorithm may have to travel complete data set to find
Angiulli et al.  proposed a method Detecting Outliers
Pushing objects into an Index (DOLPHIN) which works with
data sets resident to disk. It is simple to implement and can
work with any data type. It has I/O cost of successive reading
two times the input dataset file is inputted. Its performance is
linear in time with respect to data set size since it performs
similarity search without pre-indexing the whole data set. This
method is improved further in efficient computations adopting
spatial indexing by other researchers e.g. R-Trees, M-Trees
etc. But these methods are sensitive to the dimensions.
III. SYSTEM ARCHITECTURE
A. System Overview
Fig.1. gives the detail idea about working of the system.
The system works in two phases in first phase k number of
clusters of input data sets are formed using US-ELM whereas
in second phase the outliers from each cluster is detected and
finally system gives set of outliers as an output.
Here for clustering US-ELM algorithm is used to form good
quality clusters.The clusters are given as an input to the outlier
detection block where pruning technique is used to find outlier
from each cluster .
Fig. 1. Block Diagram of the system
B. US-ELM Algorithm 
Input:fXigi=1toNX 2 RNd,Training Data.
Output: The label vector of cluster index y 2 NN1
Step 1: Construct Graph Laplacian from L from X
Step 2: Construct an ELM network of nh mapping neurons and
calculate the output matrix generated using Sigmoid function
for each pattern with each for each hidden neuron H 2 RNnh
1. If nh N
Find generalized eigenvectors v2; v3; ; vn0+1forA = Inh +
For second through the n0 + 1 smallest eigenvalues the
eigenvector is generated
Let = fv2; v3; :; vn0+1gis the matrix where each column is
Where vi = vi=jjH vijj; i = 2; 3; ; no + 1 is the normalized
2. Else i.e. nh > N
Find generalized eigenvectors u2; u3; ; un0+1forA = Inh +
For second through the n0 + 1 smallest eigenvalues the
eigenvector is generated
Let = HT fu2; u3; :; un0+1gis the matrix where each column
is the eigenvector
Where ui = ui=jjHHT uijj; i = 2; 3; ; no+1 is the normalized
Step 4: Calculate E = H as Embedding Matrix 5: For
Clustering Treat each row of E as a point and cluster N patterns
into K number of clusters using K-means clustering algorithm.
Let y be the label vector of consisting of cluster index for all
the patterns from data set.
Step 5: Return y labeled vector for clustering.
C. Outlier Detection
In this section first the Cluster Based Outliers are defined
and then the algorithm to compute outlier from each cluster
of given data set.
1) Defining Cluster Based (CB) Outliers: Let X is data
set of N points each of d-dimensions, a point x is denoted
as x =< x; x; x; :::; x[d] >.The distance between two
pints x1 and x2 is calculated by Euclidean Distance formula
(x1[i] x2[i])2 (1)
The m number of clusters are generated by US-ELM
algorithm e.g. C1;C2;C3; :::;Cm for given data set X. The
centroid Ci:center for each cluster Ci is calculated by using
It is observed that the normal points are close to the centroid
point whereas the outliers point x is far from the centroid point.
There is less number of points close to x by observing this
the weight of point is defined as follows.
Weight of a point: Given an integer k, for a point x in cluster
C, we use nnk(x) to denote the set of the k-nearest neighbors
of x in C. Then, the weight of x is:
P dip(x;C:center) k
Final Set of Outliers detected using CB Technique: For a
data set X, given two integers k and n, let RCB be a subset of
X with n points. If x 2 RCB, there is no point q 2 XCB that
w(q) > w(x), RCB is the result set of CB outlier detection.
2) Algorithm for CB Outlier Detection: According to above
given CB Outliers definition to determine whether the point
x from cluster C is an outlier, we need to perform search for
k-nearest neighbors (kNNs) for point x in cluster C. To make
this search efficient the method design to prune the searching
Suppose there is cluster C and the points in that cluster have
been sorted in ascending order according to the distances of the
points from the cluster centroid point. For point x in cluster C,
we look through the points to search kNNs of point x. Let set
of k points that are the nearest to x from the scanned points is
denoted as nntemp
k (x) and the maximum distance value from
set of points nntemp
k (x) to x is denoted as kdistemp(x) .The
pruning technique follows following theorems.
Theorem 1: For a point q in front of x, if dis(q;C:center) <
dis(x;C:center) kdistemp(x), the points in front of q and
q itself cannot be the kNNs of x.
Theorem 2: For a point q at the back of x, if
dis(q;C:center) > dis(x;C:center) + kdistemp(x), the
points at the back of q and q itself cannot be the kNNs of
D. Mathematical Model
The system S accepts the numeric data and detects outlier
using cluster based approach.
The proposed system S is defined as: S = fI; F;Og
I = fI1; I2; I3; I4g set of input.
I1 = fXigi=1toNX 2 RNd Training Data.
X consists of unlabeled N training patterns,Xi 2 Rd.
I2 = Trade off Parameter.
I3 = Number of clusters.
I4 = Random Variable.
O = fO1;O2;O3;O4;O5;O6;O7g,set of output.
O1 =Graph Laplacian.
O2 =Parameters of Hidden mapping functions.
O3 = Output Matrix of Hidden Neurons.
O4 = Eigen Vector.
O5 = Embedding Matrix.
O6 = Input Data into K number of Clusters.
O7 = Outliers from each cluster.
F = fF1; F2; F3; F4; F5; F6; F7g ,set of function.
F1 = It is a function of constructing a graph Laplacian from
F(I1) ! O1
F2 =It is a function of randomly generating parameters of
hidden mapping functions by Continuous Uniform Distribution
in the interval (-1, 1).
F2(I4) ! O2
F3 = It is a function of initiating ELM network of nh
neurons and calculate output matrix of hidden neurons.
F3(O2) ! O3
F4 =It is a function to calculate eigenvalues and eigenvectors.
F4(O1;O3; I2) ! O4
F5 =It is a function to calculate Embedding Matrix.
F5(O3;O4) ! O5
F6 =It is a function that forms clusters for input data using
F6(O5; I3) ! O6
F7 =It is a function to find the outliers from each cluster
using Pruning technique.
F7(O6) ! O7
Table 1 shows functional dependency among the different
IV. SYSTEM ANALYSIS
The conclusion and future work explain here.
F1 F2 F3 F4 F5 F6 F7
F1 1 0 0 0 0 0 0
F2 0 1 0 0 0 0 0
F3 0 1 1 0 0 0 0
F4 1 0 1 1 0 0 0
F5 0 0 1 1 1 0 0
F6 0 0 0 0 1 1 0
F7 0 0 0 0 0 1 1
I would like to express my gratitude to my guide Prof.
J. R. Mankar, Associate Professor, Computer Engineering,
K.K.W.I.E.E.R., Nashik for giving me a moral support, valuable
guidance and encouragement in making this dissertation.
A special thanks to Prof. Dr. K. N. Nandurkar, Principal
and Prof. Dr. S. S. Sane, Head of Department of Computer
Engineering, K.K.W.I.E.E.R.,Nashik for their kind support and
suggestions. It would not have been possible without the kind
support. I would like to extend my sincere thanks to all the
faculty members of the department of computer for their help.
I am also thankful to my colleagues who encouraged and
willingly helped me with their abilities. Lastly, I am also
thankful to those who helped me directly or indirectly.
 G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, Extreme learning machine: A
new learning scheme of feedforward neural networks, in Proc. Int. Joint
Conf. Neural Netw., vol. 2. 2004, pp. 985990.
 L. Li, D. Liu, and J. Ouyang, A new regularization classification method
based on extreme learning machine in network data, J. Inf. Comput.
Sci.,vol. 9, no. 12, pp. 33513363, 2012.
 K. Li, J. Zhang, H. Xu, S. Luo, and H. Li, A semi-supervised extreme
learning machine method based on co-training, J. Comput. Inf. Syst.,vol.
9, no. 1, pp. 207214, 2013.
 Barnett, V., Lewis, T.: Outliers in Statistical Data.Wiley, New York (1994)
 Rousseeuw, P.J., Leroy, A.M.: Robust Regression and Outlier Detection.
Wiley, New York (2005)
 He, Z., Xu, X., Deng, S.: Discovering cluster-based local outliers. Pattern
Recog. Lett. 24(9), 16411650 (2003)
 Knorr, E.M., Ng, R.T.: Algorithms for mining distance based outliers in
large datasets. In: Proceedings of the International Conference on Very
Large Data Bases, pp. 392403 (1998)
 Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining
outliers from large data sets. ACM SIGMOD Rec. 29(2), 427438 (2000)
 Angiulli, F., Pizzuti, C.: Outlier mining in large high-dimensional data
sets. IEEE Trans. Knowl. Data Eng. 17(2), 203215 (2005)
 Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: Lof: identifying
density-based local outliers. ACM Sigmod Rec. 29(2), 93104 (2000)
 Bay, S.D, Schwabacher, M.: Mining distance-based outliers in near linear
time with randomization and a simple pruning rule. In: Proceedings of the
Ninth ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, pp. 2938 (2003)
 Angiulli, F., Fassetti, F.: Very efficientmining of distance-based outliers.
In: Proceedings of the Sixteenth ACMConference on Information and
KnowledgeManagement, pp. 791800 (2007)
 Huang, G., Song, S., Gupta, J.N.D.,Wu, C.: Semi-supervised and
unsupervised extreme learning machines. IEEE Trans. Cybern. 44(12),
 Hawkins, D.M.: Identification of Outliers. Springer, New York (1980)
 C. H. Papadimitriou and K. Steiglitz, Combinatorial Optimization: Algorithms
and Complexity. Mineola, NY, USA: Courier Dover Publications,
...(download the rest of the essay above)