Essay details:

  • Subject area(s): Engineering
  • Price: Free download
  • Published on: 7th September 2019
  • File format: Text
  • Number of pages: 2

Text preview of this essay:

This page is a preview - download the full version of this essay above.

Outlier Detection using Cluster-based approachMs.Mayuri. A. BhangareDepartment of Computer EngineeringK.K.W.I.E.E.R,[email protected] of Computer EngineeringK.K.W.I.E.E.R,[email protected]—Outlier detection is a crucial task in data miningwhich aims to detect an outlier from given data set. An outlieris a data which appears to have inconsistent observation withthe remaining data. Outliers are generated because of impropermeasurements, data entry errors or data arriving from varioussources than rest of the data. Outlier detection is the techniquewhich discovers such type of data from the given data set. Severaloutlier detection techniques have been introduced which requiresinput parameter from the user such as distance threshold, densitythreshold, etc. However user needs to have prior knowledge.The proposed work focuses on partitioning given data set into anumber of clusters and then outlier is detected from each clusterby using the pruning technique. This work aims at noise removalwhich will affect computational time and quality of clusters.Index  Terms—Cluster-Based, Noise Removal, Outlier Detec-tion.I.  INTRODUCTIONOutlier  is  a  data  which  appears  to  have  inconsistent  ob-servation with the remaining data and outlier detection is thetechnique  which  discovers  such  type  of  data  from  the  givendata  set.  Outlier  is  generated  because  of  data  entry  errors,improper measurements or data arriving from various sourcesthan rest of the data [1].Outlier  detection  is  a  crucial  task  in  data  mining  whichaims  to  detect  an  outlier  from  given  data  set.  In  many  dataanalysis tasks a large number of variables are being recordedor sampled .Outlier detection is the first step towards obtaininga  coherent  analysis  in  many  data-mining  applications.  Thetechnique  of  outlier  detection  is  used  in  many  fields  such  asdata cleansing, environment monitoring, criminal activities ine-commerce, clinical trials, network intrusion detection etc.For  outlier  detection  the  considerations  about  outlier  mustbe  made.  First  outlier  must  be  defined  i.e.  what  data  areconsidered as an outlier in given data set and the method mustbe designed to compute defined outlier efficiently.Statistical community [3, 4] is the first to study the problemof  outlier.  They  assume  that  the  given  data  set  is  generatedby  some  fixed  distribution;  if  an  object  deviates  from  thisdistribution  then  it  is  declared  as  an  outlier.  However  it  isimpossible  to  find  distribution  followed  by  data  set  for  highdimensional  data.  Hence  to  overcome  this  drawback  somemodel  free  approaches  like  distance-based  outliers  [5-7]  anddensity-based outliers [8] are introduced by data managementcommunity.  These  algorithms  dont  consider  any  assumptionabout  data  set  and  has  some  drawbacks  like  distance  d  fordistance-based  outlier  detection  technique  and  the  density-based  outlier  detection  technique  require  high  computationalcost. Hence cluster-based outlier detection comes into picturewhich has an advantage that it works with data set consists ofmany clusters with different densities.The cluster-based outlier detection the method works in twophases. In first phase the data set needs to be clustered usingUnsupervised  Extreme  Learning  Machine  [2].  Unsupervisedlearning  machine  (US-ELM)  deals  with  unlabeled  data  andperforms  clustering  efficiently.  UL-ELM  can  be  used  formulticluster clustering for unlabeled data. In second phase thedefined outliers are detected from each cluster.Proposed  system  extends  ELM  to  Unsupervised  ExtremeLearning Machine. It deals only with unlabeled data and alsohandles  clustering  task  efficiently.  Proposed  system  works  intwo  phases  where  in  first  phase  k-number  of  clusters  aregenerated  using  US-ELM  from  input  data  set  and  in  secondphase using pruning technique the outlier from each cluster isdetected. Then the systems final output is the set of outliers.II.  REVIEW OFLITERATUREHuang  et  al.  [1]  introduced  Extreme  Learning  machine(ELM) used for training Single Layer Feed Forward Network(SLFNs).The bias and parameters of SLFNs are randomly gen-erated  and  ELM  updates  the  output  weights  between  hiddenlayer  and  output  layer  ELM  solves  regularized  least  squaredproblem  faster  than  the  quadratic  programming  problem  inSupport  Vector  Machine  (SVM).But  ELM  only  works  withlabeled data.D. Liu [2] extended ELM to the Semi-Supervised ExtremeLearning  Machine  (SS-ELM)  where  the  manifold  regulariza-tion  framework  was  imported  into  the  ELMs  model  to  dealwith  both  labeled  and  unlabeled  data  .When  the  number  ofpatterns  is  larger  than  the  number  of  neurons  the  ELM  andSS-ELM  are  work  effectively.  But  SS-ELM  is  not  able  toachieve  this  because  the  data  is  not  sufficient  as  comparedto the number of hidden neurons.J. Zhang [3] proposed co-training technique to train ELMsin SS-ELM. The labeled training sets grows progressively bytransferring a small set of most confidently judged unlabeleddata to the labeled set at each iteration, and ELMs are trainedregularly on the pseudo-labeled set. Since the algorithm has totrain ELMs regularly, it makes effects on computational cost.Statistical community [4, 5] is the first to study the problemof  outlier  and  proposed  model  based  outliers.  They  assumed

 that the data set follows some distribution or at least statisticalestimates  of  unknown  distribution  parameters.  An  outlier  isthe data from dataset that deviates from assumed distributionof  dataset.  These  model  based  approaches  degrades  theirperformance with high dimensional data set and arbitrary dataset  since  there  is  no  chance  to  have  prior  knowledge  aboutdistribution followed by these type of data set.K.  Li  [6]  proposed  some  model  free  outliers  methods  toovercome  the  drawback  of  model  based  outliers.  Distancebased outliers and Density based outliers are two model freeoutliers methods. But these two model free outlier approachesrequired  some  input  parameter  to  declare  an  object  as  anoutlier  e.g.  distance  threshold,  number  of  objects  nearestneighbor, density threshold etc.Knorr  and  Ng  [7-9]  proposed  another  algorithm  Nested-Loop (NL) to compute distance-based outlier. In this algorithmthe  buffer  is  partitioned  into  two  halves  viz.  first  array  andsecond array. It copies dataset into both arrays and computesthe  distance  between  each  pair  of  objects.  The  count  ofneighbor  is  maintained  for  objects  in  first  array.  It  stopscounting neighbors of an object as soon as count of neighborsreaches to the D. Drawback of this algorithm is it takes highcomputation time. Typically nested loop algorithm requires O(N2) distance computations where N is no of objects in dataset.Bay et al. [11] proposed improved version on Nested Loopalgorithm.  The  technique  efficiently  reduces  the  searchingspace  by  randomizing  the  data  set  before  outlier  detection.This  algorithm  works  well  when  the  dataset  consist  data  inrandom  order  but  performance  is  poor  for  the  sorted  dataset  and  also  if  the  data  is  dependent  of  each  other  sincethe  algorithm  may  have  to  travel  complete  data  set  to  finddependent objects.Angiulli  et  al.  [12]  proposed  a  method  Detecting  OutliersPushing objects into an Index (DOLPHIN) which works withdata  sets  resident  to  disk.  It  is  simple  to  implement  and  canwork with any data type. It has I/O cost of successive readingtwo times the input dataset file is inputted. Its performance islinear  in  time  with  respect  to  data  set  size  since  it  performssimilarity search without pre-indexing the whole data set. Thismethod is improved further in efficient computations adoptingspatial  indexing  by  other  researchers  e.g.  R-Trees,  M-Treesetc. But these methods are sensitive to the dimensions.III.  SYSTEMARCHITECTUREA.  System OverviewFig.1.  gives  the  detail  idea  about  working  of  the  system.The  system  works  in  two  phases  in  first  phase  k  number  ofclusters of input data sets are formed using US-ELM whereasin second phase the outliers from each cluster is detected andfinally system gives set of outliers as an output.Here for clustering US-ELM algorithm is used to form goodquality clusters.The clusters are given as an input to the outlierdetection block where pruning technique is used to find outlierfrom each cluster .Fig. 1.   Block Diagram of the systemB.  US-ELM Algorithm [13]Input:{Xi}i=1toNX∈RN∗d,Training Data.Output: The label vector of cluster indexy∈NN∗1Steps:Step 1: Construct Graph Laplacian from L from XStep 2: Construct an ELM network ofnhmapping neurons andcalculate the output matrix generated using Sigmoid functionfor each pattern with each for each hidden neuronH∈RN∗nhStep 3:1. Ifnh≤NFind  generalized  eigenvectorsv2,v3,,vn0+1forA=Inh+λHTLHFor   second   through   then0+  1smallest   eigenvalues   theeigenvector is generatedLetβ={ ̄v2, ̄v3,., ̄vn0+1}is the matrix where each column isthe eigenvectorWhere ̄vi=   ̄vi/||H ̄vi||,i=  2,3,,no+ 1is  the  normalizedeigenvectors.2. Else i.e.nh> NFind  generalized  eigenvectorsu2,u3,,un0+1forA=Inh+λHTLHFor   second   through   then0+  1smallest   eigenvalues   theeigenvector is generatedLetβ=HT{ ̄u2, ̄u3,., ̄un0+1}is the matrix where each columnis the eigenvectorWhere ̄ui=   ̄ui/||HHT ̄ui||,i= 2,3,,no+1is the normalizedeigenvectors.

 Step  4:  CalculateE=H∗as  Embedding  Matrix  5:  ForClustering Treat each row of E as a point and cluster N patternsinto K number of clusters using K-means clustering algorithm.Let y be the label vector of consisting of cluster index for allthe patterns from data set.Step 5: Return y labeled vector for clustering.C.  Outlier DetectionIn  this  section  first  the  Cluster  Based  Outliers  are  definedand  then  the  algorithm  to  compute  outlier  from  each  clusterof given data set.1)  Defining  Cluster  Based  (CB)  Outliers:Let  X  is  dataset  of  N  points  each  of  d-dimensions,  a  point  x  is  denotedasx=< x[1],x[2],x[3],...,x[d]>.The distance between twopints  x1  and  x2  is  calculated  by  Euclidean  Distance  formulai.e.√√√√d∑i=1(x1[i]−x2[i])2(1)The   m   number   of   clusters   are   generated   by   US-ELMalgorithm  e.g.C1,C2,C3,...,Cmfor  given  data  set  X.  ThecentroidCi.centerfor each clusterCiis calculated by[i] =∑x∈Cix[i]|Ci|(2)2)  Algorithm for CB Outlier Detection:According to abovegiven  CB  Outliers  definition  to  determine  whether  the  pointx from cluster C is an outlier, we need to perform search fork-nearest neighbors (kNNs) for point x in cluster C. To makethis search efficient the method design to prune the searchingspace.Suppose there is cluster C and the points in that cluster havebeen sorted in ascending order according to the distances of thepoints from the cluster centroid point. For point x in cluster C,we look through the points to search kNNs of point x. Let setof k points that are the nearest to x from the scanned points isdenoted asnntempk(x)and the maximum distance value fromset of pointsnntempk(x)to x is denoted askdistemp(x).Thepruning technique follows following theorems.Theorem 1: For a point q in front of x, ifdis(q,<dis(x,−kdistemp(x), the points in front of q andq itself cannot be the kNNs of x.Theorem   2:   For   a   point   q   at   the   back   of   x,   ifdis(q,> dis(x,  +kdistemp(x),   thepoints  at  the  back  of  q  and  q  itself  cannot  be  the  kNNs  ofx.D.  Mathematical ModelThe  system S  accepts the  numeric data  and  detects outlierusing cluster based approach.The proposed system S is defined as:S={I,F,O}Where,I={I1,I2,I3,I4}set of input.I1={Xi}i=1toNX∈RN∗dTraining Data.X consists of unlabeled N training patterns,Xi∈Rd.I2=Trade off Parameter.I3=Number of clusters.I4=Random Variable.O={O1,O2,O3,O4,O5,O6,O7},set of output.O1=Graph Laplacian.O2=Parameters of Hidden mapping functions.O3=Output Matrix of Hidden Neurons.O4=Eigen Vector.O5=Embedding Matrix.O6=Input Data into K number of Clusters.O7=Outliers from each cluster.F={F1,F2,F3,F4,F5,F6,F7},set of function.F1=It is a function of constructing a graph Laplacian fromgiven input.F(I1)→O1F2=It is a function of randomly generating parameters ofhidden mapping functions by Continuous Uniform Distributionin the interval (-1, 1).F2(I4)→O2F3=It  is  a  function  of  initiating  ELM  network  of  nhneurons and calculate output matrix of hidden neurons.F3(O2)→O3F4=It is a function to calculate eigenvalues and eigenvec-tors.F4(O1,O3,I2)→O4F5=It is a function to calculate Embedding Matrix.F5(O3,O4)→O5F6=It is a function that forms clusters for input data usingK-Means algorithm.F6(O5,I3)→O6F7=It  is  a  function  to  find  the  outliers  from  each  clusterusing Pruning technique.F7(O6)→O7Table  1  shows  functional  dependency  among  the  differentfunctions used.TABLE IFUNCTIONALDEPENDENCYF1F2F3F4F5F6F7F11000000F20100000F30110000F41011000F50011100F60000110F70000011IV.  SYSTEMANALYSISA.  Performance Measures1)  Cluster Quality:Cluster Quality is one of the measuresof this system, which gives how correctly the classes/labels arepredicted for each data point in given data set. To get clusteraccuracy  the  actual  class  label  and  the  predicted  class  labelboth are considered.

 2)  Computational  Time:Computational  time  is  the  timerequired to perform both phases i.e. for clustering using US-ELM  and  Outlier  Detection  too  using  pruning  technique.time  of  the  system  will  be  measured  against  the  k  in  kNNs,dimensionality (d) and size of the data set (N)

...(download the rest of the essay above)

About this essay:

This essay was submitted to us by a student in order to help you with your studies.

If you use part of this page in your own work, you need to provide a citation, as follows:

Essay Sauce, . Available from:< > [Accessed 29.11.20].