Motivation:

Outlier detection is one of the most interesting areas in the context of data mining. It has many applications such as intrusion detection, medical anomaly detection, sensor anomaly detection etc. Detecting outlier is challenging in various new data types such as data stream, spatio temporal and time series data. Effective and efficient methods are needed to tackle these challenges. Identifying and analyzing outlier in a given time-series is an important in many applications, because peaks are useful topological features of a time-series. In power distribution data, peaks indicate sudden high demands. In server CPU utilization data, peaks indicate sharp increase in workload. In network data, peaks correspond to bursts in traffic. In financial data, peaks indicate abrupt rise in price or volume.

Outlier detection has been used for ages to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behavior, fraudulent behavior, instrument error etc. In this paper, we are proposing a method to identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The previous outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we propose a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review.

Outlier or anomaly detection is a general challenge for computer science. It can cause many difficulties which is hard to solve. In big systems, outlier detection is very important and affects a lot in the total system. There are some effective algorithms for detecting anomaly despite of causing any kind of change in the system. But there is a need of some cost effective and faster algorithm to solve this system. So, I think developing a system that can detect anomaly or outlier effectively and correctly, in a short period of time will be very helpful for the field of data mining. I find this topic not only challenging but also extremely interesting and helpful to do my research on.

Research Proposal:

Introduction:

Outlier detection is one of the most interesting areas in the context of Data Mining/Knowledge discovery. Outlier detection is also referred to as anomaly detection, event detection, novelty detection, deviant discovery, fault detection, intrusion detection, or misuse detection [GGAH14].

Moreover, a subtle difference between the definitions of outlier and anomaly is mentioned in

[Agg13b, p. 4]:

‘outlier refers to a data point, which could either be considered an abnormality or noise,

whereas an anomaly refers to a special kind of outlier, which is of interest to an analyst.”

Figure 1: The spectrum from normal data to outliers[Agg13b]

Here, we will use the term outlier and anomaly interchangeably. Some well established

definitions of outliers are:

An outlying observation or `outlier’ is one that appears to deviate markedly from other

members of the sample in which it occurs.” [Gru69]

An outlier is an observation which deviates so much from the other observations as to

arouse suspicions that it was generated by a different mechanism.” [Haw80]

an observation (or a set of observations) which appears to be inconsistent with the remainder

of that set of data” [BL94]

These seemingly vague definitions cover a broad spectrum for outliers which provide the opportunity to define outlier differently in various application domains. As a result, outlier detection is the process to effectively detect outliers based on the particular definition of the outlier. It is highly unlikely to find a general purpose outlier detection technique.

Several books provide an extensive overview of this field. [HKP11, Ch. 12] and [Agg15, Ch. 9-

10] provide a broad overview on outlier detection. But the most comprehensive book for outlier

detection is [Agg13b]. There are also several excellent surveys in the literature like [HA04,

CBK09, KKZ09]. Some surveys are more focused on particular domain. [ZMH10, MSME15]

cover outlier detection methods for wireless sensor networks.

Figure 2: Taxonomy of outlier detection in WSN[ZMH10]

[CBK12] covers the topics related to discrete sequences. [SG14] provides the research issues

of outlier detection for data streams. For temporal/time-series data, [Fu11, EA12, GGAH14]

provide a detail overview of the topic.

Figure 3: Taxonomy of outlier detection in temporal data[GGAH14]

Moreover, [Gam10, Ch. 11] and [Agg13b, Ch. 8] provides an overview of outlier detection for

time-series data streams.

In general, Outlier detection techniques can be categorized into several groups: (i) statistical

methods; (ii) Nearest neighbor methods; (iii) Classification methods; (iv) Clustering methods;

(v) Information theoretic methods and (vi) Spectral decomposition methods [CBK09, ZMH10].

on the other hand [KKZ09] has categorized outlier detection techniques into(i) statistical test;

(ii) Depth-based methods; (iii) Deviation-based methods; (iv) Distance-based methods; (v)

Density-based methods and (vi) High-dimensional methods. Each method has its strength

and weakness. Choosing a method largely depends on the application domain. It has been

identified that an anomaly detection problem has four main aspects [CBK09]. Firstly, the nature

of data such as univariate vs. multivariate; discrete vs. continuous. Secondly, based on the availability of data labels, anomaly detection problem can be treated using a supervised/semi supervised/

unsupervised method. Thirdly, anomalies are divided into three types: point, contextual

and collective. Recently a new type of anomaly called contextual collective anomaly

has been proposed in [JZXL14]. Finally, output of an anomaly detection method is generated

as scores or labels.

Recently, the research direction of outlier detection is moving towards “Outlier Ensembles”

after the inuential paper of the same title by Charu Aggarwal [Agg13c]. Moreover, [ZCS14]

has extended the research issues for outlier ensembles with a focus on unsupervised methods.

[MMA14] emphasizes using techniques from both supervised and unsupervised approaches to

leverage the idea of outlier ensembles.

Literature Review:

Data Stream vs. Time-series:

Data stream has brought a new kind of setting in computing: processing a stream of data as

opposed to static, multiple-access data. Data streams are temporally ordered, fast changing and

potentially infinite. Wireless sensor network traffic, telecommunications, on-line transactions

in the financial market or retail industry, web click streams, video surveillance, and weather or

environment monitoring are some sources of data stream. As these kinds of data cannot be

stored in any kind of data repository, effective and efficient management and online analysis of

data streams brings new challenges.

Knowledge discovery from data stream is a broad topic which is covered in several books

like [Agg07, Gam10], [LRU14, Ch. 4], [Agg15, Ch. 12]. As sensor data is one of the sources of

data stream, extensive analysis from this perspective can be found in [GGO+08, Agg13a].

In many application domains data stream includes a temporal attribute where each data

point has either implicit or explicit timestamp with it. Real time sensor data, medical data,

mechanical system diagnosis are such examples. These are also example of Time-series data.

Traditionally it is assumed that time series data can be stored easily and established online

analysis and mining methods can be applied. But in a streaming setting, the focus is shifted

towards online data mining. This requirement makes the online algorithms infeasible.

In [Agg13b, p. 260], it is identified that the problem of outlier detection in streaming time

series data and multidimensional data streams are very different. The former requires the

analysis of each series as a unit, whereas the latter requires the analysis of each multidimensional

point as a unit.

Outlier detection in a time-series can be divided into two categories: values at specific time

stamps are classified as outliers because of sudden changes (contextual anomalies), or entire

time-series or large subsequences within a time series are classified as outliers because of their

unusual shapes (collective anomalies) [Agg13b, p. 227].

Jointly, we are interested to use the term time-series data stream or streaming time series

data interchangeably.

Time-series Data Stream:

We are really motivated by three research issues provided in the context of data stream [SG14]:

‘Research Issue 2- A data point has to be compared with the other data points with same

temporal context (occurred within the time period which is semantically related to the timestamp

of the data point).”

‘Research Issue 6- An outlier detection technique for data streams should not assume any

kind of fixed data distribution.”

‘Research Issue 14- An outlier detection technique for multiple data streams should be able

to compare data points with the same or different schemas in order to detect outliers.”

Change Detection in Data Stream:

Another important task in processing of time-series data streams is change detection. For

temporal data, the task of change detection is closely related with anomaly detection but

different:

It should be emphasized that change analysis and outlier detection(in temporal data) are

very closely related areas, but not necessarily identical” [Agg13b, p. 25].

Figure 4: Different types of Change[GZB+14]

The following different modes of change have been identified in the literature: concept drift

(gradual change) and concept shift (abrupt change). [Gam10, Ch. 3] and [Agg07, Ch. 5] provide

separate chapter to cover change detection for data streams. Detecting concept drift is more

difficult than concept shift. [SG09, G_ZB+14] provides an extensive overview for detecting concept

change. In contrast with anomaly detection, for concept drift detection two distributions

are being compared, rather than comparing a given data point against a model prediction.

Here, a sliding window of most recent examples is usually maintained, which is then compared

against the learned hypothesis or performance indicators, or even just a previous time window.

Much of the difference between the algorithms below is in the way the sliding windows of recent

examples are maintained and in the types of statistical tests performed (except for CVFDT),

though some algorithms, notably ADWIN family, allow different statistical tests to be used.

In particular, statistical tests range from a comparison of means of old and new data, to order

statistics [KBDG04], sequential hypothesis testing [MvdBW07], velocity density estimation

[Agg03], density test method [SWJR07], to Kullback Leibler (KL) divergence [DKVY06]. Many

of the results specifically address multidimensional data. Different tests are suitable for different

situations; in [DKP11] a comparison of applicability of several of the above mentioned tests is

made.

The following are a sample of algorithms for detecting concept drift. There has been publicly

available implementations of some of them: in particular, the MOA software environment

for online learning of evolving data stream (http://moa.cms.waikato.ac.nz/) incorporates

ADWIN (family of) algorithms mentioned below.

1. CUSUM/PH test: Probably the oldest algorithm for change detection, CUSUM maintains

a mean of (adjusted) examples seen so far: g0 = 0 and gt = max(0; gt-1 + (rt – v))

in its simplest form (assuming only positive change). Whenever the cumulative sum gt

exceeds a given threshold, a change is detected. A similar idea with a different cumulative

variable is used in Page-Hinkley (PH) test.

2. CVFDT: The CVFDT [HSD01] algorithm is an early algorithm that proposed an incremental

approach for building and maintaining a decision tree (Hoeffding tree) in the

face of changes or concept drift that occur in a data stream environment. This algorithm

does not need an external classifier, checking the incoming data against the decision tree

it is maintaining; when that tree does not adequately describe the data, a switch to an

alternative tree is made. There is a number of implementations available.

3. ADWIN: A common theme amongst change detection algorithms is maintaining a sliding

window of new or relevant data. Bifet et al. [BG07] proposed an adaptive windowing

scheme called ADWIN; the second version ADWIN2 is now available, as well as a version

with Kalman filter. In ADWIN, the detection of change is based on statistical methods, in

particular on the use of the Hoeffding bound. An implementation of ADWIN is available

at http://adaptive-mining.sourceforge.net/?page_id=20; ADWIN and k-ADWIN

are incorporated into http://moa.cms.waikato.ac.nz/.

4. OnePassSampler: Recently, a faster algorithm has been proposed named OnePassSampler

[SPK13]. This algorithm does not do the extensive within-window comparisons of

ADWIN, but it uses a sequential hypothesis testing strategy. The statistical test involves

computing sample means and using Bernstein bound to estimate the error. It seems to

have good performance in terms of false positive/true positive rate, however its detection

delay is higher.

Proposed Research Methodology:

Contextual(Point) Anomaly Detection Framework:

Input: A univariate time-series data stream X = {x1, x2, x3′, xt-1, xt,’.} where each measurement

has a explicit/implicit timestamp associated with it.

Output: Decide whether xt + 1 is an anomaly (based on the definition of anomaly for the

specific domain).

Assumptions: i) No ground truth is available which makes supervised techniques less applicable.

ii) Near real-time anomaly detection is needed which makes offline methods infeasible. That

is detection xt+1 must be performed before the arrival of xt+2.

iii) Considering domains where data arrival rate is within certain limit. This has made the

second assumption fairly relaxed.

Contextual anomaly detection methods for the aforementioned setting are typically deviation

based [Agg13b, p. 229]. But we are interested to use a non-parametric statistical method

within a sliding window for online anomaly detection. Moreover, we are interested to use a external change detection mechanism for detecting Concept Drift (gradual change) so that we

can adapt the change of underlying data distribution to detect anomalies.

Unified techniques for change point and outlier detection are presented in [TY06, KS09,

SZLH13]. But using change detection mechanism for outlier detection is presented in [BP_Z+09,

PB_Z+10]. But the primary motivation of the work was not anomaly detection rather better

prediction of the model in the presence of concept drift (further review needed). On the other

hand, we are interested to adapt the general framework for model prediction in[PB_Z+10] with

slight modification:

Input: X = {x1, x2, x3, ‘., xt-1, xt, ‘..}.

1) Use ADWIN-2[BG07] to detect the concept change point c (issues: replace outliers and

normalization).

2) Learn the model F(x) from Xnew = {xc,’.., xt}.

3) May use different value of confidence parameter _ for ensembles.

That is our outlier detection framework will be:

1) Remove obvious outlier from Xnew using Z-value test(or other suitable method) to make

next model more robust[Agg13b, p. 125].

2) Apply non-parametric statistical method such as Kernel Density Estimation (KDE) [Sil86]

to detect anomaly.

3) Use only the first window of data as training set to model normal behavior with respect

to the context (within window).

4) General KDE algorithm has a O(n2) computational complexity. But once the model is

learned, the computational cost of outlier detection for each item is very low. May need to use

more efficient method.

We see the following issues and questions for research:

The research questions are:

‘ How the system will work more accurately?

‘ How can the system be more efficient?

‘ How the system differ from other algorithms?

‘ How to deal with the changes appearance in time?

The Sub questions are:

‘ Would the system be user friendly?

‘ Would the system be cost effective?

‘ Would the system be able to find exactly correct results?

References:

[Agg03] Charu C Aggarwal. A framework for diagnosing changes in evolving data streams.

In Proceedings of the 2003 ACM SIGMOD international conference on Manage-

ment of data, pages 575{586. ACM, 2003.

[Agg07] Charu C Aggarwal. Data streams: models and algorithms, volume 31. Springer,

2007.

[Agg13a] Charu C Aggarwal. Managing and mining sensor data. Springer Science & Business

Media, 2013.

[Agg13b] Charu C Aggarwal. Outlier analysis. Springer Science & Business Media, 2013.

[Agg13c] Charu C Aggarwal. Outlier ensembles: position paper. ACM SIGKDD Explo-

rations Newsletter, 14(2):49{58, 2013.

[Agg15] Charu C Aggarwal. An introduction to data mining. In Data Mining, pages 1{26.

Springer, 2015.

[BG07] Albert Bifet and Ricard Gavalda. Learning from time-changing data with adaptive

windowing. In SDM, volume 7, page 2007. SIAM, 2007.

[BL94] Vic Barnett and Toby Lewis. Outliers in statistical data, volume 3. Wiley New

York, 1994.

[BP_Z+09] Jorn Bakker, Mykola Pechenizkiy, I _Zliobait_e, Andriy Ivannikov, and Tommi

Karkkainen. Handling outliers and concept drift in online mass ow prediction

in cfb boilers. In Proceedings of the Third International Workshop on Knowledge

Discovery from Sensor Data, pages 13{22. ACM, 2009.

[CBK09] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A

survey. ACM Computing Surveys (CSUR), 41(3):15, 2009.

[CBK12] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection for

discrete sequences: A survey. Knowledge and Data Engineering, IEEE Transac-

tions on, 24(5):823{839, 2012.

[DKP11] Tamraparni Dasu, Shankar Krishnan, and Gina Maria Pomann. Robustness of

change detection algorithms. In Advances in Intelligent Data Analysis X, pages

125{137. Springer, 2011.

[DKVY06] Tamraparni Dasu, Shankar Krishnan, Suresh Venkatasubramanian, and Ke Yi.

An information-theoretic approach to detecting changes in multi-dimensional data

streams. In In Proc. Symp. on the Interface of Statistics, Computing Science, and

Applications, 2006.

[EA12] Philippe Esling and Carlos Agon. Time-series data mining. ACM Computing

Surveys (CSUR), 45(1):12, 2012.

[Fu11] Tak-chung Fu. A review on time series data mining. Engineering Applications of

Arti_cial Intelligence, 24(1):164{181, 2011.

[Gam10] Jo~ao Gama. Knowledge Discovery from Data Streams. Chapman and Hall / CRC

Data Mining and Knowledge Discovery Series. CRC Press, 2010.

[GGAH14] Manish Gupta, Jing Gao, Charu Aggarwal, and Jiawei Han. Outlier detection

for temporal data. Synthesis Lectures on Data Mining and Knowledge Discovery,

5(1):1{129, 2014.

[GGO+08] Auroop R Ganguly, Joao Gama, Olufemi A Omitaomu, Mohamed Gaber, and

Ranga Raju Vatsavai. Knowledge discovery from sensor data. CRC Press, 2008.

[Gru69] Frank E Grubbs. Procedures for detecting outlying observations in samples. Tech-

nometrics, 11(1):1{21, 1969.

[G_ZB+14] Jo~ao Gama, Indr_e _Zliobait_e, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid

Bouchachia. A survey on concept drift adaptation. ACM Computing Surveys

(CSUR), 46(4):44, 2014.

[HA04] Victoria J Hodge and Jim Austin. A survey of outlier detection methodologies.

Arti_cial Intelligence Review, 22(2):85{126, 2004.

[Haw80] Douglas M Hawkins. Identi_cation of outliers, volume 11. Springer, 1980.

[HKP11] Jiawei Han, Micheline Kamber, and Jian Pei. Data Mining: Concepts and Tech-

niques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 3rd edition,

2011.

[HSD01] Geo_ Hulten, Laurie Spencer, and Pedro Domingos. Mining time-changing data

streams. In Proceedings of the seventh ACM SIGKDD international conference

on Knowledge discovery and data mining, pages 97{106. ACM, 2001.

[JZXL14] Yexi Jiang, Chunqiu Zeng, Jian Xu, and Tao Li. Real time contextual collective

anomaly detection over multiple data streams. 2014.

[KBDG04] Daniel Kifer, Shai Ben-David, and Johannes Gehrke. Detecting change in data

streams. In Proceedings of the Thirtieth international conference on Very large

data bases-Volume 30, pages 180{191, 2004.

[KKZ09] Hans-Peter Kriegel, Peer Kroger, and Arthur Zimek. Outlier detection techniques.

In Tutorial at the 13th Paci_c-Asia Conference on Knowledge Discovery and Data

Mining, 2009.

[KS09] Yoshinobu Kawahara and Masashi Sugiyama. Change-point detection in timeseries

data by direct density-ratio estimation. In SDM, volume 9, pages 389{400.

SIAM, 2009.

[LRU14] Jure Leskovec, Anand Rajaraman, and Je_rey David Ullman. Mining of massive

Datasets Cambridge University Press, 2014.

[MMA14] Barbora Micenkov_a, Brian McWilliams, and Ira Assent. Learning outlier ensembles:

The best of both worlds{supervised and unsupervised. 2014.

[MSME15] Dylan McDonald, Stewart Sanchez, Sanjay Madria, and Fikret Ercal. A survey of

methods for _nding outliers in wireless sensor networks. Journal of Network and

Systems Management, 23(1):163{182, 2015.

[MvdBW07] S Muthukrishnan, Eric van den Berg, and Yihua Wu. Sequential change detection

on data streams. In Data Mining Workshops, 2007. ICDM Workshops 2007.

Seventh IEEE International Conference on, pages 551{550. IEEE, 2007.

[PB_Z+10] Mykola Pechenizkiy, Jorn Bakker, I _Zliobait_e, Andriy Ivannikov, and Tommi

Karkkainen. Online mass ow prediction in cfb boilers with explicit detection

of sudden concept drift. ACM SIGKDD Explorations Newsletter, 11(2):109{116,

2010.

[SG09] Raquel Sebastiao and Joao Gama. A study on change detection methods. In 4th

Portuguese Conf. on Arti_cial Intelligence, Lisbon, 2009.

[SG14] Shiblee Sadik and Le Gruenwald. Research issues in outlier detection for data

streams. ACM SIGKDD Explorations Newsletter, 15(1):33{40, 2014.

[Sil86] Bernard W Silverman. Density estimation for statistics and data analysis, volume

26. CRC press, 1986.

[SPK13] Sripirakas Sakthithasan, Russel Pears, and Yun Sing Koh. One pass concept

change detection for data streams. In Advances in Knowledge Discovery and Data

Mining, pages 461{472. Springer, 2013.

[SWJR07] Xiuyao Song, Mingxi Wu, Christopher Jermaine, and Sanjay Ranka. Statistical

change detection for multi-dimensional data. In Proceedings of the 13th ACM

SIGKDD international conference on Knowledge discovery and data mining, pages

667{676. ACM, 2007.

[SZLH13] Wei-xing Su, Yun-long Zhu, Fang Liu, and Kun-yuan Hu. On-line outlier and

change point detection for time series. Journal of Central South University,

20:114{122, 2013.

[TY06] Jun-ichi Takeuchi and Kenji Yamanishi. A unifying framework for detecting outliers

and change points from time series. Knowledge and Data Engineering, IEEE

Transactions on, 18(4):482{492, 2006.

[ZCS14] Arthur Zimek, Ricardo JGB Campello, and Jorg Sander. Ensembles for unsupervised

outlier detection: challenges and research questions a position paper. ACM

SIGKDD Explorations Newsletter, 15(1):11{22, 2014.

[ZMH10] Yang Zhang, Nirvana Meratnia, and Paul Havinga. Outlier detection techniques

for wireless sensor networks: A survey. Communications Surveys & Tutorials,

IEEE, 12(2):159{170, 2010.

..

**...(download the rest of the essay above)**