Clustering of High Dimensional Data Streams by Implementing
HPStream Method
Abstract:
Clustering is an important task in mining evolving data streams because of data streams produces the continuous and potentially unbounded sequence of data points [1].Such streams collecting the data from different devices. However, naturally, streaming data is high dimensional data [1]. High dimensional data are frequently very large and it may contain outliers .Therefore such streaming data is an important issue in data mining process. High dimensional data is more complex in classification, clustering and similarity search. Recently, DBSTREAM, single-scan, subspace methods are used for projected clusters over the high dimensional data sets. These methods are difficult to generalize to high dimensional data streams because of the huge volume of data generated the automatically by simple transactions of daily life.
This paper describes a high dimensional data stream clustering technique, known as HPStream. This technique consists of a fading cluster structure and projected primarily based clustering. It is continuously updatable and it’s accurate scalable on both the number of dimensions and quantity of the data streams, and it offers the better high-quality clusters as compared with the preceding records movement techniques.
1. Introduction:
Data streams have get more importance in recent years because of advances in hardware technology. Because of these advances have made easy to store and record numerous transactions, In recent years, many companies are amassing an enormous quantity of data, typically generated continuously as a sequence of events and coming from distinct locations. Telephone call logs, Bank card transactional, sensor network data, network event logs are just some examples of data streams. The presence of data streams in a number of sensible domains has generated a variety of research in this place [8, 10,]. One of the crucial problems recently in the data stream domain is that of clustering [7]. The clustering problem is particularly interesting for the data streaming area due to its applications to data summarization and outlier detection.
The clustering trouble is defined as follows: for a given set of data points, partition them into one or greater agencies of similar data point, where in the notation of similarity is defined by the distance feature. There had been a lot of research work staunch to scalable cluster analysis in current years [2, 6]. In the data stream area, the clustering challenge calls for a technique which can continuously determine the dominant clusters in the information without being dominated by means of the preceding history of the data stream.
The high-dimensional case affords a special undertaking to clustering algorithms even in the traditional area of static data sets. This is due to the sparsity of the data within the high-dimensional case. In the high-dimensional area, all pairs of points tend to be nearly equidistant from each other. As an end result, it’s far frequently unrealistic to define distance-based clusters in a meaningful way. Some latest works on high-dimensional data make use of strategies for projected clustering that can determine clusters for a selected subset of dimensions [3, 6]. In those techniques, the definitions of the clusters are such that every cluster is specific to a subset group of dimensions. This reduces the sparsity problem in the high-dimensional area to some extent. Even though a cluster may not be meaningfully described on all the dimensions due to the sparsity of the data, some subset of the dimensions can always be located on which
specific subset of points form better quality and significant clusters. Of course, those subsets of dimensions may vary over the one clusters to another cluster. Such clusters are called projected clusters [2].
The idea of a projected cluster is formally defined as follows. Assume that k is the number of clusters to be discovered. In addition, the algorithm will take as input the dimensionality l of the subspace in which each cluster is pronounced. The output of the set of rules
A (k + 1)-way partition {C1 . . . Ck, O} of the data, such that the factors in each partition element except the final form of a cluster, while the points within the final partition element are the outliers, which through definition don’t cluster well.
A likely different set ”i of dimensions for every cluster Ci, 1′ i ‘k, such that the points in Ci cluster well in the subspace defined by those vectors.(The vectors for the outlier set O can be assumed to be the empty set.) For every cluster Ci, the cardinality of the corresponding set ”i is identical to the user-defined parameter l.
Inside the context of a data stream, the problem of locating projected clusters turns into even more tough. This is because the additional problem of locating the applicable set of dimensions for every cluster makes the problem considerably more computationally intensive in the data stream area. While the problem of clustering has these days been studied within the data stream environment [3, 11], these methods are for the case of complete dimensional clustering. In this paper, work on the significantly harder problem of clustering high-dimensional data circulate by means of exploring projected clustering strategies. Current projected clustering strategies including those discussed in [2] can’t be easily generalized to the data stream problem because they usually require a multiple of passes over the data. Moreover, the algorithms in [2] are too computationally intensive to use for the data stream problem. Further, data streams fastly evolve over time [4, 5] because of which it is important to design techniques which might be designed to effectively adjust with the progression of the stream.
2. Background:
Density based clustering algorithms are more suitable to data mining in the applications. These strategies use a local criterion and define clusters because the regions within in the data space of higher density as compared to the areas of noise points or margined points. The data points can be distributed by absolutely in these regions of high density and may contain clusters of arbitrary size and shape. A common way to discover the areas of high density is to become aware of grid cells of high densities by partitioning each dimension area into non-overlapping partitions or grids.
The earliest density based clustering method is DBSCAN [4]. It is totally based on the technic of density region. A point is known as ‘core object’, if inside a given radius (”), the neighborhood of this point carries a minimal threshold range (MinPts) of object. A core object is a starting point of a cluster and as a consequence can build a cluster around it. Density based clustering algorithms are used the DBSCAN notation, can find clusters of absolutely size and shape. Fig. 1(a) [2] indicates clusters constructed with density belief with no. Of objects > 10 and Fig 1(b) does now not construct the cluster.
(b)
Fig 1: Clusters by density notion [2]
Density based approaches are normally and popularly used to find out clusters in high dimensional data. These approaches search for the probable subspaces of high densities and then the clusters hidden in those subspaces. It is defined a dense subspace, if it incorporates many data points are in threshold region in a given radius. SUBCLU [7] algorithm is the primary subspace clustering extension to DBSCAN to clustering high dimensional stream data, by using the DBSCAN.
Cluster partition evolving on stream data are often computed primarily based on time periods. The clustering a data stream problem consider in the window version, in which the weight of each data point decreases exponentially with time t via a fading characteristic f(t) = 2-”.t [3] where, ” > 0. The exponentially fading characteristic is broadly used in temporal applications in which it’s desirable to regularly discount the history of past behavior. The higher the value of ”, the decrease importance of the past data in comparison to more latest data. And the overall weight of the data stream is a constant
W = v( ‘_(t=0)^(t=t_c)’2^(-”t) ) = v/(1-2^(-”) )” , where tc (tc ”) is the current time, and v denotes the rate of movement, i.e., the quantity of points arrived in one unit time.
2.1 High-dimensional Data
The intersection region turns into an intersection volume when the dimensions higher than two. Most of the clustering algorithms face problem with the high dimensional data, it is the curse of dimensionality. As the number of dimensions in a data stream will increase, distance measures become increasingly more meaningless. Additional dimensions distribute the points till, in very high dimensions, they’re almost equidistant from each different. Figure 2 [3] illustrates how extra dimensions spread out the points in a sample dataset. The dataset includes 20 points randomly located between 0 and 2 in each of 3 dimensions. Figure 2(a) [3] suggests the data projected onto one axis. The points are near together with approximately half of them in a one unit sized bin. Figure 2(b) [3] shows the same data stretched into the second dimension. By adding some other dimension. The points spread out along every other axis, pulling them similarly aside. Now only about a quarter of the points fall into a unit sized bin. In Figure 2(c) [3] added a third dimension which spreads the data further aside. A one unit sized bin now holds only about one eighth of the points.
(a) (b) (c)
Fig 2: The curse of dimensionality .First data in one dimension is quite tightly packed. Add a dimension stretches the points across that dimension, pushing them similarly aside. Additional dimensions spreads the data even similarly make the high dimensional data extremely less dense.[3]
2.2 Application:
High dimensional clustering is especially effective in domains where one can expect to find relationships across a variety of perspectives. Where Some areas High dimensional clustering has great potential are information integration system, text-mining, and bioinformatics, image recognition.
‘ Information Integration Systems: Query optimization becomes a complex problem since the data is not centralized. The decentralization of data poses a difficult challenge for information integration systems, mainly in the determination of the best subset of sources to use for a given user query. An exhaustive search on all the sources would be a naive and a costly. Application of subspace clustering in the context of query optimization for an information integration system developed here at ASU, Bibfinder[7].
‘ Web Text Mining: A fundamental problem with organizing web sources is that web pages are not machine readable, meaning their contents only convey semantic meaning to a human user. In addition, semantic heterogeneity is a major challenge. That is when a keyword in one domain holds a different meaning in another domain making information sharing and interoperability between heterogeneous systems difficul . In order to automate the process, the key concepts of a domain must be learned. Subspace clustering has major strengths that will help it learn the concepts.
‘ DNA Microarray Analysis: DNA microarrays are an exciting new technology with the potential to increase our understanding of complex cellular mechanisms. Microarray datasets provide information on the expression levels of thousands of genes under hundreds of conditions. For example, interpret a lymphoma dataset as 100 cancer profiles with 4000 features where each feature is the expression level of a specific gene. Patterns in the data reveal information about genes whose products function together in pathways that perform complex functions in the organism. The study of these pathways and their relationships to one another can be used to build a complete model of the cell and its functions, bridging the gap between genetic maps and living organisms with help of high dimensional data clustering methods
‘ Image recognition: Suppose you have ‘n’ images, each with a resolution of ‘m’ pixels by ‘k’ pixels. Here define each pixel within the image as one variable so that each of the n images resides in an m x k dimensional space. From there a training set of images is used to recognize new faces it’s solution. By using the high dimensional data clustering methods to represent the
training/new images with lower dimensions depending on the application and the images.
3. HPStream Method:
Micro-cluster-based data streams clustering algorithms uses the density inside every micro-cluster (MC) as some form of weight (e.g., the quantity of points assigned to the MC). For re-clustering, uses only the distances among the MCs and their weights are used. In this, MCs which might be closer to each other cluster combine as a single cluster based on the MC centers and their weights. This is even proper if a density-based algorithm like DBSTREAM [4] is used for re-clustering. The density in the region between MCs is not available since it isn’t retained at some stage in the online stage.
This paper implements a high-dimensional projected stream clustering method by means of continuous refinement of the set of projected dimensions and data points all through the progression of the stream this is called as HPStream, since it describes the High-dimensional Projected Stream clustering method. The updating of the set of dimensions related to each cluster is carried out in such a way that the points and dimensions related to each cluster can efficaciously evolve through the time. In order to obtain this goal, using the condensed representation of the statistics of the points in the clusters. These condensed representations are selected in the sort of manner that they can be update effectively in a fast data stream. At the same time, a sufficient amount of information is stored in order that essential measures about the cluster in a given projection can be quickly computed. The fading cluster structure is also capable of performing the updates in this such a way that previous data is temporally discounted. This guarantees that during an evolving data stream, the beyond history is progressively discounted from the computation. HPStream introduces the technic of projected clustering to data streams and fading cluster structure.
Algorithms:
Algorithm for clustering High Dimensional Data Streams
Algorithm 1: HPStream (Data Stream Point: X, Cluster Structures: FCS,
Dimensionality Vector Sets: BS, Dimensionality: l);
begin
{Assume that FCS contains the relevant cluster structures denoted by FCS = {FC(C1, t) . . . FC(Cr; t) . . . } }
{Assume that BS contains the relevant cluster dimensions denoted by BS = {B(C1) . . . B(Cr) . . . }
Receive the next data point X at current time t from stream DS;
BS =ComputeDimensions (FCS, l, X);
for r = 1 to |FCS| do
s = FindLimitingRadius (FC(C index, t ), B(C index));
if r(index) > s
then set index = |FCS| + 1 and add new fading cluster structure CjFCSj+1 with a solitary data point to FCS;
else add X to FC(C index, t);
Remove those clusters from FCS which have zero dimensions assigned to them;
if |FCS| > k
then delete the least recently added cluster in FCS;
end;
Algorithm for Computing The Projected Dimensions
Algorithm 2: ComputeDimensions(Faded Cluster Structures: FCS, NumberofDimensions: l, Incoming Point: X);
begin
Create |FCS| (tentative) fading cluster structures by adding X to each of the existing clusters;
Compute the |FCS| * d radii of each of the |FCS| (tentative) clusters along each of the d dimensions;
Pick the |FCS * l| dimensions with the least radii;
Create a bit vector B(Cr) for each cluster Cr reecting its projected dimensions;
end;
3.1 The Fading Cluster Structure:
Introduce the concept of a fading data structure which is able to adjust for the recent of the clusters in a flexible way. The data stream consists of a set of multi-dimensional records X1 ‘. Xk arriving at time stamps T1’.Tk. Each data point Xi is a multi-dimensional record containing d dimensions, denoted by Xi = (x1i ‘ xdi). It is assumed that each data point has a weight defined by a function f(t) to the time t.
The fading cluster structure, a data structure which is designed to capture key statistical characteristics of the clusters generated during the course of a data stream [Algorithm 1]. The aim of the fading cluster structure is to capture a sufficient number of the underlying statistics so that it is possible to compute key characteristics of the underlying clusters.
3.2 The High Dimensional Projected Clustering:
In this discussion, individual clusters are maintained in an online fashion. High dimensional clustering utilizes an iterative approach which continuously determines new cluster structures while re-defining the set of dimensions included in each cluster [Algorithm 2].
First, run a normalization process because different dimensions having different length of values. This is because the clustering algorithm needs to pick the dimensions which are specific to each cluster by comparing the radius along different dimensions. Different dimensions may refer to different scales of reference such as age, salary or other attributes which have vastly different ranges and variances. Therefore, it is not possible to compare the dimensions in a meaningful way using the original data. In order to be able to compare different dimensions meaningfully, perform a normalization process. The aim is to equalize the standard deviation along each dimension.
4. Scalability Results:
Here present and analyze results on clustering quality (accuracy) and the efficiency of the comparing algorithms. Cluster purity is taken as the measure of the clustering quality.
Accuracy comparison: evaluated the clustering quality of the HPStream algorithm in comparison with the DBSTREAM, CluStream algorithm using real data set, forest Covertype.
Figure 3: Quality comparison (Forest CoverType data)
An average projected dimensionality l = 75, in experiments used a series of different l’s, i.e., (25, 50, 75, 100), to test the clustering quality Figure 13 shows the result. As in the Fig 3. overall l = 75 can lead to the best cluster purity, and a too small l at 25 or a too large l at 100 generate very poor clustering quality. In addition, the cluster purity for l = 50 or l = 75 is very similar to that for l = 70, which suggests as long as a value choose for l in the range from 50 to 75, HPStream give a very good clustering quality.
The above experiments in Fig.3 about the sensitivity of the average projected dimensionality l demonstrate that as long as l value not too deviated from the true average projected dimensionality, HPStream give a high clustering quality. HPStream always generated similar clustering solution if the l value in the range from 50 to 75
Figure 4: Clustering quality vs. outlier threshold
Sensitivity Analysis: An important parameter of HPStream is the decay factor. It controls the importance of historical data to current clusters. In pervious experiments set it to 0.25, which is a
moderate setting. However, the quality of HPStream is still higher than that of DBSTREAM. It can been seen that if the threshold ranges from 0.125 to 1, the clustering quality is quite good and stable, and always above 95%. Another important parameter is the outlier threshold. Figure 4 shows the clustering quality of HPStream when threshold is varying from 0.2 to 1. If threshold ranges between 0.2 and 0.6, the clustering quality is very good. However, if it is set to a relatively high value like 1, the quality deteriorates greatly. Because a lot of points corresponding to potential clusters are pruned, the quality is reduced. HPStream gives the better clusters than DBSTREAM
5. Conclusion:
In early years, the management and processing of High Dimensional data streams has become a subject of dynamic research in numerous fields of computer science such as, e.g., database systems, and data mining. Lot of research work has been carried in this field to develop an efficient clustering algorithm for High Dimensional data streams. High Dimensional data are frequently large and may contain outliers. HPStream implemented by incorporating a fading cluster structure and the projection based clustering methodology to work effectively with high dimensional data streams.