Abstract. The present trends in databases are attracting towards dynamic approaches, as most of the applications are changing their data regularly. Hence, there is a need to adopt changes in mining process also for find new movement in data up-to-date and we have to revise the standard data mining algorithms with dynamic characteristics. In order to adopt changes we can apply any of two approaches, update algorithm with or without considering previous state. In this paper, we focus on these strategies and implemented using fuzzy clustering to shown comparison.

Keywords: data mining strategy; changing data; fuzzy clustering; cluster validity.

1 Introduction

In many data mining applications such as business intelligence, image processing, scientific and online applications often apply technique is cluster analysis, is process of grouping a set of similar data points into subsets [1]. In recent years all these applications are generating dynamic data, i.e. changing over time. For example, in business intelligence we group the customers based on their buying behavior to develop the business strategies to enhance the customer relation management. However, the behavior of customer is always changes over time and they have to alter their strategies. Therefore, such applications we need to change in the process of clustering to find new trends.

The traditional data mining algorithms are not suitable for finding new trends on these application data, due to algorithms takes static inputs. Consider a case, where we have to input number of clusters (k) to partition clustering algorithms (K-means and K-medoids) which is decide before [2], [3], [4]. But k value depends on characteristics of data set (the size of data) and updates it as data changes. Hence, we need a new approach for mining on dynamic environment. In this direction many clustering methods are proposed to perform on dynamic data.

The aim of dynamic model is to find changes in data and adjust the input parameters accordingly. Soft Computing methods are suitable for finding changes in uncertain and vagueness data. In this field Crespo and Webber introduced data mining strategy in changing environment [5]. According to, when user wants to adopt the changes and he can apply any of two approaches, perform complete data mining task from the base or follow the updating of present system according arrival of new data. In this paper, we evaluate these two strategies on changing data using fuzzy clustering. The paper is structured as follows. We discussed the related work in section 2, proposed dynamic clustering methods are presented in section 3, and results are presented in section 4 and finally conclusion in section 5.

2. Related Work

From literature, the focus of dynamic clustering algorithm is to identify the changes in the data and adopt these by updating clustering parameters. The change in data is uncertain and incomplete, hence soft computing approaches (Fuzzy sets, Rough sets, and Evaluation computing and neural networks) are suitable for clustering on these data. F Crespo and Webber proposed a methodology on dynamic data using fuzzy clustering and rough k-means clustering algorithms [5], [6], [7]. We motivated to their methodology and evaluate proposed strategies using fuzzy clustering.

2.1 Fuzzy Clustering

Unlike in hard clustering, fuzzy clustering assign a data point to more than one cluster using degree of membership values as defined in fuzzy set theory [8], [9]. For all data point, algorithm calculates degree of membership value with each cluster and assigns to cluster with high membership value. The basic fuzzy clustering algorithm is Fuzzy C Means (FCM) and attains using minimization of objective function (J) as in equation (1).

J= ‘_(i=1)^n”_(k=1)^c’??_ik^m |p_i-v_k |^2 (1)

Where

n: number of data objects; c: number of clusters; ??: fuzzy membership value;

m: fuzziness factor (>1); pi :data point; vk : center of kth cluster.

The center of the kth cluster is calculated using equation (2) as,

v_k= (‘_(i=1)^n”??_ik^m p_i ‘)/(‘_(i=1)^n’??_ik^m ) (2)

The fuzzy membership can be calculated using equation (3) as,

??_ik= 1/(‘_(i=1)^c'(|p_i-v_k |’|p_i-v_l | )^(2/(m-1)) ) (3)

2.2 Silhouette index

Silhouette index is for evaluating internal cluster results and used for finding optimal number of clusters [10]. For each data point (i), the silhouette width s(i) is defined as,

s( i )= (b ( i )’ a ( i ))/’max ”{ a(i),b(i) }

(4)

Where a(i) is average dissimilarity between data point (i) and all other data within the similar cluster and b(i) is the minimum average dissimilarity of i to all other cluster. The data point with positive s(i) is correctly clustered and with negative value indicates wrong clustering.

2.3 Data mining strategies on the changing data

The process of mine the knowledge from data warehouse is cycle of extract data, set input parameters of mining algorithm and execute algorithm. As database adds new data into data warehouse, this cycle is repeated to get accurate results. In [5] described three approaches for data mining system on changing databases:

Ignore changes in data and keep on apply initial parameters.

For each new incoming data entire cycle is repeated by ignoring previous state.

For each new incoming data, identify need of change and update based on existing clusters with new data.

The first strategy does not adopt the changes, thus no need of updating the data mining system and reduces the computational cost. But it does not give the correct results as data change. In strategy two, adopts the changes and gives the accurate results, for that, it repeats entire cycle for every instance and performs mining on the entire data. However, it requires more computation cost. Strategy three, also adopts the changes in data, but it does not do the data mining process from the scratch and it identifies need for update based on the new data and performs the update with respect to previous system. It is computationally cheap and identifies the changes in environment based on previous state of system.

3 Dynamic clustering algorithms

Here we consider second and third data mining strategies to adopt the changes in data as mention in section 1.2 and framed a fuzzy clustering method for each approaches.

3.1 Dynamic clustering using first strategy

We proposed a method to cluster on dynamic data in two phases. 1. Find optimal number of clusters using Silhouette width on given data set. 2. Execute fuzzy c-means clustering. For each cycle of new incoming data, combine new data with existing data and repeat two phases. The algorithm steps are giving below:

Phase 1: Find right number of clusters on initial data set Dinital with data size of n.

Step1: Repeat for each cluster number (c) for c= 2 to (n/2 -1)

Calculate the silhouette width (S) using equ.1;

Step 3: find c which has maximum average cluster silhouette width;

Phase 2: Execute fuzzy c-means on data set (D)

Repeat (phase 1&2) for every new incoming data by combining new and old data as,

D = Dinitial + Dnew

3.2 Dynamic clustering in strategy 2

In this section, we present a method for finding soft changes in data and update cluster structure as proposed in [5]. The algorithm steps are given in two phases: 1. Initial clustering 2. Iteration. Iteration step is executed for each new incoming data, it can perform any of three actions (create new cluster, move cluster centers and delete cluster) as for each change.

Initial clustering:

Step1: calculate average cluster silhouette width for each c.

Step 2: Find cluster (c) which gives maximum average silhouette width;

Step 3: Execute fuzzy-c means clustering with c on initial data set Dinitial.

Iteration: repeat for every new incoming data

Step1: Let Dnew has m number of new data points added.

Step2: Identify the new points which represent the changes in cluster structures;

To find the new objects which are not fit in existing cluster centers, apply two properties. For each new data point i with current clusters center,

Property 1: If all membership values near 1/c cannot be classified correctly.

|??_ik- 1/c| ‘ ??,’ k’ {n+1,n+2,’n+m},’ i'{1,2,’c}

(5)

Property 2: If the distance between data point (i) to all cluster centers is more than the minimum distance among any two cluster centers.

(d_ik ) ??>1/2 min{d(v_i,v_j ) } (6)

If any data point satisfies both the properties represent need of change.

Step3: Identify structural changes.

Case 1: Create new cluster.

Property: If average number of new objects requires changes is beyond threshold (??) then create new cluster. Otherwise, go for next case.

‘_(k=n+1)^(n+m)’IC(x_k )/m ‘ ?? with a parameter ??,0”??1 (7)

Case 2: Move cluster centers.

Combine new objects with old data and perform fuzzy c means clustering with same number of cluster. D = Dinitial + Dnew

4 Results

We implemented proposed methods in R software language on dynamic customer segmentation. In customer relationship management, tracking customer behavior is significant and it changes always over time. We use customer wholesale data set collected from UCI repository to show effectiveness of these methods. It refers to 440 customers of wholesale data with two channel and three regions. To show dynamic behavior of customers, in each cycle we added randomly generated subsets and executed methods to track behavior of customers.

4.1 Results of first strategy

As defined in section 3.1, we executed dynamic clustering method in three cycles and the results of first phase are given in table.1. After finding right number of clusters, we executed fuzzy c-means.

Table.1Results of first phase to decide the right number of clusters

Cycle data size optimal no of clusters max. avgsil width

cycle-1 20 2 0.676344

cycle-2 40 2 0.545914

cycle-3 60 3 0.5362692

4.2 Results of second strategy

As defined in section 3.2, in cycle -1 initial clustering executed on 20 objects. From the next cycle iterations are started. In cycle-2, we added 20 new objects and identified two objects satisfy both conditions. It shows that need of change in cluster structure and then applied condition three which results to move of cluster centers. In cycle-3, we added 20 objects and apply the condition1, 2 & 3 on new data, number of data objects requires changes in clusters and indicates to create new cluster. The results of cycles are given in table.2.

Table.2 Results of three Cycles

Cycle Cluster No of objects Changes avgsil width

Cycle-1 1 15 – 0.6773940

Initial data 2 5 – 0.6731938

20 objects Avgsil width 0.676344

Cycle-2 1 23 Move 0.7436776

20 new 2 17 Move 0.1786387

data added Avgsil width 0.545914

Cycle-3 1 35 Move 0.7833169

20 new 2 15 Move 0.3769178

Data added 3 20 Create -0.0893708

Avgsil width 0.5362692

4.3 Evaluation

From the results of both methods, we can observe that in first method does not give internal changes in cluster structures. For every cycle old results are refreshed with new results. Hence, we cannot track customer behavior as it not maintaining previous results. But in second method, it shows the moving of objects between clusters, changes in cluster centers and arrival of new groups. It indicates changes in buying behavior of customer over time. Therefore, the applications which require the internal changes of data with respect to previous cluster structure can offer second method.

5 Conclusion

We considered the problem of clustering on dynamic data set as most of applications are generating the changing data over time and discussed the merits and demerits in changing environment. We proposed and executed two dynamic clustering algorithms based on fuzzy clustering, to show the evaluations of two strategies on wholesale customer data. From the results, we identify that first method is simple and it does not give changes in behavior of data, but second method shows the changing behavior in data and as for that we can modify the clusters. Most of present applications are required adopt dynamic model and further we can modify these methods by considering the issues noise, complexity and size of data to get more accurate results.

ext in here…

**...(download the rest of the essay above)**