AL- ALBayt University

Computer science Department

Prince Hussein Bin Abdualla College for Information Technology

An intelligent two level clustering hierarchies by use Fuzzy clustering by local approximation of membership. '

INDEX:

`1-Abstract.

2-Code

Chapter one

3-Clustering

3.1 clustering

3.2 clustering Algorithm.

3.3 hierarchical clustering.

3.4 hierarchical Algorithm.

3.5 fuzzy clustering

3.6 k-means algorithm.

3.7 c-means algorithm.

Chapter two

4- FLAME

4.1 flame algorithm.

4.2 local approximation of membership

4.3 hierarchical fuzzy clustering

4.4 hierarchical fuzzy clustering with c-means

4.5 Time complexity of flame

4.6 hierarchical flame algorithm

4.7 K-nearset algorithm

5,5.1 Inference

5.2 Deference between c-means and flame

1-Apstact of message.

Abstract

Clustering has been applied in a wild variety of fields such as biology, medicine, economics, etc. existing clustering approaches, mainly developed in computer science, have been adapted to microarray data analysis. some of which may not be correctly captured by current clustering methods. We therefore approached the problem from a new starting point, and developed a clustering algorithm designed to capture dataset-specific structures at the beginning of the process. We can make the good performance ,high speed and more effective in software system by Flame (fuzzy clustering by local approximation of membership) , in this paper we show the prove how our approach control and manage the network , and improve it's performance.

Keywords: Flame, clustering, fuzzy, local approximation

Introduction

Known communication environments can be quite varied. For example, within a single device or entity, there may be communications between different functionalities or nodes within that device. Alternatively, two or more devices may be connected together and there may be communications between two or more of the devices. It is known to have two devices communicating with each other via one or more other devices such as in the context of a packet data network, a mobile communications network or the like.The connections can be in any form. For example one node may be a control node to which each of the other nodes are connected or they may be connected in a chain or in any other suitable arrangement. One possible way to interconnect the nodes is an ad hoc mesh network(1) .It should be appreciated that the sensors can be any suitable sensor and on any scale. For example embodiments of the present invention can be used with networks on the scale of a city or a country . The sensor network may be on a smaller scale to the arrangement the connections can be in any form.

Method and apparatus for sending a packet from a source node to a destination node in the same broadcast domain' Intel Corporation

Objective of clustering:

' Discover structures and patterns in high-dimensional

data.

' Group data with similar patterns together.

' This reduces the complexity and facilitates interpretation.

The clustering applications in various area such as taxonomy, medicine, geology, business, engineering systems and image processing, etc . (2)

Many clustering algorithms in various contexts have been proposed (3). These algorithms are mostly heuristic in nature and aim at generating the minimum number of clusters such that any node in any cluster is at most d hops away from the cluster head. Most of these algorithms have a time complexity of O(n) , where n is the total number of nodes. Many of them also demand time synchronization among the nodes, which makes them suitable only for networks with a small number of sensors.

The clustering problem is to partition a data set into clusters so that the elements within a cluster are closer to each other than elements in different clusters. In (6) is presented a general classification of clustering One large group of clustering algorithms are fuzzy clustering (5). These algorithms aim to partition the observations into distinct clusters as membership is probabilistic, not absolute and data elements can be member of more than one cluster.

One of Fuzzy clustering is Fuzzy clustering by Local Approximation of Memberships (FLAME) algorithm (4) on which is focused our attention. The three basic steps of the

FLAME algorithm are:

Definition of the neighborhood of each object and constructing a neighborhood graph to connect each object. In this step is used the K-Nearest Neighbors (KNN)

clustering by which the objects are classified into three sets:

1. Set of objects with density higher than all its neighbors - Cluster Supporting

Objects (CSO);

2. Set of objects with density lower than all its neighbors and lower than a defined

threshold - Cluster Outliers;

3. Set of the rest objects.

Iterative converging process for Local/Neighborhood approximation of fuzzy memberships. First, in this step is initialized the fuzzy membership:

1. Each CSO is assigned with fixed and full membership to itself to represent one cluster

2. All outliers are assigned with fixed and full membership to the outlier group;

3. The rest are assigned with equal memberships to all clusters and the outlier group.

Second, update the fuzzy memberships of all three sets via the Local/Neighborhood Approximation of Fuzzy Memberships, in which the fuzzy membership of each object is updated by a linear combination of the fuzzy memberships of its nearest neighbors.At the end of this process, objects can be assigned to one of the established clusters around the CSO or to the Cluster Outliers, based on their approximate memberships

related work

Recently, fuzzy clustering approaches have been taken in consideration because of their capability to assign one gene to more than one cluster (fuzzy assignment), which may allow capturing genes involved in multiple transcriptional programs and biological processes. Fuzzy C-means (FCM), also named Fuzzy K-means, is a fuzzy extension of K-means clustering and bases its fuzzy assignment essentially on the relative distance between one object and all cluster centroids . Many variants of FCM have been proposed in the past years, including a heuristic variant that incorporates principle component analysis (PCA) and hierarchical clustering , and Fuzzy J-Means, that applies variable neighborhood searching to avoid cluster solution being trapped in local minima (7) . A FuzzySOM approach was also developed, to improve FCM by arraying the cluster centroids into a regular grid (8) . All these fuzzy C-means-derived clustering approaches suffer from the same basic limitation of K-means, i.e. using pairwise similarity between objects and cluster centroids for membership assignment, thereby lacking the ability to capture non-linear relationships (12). Another family of fuzzy clustering approaches is based on Gaussian Mixture Model (GMM) (9, 10, 11), where the dataset is assumed to be generated by a mixture of Gaussian distributions with certain probability, and an objective function is calculated based on the mixture Gaussians as the likelihood of the dataset being generated by such model. Then the objective function is maximized to solve the model and give a set of probabilistic assignment. A possible problem with this approach, as highlighted by Yeung and colleagues, is that real expression data not always satisfy the basic Gaussian Mixture assumption even after various transformations aimed at improving the normality of the data distributions (11).

The aim of this paper is to propose a conceptually novel clustering algorithm combining simplicity with good performance and robustness. The algorithm approaches fuzzy data clustering from a novel perspective. It is mainly based on two general assumptions: (a) clusters should be identified in the relatively dense part of the dataset; (b) neighboring objects with similar features (expression profiles) must have similar cluster memberships so that the membership of one object is constrained by the memberships of its neighbors. Therefore, the membership of each single object (gene or sample) is not determined with respect to all other objects in the dataset or to some cluster centroids, but is determined with respect to its neighboring objects only. This approach brings the notable advantage of capturing non-linear relationships, in a way similar to a nonlinear data dimensionality reduction approach called Locally Linear Embedding (LLE), originally developed for mapping multi-dimensional objects (data points) into a lower-dimension space for their representation . The idea behind LLE is that, in a dataset, most nonlinear relationships can be effectively captured by subdividing the general network of relationships across all objects into locally linear relationships between neighbor objects. As an important consequence, information about one object can be correctly approximated by information obtained from its nearest neighbors. So for each object, LLE used the original dataset to define its nearest neighbors and to assign a set of weights specifying how much each neighbor contributes to the reconstruction of the features (coordinates) of the object. After this, the dataset can be represented in a lower dimensional space, where each object is mapped according to the lower dimensional representation of its nearest neighbors and the weights assigned to its nearest neighbors. In this way the local structure of the original dataset (the neighbors of each object and their proximity) is preserved also in a lower dimensional space such as the 2d or 3d views commonly used for data displaying. We therefore envisaged a fuzzy clustering approach based on neighborhood approximation, to capture non-linear relationships in multidimensional data and to provide a substantial improvement in the visualization and analysis of microarray data. The novel clustering method, FLAME, integrates the two above-mentioned key properties: (a) fuzzy membership assignment (one-to-many gene-to-cluster relationship); (b) definition of membership assignment by local approximation, where membership assignment of a gene depends on membership assignments of its neighbors genes (genes showing similar behavior).

4-background

flame (fuzzy clustering by local approximation of membership) it's algorithms divide cluster in three group inner ,outer and rest and these groups used for level on and two , on the other hand our approach have another basic function is election the head cluster , the head cluster in the second level will manage the node that he had for each cluster.

The fig show the main step of our approach .

5 ' conclusion

The FLAME algorithm has intrinsic advantages, such as the ability to capture non-linear relationships and non-globular clusters, the automated definition of the number of clusters, and the identification of cluster outliers, i.e. genes that are not assigned to any cluster. As a result, clusters are more internally homogeneous and more diverse from each other, and provide better partitioning of biological functions. The clustering algorithm can be easily extended to applications.

Reference:

K. Jain, R.C. Bubes, Algorithm for Clustering Data,

Prentice-Hall, Englewood Cli9s, NJ, 1988.

L. Kaufman, P.J. Rousseeuw, Finding Groups in Data:

An Introduction to Cluster Analysis, Wiley, New York,

1990.

(3) D. J. Baker and A. Ephremides, 'The Architectural Organization of a

Mobile Radio Network via a Distributed Algorithm', IEEE

Transactions on Communications, Vol. 29, No. 11, pp. 1694-1701,

November 1981.

Fu L., E. Medico, FLAME: A Novel Fuzzy Clustering Method for the Analysis of DNA Microarray

Data, BMC Bioinformatics. Vol. 8, No. 3, (2007).

Sampat Richa, Shilpa Sonawani, A Survey of Fuzzy Clustering Techniques for Intrusion Detection

System, International Journal of Engineering Research & Technology (IJERT), Vol. 3 Issue 1, 2188-

2192, (January, 2014).

Santhi M.V.B.T., V.R.N.S.S.V.Sai Leela, P.U.Anitha, D.Nagamalleswari, Enhancing K-Means

(7) Belacel N, Cuperlovic-Culf M, Laflamme M, Ouellette R: Fuzzy J-Means and VNS methods for clustering genes from microarray data. Bioinformatics 2004, 20: 1690'1701. 10.1093/bioinformatics/bth142

(8) Pascual-Marqui RD, Pascual-Montano AD, Kochi K, Carazo JM: Smoothly distributed fuzzy c-means: a new self-organizing map. Pattern Recognition 2001, 34: 2395'2402. 10.1016/S0031-3203(00)00167-9

(9) Hand D, Mannila H, Smyth P: Principles of Data Mining. Cambridge, MA: The MIT Press; 2001.

(10) Qu Y, Xu S: Supervised cluster analysis for microarray data based on multivariate Gaussian mixture. Bioinformatics 2004, 20(12):1905'13. 10.1093/bioinformatics/bth177

(11) Yeung KY, Fraley C, Murua A, Raftery E, Ruzzo WL: Model-based clustering and data transformations for geneexpression data. Bioinformatics 2001, 17(10):977'987. 10.1093/bioinformatics/17.10.977

(12) Chen YD, Bittner ML, Dougherty ER: Issues associated with microarray data analysis and integration. Nature Genetics 1999, (Suppl 22):213'215.

2-Code:

Flame_CovarianceDist ,

- Flame_Manhattan ,

- Flame_SpearmanDist

+ Flame_Manhattan

};

float Flame_Euclidean( float *x, float *y, int m )

@@ -191,12 +190,6 @@ float Flame_Manhattan( float *x, float *y, int m )

for(i=0; i<m; i++ ) d += fabs( x[i] - y[i] );

return d;

}

-float Flame_Spearman( float *x, float *y, int m )

-{

- float d = 0;

- int i;

- return d;

-}

float Flame_CosineDist( float *x, float *y, int m )

{

return 1-Flame_Cosine( x, y, m );

@@ -221,10 +214,6 @@ float Flame_CovarianceDist( float *x, float *y, int m )

{

return 1-Flame_CovarianceDist( x, y, m );

}

-float Flame_SpearmanDist( float *x, float *y, int m )

-{

- return 1-Flame_SpearmanDist( x, y, m );

-}

Flame* Flame_New()

{

@@ -344,7 +333,7 @@ void Flame_DefineSupports( Flame *self, int knn, float thd )

* But in this definition, the weights are only dependent on

* the ranking of distances of the neighbors, so it is more

* robust against distance transformations. */

- sum = k*(k+1)/2.0;

+ sum = 0.5*k*(k+1.0);

for(j=0; j<k; j++) self->weights[i][j] = (k-j) / sum;

sum = 0.0;

@@ -400,18 +389,18 @@ void Flame_LocalApproximation( Flame *self, int steps, float epsilon )

memset( fuzzyships[i], 0, (m+1)*sizeof(float) );

if( obtypes[i] == OBT_SUPPORT ){

/* Full membership to the cluster represented by itself. */

- fuzzyships[i][k] = 1;

- fuzzyships2[i][k] = 1;

+ fuzzyships[i][k] = 1.0;

+ fuzzyships2[i][k] = 1.0;

k ++;

}else if( obtypes[i] == OBT_OUTLIER ){

/* Full membership to the outlier group. */

- fuzzyships[i][m] = 1;

- fuzzyships2[i][m] = 1;

+ fuzzyships[i][m] = 1.0;

+ fuzzyships2[i][m] = 1.0;

}else{

/* Equal memberships to all clusters and the outlier group.

* Random initialization does not change the results. */

for(j=0; j<=m; j++)

- fuzzyships[i][m] = fuzzyships2[i][m] = 1.0/(m+1);

+ fuzzyships[i][j] = fuzzyships2[i][j] = 1.0/(m+1);

}

}

for(t=0; t<steps; t++){

@@ -422,6 +411,7 @@ void Flame_LocalApproximation( Flame *self, int steps, float epsilon )

float *wt = self->weights[i];

float *fuzzy = fuzzyships[i];

float **fuzzy2 = fuzzyships2;

+ double sum = 0.0;

if( self->obtypes[i] != OBT_NORMAL ) continue;

if( even ){

fuzzy = fuzzyships2[i];

@@ -433,7 +423,9 @@ void Flame_LocalApproximation( Flame *self, int steps, float epsilon )

fuzzy[j] = 0.0;

for(k=0; k<knn; k++) fuzzy[j] += wt[k] * fuzzy2[ ids[k] ][j];

dev += (fuzzy[j] - fuzzy2[i][j]) * (fuzzy[j] - fuzzy2[i][j]);

+ sum += fuzzy[j];

}

+ for(j=0; j<=m; j++) fuzzy[j] = fuzzy[j] / sum;

}

even = ! even;

if( dev < epsilon ) break;

@@ -452,6 +444,8 @@ void Flame_LocalApproximation( Flame *self, int steps, float epsilon )

dev += (fuzzy[j] - fuzzy2[i][j]) * (fuzzy[j] - fuzzy2[i][j]);

}

}

+ for(i=0; i<n; i++) free( fuzzyships2[i] );

+ free( fuzzyships2 );

}

void IntArray_Push( IntArray *self, int value )

@@ -490,7 +484,7 @@ void Flame_MakeClusters( Flame *self, float thd )

free( self->clusters );

}

self->clusters = (IntArray*) calloc( C, sizeof(IntArray) );

- if( thd <= EPSILON || thd >= 1.0-EPSILON ){

+ if( thd <0 || thd > 1.0 ){

/* Assign each object to the cluster

* in which it has the highest membership. */

for(i=0; i<N; i++){

@@ -534,5 +528,6 @@ void Flame_MakeClusters( Flame *self, float thd )

C ++;

for(i=C; i<self->cso_count+1; i++) memset( self->clusters+i, 0, sizeof(IntArray) );

self->count = C;

+ free( vals );

}

Chapter one

3-

3.1Clustering:

The Cluster is a computer account imagine to be properly configured when establishing the Cluster is the fact that we want to express way but to understand more say that a group of in servers called Cluster Nodes on the expense of the Cluster virtual directing users to the server or the Cluster Node retrieves the appropriate in servers and it contains depending on availability, or the balance between the pressure on in servers.

Clustering the most important things in computers and company world its solve many problems.

and this graphical picture can show this:

In this picture show the one of cases we easily identify the 4 clusters into which the data can be divided; the similarity criterion is distance: two or more objects belong to the same cluster if they are 'close' according to a given distance (in this case geometrical distance). This is called distance-based clustering.

There's kind of cluster putting the two or more objects in group to same cluster, which mean objects are grouped to accord to their fit to descriptive concepts, mean isn't according the same simply .

The AIM of Clustering

Now, the aim and the goal So, the aim and the goal for using clustering is determine the essential grouping in a group by data, but there's the questions how to decide and make a good clustering? It can be shown that there is no absolute 'best' criterion which would be independent of the final aim of the clustering. Consequently, it is the user which must supply this criterion, in such a way that the result of the clustering will suit their needs.

For instance, we tend to might be fascinated by finding representatives for same groups (data reduction), find 'natural clusters' and describe their unknown properties ('natural' data types), find helpful and appropriate groupings ('useful' knowledge classes) or find uncommon data objects (outlier detection)

Possible Applications

Clustering algorithms will be applied in several fields, for instance:

' Marketing: finding groups clients of consumers of shoppers} with similar behavior given an oversized info of customer data containing their properties and past shopping for records;

' Biology: classification of plants and animals given their features;

' Libraries: book ordering;

' Insurance: characteristic teams of motor contract holders with a high average claim cost; characteristic frauds;

' City-planning: characteristic teams of homes in keeping with their house sort, price and geographical location;

' Earthquake studies: cluster ascertained earthquake epicenters to spot dangerous zones;

' WWW: document classification; cluster weblog knowledge to get teams of comparable access patterns.

There's many benefit of clustering following:

1. simple endeavor of noisy information and outliers,

2. ability to upset the info having numerous styles of variables, like continuous variable that needs standardized information, binary variable, nominal variable (a additional generalized illustration of binary variable), ordinal variable (where order of information is that the most significant criterion) and mixed variables, (that is ,amalgamation of all above)

3.2 Clustering Algorithms

Classification

Clustering algorithms could also be classified as listed below::

Exclusive Clustering

Overlapping Clustering

Hierarchical Clustering

Probabilistic Clustering

In the initial case data are sorted in AN exclusive method, in order that if a precise information belongs to an exact cluster then it couldn't be enclosed in another cluster. a straightforward example of that's shown within the figure below, wherever the separation of points is achieved by a line on a bi-dimensional plane.

On the contrary the second kind, the overlapping bunch, uses fuzzy sets to cluster data, in order that every purpose might belong to 2 or additional clusters with totally different degrees of membership. during this case ,data are going to be associated

to AN applicable membership worth.

Instead, a hierarchical agglomeration algorithmic rule relies on the union between the 2 nearest clusters. the start condition is complete by setting each information as a cluster. once a number of iterations it reaches the ultimate clusters needed.

Finally, the last kind of clustering use a completely probabilistic approach.

In this tutorial we tend to propose four of the foremost to use clustering algorithms:

1- K-means

2- Fuzzy C-means

3- Hierarchical clustering

4- Mixture of Gaussians

A cluster is sometimes described as either grouping of comparable data points around a middle (called center of mass) or a paradigm data instance nearest to the centroid. In different manner, a cluster is described either with or while not a well-defined boundary. Clusters with well-defined boundaries square measure known as crisp clusters, whereas those while not such feature square measure known as fuzzy clusters. the current paper deals with fuzzy agglomeration solely. agglomeration is associate unsupervised learning of untagged data, and such property has separated it from classification, wherever the class-prediction is finished on

unlabeled data once a supervised learning on pre-labeled data. because the coaching is unsupervised in agglomeration algorithms, these is safely used on a { data an information} set while not a lot of knowledge of it.

3.3Hierarchical Clustering

In data processing and statistics, hierarchical clustering (also known as hierarchical cluster analysis or HCA) may be a technique of cluster analysis that seeks to create a hierarchy of clusters.

methods for hierarchical clustering usually represent 2 types:

Agglomerative: this can be "bottom up" method: every observation starts in its own cluster, and pairs of clusters are incorporated jointly (mean merge) one moves up the hierarchy.

Divisive: this can be "top down" approach: all observations begin in one cluster, and splits are performed recursively as one moves down the hierarchy.

In general, the merges and splits are determined in a during manner. The results of hierarchical clustering are sometimes bestowed during a dendrogram.

In the general case, the quality of agglomerative clustering is O(n^2 log'(n)) ,which makes them too slow for huge data sets. Divisive clustering an exhaustive search is O(2^n), that's is even worse. However, for a few special cases, optimal efficient agglomerative approaches (of complexity O(n^2) are known: SLINK for single-linkage and CLINK for complete-linkage clustering

Hierarchical clustering has the distinct advantage that any valid measure of distance can be used. In fact, the observations themselves are not required: all that is used is a matrix of distances

Agglomerative clustering example

For example, suppose this data is to be clustered, and the Euclidean distance is the distance metric.

Cutting the tree at a given height will give a partitioning clustering at a selected precision. In this example, cutting after the second row of the dendrogram will yield clusters {a} {b c} {d e} {f}. Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser clustering, with a smaller number but larger clusters.

The hierarchical clustering dendrogram must be as this:

Hierarchical clustering dendrogram of the Iris dataset (using R)

This approach builds the hierarchy from the individual elements by progressively merging clusters. In our example, we see six elements {a} {b} {c} {d} {e} and {f}. The initial step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, according to the chosen distance.

Optionally, one can also construct a distance matrix at this stage, where the number in the i-th row j-th column is the distance between the i-th and j-th elements. Then, as clustering progresses, rows and columns are merged as the clusters are merged and the distances updated. This is a common way to implement this type of clustering, and has the benefit of caching distances between clusters. A simple agglomerative clustering algorithm is described in the single-linkage clustering page; it can easily be adapted to different types of linkage (see below).

Suppose we have merged the two closest elements b and c, we now have the following clusters {a}, {b, c}, {d}, {e} and {f}, and want to merge them further. To do that, we need to take the distance between {a} and {b c}, and therefore define the distance between two clusters. Usually the distance between two clusters A and B is one of the following:

The maximum distance between elements of each cluster (also called complete-linkage clustering):

max{d(x,y):x'A,y'B}

The minimum distance between elements of each cluster (also called single-linkage clustering):

min{d(x,y):x'A,y'B}

The mean distance between elements of each cluster (also called average linkage clustering, used e.g. in UPGMA):

1/(|A|.|B|) '_(x'A)''_(y'B)''d(x,y).'

The sum of all intra-cluster variance.

The decrease in variance for the cluster being merged (Ward's method)

The probability that candidate clusters spawn from the same distribution function (V-linkage).

Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion) or when there is a sufficiently small number of clusters (number criterion).

Divisive clustering

The basic principle of divisive clustering was published as the DIANA (Divisive Analysis Clustering) algorithm. Initially, all data is in the same cluster, and the largest cluster is split until every object is separate. Because there exist O(2^n) ways of splitting each cluster, heuristics are needed. DIANA chooses the object with the maximum average dissimilarity and then moves all objects to this cluster that are more similar to the new cluster than to the remainder. An obvious alternate choice is k-means clustering with k = 2 , but any other clustering algorithm can be used that always produces at least two clusters.

Commercial

MATLAB includes hierarchical cluster analysis.

SAS includes hierarchical cluster analysis in PROC CLUSTER.

Mathematica includes a Hierarchical Clustering Package.

NCSS (statistical software) includes hierarchical cluster analysis.

SPSS includes hierarchical cluster analysis.

Qlucore Omics Explorer includes hierarchical cluster analysis.

Stata includes hierarchical cluster analysis.

Hierarchical clustering algorithm

How They Work

Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic process of hierarchical clustering (defined by S.C. Johnson in 1967) is this:

Start by assigning each item to a cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters the same as the distances (similarities) between the items they contain.

Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one cluster less.

Compute distances (similarities) between the new cluster and each of the old clusters.

Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. (*)

Step 3 can be done in different ways, which is what distinguishes single-linkage from complete-linkage and average-linkage clustering.

In single-linkage clustering (also called the connectedness or minimum method), we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster. If the data consist of similarities, we consider the similarity between one cluster and another cluster to be equal to the greatest similarity from any member of one cluster to any member of the other cluster.

In complete-linkage clustering (also called the diameter or maximum method), we consider the distance between one cluster and another cluster to be equal to the greatest distance from any member of one cluster to any member of the other cluster.

In average-linkage clustering, we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster.

A variation on average-link clustering is the UCLUS method of R. D'Andrade (1978) which uses the median distance, which is much more outlier-proof than the average distance.

This kind of hierarchical clustering is called agglomerative because it merges clusters iteratively. There is also a divisive hierarchical clustering which does the reverse by starting with all objects in one cluster and subdividing them into smaller pieces. Divisive methods are not generally available, and rarely have been applied.

Of course there is no point in having all the N items grouped in a single cluster but, once you have got the complete hierarchical tree, if you want k clusters you just have to cut the k-1 longest links.

Example:

Let's now see a simple example: a hierarchical clustering of distances in kilometers between some Italian cities. The method used is single-linkage.

Input distance matrix (L = 0 for all the clusters):

BA FI MI NA RM TO

BA 0 662 877 255 412 996

FI 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0

The nearest pair of cities is MI and TO, at distance 138. These are merged into a single cluster called "MI/TO". The level of the new cluster is L(MI/TO) = 138 and the new sequence number is m = 1.

Then we compute the distance from this new compound object to all other objects. In single link clustering the rule is that the distance from the compound object to another object is equal to the shortest distance from any member of the cluster to the outside object. So the distance from "MI/TO" to RM is chosen to be 564, which is the distance from MI to RM, and so on.

After merging MI with TO we obtain the following matrix:

BA FI MI/TO NA RM

BI 0 662 877 255 412

FI 662 0 295 468 268

MI/TO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

min d(i,j) = d(BA/NA/RM,FI) = 268 => merge BA/NA/RM and FI into a new cluster called BA/FI/NA/RM

L(BA/FI/NA/RM) = 268

m = 4

BA/NA/RM MI/TO

BA/NA/RM 0 295

MI/TO 295 0

Finally, we merge the last two clusters at level 295.

The process is summarized by the following hierarchical tree:

3.5 Fuzzy Clustering

Fuzzy clustering (also referred to as soft clustering) is a form of clustering in which each data point can belong to more than one cluster or partition.

Fuzzy clustering is a process of categorizing elements such as usage clicks or usage sessions into groups, where each element can belong to several groups with different degrees of membership. Fuzzy clustering is also known as soft clustering. In fuzzy clustering, the data points can belong to more than one cluster, and associated with each of the data points are membership grades which indicate the degree to which the data points belong to the different clusters.

In non-fuzzy clustering (also known as hard clustering), data is divided into distinct clusters, where each data point can only belong to exactly one cluster. In fuzzy clustering, data points can potentially belong to multiple clusters.

Recently, fuzzy clustering approaches have been taken in consideration because of their capability to assign one gene to more than one cluster (fuzzy assignment), which may allow capturing genes involved in multiple transcriptional programs and biological processes. Fuzzy C-means (FCM), also named Fuzzy K-means, is a fuzzy extension of K-means clustering and bases its fuzzy assignment essentially on the relative distance between one object and all cluster centroids . Many variants of FCM have been proposed in the past years, including a heuristic variant that incorporates principle component analysis (PCA) and hierarchical clustering , and Fuzzy J-Means, that applies variable neighborhood searching to avoid cluster solution being trapped in local minima . A FuzzySOM approach was also developed, to improve FCM by arraying the cluster centroids into a regular grid . All these fuzzy C-means-derived clustering approaches suffer from the same basic limitation of K-means, i.e. using pairwise similarity between objects and cluster centroids for membership assignment, thereby lacking the ability to capture non-linear relationships . Another family of fuzzy clustering approaches is based on Gaussian Mixture Model (GMM) , where the dataset is assumed to be generated by a mixture of Gaussian distributions with certain probability, and an objective function is calculated based on the mixture Gaussians as the likelihood of the dataset being generated by such model. Then the objective function is maximized to solve the model and give a set of probabilistic assignment.

A possible problem with this approach, as highlighted by Yeung and colleagues, is that real expression data not always satisfy the basic Gaussian Mixture assumption even after various transformations aimed at improving the normality of the data distributions .

The algorithm approaches fuzzy data clustering from a novel perspective. It is mainly based on two general assumptions: (a) clusters should be identified in the relatively dense part of the dataset; (b) neighboring objects with similar features (expression profiles) must have similar cluster memberships so that the membership of one object is constrained by the memberships of its neighbors. Therefore, the membership of each single object (gene or sample) is not determined with respect to all other objects in the dataset or to some cluster centroids, but is determined with respect to its neighboring objects only. This approach brings the notable advantage of capturing non-linear relationships, in a way similar to a nonlinear data dimensionality reduction approach called Locally Linear Embedding (LLE), originally developed for mapping multi-dimensional objects (data points) into a lower-dimension space for their representation . The idea behind LLE is that, in a dataset, most nonlinear relationships can be effectively captured by subdividing the general network of relationships across all objects into locally linear relationships between neighbor objects. As an important consequence, information about one object can be correctly approximated by information obtained from its nearest neighbors. So for each object, LLE used the original dataset to define its nearest neighbors and to assign a set of weights specifying how much each neighbor contributes to the reconstruction of the features (coordinates) of the object. After this, the dataset can be represented in a lower dimensional space, where each object is mapped according to the lower dimensional representation of its nearest neighbors and the weights assigned to its nearest neighbors. In this way the local structure of the original dataset (the neighbors of each object and their proximity) is preserved also in a lower dimensional space such as the 2d or 3d views commonly used for data displaying. We therefore envisaged a fuzzy clustering approach based on neighborhood approximation, to capture non-linear relationships in multidimensional data and to provide a substantial improvement in the visualization and analysis of microarray data. The novel clustering method, FLAME, integrates the two above-mentioned key properties: (a) fuzzy membership assignment (one-to-many gene-to-cluster relationship); (b) definition of membership assignment by local approximation, where membership assignment of a gene depends on membership assignments of its neighbors genes (genes showing similar behavior).

Several fuzzy clustering algorithms had been proposed by various researchers.

Those algorithms include fuzzy ISODATA, fuzzy C-means, fuzzy K-nearest neighborhood algorithm, potential-based clustering ,FLAME algorithm and others . Recently, some more

fuzzy clustering algorithms have been proposed. For example, Fu and Medico

developed a clustering algorithm to capture dataset-specific structures at the beginning of DNA microarray analysis process, which is known as Fuzzy clustering.

hierarchical structure clustering not used only in non fuzzy clustering, it used in fuzzy cluster too.

First we are taking about fuzzy k-means and fuzzy C- means for comparative just with flame :

3.6 K'Means algorithm

Given a set of observations (x1, x2 ' xn), k-means clustering partitions the set into k clusters such that the within-cluster sum of squares(WCSS) is minimized.

Where' '''_i is mean point in Si. This results in partitioning of input space into Voronoi cells. This algorithm is advantageous over hierarchical clustering as it is computationally faster with large number of variables. Also it forms tighter clusters than hierarchical clustering, especially if the clusters are globular. But fixed number of clusters can make it difficult to predict what k should be. It does not work well with non-globular clusters. Also, different initial partition can result in different final clusters.

3.7 Fuzzy-C means algorithm

This algorithm belongs to the family of fuzzy logic based clustering algorithms and was introduced in 1984 by Bezdek. It attempts to partition a finite collection of n elements X = {x1,x2'.xn} into a collection of K clusters C = {c1,c2,'..,cK} by associating each gene with all clusters via a real valued vector of indexes. We introduce a partition matrix U=U k_i [0,1],

k = 1,'..K, i= 1,''.,n, where each element Uk_i tells the degree to which each element xi belongs to the cluster ck. Similar to k-means algorithm it aims to minimize following objective function.

Although FCM is an effective clustering technique, the resulting membership values do not always correspond well to the degree of belonging of the data, and it may be inaccurate in noisy environment.

Local Approximation of Membership (FLAME). It worked by defining the neighborhood of each object and identifying cluster supporting objects. Fuzzy membership vector for each object was assigned by approximating the memberships of its neighboring objects through an iterative converging process. Its performance was found to be better than that of fuzzy C-means, fuzzy K-means algorithms and fuzzy self organizing maps (SOM). Ma and Chan proposed an Incremental Fuzzy Mining (IFM) technique to tackle complexities due to higher dimensional noisy data, as encountered in genetic engineering. The basic philosophy of that technique was to

mine gene functions by transforming quantitative gene expression values into linguistic terms, and using fuzzy measure to explore any interesting patterns existing between the linguistic gene expression levels. Those patterns could make accurate gene function predictions, such that each gene could be allowed to belong to more than one functional class with different degrees of membership.

Chapter two

FLAME

4.1 The FLAME algorithm

Data cluster by FLAME goes through 3 main steps, illustrated in Figure one. the primary is that the extraction of native structure info and identification of cluster supporting objects (CSO's). during this step, the distance/proximity between every object and its k-nearest neighbors is employed to calculate object density. Objects with the very high density among their neighbors are known as CSOs and function prototypes for the clusters, supported the very fact that a lot of different objects show similar behavior. Some outliers are known during this step, whose behavior is rare within the dataset. The second step is that the assignment of fuzzy membership by native approximation. The initial range of clusters is outlined by the amount of CSOs. At the start, every object is appointed with equal membership to all or any clusters, with the exception of CSOs and outlier objects, every CSO being appointed with full membership to itself as a cluster, and every one outlier objects being appointed with a full membership to the outlier cluster. Then, associate degree unvaried method is performed to approximate the fuzzy memberships of objects that don't seem to be CSOs or outliers, that the membership is fastened. At every iteration, the fuzzy membership of every object is updated by a linear combination of the memberships of its nearest neighbors, weighted by their proximity. during this method the fastened, full memberships of CSOs associate degreed outliers exert an influence on the membership of their neighbors, that later propagates within the neighborhood network through out the subsequent iterations in order that the ultimate membership of every object (except CSOs and initial outliers)is that the results of a balanced influence (direct and indirect) of the memberships of all different objects. To facilitate comprehension of the method of membership "propagation", a Flash animation is on the market as extra motion-picture show .The last step is that the construction of clusters from the fuzzy memberships, which may becreated in 2 ways: (i) by assignment every object to the cluster during which it's the very best membership degree (one to 1 object-cluster relationship), or (ii) by applying a threshold on the memberships, and assign every object to the one or a lot of clusters during which it's a membership degree above the brink (one-to-many object-cluster relationship). within the validation analysis bestowed here, we have a tendency to used the one membership method.

Before assessing the clustering performances of FLAME, we preliminarily estimated its computational efficiency by analyzing its time complexity . As a theoretic time complexity estimation of the membership approximation procedure is very difficult, we performed an empirical study of the time complexity of FLAME compared with other algorithms . The empirical result shows that for data matrices with many columns, FLAME has significant computational advantage over the other methods, except K-means. But actually in our implementation, no sophisticated techniques have been implemented for K-means to search for global minimum, while FLAME always guarantees global minimum. Taking this into account, K-means may not have much computational advantage over FLAME.

FLAME implementation in the GEDAS software

The whole FLAME algorithm has been implemented as a part of Gene Expression Data Analysis Studio (GEDAS), a C++ program with graphical user interface currently running on Linux and Microsoft Windows. Two user modes are provided, Simple Mode, which is enough for most usages, and Advanced Mode, which enables tuning of all parameters to optimize the clustering. The key parameter to tune during FLAME optimization is the KNN number, because it affects the number of clusters in different ways. First, KNN determines the smoothness of density estimation (the number of peaks in the density distribution), which in turn limits the maximum number of CSOs. Second, KNN determines the range covered by one CSO: the larger the KNN, the larger the CSO range, with fewer CSOs. In the end, in the neighborhood approximation step, KNN determines the range of membership influence of each object: the larger the KNN, the fuzzier the memberships of the genes. Four other clustering algorithms, K-means, hierarchical clustering, Fuzzy C-Means (FCM) and Fuzzy SOM (FSOM) are implemented in GEDAS. Multiple cluster validation metrics based on Figures Of Merit (FOM ) have also been implemented in the software, for selection of the best-performing clustering algorithm and parameters in a given dataset. More details about GEDAS and its use are provided in a manual, which is available together with the software.

Comparative analysis of FLAME performances: expression partitioning

To assess the performance of FLAME and compare it with the other above-mentioned algorithms, we used GEDAS to cluster four different datasets: (i) Reduced Pheripheral Blood Monocytes (RPBM) dataset , (ii) yeast cell cycle (YCC) expression dataset , (iii) hypoxia response (HR) dataset, and (iv) mouse tissues (MT) dataset . Further details on data processing and clustering are provided in the Methods section.

The clustering performance was initially assessed using three different Figures Of Merit (FOM): 1-Norm FOM, 2-Norm FOM and Range FOM (a short description of FOMs and their properties is provided in Methods). We noticed that FOM analysis cannot be applied to FLAME in the standard way, because there is no parameter in FLAME to directly fix the number of clusters: the cluster number is indirectly determined by the number of K-nearest neighbors (KNN) chosen. Moreover, for the same KNN number, when one experimental condition is left out during the analysis,

the number of clusters generated by FLAME may change. Therefore, when applying FOM to FLAME, we use the median number of clusters generated by a given KNN during the leave-one-out analysis as the representative cluster number. The FOM analysis could not be performed on the MT dataset, because of its high sample diversity.

Norm FOM produced results very similar (in the sense of relative performance between algorithms) to the more widely used 2 Norm FOM .

Norm FOM analysis (Fig.2) indicated that no clustering algorithm was the best in all datasets, with FLAME, hierarchical and FSOM being the best in, respectively, RPBM, HR and RYCC data. Conversely, Range FOM highlighted a better performance for hierarchical clustering in all datasets, with FLAME being the second best .

To validate clustering performance also on large datasets with reasonable computing time, we defined another validation index, named Partitioning Index, which does not require leave-one-out analysis and is defined as the ratio between the overall within-cluster variability and the overall between-cluster distance. According to this metric, a good data clustering results in low variability within each cluster and high distance between the various clusters. To calculate the overall within-cluster variability, the variability within each cluster is determined as the average distance between each pair of genes in the cluster, and then averaged for all clusters. The between-cluster distance is obtained by averaging all pairwise distances between clusters. In turn, each single between-cluster distance is calculated by averaging the distance between each pair of genes from the two clusters.

Figure 2

Clustering validation and comparison by 2-Norm FOM. a, 2-Norm FOM on the reduced peripheral blood monocyte dataset. b, 2-Norm FOM on the reduced hypoxia response dataset. c, 2-Norm FOM on the reduced yeast cell cycle dataset.

Interestingly, according to the Partition Index analysis, FLAME emerged as the best algorithm in three out of four datasets (Fig.3). A possible explanation for the different results obtained with the Partition Index analysis is that FLAME may generate non-globular clusters with more heterogeneous size distribution. Indeed, FOM is calculated by averaging the deviations in the left-out condition not cluster by cluster, but by averaging over the whole dataset. Therefore, large clusters with high internal variability have a higher weight in FOM calculation than small, compact clusters. We verified that a modified FOM calculation, where deviations are averaged at the cluster level, gives better values for FLAME (data not shown).

Figure 3

Clustering validation and comparison by Partition Index. a, Partition Index on the reduced peripheral blood monocyte dataset. b, Partition Index on the hypoxia response dataset. c, Partition Index on the yeast cell cycle dataset. d, Partition Index on the mouse tissue dataset.

Comparative analysis of FLAME performances on function partitioning

As a consequence of partitioning of genes according to their expression, a good clustering algorithm should also generate clusters of functional significance, i.e. of genes that share both similar expression profiles and similar functional roles . A particular caution should however be taken when using gene clustering for functional analysis, as the assumption that genes sharing the same expression profile have a similar function does not always hold true and requires extensive statistical validation . To assess whether FLAME is better than other algorithms at partitioning genes into functionally homogeneous groups, we used Gene Ontology (GO) annotation for a comparative assessment on three datasets (functional annotation analysis is not feasible for the RPBM dataset, as explained in Methods). For GO-based comparison, the first thing we investigated is how the GO terms are spread among the expression clusters. The rationale is that a good clustering algorithm should highlight which gene functional classes (GO terms) display a precise pattern of transcriptional regulation in a given dataset. For such classes, the algorithm should generate few expression clusters annotated with the respective GO term and many clusters without annotation to that term. A high spreading of all GO terms across the various clusters is an index of poor performance. We therefore calculated, for each GO term, the percentage of clusters with at least one annotation to that term, and defined a global Annotation Spreading Index as the median of such percentages across all GO terms. As shown in Figure Figure 4, FLAME has a substantially lower Annotation Spreading Index in two out of three datasets.

Figure 4

Clustering validation and comparison by Annotation Spreading Index.a, Spreading Index on the hypoxia response dataset. b, Spreading Index on the yeast cell cycle dataset. c, Spreading Index on the mouse tissue dataset.

A second metric to assess function partitioning is based on the principle that a good clustering method should generate clusters with asymmetric distribution of functional classes, in which specific groups of functions are enriched in specific clusters. To evaluate this, we calculated a vector composed of the number of occurrences of each of the represented GO terms across the entire gene set. This vector was called the Average Annotation Profile. A similar vector was then calculated for each expression cluster, the Cluster Annotation Profile (for annotation profile matrices). We then calculated the correlation between the annotation profile of each cluster and the average annotation profile of the entire dataset. The median of the correlations between the annotation profile of each cluster and the average annotation profile finally yielded an index called Correlation with Average Annotation (CAVA). A high CAVA value indicates that the various functions are represented in the various clusters in a similar way, and therefore indicates poor function partitioning. As shown in Figure 5 , the annotation profiles of clusters generated by FLAME display the lowest correlation to the average annotation profile in two out of three datasets. Hypergeometric distribution analysis indicated that the enrichment of GO terms in clusters generated by FLAME reached statistical significance (not shown).

Figure 5

Clustering validation and comparison by Correlation to Average Annotation Profile (CAVA). a, CAVA on the hypoxia response dataset. b, CAVA on the yeast cell cycle dataset. c, CAVA on the mouse tissue dataset.

In both types of analysis, we noticed that some GO terms maintain a wide distribution across all clusters and do not display particular expression patterns. This is in line with the fact that not all gene functional categories are expected to be coordinately regulated at the transcriptional level in a given set of experimental conditions.

To provide a quantitative readout of the comparative analysis between the various algorithms, we defined a way to rank the algorithms in each validation analysis based on the area below the index line plots. The algorithm giving the smallest area below the index line plot was assigned a rank of 1 (the best), and the others obtain a progressively higher value (lower rank). The results of this ranking procedure, illustrated in Table 1 , show that no single clustering algorithm has always the best performance in all datasets and with all validation metrics. However, FLAME proved the best in many cases and, more importantly, its "performance profile" across the various datasets and validation metrics is profoundly different from those of the other algorithms. This indicates that FLAME can be a truly alternative clustering strategy, while, as an example, FSOM and FCM, that are tightly related clustering algorithms, display an overlapping performance profile.

Discussion

We present here a new algorithm for clustering microarray data, FLAME, that exploits a typical feature of "real-life" biological clusters, like sheep herds and fish shoals: the behavior of one element is dictated by the behavior of its neighbors. In other fuzzy clustering algorithms, like Fuzzy C-means or Fuzzy K-means, the fuzzy memberships of data points are directly determined by their similarity to a series of calculated cluster prototypes (or centroids). Conversely, FLAME uses pairwise similarity measures only to define the neighbors of each gene and how close each gene is to its nearest neighbors, and then approximates the fuzzy memberships of each object from its neighbors' memberships. In this approach, the cluster "prototypes", that we named Cluster Supporting Objects (CSOs, See Methods section), are defined as individual genes having a particularly high number of neighbors. The behavior of such genes would therefore be an "archetypal" behavior, shared with many other genes, and therefore likely to correctly represent the data structure. After defining the CSOs, the membership approximation propagates like a wave from the CSOs to other far objects through a network formed by the neighborhood relationships. In this way FLAME, essentially, performs the clustering using not the expression data, but the local information extracted from them, which allows reliable capturing of both linear and non-linear relationships.

Table 1

Ranking of each clustering algorithm across all comparative validation case.

In some sense, FLAME is also a kind of self-organization method. However, this self-organization process is quite distinct from the one of Self-Organizing Maps (SOM) and fuzzy SOM, which is based directly on the expression measurements. SOM also defines a neighborhood, but this neighborhood is defined only for neurons (i.e. cluster prototypes), and set in advance to constrain the cluster orientations independently from the dataset. In FLAME, instead, the neighborhood relationships are calculated for all objects, and are used to constrain the fuzzy memberships with no external inputs on the cluster number and size.

In principle, the possible applications of FLAME are not limited to gene expression datasets. In particular, the assumption that neighboring objects should have similar fuzzy memberships is well described as a mathematical cost function (Local/Neighborhood Approximation Error). Minimization of this cost function renders FLAME theoretically very valuable, because the Local Approximation Error could possibly be used in combination with other clustering constraints to get new and more powerful clustering algorithms. FLAME can be applied to any dataset including category datasets if a neighborhood can be defined for each object. In fact, a set of neighborhood relationships among the objects is the minimum requirement of FLAME, since a rough similarity between neighboring objects can be estimated as the fraction of their common neighbors.

Methods

Extraction of Local Structure Information and CSO Identification

In this step, local structure information is extracted and cluster supporting objects are identified. To do this, similarities between each pair of objects are calculated, and the nearest neighbors are identified. The similarity measures between each object and its nearest neighbors are used to estimate the density around that object and calculate a set of weights for Local Approximation of fuzzy memberships in next step. The set of densities forms a rough estimation of the distribution of the dataset, and they are used in this step to identify CSOs and possible cluster outliers. Different distance and density metrics have been implemented our software, here we describe the default ones.

The k-nearest neighbors (KNN) for each gene are defined as the k genes with highest similarity according to a given similarity measure. The weights defining how much each neighbor will contribute to approximation of the fuzzy membership of that neighbor are calculated as Wxy, '_(y'KNN(x))''w_xy=1', from the similarities Sxy between that gene and its nearest neighbors. The only requirement for a definition of weights is that, the neighbors that have higher similarities must have higher weights. The simplest one we use is w_xy=s_xy/('_(z'KNN(X))'s_xz ) and distance measures is transformed into similarity measure before applying this definition. For correlation measures, additional transformation is applied to highlight their relative proximities.

The density of each gene is calculated as one over the average distance to the k-nearest neighbors. Subsequently, the set of CSOs (Xcso) is defined as the set of objects with Local Maximum Density (LMAXD), i.e., with a density higher than that of all objects in their neighborhood. The higher k is, the less CSO will be identified, as a consequence the less cluster will be generated.

To define possible cluster outliers, a density threshold can be applied, so that objects with a density below the threshold are defined as possible outliers (genes with "atypical" behavior). This enables starting the clustering process from the entire dataset or after just a minimal filtering. A definition similar to LMAXD can also be applied to define outliers, namely, objects with Local Minimum Density(LMIND). In our validation, we used LMIND plus a density threshold defined by the mean minus two times standard deviation of the densities.

4.2 Local Approximation of Fuzzy Membership

In fuzzy clustering, each object x is associated with a membership vector p(x), in which each element pi(x) indicates the membership degree of x in cluster i:

x : p(x) = (p1(x), p2(x),..., pM(x)),

where:

0'pi(x)'1;'i=1 Mp_i(x)=1;

and M = |Xcso| + 1.

Note that |...| means the number of elements in a set. Each element of membership vector takes value between 0 and 1, indicating how much percentage a object belonging to a cluster, or being an outlier (the last element stands for outliers).

In FLAME, such membership vector is assigned to each object through an iterative process of local approximation. More precisely speaking, the membership vector of one object is approximated by a combination of its nearest neighbors' memberships, namely, p(x) '

'_(y'KNN(x))''w_xy p(y)'

, where the sum is over x's nearest neighbors. And wxy, with

'_(y'KNN(x))''w_xy=1'

, are the weights calculated from the original dataset as described before.

The iteration proceeds to minimize the overall difference between membership vectors and their approximations, described as the Local (Neighborhood) Approximation Error,

E({p} )='_(x'X)'''p(x)-'_(y'KNN(x))''w_xy p(y) ''^2'1'

where each term is the difference between the membership vector p(x), and the linear approximation of p(x) by its neighbors

'_'(y'KNN(X)@)''w_xy p(y).'

In FLAME, Eq(1) is minimized to calculate a set of memberships vectors under some constraints (in addition to the natural constraints on fuzzy membership vectors) derived in the first step, that is, fixing membership vectors of CSOs and outliers to avoid the trivial solutions where all p(x) are the same.

For CSOs, each of them represents a cluster, and is assigned with an unique membership vector, where only the element with index corresponding to its own cluster is 1, all others 0. For Cluster Outliers, all of them are assigned with the same membership vector, in which the last element is 1 and others 0, For all other objects(for convenience, they are referred as X'' = X\XCSO\XOutlier, their membership vectors are initialized to be the same for convenience, and all elements in each vectors have the same value, i.e. 1/M. This means in the beginning, they are uncertain which clusters they belong to. In fact, random initialization doesn't change the final result, but slightly increase the computational time.

Now we can fix the memberships of CSOs Xcso and outliers Xoutlier as a set of constraints, and minimize eq(1). To get a simpler algorithm, we excluded and from the sum in eq(1), so we have

E'({p} )='_(x'X')'''p(x)-'_(y'KNN(x))''w_xy p(y) ''^2'2'

It can be prove in a heuristic way that, E''({p}) can be minimized by the iterative procedure defined as,

p_t+1='_(y'KNN(x))''w_xy p_t (y) for 'X''

starting from p'(x) satisfying 'i=1Mp0i(x)=1. In this way, the fuzzy membership of one object in approximation cycle t+1 is updated by a linear combination of the fuzzy memberships of its neighbors in cycle t. As in the step identifying CSO and Outliers, a new neighborhood can be defined, or simply use one of the neighborhoods defined in previous steps. The combination weight wxy is define by the relative proximity of y to x with respect to the other neighbors of x, the closer y is to x, the bigger is wxy. The types of neighborhood and wxy effect the fuzziness of the clustering. For t ' ', pt(x) will converge to p* with E'({p*}) = 0. And in each step, pt(x) satisfy 'i=1Mp_ti(x)=1. The set of outliers can be enlarged after the Neighborhood Approximation of Fuzzy Membership, due to the fact that some other objects will have similar memberships as outliers.

Cluster Construction

When a set of fuzzy memberships is calculated, clusters can be defined based on a one-to-one gene-cluster assignment. Alternatively, one object can be assigned to more than one cluster if it has a reasonably high membership score for multiple clusters. Also, some objects may not be assigned to any clusters if they don't have one dominant membership percentage. The objects not assigned to any cluster are regarded as outliers. In this way more objects can be screened out from clusters.

Figures of Merit

The use of Figures of Merit (FOMs) has been proposed by Yeung and colleagues to characterize the predictive power of different clustering algorithms. FOM is estimated by removing one experiment at a time from the dataset, clustering genes based on the remaining data, and then measuring the within-cluster similarity of the expression values in the left-out experiment. The principle is that correctly co-clustered genes should retain a similar expression level also in the left-out sample. The assumption (and limit) of this approach is that most samples have correlated gene expression profiles. The most commonly used FOM, referred to as "2-Norm FOM" , measures the within-cluster similarity as root mean square deviation from the cluster mean in the left-out condition. Then, an aggregated FOM is obtained by summing up all the FOMs of all left-out experiments and used to compare the performance of different clustering algorithms (the lower the FOM, the better the predictive power of a clustering algorithm). Other types of FOM measure the within-cluster similarity in different ways . Of these, "1-Norm FOM" and "Range FOM" have also been used in this work. 1-Norm FOM measures the within-cluster similarity in the left-out experiment as the average of the Manhattan distances between the expression levels of genes and the mean expression level in the clusters. Range FOM is the average difference between the maximum and minimum expression levels in the clusters in the left-out experiment. While 1-Norm and 2-Norm FOM measure the compactness of clusters, Range FOM measures the diameter of clusters regardless the distribution of expression values in a cluster. Different FOM may favor different clustering algorithms dependent on the clustering criteria employed by them. Moreover, FOM may not be applied to datasets where most of the experimental conditions display highly divergent gene expression profiles .

Datasets and analysis parameters

RPBM

It is a reduced version of a Pheripheral Blood Monocytes dataset originally used by Hartuv et al. to test their clustering algorithm . The dataset consists of 139 hybridizations (performed with 139 different oligonucleotide probes) on an array of 2329 spotted cDNAs derived from 18 genes. The rationale is that spotted cDNAs derived from the same gene should display a similar profile of hybridization to the 139 probes and therefore be clustered together. This dataset was then reduced to contain 235 cDNAs only by Gesu et al. [5] to reduce the computational time for applying FOM analysis. No validation based on gene functional annotation can be performed on this dataset since it does not reflect gene expression.

YCC & RYCC

They yeast cell cycle data is a part of the studies by Spellman et al . The complete dataset contains about 6178 genes and 76 experimental conditions. The reduced yeast cell cycle (RYCC) dataset is a subset of the original YCC dataset being selected by Yeung et al for FOM analysis and composed of 698 genes and 72 experimental conditions. To facilitate functional annotation analysis, we processed from the complete data and got a subset of 5529 genes which are annotated to a total number of 64 biological process GO terms at level 4.

HR & RHR

The hypoxia response (HR) dataset is a part of the recent work conducted by Chi et al. to investigate cell type specificity and prognostic significance of gene expression programs in response to hypoxia in human cancers. The dataset is downloaded from Stanford Microarray Database with default filtering parameters provided by the web interface. This initial dataset includes 11708 genes and 57 experimental conditions. Then the genes in the dataset is annotated to GO terms and only genes with annotation to biological process GO terms are selected for further analysis. This results in a subset of 6613 genes. Noticing that some of the genes has null values in most of the experimental conditions, genes with more than 80% null values are filtered out to facilitate the clustering analysis. This final dataset includes 6029 genes. In the end the GO terms are also mapped to level 4, resulting 149 GO terms. A subset (RHR, reduce hypoxia response dataset) of the final dataset was created for FOM analysis by selection the top 1000 genes with the highest expression variations.

MT

The mouse tissue (MT) data is the result generated from the work of Zhang et al. on 55 mouse tissues, including 21622 confidently detected transcripts. We analyzed a subset of 6831 transcripts annotated with the 230 Gene Ontology (GO) terms defined by Zhang and colleagues [ as "super GO" terms. FOM-based analysis was not performed on this dataset due to the fact that it contains individual tissue samples from many different organs with highly diverse expression profiles, which make the basic assumption of FOM not valid in this dataset.

4.3 Hierarchical fuzzy clustering

Given a set of elements X, we have applied a mixed approach to build a fuzzy hierarchical structure. The process starts building a fuzzy partition of X applying fuzzy c-means.This results into a set of fuzzy membership functions ''i, each one built on the centroid vi.

This fuzzy partition bootstraps the process. Then, the iterative process is applied to build the hierarchical clustering following a bottom-up strategy. At each step of the process, we start with a fuzzy partition of X represented by a set of membership functions ''i. Such set of membership functions is partitioned using a partitive clustering method for fuzzy sets. Such partitive clustering method returns a ne fuzzy partition '' that is used as the starting point of the new step. This is represented in the next algorithm: Algorithm Bottom-up algorithm

Algorithm BottomUp (X: data) returns fuzzyHierarchy

begin

P := fuzzy partition (X);

while not the root(P) do

P':= fuzzy partition of P;

link P' and P;

set P := P';

end while

return(P);

end

4.4 HIERARCHICAL FUZZY CLUSTERING with fuzzy c-means In this algorithm, we use the fuzzy c-means algorithm for building the initial fuzzy partition. Such fuzzy partition is obtained by applying the fuzzy c-means algorithm to X. In this case, the algorithm is applied with a large number of clusters (i.e., c is large). This selection of c is to have a large number of leaves in the fuzzy hierarchy.

In the iterative process, we use a fuzzy c-means based clustering method. Given a set of fuzzy sets, we build a fuzzy partition following the scheme of the fuzzy c-means described in Algorithm

Differences consist on the way the distance ||' x'_k ' v_i|| is computed. Note that here, xk and vi represent fuzzy sets.

More specifically, ' x'_k stands for the k-th fuzzy set to be partitioned and vi is one of the fuzzy sets in the new partition. Accordingly, ||' x'_k ' v_i|| is a distance between fuzzy sets. We use here, the distance defined in Equation 3. Following the standard approach in fuzzy c-means, the fuzzy membership of a fuzzy set with centroid v is defined considering all other centroids v_i. In our case, the membership of the fuzzy set with centroid ' x'_k is computed for all x taking into account all other centroids x_j as follows:

''x_k (x)=('_(j=1)^c''((d(x_k,x)^2)/d(x_k,xj) )^1/(m-1))^-1'

Similarly, the membership of the fuzzy set with centroid v_iis computed for all x taking into account all other centroids v_ias follows:

''v_k (x)=('_(j=1)^(c^')''((d(v_k,x)^2)/(d(v_(k,) v_j )^2 )')^1/(m-1))^-1

Note that here, x_j are the centroids of the fuzzy sets being clustered and v_jare the centroids of the clusters we are constructing with the fuzzy c-means. Similarly, c is the number of centroids x_jand c is the number of centroids in v_j .

Then, the distance between a fuzzy set with centroid xk and another with centroid vi will be computed using Equation uder below taking into account that the distance is restricted to the set A={x'v_i+(1-') x_k,''[0,1] }

Here, we apply the same expression, using as' x'_k the centriods of the clusters being clustered and u_ik the corresponding membership values. Although we could use at this point the fuzzy extension to aggregate the fuzzy clusters (taking as ' x'_k the fuzzy cluster itself instead of its centroid) we avoid such operation as that would increase the fuzziness of the whole result, and at the same time that would imply that vi is a fuzzy centroid instead of a crisp one.

Note that this approach leads to different membership values than the one described in Section

( I-A). In particular, in the new approach it is possible that the membership of x to a cluster '' is smaller than the membership of x to a subcluster ''j of ''.

This situation was not possible in our previous approach. In that case, ''_j(x) = ''(x) ' '''_i (x) and, therefore, ''_i(x) ' ''(x) for all ''_isubset of ''.

4.5 Hierarchical k-means clustering The algorithm divides the dataset recursively into clusters. The k-means algorithm is used by setting k to two in order to divide the dataset into two subsets . Then, the two subsets are divided again into two subsets by setting k to two .The recursion terminates when the dataset is divided into single data points or a stop criterion is reached .

Hierarchical k-means clustering

Hierarchical k means has O(n) run time .such a run time is possible because both k means algorithm and all operations concerning trees are possible in O(n). Traversing a tree is always done via depth- or breadth-first-search.

Evaluation studies with 65.000 data points in a 23-dimensional descriptor space using a predefined stop criterion showed computation time of less than 20 sec for both clustering and visualization.

Flow chart for hierarchical k-means clustering

In previous chapter we detailed local approximation of membership(Flame) and now we reveries some information about it.

Fuzzy clustering by Local Approximation of MEmberships (FLAME) is a data clustering algorithm that defines clusters in the dense parts of a dataset and performs cluster assignment solely based on the neighborhood relationships among objects. The key feature of this algorithm is that the neighborhood relationships among neighboring objects in the feature space are used to constrain the memberships of neighboring objects in the fuzzy membership space.

Description of the flame the algorithm

The FLAME algorithm is mainly divided into three steps:

1.Extraction of the structure information from the dataset:

1.Construct a neighborhood graph to connect each object to its K-Nearest Neighbors (KNN);

2.Estimate a density for each object based on its proximities to its KNN;

3.Objects are classified into 3 types:

1.Cluster Supporting Object (CSO): object with density higher than all its neighbors;

2.Cluster Outliers: object with density lower than all its neighbors, and lower than a predefined threshold;

3.the rest.

2.Local/Neighborhood approximation of fuzzy memberships:

Initialization of fuzzy membership:

Each CSO is assigned with fixed and full membership to itself to represent one cluster;

All outliers are assigned with fixed and full membership to the outlier group;

The rest are assigned with equal memberships to all clusters and the outlier group;

Then the fuzzy memberships of all type 3 objects are updated by a converging iterative procedure called Local/Neighborhood Approximation of Fuzzy Memberships, in which the fuzzy membership of each object is updated by a linear combination of the fuzzy memberships of its nearest neighbors.

3.Cluster construction from fuzzy memberships in two possible ways:

One-to-one object-cluster assignment, to assign each object to the cluster in which it has the highest membership;

One-to-multiple object-clusters assignment, to assign each object to the cluster in which it has a membership higher than a threshold.

Local approximation of membership

n fuzzy clustering, each object x is associated with a membership vector p(x), in which each element p i (x) indicates the membership degree of x in cluster i:

x : p(x) = (p1(x), p2(x),..., p M (x)),

where:

0 'p_i (x) '1;'_(i=1)^M''p_i (x)=1;'

and M = |X cso | + 1.

Note that |...| means the number of elements in a set. Each element of membership vector takes value between 0 and 1, indicating how much percentage a object belonging to a cluster, or being an outlier (the last element stands for outliers).

In FLAME, such membership vector is assigned to each object through an iterative process of local approximation. More precisely speaking, the membership vector of one object is approximated by a combination of its nearest neighbors' memberships, namely, p(x) ' '_(y'KNN(x))'' w_xy p(y),' where the sum is over x's nearest neighbors. And w_xy,

with '_(y'KNN(x))''w_xy=1 ,' are the weights calculated from the original dataset as described before.

The iteration proceeds to minimize the overall difference between membership vectors and their approximations, described as the Local (Neighborhood) Approximation Error,

E({p} )='_(x'X)'''p(x)-'_(y'KNN(x))''w_xy p(y) ''^2'1'

where each term is the difference between the membership vector p(x), and the linear approximation of p(x) by its neighbors

'_'(y'KNN(X)@)''w_xy p(y).'

In FLAME, Eq(1) is minimized to calculate a set of memberships vectors under some constraints (in addition to the natural constraints on fuzzy membership vectors) derived in the first step, that is, fixing membership vectors of CSOs and outliers to avoid the trivial solutions where all p(x) are the same.

For CSOs, each of them represents a cluster, and is assigned with an unique membership vector, where only the element with index corresponding to its own cluster is 1, all others 0. For Cluster Outliers, all of them are assigned with the same membership vector, in which the last element is 1 and others 0, For all other objects(for convenience, they are referred as X'' = X\X CSO \X Outlier , their membership vectors are initialized to be the same for convenience, and all elements in each vectors have the same value, i.e. 1/M. This means in the beginning, they are uncertain which clusters they belong to. In fact, random initialization doesn't change the final result, but slightly increase the computational time.

Now we can fix the memberships of CSOs X cso and outliers X outlier as a set of constraints, and minimize eq(1). To get a simpler algorithm, we excluded and from the sum in eq(1), so we have

E'({p} )='_(x'X')'''p(x)-'_(y'KNN(x))''w_xy p(y) ''^2'2'

It can be prove in a heuristic way that, E''({p}) can be minimized by the iterative procedure defined as' the prove is:

1. Proof for The Heuristic Optimization Procedure Here we derived the heuristic iterative procedure to find the global minimum of Local (Neighborhood) Approximation Error E'({p}). The optimal {p} that minimize the Local Approximation Error E'({p}) satisfy, for 'x ' X'.

0= (''(1/2 E^' ({p}-''(x)('_(k=1)^M''P_K (X)-1)'))/(''p_K (x))

=p_k (x)-'_(y'Gx)''w_xy p_k (y)'-'_((z'x'G_x))''w_zx (p_k (z)'-'_(u'Gz)''w_zu p_k (u)-''(x)..1'

where ''(x) is the Lagrange multiplier for the constraint '_(k=1)^M''p_k (x)=1 '(M cluster number), and '_{z'x'Gz} is sum over the data points which have x as one of their nearest neighbors. So far we didn't consider the constraints of 0 ' p_k(x) ' 1, but finally we will see that these constraints can be automatically satisfied in the heuristic optimization procedure. Since '_k''p_k (x)=1 ',

summing Eq.1 over k, we have ''(x) = 0. now , if we denote ''_k=p_k (x)-'_(y'G_x)''w_xy p_k (y') ,we have ,

''_k (x)-'_{z:xG_z }''w_zx ''_k (z)'=0,k=1,'.,M-1 (2)

'_(k=1)^M''''_K (x)=0'

A_X'X^'

Now note that, the coefficient matrix of the above linear equations have non zero determinant, because this matrix can be rearranged so that the diagonal elements are 1, and no two rows and columns are correlated. So ''(x) = 0.

So now we have,

p_k-'_(y'G_x)''w_xy p_k (y)=0,k=1,'..,M-1 (4) '

'_(k=1)^M''p_k (x)=1 (5)'

by similar argumentation, the above equations will have a unique solution. These linear equations can be solved by an iterative procedure defined by Eq.3 in the Methods section of the paper, or by standard techniques for solving linear equations.

4.5Time Complexity of FLAME

Suppose the number of genes under clustering is N, the number of experimental conditions is M, the number of CSO (the same as the number of clusters) is C, and all the steps use the same number K of K-Nearest Neighbors. In the first step of FLAME to define nearest neighbors, for M << N and K << N, the time complexity is

O(M ' N ' (N + K log(N))) ' O(N)^2

which is the time many other clustering algorithms have also to spend in the initial stage of clustering. The time spend in the second step is ignorable compared with the other two steps. In the third step for membership approximation , there is a linear equation to solve, which in general has O(N ' C)^3 . computational complexity in theory y. But the coefficient matrix of that equation n is actually extremely sparse for large N , it is solved efficiently in our implementation by an iterative procedure The theoretic time complexity analysis of this iterative procedure is very difficult.

NOW After we explain everything in this paper we talk about hierarchical flame clustering

4.6 Hierarchical flame algorithm

We incorporate a simple, natural hierarchical flame algorithm described in into our pipeline in order to accelerate generation of centroids. We briefly describe the algorithm below: In hierarchical flame we pick some k to be the branching factor. This defines the number of cluster at each level of the clustering hierarchy we then cluster the set of points into k clusters using a standard flame algorithm. Finally, we recursively cluster each sub-cluster until we hit some small fixed number of points.

The hierarchical flame method use for increase the speed and performance of clustering, but this method and algorithm has some questions : This algorithm is'nt guaranteed to generate the same quality of clusters as a traditional flame algorithm. How can we know that we haven't sacrificed good centroids for speed? Later we show through experimentation on classification that this does not seem to be the case.

What happens to clusters that contain few data points? if there are not enough data points in a given cluster then we end up generating degenerate centroids that will never be activated. While degenerate centroids do not directly hurt classification performance, they do waste CPU cycles as well as sample patches. Therefore we would like to maintain a relatively uniform number of data points for each centroid. Fortunately, with hierarchical clustering we can easy merge clusters with few points into neighboring clusters, by simply cutting off the tree at the appropriate depth. Thus, singleton clusters are not an issue, unless very large patches are used (which suffers from curse of dimensional) or not enough data exists

Table : CIFAR-10 Full dataset. Performance of our hierarchical FLAME approach compared to other published result

Architecture Accuracy % Speed / s

1 layer flame [2] 78.3% 475

3 flame [2] 82.0% 3061

Conv. DBN [8] 80.49% >200000

1 layer HFLAME 77.8% 49

3 Layer HFLAME 82.89% 221

Table' shows how our results compare with relevant published results, mostly those of k-means networks . All of the tests were run on GHC 5208 cluster laptops, using NumPy/SciPy. The convolutional deep belief network results were not recorded by me, but were run on fast desktop GPUs This table highlights the success of flame net works when compared to standard DBNs. Our own results, highlighted by the ? are competitive with the other flame results, which shows that it is possible to get major speedups with hierarchical flame.

We performed n-way cross validation in order to tune and test parameters, and the final results used everything except the test set to compute accuracy. This is in line with the results from [2] and [8].

4.7 k-nearest algorithm

Finally we explain about k-nearest algorithm it's the basic of our approach and I explain it in the last paper to show the matter easier and better than explain it in previous paper .

K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on the same measure (e.g., distance functions). K-nearest neighbors has been used in statistical estimation and pattern recognition already in the beginning of 1970's as a non-parametric technique.

After we know the pattern , the (k-NN for short) is a non-parametric approach to use for regression and classification. In this cases, the input consists of the k closest training examples in the feature space. The output based on whether k-NN is used for regression or classification

Classification: Select the items before examining data and data classification based on the characteristics common to them.

Regression: s the technology that allows the analysis of data to describe the relationship between the two variables or more, the gradient is supposed to be gender data known from thefunction is then determine the best a function of the data given.

K- NEAREST NIGHBORS ALGORITHIM :

The prospecting techniques aimed at predictability through compared records similar to the Register to be predictability to the estimation of the Unknown value for the Register, on the basis of information to those records.

Can be summarized in the algorithm steps the closest neighbor(KNN):

1. Select the number of the closest neighbors let K.

2.calculate distance between the register of the explorer and the nearest neighbor.

d(x_i,x_j )='(2&'_(t=1)^T'[x_ti-x_tj ] )^2

3.arranged the distances to give its level of the smallest distance to the highest walking, then select the closest neighbors on the basis of a minimum k-th.

4. The whole world class closest neighbors.

5-the expense of covariance closest neighbors 61,556,868 forecast of the situation of the register of the explorer.

6-continue assessing a function of Objective Determine the records.

7-calculate RMSE value (the root of the error rate this mission is unexpected) for each value of K if less than previous stop if not go to Step 1.

Applied Example:

We have data from the survey questionnaires (people) substantive testing of two characteristics (strength of DNA acid Durability and Remove light stains force) classification of tissue if was good or not to take four training samples as shown in the following table:

Y=Classification Power=x2 1= durability

Acid(sec)

Bad 7 7

Bad 4 7

Good 4 3

good 4 1

Now the factory produces fine, which are subject to laboratory test to x1=3, x2=7, for the purpose of the report of the type of Category new fine apply KNN algorithm is as follows:

1. Define the number neighbors closest to suppose that k=3.

2. Calculate distance between the query and all samples of training, and instead of the expense of distance square calculate the distance is the fastest growing of the account without the use of square root as shown in the following table:

For example square feet distance

(7, 3) to inquire X2=power

(kg/m^2) X1 acid durability

(sec)

(7-3)^2+(7-7)^2=16 7 7

(7-3)^2+(4-7)^2=25 4 7

(3-3)^2+(4-7)^2=9 4 3

(1-3)^2+(4-7)^2=16 4 1

4-arrange the distance to the nearest neighbors as early as least developed within walking distance to the highest is drawing on the value of the k as shown in the following table:

Is included in 3- the closest neighbors. Rank minimum distance Examples square distance(7.3) X2=power

(kg/m^2) X1 acid durability

(acid)

yes 3 (7-3)^2+(7-7)^2=16 7 7

No 4 (7-3)^2+(4-7)^2=25 4 7

yes 1 (3-3)^2+(4-7)^2=9 4 3

yes 2 (1-3)^2+(4-7)^2=16 4 1

5. Collect category closest neighbors y, taking into consideration that the second row last column to the group as soon neighbors (y) fell, which included in classification because the P (Rank) this sample more k=3, as shown in the following table.

Y=Category from the nearest neighbors Is included in 3- the closest neighbors. Rank minimum distance Examples square distance(7.3) X2=power

(kg/m^2) X1 acid durability

(acid)

bad yes 3 (7-3)^2+(7-7)^2=16 7 7

---- no 4 (7-3)^2+(4-7)^2=25 4 7

good yes 1 (3-3)^2+(4-7)^2=9 4 3

good yes 2 (1-3)^2+(4-7)^2=16 4 1

Take the largest group of classified the closest neighbors forecast value the case of inquiry, we have a good 2 and 1 bad, since 1 < 2 conclude that the laboratory test with X1=3 and X2=7 for fine new banknotes fall within the category of good governance.

Regression :

The regression is the way charts depicting the relationship between variables regression is used to estimate the value of one of the variables if they knew the other variable.

The method of stepwise regression

n many areas of application and the model of linear, wants the researcher in identifying the most important independent variables, which improve the predictive capacity of the model, and in return, removing the independent variables that will be the addition of one or some of the Model irrelevant morally in improving the predictive capacity of the model, and can therefore be the application of the method of the gradual decline and based on the introduction of the variables one after the other as the proportion of the part attributed to him in the interpretation of the overall differences in follower variable .

And we can notice importance of stepwise regression following:

1. The problem of the linear link (Multicollinearity) between independent variables.

2. Order of variables by their importance in the interpretation of the follower variable

Steps to the application of the method of the regression stepwise:

The imposition of (x1,X2,...Xk) is a set of variables of interpretation, could be the application of the method of the regression stepwise, enter the variables one after the other, in Phase No. r drafted imposition scratch H0 on the following in this:.

H0:''(X_r,'X*'_1,'X*'_2,','X*'_(r-1))=0

mposition scratch above that the addition of the variable X_rvariables does not improve its ability predictive'X*'_1,'X*'_2,','X*'_(r-1), statistical uses test "Partial F "calculated by applying the equations and determine this equation following:

Rrgression:SSE(X*1,X*2,'..X*r)

Statistically, the above test followed the distribution of F in a simple Extension=(r-(r-1)=1), and the Shrine is =(n-(r+1)=n-r+1).

We are notice when use this approach:

1-use of the Standard R^2 (X*1,X*2,'..X*r)In determining the changing of the best (resulting from added the highest coefficient for determining).

2-After selecting the best variable must be moral test added, if proved statistical tests moral added in improving the predictive capacity of the model should be kept as one of the important variables in the interpretation of the behavior of the variable.

FLAME WITH KNN

K-nearest neighbors (KNN),for achieve the aim, with connected (KNN) the flame algorithm is presented.

Flame algorithm with KNN :

1.Extract the tag space-inherent structure by construction a neighborhood graph where each tag' x'_(j ) is connected to its KNN.

2.Neareset neighbor using this formula ''=1'd and this using for calculate density, so calculate the density for each tag depended on d(distance) .

3.assign each tag:

a-if the tag has a higher density '' then all its neighbor tags to cluster supporting object

C_CSO={C1,'.',C'_(c-1)}

b-if the tag has a density lowers than all its neighbors, and lowers than a threshold '( ex an arbitrary regular )to cluster outliers C_(O=) (C_r ).

4.assign the tags of the cluster helping objects C_CSOfull regular membership to its respective clusters'_CSO={'_1,'..,'_(c-1)}.

5.assign the tags of the cluster outliers full membership to its individual cluster '_o.

6.assign for each and every remaining tag membership to all cluster C (inch. the outlierC_O )

7.reiterate assigning membership until the approximation mistake converges to zero by updating the membership of the remaining tags by a linear combination of the membership of its nearest neighbors (assign higher weights for closer objects x_j).

Condensed for data reduction

Condensed nearest neighbor or (CNN) means the hart algorithm it's an algorithm for help reduce the data set for k-NN classification it's a main purpose for designed it . It choose ans specify the set of prototypes( U )from the training data, such that 1NN with U can classify the examples almost as accurately as 1NN does with the whole data set.

Calculation of the border ratio.

Three types of points: prototypes, class-outliers, and absorbed points.

Given a training set X, CNN works iteratively:

Scan all elements of X, looking for an element x whose nearest prototype from U has a different label than x.

Remove x from X and add it to U

Repeat the scan until no more prototypes are added to U.

Use U instead of X for classification. The examples that are not prototypes are called "absorbed" points.

It is efficient to scan the training examples in order of decreasing border ratio.[22] The border ratio of a training example x is defined as

a(x) = ||x'-y||/||x-y||

where ||x-y|| is the distance to the closest example y having a different color than x, and ||x'-y|| is the distance from y to its closest example x' with the same label as x.

The border ratio is in the interval [0,1] because ||x'-y||never exceeds ||x-y||. This ordering gives preference to the borders of the classes for inclusion in the set of prototypesU. A point of a different label than x is called external to x. The calculation of the border ratio is illustrated by the figure on the right. The data points are labeled by colors: the initial point is x and its label is red. External points are blue and green. The closest to x external point is y. The closest to y red point is x' . The border ratio a(x) = ||x'-y|| / ||x-y||is the attribute of the initial point x.

Below is an illustration of CNN in a series of figures. There are three classes (red, green and blue). Fig. 1: initially there are 60 points in each class. Fig. 1 shows the 1NN classification map: each pixel is classified by 1NN using all the data. Fig. 2 shows the 5NN classification map. White areas correspond to the unclassified regions, where 5NN voting is tied (for example, if there are two green, two red and one blue points among 5 nearest neighbors). Fig. 4 shows the reduced data set. The crosses are the class-outliers selected by the (3,2)NN rule (all the three nearest neighbors of these instances belong to other classes); the squares are the prototypes, and the empty circles are the absorbed points. The left bottom corner shows the numbers of the class-outliers, prototypes and absorbed points for all three classes. The number of prototypes varies from 15% to 20% for different classes in this example. Fig. 5 shows that the 1NN classification map with the prototypes is very similar to that with the initial data set. The figures were produced using the Mirkes applet.

Fig 1

Fig 2

5

5.1 Inference:

NOW after we explain everything related in our message and our approach we put a conclusions and some examples

A Simple illustration on a 2-Dimension testing dataset

And this is steps flame

Fuzzy clustering by Local Approximation of MEmberships (FLAME) defines clusters in the dense parts of a dataset and performs cluster assignment based on the neighborhood relationships among objects. The FLAME constructs k-Nearest Neighbors graph to identify the cluster centers and outliers. Proteins with the highest local density called Cluster Supporting Objects (CSO) and proteins with a local density lower than a threshold are called outliers. CSOs are assigned with full membership to represent itself as cluster centers. Outliers are assigned with full membership to the outlier group. Fuzzy memberships are then assigned to remaining proteins with varying degrees of memberships to the cluster supporting objects. There is no need to specify the predefined number of clusters. It automatically determines the numbers of cluster and outliers. FLAME requires the number of k-Nearest Neighbors and threshold value for outliers as initial parameters.

5.2 Additional to this I will be show the deference between flame and c-means :

The experiments were conducted on Intel pentium-4 processor with 2GB RAM. Alignment scoring matrix for dataset given was obtained by Smith-Waterman algorithm . Then, the normalized similarity scores a. Distance matrix of protein sequences are calculated using similarity scores. After completing these processes, clustering algorithms are initialized, and run with the datasets and above predicted distance matrix. We calculate validity indices for clustering algorithms on four datasets. Figure1 shows silhouette index on four datasets. Figure 3 shows partition index on four datasets, and this table show the deference in datasets.

TABLE 1 .SILHOUETTE INDEX OF ALGORITHMS ON FOUR DATASETS

Algorithm Datasets 1 Datasets 2 Datasets 3 Datasets 3

FCM 0.4837 0.4568 0.4488 0.4231

FLAME 0.5097 0.4798 0.4781 0.4797

Fig 1 Clustering validation and comparison by silhouette index

TABLE 2. PARTITION INDEX OF ALGORITHMS ON FOUR DATASETS

Alogrithm Datasets 1 Datasets 2 Datasets 3 Datasets 3

FCM 0.3087 0.3022 0.3532 0.3279

FLAME 0.2595 0.2813 0.2931 0.2754

Figure 2. Clustering validation and comparison by partition index

TABLE 3. EXECUTION TIME OF ALGORITHMS ON FOUR DATASETS

Algorithm Datasets 1 (sec) Datasets2 (sec) Datasets 3(sec) Datasets 4(sec)

FCM 74.7738 86.6731 80.0534 87.2015

FLAME 72.1925 80.9234 75.7239 84.8062

Figure 3. Execution time of algorithms on four datasets

According to both of the validity index analysis, flame is the best algorithm on four datasets. Figure 3 shows the execution time of clustering methods on four datasets. Execution time of flame is lower than fuzzy c-means clustering. From the results, it is inferred that flame performs better in terms of validity indices and execution time as well. Many problems in bioinformatics are related to gene or protein sequence analysis. Clustering protein sequences is important problem in bioinformatics. Clustering is used to identify the relationship between proteins. In this paper, we compare and evaluate the performance of two clustering algorithms fuzzy c-means and flame. The experimental result shows that flame clustering performs better than fuzzy c-means clustering in terms of validity indices and execution time.

And this a flowchart compare between our approach and c- means:

Many problems in bioinformatics are related to gene or protein sequence analysis. Clustering proteins sequences is important problem in bioinformatics. Clustering is used to identify the relationship between proteins. In this paper, we compare and evaluate the performance of two clustering algorithms fuzzy c-means and flame. The experimental result shows that flame clustering performs better than fuzzy c-means clustering in terms of validity indices and execution time.

We know now after all comparative FLAME (local approximation of membership) the best method and approach next page show the final flowchart for flame and finished this search.

Reference:

1- R. Sibson (1973). "SLINK: an optimally efficient algorithm for the single-link cluster method" (PDF). The Computer Journal. British Computer Society. 16 (1): 30'34.doi:10.1093/comjnl/16.1.30.

2-D. Defays (1977). "An efficient algorithm for a complete link method". The Computer Journal. British Computer Society. 20 (4): 364'366. doi:10.1093/comjnl/20.4.364.

3- Steinbach, M., Karypis, G., & Kumar, V. (2000, August). A comparison of document clustering techniques. In KDD workshop on text mining (Vol. 400, No. 1, pp. 525-526).

4-Bezdek JC. Pattern Recognition With Fuzzy Objective Function Algorithms. Plenum Press, New York, USA; 1981.

5-Gasch AP, Eisen MB. Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering. Genome Biol. 2002;3:RESEARCH0059. doi: 10.1186/gb-2002-3-11-research0059.

6-Pascual-Marqui RD, Pascual-Montano AD, Kochi K, Carazo JM. Smoothly distributed fuzzy c-means: a new self-organizing map. Pattern Recognition. 2001.

7-Chen YD, Bittner ML, Dougherty ER. Issues associated with microarray data analysis and integration. Nature Genetics. 1999. pp. 213'215.

8-Yeung KY, Fraley C, Murua A, Raftery E, Ruzzo WL. Model-based clustering and data transformations for geneexpression data. Bioinformatics. 2001;17:977'987. doi: 10.1093/bioinformatics/17.10.977.

9-Roweis ST, Saul LK. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science.2000;290:2323'2326. doi: 10.1126/science.290.5500.2323.

10-Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA. 96:2907'12. doi

11-Belacel N, Cuperlovic-Culf M, Laflamme M, Ouellette R. Fuzzy J-Means and VNS methods for clustering genes from microarray data. Bioinformatics. 2004;

12-Yeung KY, Haynor DR, Ruzzo WL. Validating clustering for gene expression data.Bioinformatics. 2001

13-Di Gesu V, Giancarlo R, Lo Bosco G, Raimondi A, Scaturro D. GenClust: A Genetic Algorithm for Clustering Gene Expression Data. BMC Bioinformatics. 2005

14-Zhang W, Morris QD, Chang R, Shai O, Bakowski MA, Mitsakakis N, Mohammad N, Robinson MD, Zirngibl R, Somogyi E, Laurin N, Eftekharpour E, Sat E, Grigull J, Pan Q, Peng WT, Krogan N, Greenblatt J, Fehlings M, van der Kooy D, Aubin J, Bruneau BG, Rossant J, Blencowe BJ, Frey BJ, Hughes TR

15-Yanai I, Korbel JO, Boue S, McWeeney SK, Bork P, Lercher MJ. Similar gene expression profiles do not imply similar tissue functions. Trends Genet. 2006;22:132'8

16-The Gene Ontology Consortium http://www.geneontology.org/.

17-Yeung KY, Haynor DR, Ruzzo WL. Technical Report UW-CSE-00-01-01. Department of Computer Science and Engineering, University of Washington; 2000.

18-Hartuv E, Schmitt AO, Lange J, Meier-Evert S, Lehrach H, Shamir R. An algorithm for clustering cDNA fingerprints. Genomics. 2000

19-Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B, et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998

20-Stanford Microarray Database http://genome-www5.stanford.edu/.

21-Zhang W, Morris QD, Chang R, Shai O, Bakowski MA, Mitsakakis N, Mohammad N, Robinson MD, Zirngibl R, Somogyi E, Laurin N, Eftekharpour E, Sat E, Grigull J, Pan Q, Peng WT, Krogan N, Greenblatt J, Fehlings M, van der Kooy D, Aubin J, Bruneau BG, Rossant J, Blencowe BJ, Frey BJ, Hughes TR. The functional landscape of mouse gene expression. J Biol. 2004;

22-An Empirical Analysis of Flame and Fuzzy C-Means Clustering for Protein Sequences

23-Hand D, Mannila H, Smyth P: Principles of Data Mining. Cambridge, MA: The MIT Press; 2001.

24-Qu Y, Xu S: Supervised cluster analysis for microarray data based on multivariate Gaussian mixture. Bioinformatics 2004, 20(12):1905'13. 10.1093/bioinformatics/bth177

25- Yeung KY, Fraley C, Murua A, Raftery E, Ruzzo WL: Model-based clustering and data transformations for geneexpression data. Bioinformatics 2001, 17(10):977'987. 10.1093/bioinformatics/17.10.977

26- Chen YD, Bittner ML, Dougherty ER: Issues associated with microarray data analysis and integration. Nature Genetics 1999, (Suppl 22):213'215.

27-Wikibidia.

**...(download the rest of the essay above)**