ABSTRACT

The problem of increased rate of data has not been completely solved yet though there evolved a number of techniques to find optimized solution from training sets under the mechanism of association rule mining in data mining. The techniques such as K- nearest neighbor, Genetic algorithm, Ant Colony Optimization, Mora algorithm, etc are used to overcome the problems of data mining such as pruning pass of transaction database, superiority of rule set and negative rule generation. The various algorithm differ in the context of Execution Time, Minimum Support, Confidence and efficiency though possess different advantage such as K- NN is highly robust to noisy data and effective in case of large training set, Genetic algorithm is independent of error surface and can solve problems possessing multiple solutions but unable to solve variant problems, ACO produce a number of alternative pheromone representations for constructive search problems. The comparative study of these algorithm shows their different behavior on the basis of various parameter.

Keyword - Data mining, support, confidence, distance weight optimization, negative association rule mining.

ASSOCIATION RULE MINING

Association rule mining can also be understand by the association rule in data mining. They are generally conditional statements that uncover relationship between those data that seem to be non- related in a relational database or other information repositories. Association rules are usually required to satisfy the need of minimum support and minimum confidence that are specified by user at the same time. Association rule generation is basically splitted into two separate steps:

A minimum support threshold is identified to find all item-sets that are frequently used in a database.

A minimum confidence constraint is applied to form rules on those frequently used item sets.

In data mining, association rules have importance because they are useful in predicting and analyzing the behavior of customer. Association rules plays an important role in product clustering, store layout, basket data analysis and catalog design. The problem is usually decomposed into two sub problems. One is to find those item sets whose occurrences exceed a predefined threshold in the database; those item sets are called frequent or large item sets. The second problem is to generate association rules from those large item sets with the constraints of minimal confidence. The item sets whose support exceed the support threshold as large or frequent item-sets, are expected or have the hope to be large or frequent are called candidate item sets. In many cases, the algorithms generate an extremely large number of association rules, often in thousands or even millions that are even very large in nature. It is nearly impossible for the end users to comprehend or validate such large number of complex association rules, thereby limiting the usefulness of the data mining results. Several strategies have been proposed to reduce the number of association rules, such as generating only “no redundant” rules, generating only “interesting” rules, or generating only those rules satisfying certain other criteria such as leverage, coverage, lift or strength.

DESCRIPTION OF APPROACHES

(a) KNN : The basic algorithm for the classification is KNN. It can also be used for the purpose of estimation and prediction. K- nearest neighbor is based on the principle of instant- based learning, which is a technique used to identify the classification of unclassified records by comparing them with the most familiar records in the training data set which is stored until the process of finding classification is completed. As the k- nearest neighbor algorithm assigns the classification of the record on the basis of similarity, data analysts define distance function, also known as distance metrics to measure similarity. The distance function is just a real valued mathematical function 'd' such that it defines the distance between any coordinates x, y & z in the space like -

I) d(x,y) > 0 and d(x,y) = 0 only if x = y {Non - negative distance property}

II) d(x,y) = d(y,x) {Commutative property}

III) d(x,z) ≤ d(x,y) + d(y,z) {Triangle inequality property}

The distance 'd' can never be zero until both the coordinates are overlapped to each other. Commutative property shows that the distance never changes between two points no matters if x--> y or y--> x. Triangle property shows that distance between two points can never be reduced by introducing a new point. The most familiar and widely used mathematical function for determining distance between points of n- dimensional is Euclidean distance function which is given by -

d(x,y)=√(∑_(i=0)^n'▒'〖'(x_i - y_i)'〗'^2 ) where i shows 'n' number of dimension.

KNN is a lazy algorithm as it uses complete training data set during the testing phase. Every training set are comprised of a set of vectors having class label associated with each vector. In the simplest case, the class labels are + (positive classes) or - (negative classes). The idea in k-Nearest Neighbor methods is to identify k that decides how many neighbor based on the distance metric or distance function are influencing or can influence the classification. With small k (e.g., k = 1), the algorithm will simply return the target value of the nearest observation, a process that may lead the algorithm toward overfitting, tending to memorize the training data set at the expense of generalizability. A small value of k means that noise will have a higher influence on the result. On the other hand, choosing a value of k that is not too small will tend to smooth out any idiosyncratic behavior learned from the training set. However, if we take this too far and choose a very large value for k, the behavior of local interest will be overlooked. The data analyst are responsible to balance these considerations when choosing the value of k. A large value make it computationally expensive and also defeats the basic philosophy behind KNN that states points that are near might possess similar densities or classes. A simple approach to select k is to set k = √n where n shows the number dimension with which a point can be defined.

(b) GENETIC ALGORITHM : Genetic algorithms (GA) are derived from the principle of Darwin in natural genetics that states that only the fittest will survive and are adaptive in nature. GA maintains population of potential solutions of the candidate problem considered as individuals or creatures or phenotypes. GA comes under the larger class of evolutionary algorithms (EA) that produces optimized solution to problems by taking inspiration from natural evolution techniques on earth such as mutation, selection, inheritance and crossover. Chromosomes or genotype are the properties of each candidate solution that can be altered or mutated traditionally. Candidate solutions are composed of binary strings i.e, having only 1's and 0's of fixed length which can be encoded too. Evolution starts from randomly generated individuals that goes on a iterative process and in each generation, fitness of every individuals in the population is evaluated. Genetic Algorithm terminates when either a maximum number of generation are produced, or a satisfactory fitness level has been reached for the population. Fitness is a value of objective function that are solved in the optimized problem. Candidate solution are also represented in variable length but they make the crossover complex unlike in fixed length representation where the parts of candidate solution are easily aligned which makes the genetic representation convenient facilitating simpler crossover. After the process of selection of high fitness value individuals, the process of evolution takes place by three genetic operator - reproduction, mutation, and crossover.

Let 2 individuals having candidate solution of length 20 having crossover of 5 be -

X1= (01011|101101101000101) X2= (10010|011100101010001)

I) Reproduction does not make any change in the candidate solution of parent population and inherit the same candidate solution to the offspring population. The two resulting offspring are -

X’1= (01011|101101101000101) X’2= (10010|011100101010001)

II) Crossover interchange the bits of candidate key of parent population after the crossover bit and inherit that changed candidate solution to the offspring population. The two resulting offspring are -

X’1= (01001|011100101010001) X’2= (10010|101101101000101)

III) Mutation invert each bit of candidate solution of parent population and inherit that changed candidate solution to the offspring population. The two resulting offspring are -

X’1= (10100|010010010111010) X’2= (01101|100011010101110)

In the genetic programming unlike to evolutionary programming, the tree- like representations are explored. There are various kind of drawbacks of genetic algorithm such as complexity in search operation because of exponential increase in the search space size where the number of element that are exposed to mutation is large, ineffective in problem solving in the case where the criteria for decision making is only Fitness measure.

(c) ANT COLONY OPTIMIZATION : An Ant Colony Optimization Algorithm, abbreviated as (ACO) is a probabilistic technique that posses the basis of agents system which work on the simulation of natural behavior of ants through the mechanism of cooperation and adaption. This algorithm was first proposed by Marco Dorigo in 1992 using the concept of reducing computational problem with the help of finding good paths through graph which is as similar to the concept used by ants to seek a path between their colony and food. The basic idea behind the ACO is comprised of three views that are -

I) For every problem, a candidate solution is associated with each path that is followed by ants.

II) The amount of pheromone deposited on each path followed by an ant is proportional to the quality of the corresponding candidate solution for the target problem.

III) When an ant has to choose between two or more paths, the path(s) possessing a larger amount of pheromone have a greater probability of being chosen by the ant.

In ACO, the appetency of solutions is inversely proportional to the difference of importance of negative and redundant path, and the concentration is proportional to the sum of number of ants whose appetency is bigger than α where α can be defined as m/10 having the value of m equal to the no of ants in a colony. All the ants having appetency greater than α deposits incremented pheromone. ACO involve a number of parameters that need to be set approximately such as α which is used to weigh the relative influence of the pheromone, β which indicates the heuristic values in the construction of ant's solutions and posses the value between 2 and 5 usually, ρ which is known as evaporation rate parameter where 0 ≤ ρ ≤1 is used to regulate the degree of the decrease of pheromone trails, local pheromone (history) coefficient indicated with σ controls the amount of contribution history plays in a components probability of selection and set to 0.1, a problem-dependent heuristic function (h ) that measures the quality of items that can be added to the current partial solution.

(d) MORA ALGORITHM : MORA stands for movement based routing algorithm and is completely distributed, since nodes need to communicate only with direct neighbors in their transmission range, and utilizes a specific metric, which exploits not only the position, but also the direction of movement of hosts in order to try to find a solution to this critical problem. The metric used in MORA (Movement-Based Routing Algorithm) is a linear combination of the number of hops, arbitrarily weighted, and a target functional, which can be calculated independently by each node. In a position-based routing algorithm, each node makes a decision to which neighbor to forward the message based only on its own location and the location of its neighboring nodes, and destination. The idea is to create a functional that each node can independently calculate, which depends on how far the node is from the line connecting source and destination, 'sd' , and on the direction the node is moving in. The target functional should reach its absolute maxima in the case the node is moving on 'sd' and it should decrease as the distance from 'sd' increases. Moreover, the more a node moves towards 'sd' the higher should be its value, i.e. for a fixed distance from 'sd' the functional should have a maximum if the node is moving perpendicularly to 'sd'. Another degree of freedom of the metric employed in MORA is the weight assigned to each node, which can be used to represent traffic conditions, application constraints, etc. The goal of the weighting function is to obtain a fair distribution of the available resources through the overall network. DFS (Depth First Search) could appear taking the decision among direct neighbors through minimizing a distance function.

MORA algorithm are used in various situations such as -

Simplifying complicated conditional expressions

Making the process of discovering items from training sets and appealing formulas to manipulate them.

Producing GB's in non - commuting situations

Finding small bases for ideals in a non - commuting algebra

PROPOSED WORK

CONCLUSION

REFERENCES

[1] By Rakesh Agrawal, Ramakrishnan Srikant, Fast Algorithms for Mining Association Rules VLDB Conference Santiago, Chile, 1994.

[2] By Q. C. Meng, T.J. Feng I, 2. Chen I, C.J. Zhou, J.H. Bo2 Genetic Algorithms Encoding Study and A Sufficient Convergence Condition of GAS, IEEE 1999.

[3] By Lijuan Zhou, Linshuang Wang Xuebin, Ge Qian Shi A Clustering-Based KNN Improved Algorithm CLKNN for Text Classification Informatics in Control, Automation and Robotics, IEEE 2010 .

[4]Atabaki G., Kangavari M., “ Mining association rules in Distributed Environment through Ant Colony Optimization Algorithm, M.Sc thesis (in Persian), Iran University of Science and Technology, 2009.

[5] By N. Chaiyarataiia, A. M. S. Zalzala Recent Developments in Evolutionary and Genetic Algorithms: Theory and Applications Innovations and Applications, IEEE , 1997.

[6] By Masaya Yoshikawa and Hidekazu Terai, A Hybrid Ant Colony Optimization Technique for Job-Shop Scheduling Problems Software Engineering Research, Management and Applications (SERA’06) 2006.

[7] By Dieferson Luis, Alves de Araujo’ , Heitor S. Lopes’, Alex A. Freitas2 A Parallel Genetic Algorithm for Rule Discovery in Large Databases, IEEE.

[8] By Yun-lei Cai, Duo Ji ,Dong-feng Cai, A KNN Research Paper Classification Method Based on Shared Nearest Neighbor, Proceedings of NTCIR-8 Workshop Meeting, June 15–18, 2010, Tokyo, Japan.

[9] By Reza Samsami, Comparison Between Genetic Algorithm (GA), Particle Swarm Optimization (PSO) and Ant Colony Optimization (ACO) Techniques for NOx Emission Forecasting in Iran, World Applied Sciences Journal 28 (12): 1996-2002, 2013 ISSN 1818-4952 © IDOSI Publications, 2013

[10] By Sanjay Tiwari, Mahainder Kumar Rao, Optimization In Association Rule Mining Using Distance Weight Vector And Genetic Algorithm, International Journal of Advanced Technology & Engineering Research (IJATER), Volume 4, Issue 1, Jan. 2014.

[11] By Rafael S. Parpinelli, Heitor S. Lopes, and Alex A. Freitas, Data Mining with an Ant Colony Optimization Algorithm, Brazil

[12] M. Dorigo, A. Colorni and V. Maniezzo, “The Ant System: optimization by a colony of cooperating agents,” IEEE Transactions on Systems, Man, and Cybernetics-Part B, vol. 26, no. 1, 1996.

[13] By Pengfei Guo Xuezhi Wang Yingshi Han The Enhanced Genetic Algorithms for the Optimization Design, IEEE 2010.

[14] By Giulia Boato, Fabrizio Granelli, Mora: A Movement Based Routing Algorithm For Ad Hoc Networks, University Of Trento, 26 Dec 2015.

**...(download the rest of the essay above)**