﻿ ESSAY SAUCE

Essay:

Essay details:

• Subject area(s): Engineering
• Published on: 7th September 2019
• File format: Text
• Number of pages: 2

Text preview of this essay:

This page is a preview - download the full version of this essay above.

ABSTRACT

The problem of increased rate of data has not been completely solved yet though there evolved a number of techniques to find optimized solution from training sets under the mechanism of association rule mining in data mining. The techniques such as K- nearest neighbor, Genetic algorithm, Ant Colony Optimization, Mora algorithm, etc are used to overcome the problems of data mining such as pruning pass of transaction database, superiority of rule set  and negative rule generation. The various algorithm differ in the context of Execution Time, Minimum Support, Confidence and efficiency though possess different advantage such as K- NN is highly robust to noisy data and effective in case of large training set, Genetic algorithm is independent of error surface and can solve problems possessing multiple solutions but unable to solve variant problems, ACO produce a number of alternative pheromone representations for constructive search problems. The comparative study of these algorithm shows their different behavior on the basis of various parameter.

Keyword - Data mining, support, confidence, distance weight optimization, negative association rule mining.

ASSOCIATION RULE MINING

DESCRIPTION OF APPROACHES

(a) KNN : The basic algorithm for the classification  is KNN. It can also be used for the purpose of estimation and prediction. K- nearest neighbor is based on the principle of instant- based learning, which is a technique used to identify the classification of unclassified records by comparing them with the most familiar records in the training data set which is stored until the process of finding classification is completed. As the k- nearest neighbor algorithm assigns the classification of the record on the basis of similarity, data analysts define distance function, also known as distance metrics to measure similarity. The distance function is just a real valued mathematical function 'd' such that it defines the distance between any coordinates x, y & z in the space like -

I) d(x,y) > 0 and d(x,y) = 0 only if x = y {Non - negative distance property}

II) d(x,y) = d(y,x)    {Commutative property}

III) d(x,z) ≤ d(x,y) + d(y,z)  {Triangle inequality property}

The distance 'd' can never be zero until both the coordinates are overlapped to each other. Commutative property shows that the distance never changes between two points no matters if   x--> y or y--> x. Triangle property shows that distance between two points can never be reduced by introducing a new point. The most familiar and widely used mathematical function for determining distance between points of  n- dimensional is Euclidean distance function which is given by -

d(x,y)=√(∑_(i=0)^n'▒'〖'(x_i  - y_i)'〗'^2 )       where  i  shows 'n' number of dimension.

KNN is a lazy algorithm as it uses complete training data set during the testing phase. Every training set are comprised of a set of vectors having class label associated with each vector. In the simplest case, the class labels are + (positive classes) or - (negative classes). The idea in k-Nearest Neighbor methods is to identify k that decides how many neighbor based on  the distance metric or distance function are influencing or can influence the classification. With small k (e.g., k = 1), the algorithm will simply return the target value of the nearest observation, a process that may lead the algorithm toward overfitting, tending to memorize the training data set at the expense of generalizability. A small value of k means that noise will have a higher influence on the result. On the other hand, choosing a value of k that is not too small will tend to smooth out any idiosyncratic behavior learned from the training set. However, if we take this too far and choose a value of k that is too large, locally interesting behavior will be overlooked. The data analyst needs to balance these considerations when choosing the value of k. A large value make it computationally expensive and also defeats the basic philosophy behind KNN that states points that are near might possess similar densities or classes. A simple approach to select k is to set k = √n where n shows the number dimension with which a point can be defined.

(b) GENETIC ALGORITHM : Genetic algorithms (GA) are derived from the principle of Darwin in natural genetics that states that only the fittest will survive and are adaptive in nature. GA maintains population of potential solutions of the candidate problem considered as individuals or creatures or phenotypes. GA comes under the larger class of evolutionary algorithms (EA) that produces optimized solution to problems by taking inspiration from natural evolution  techniques on earth such as mutation, selection, inheritance and crossover. Chromosomes or genotype are the properties of each candidate solution that can be altered or mutated traditionally. Candidate solutions are composed of binary strings i.e, having only 1's and 0's of fixed length which can be encoded too. Evolution starts from randomly generated individuals that goes on a iterative process and in each generation, fitness of every individuals in the population is evaluated. Genetic Algorithm terminates when either a maximum number of generation are produced, or a satisfactory fitness level has been reached for the population. Fitness is a value of objective function that are solved in the optimized problem. Candidate solution are also represented in variable length but they make the crossover complex unlike in fixed length representation where the parts of candidate solution are easily aligned which makes the genetic representation convenient facilitating simpler crossover. After the process of selection of high fitness value individuals, the process of evolution takes place by three genetic operator - reproduction, crossover, mutation.

Let  2 individuals having candidate solution of length 20 having crossover of 5 be -

X1= (01011|101101101000101)                              X2= (10010|011100101010001)

I) Reproduction does not make any change in the candidate solution of parent population and inherit the same candidate solution to the offspring population. The two resulting offspring are -

X’1= (01011|101101101000101)                              X’2= (10010|011100101010001)

II) Crossover interchange the bits of candidate key of parent population after the crossover bit and inherit that changed candidate solution to the offspring population. The two resulting offspring are -

X’1= (01001|011100101010001)                             X’2= (11010|101101101000101)

III) Mutation invert each bit of candidate solution of parent population and inherit that changed candidate solution to the offspring population. The two resulting offspring are -

X’1= (10100|010010010111010)                             X’2= (01101|100011010101110)

In the genetic programming unlike to evolutionary programming, the tree- like representations are explored. There are various kind of drawbacks of genetic algorithm such as complexity in search operation because of exponential increase in the search space size where the  number of element that are exposed to mutation is large, ineffective in problem solving in the case where the criteria for decision making is only Fitness measure.

(c) ANT COLONY OPTIMIZATION : An Ant Colony Optimization Algorithm, abbreviated as (ACO) is a probabilistic technique that posses the basis of agents system which work on the simulation of natural behavior of ants through the mechanism of cooperation and adaption. This algorithm was first proposed by Marco Dorigo in 1992 using the concept of reducing computational problem with the help of finding good paths through graph which is as similar to the concept used by ants to seek a path between their colony and food.  The basic idea behind the ACO is comprised of three views that are -

I) For every problem, a candidate solution is associated with each path that is followed by ants.

II) The amount of pheromone deposited on each path followed by an ant is proportional to the quality of the corresponding candidate solution for the target problem.

III)  When an ant has to choose between two or more paths, the path(s) with a larger amount of pheromone have a greater probability of being chosen by the ant.

In ACO, the appetency of solutions is inversely proportional to the difference of importance of negative and redundant path, and the concentration is proportional to the sum of number of ants whose appetency is bigger than α where α can be defined as m/10 having the value of m equal to the no of ants in a colony. All the ants having appetency greater than α deposits incremented pheromone. ACO involve a number of parameters that need to be set approximately such as α which is used to weigh the relative influence of the pheromone, β which indicates the heuristic values in the construction of ant's solutions and posses the value between 2 and 5 usually, ρ which is known as evaporation rate parameter where 0 ≤ ρ ≤1 is used to regulate the degree of the decrease of pheromone trails, local pheromone (history) coefficient indicated with σ controls the amount of contribution history plays in a components probability of selection and set  to 0.1, a  problem-dependent heuristic function (h ) that measures the quality of items that can be added to the current partial solution.

(d) MORA ALGORITHM : MORA stands for movement based routing algorithm and is completely distributed, since nodes need to communicate only with direct neighbors in their transmission range, and utilizes a specific metric, which exploits not only the position, but also the direction of movement of mobile hosts. The metric used in MORA (Movement-Based Routing Algorithm) is a linear combination of the number of hops, arbitrarily weighted, and a target functional, which can be calculated independently by each node. In a position-based routing algorithm, each node makes a decision to which neighbor to forward the message based only on the location of itself, its neighboring nodes, and destination. The idea is to create a functional that each node can independently calculate, which depends on how far the node is from the line connecting source and destination, 'sd'  , and on the direction the node is moving in. The target functional should reach its absolute maxima in the case the node is moving on 'sd'  and it should decrease as the distance from 'sd'  increases. Moreover, the more a node moves towards 'sd' the higher should be its value, i.e. for a fixed distance from 'sd' the functional should have a maximum if the node is moving perpendicularly to 'sd'.

PROPOSED WORK

CONCLUSION

REFERENCES

 By Rakesh Agrawal Ramakrishnan Srikant, Fast Algorithms for Mining Association Rules VLDB Conference Santiago, Chile, 1994.

 By Q. C. Meng , T.J. Feng I , 2. Chen I , C.J. Zhou , J.H. Bo2 Genetic Algorithms Encoding Study and A Sufficient Convergence Condition of GAS, IEEE 1999.

 By Lijuan Zhou Linshuang Wang Xuebin Ge Qian Shi A Clustering-Based KNN Improved Algorithm CLKNN for Text Classification Informatics in Control, Automation and Robotics, IEEE 2010 .

Atabaki G., Kangavari M. “ Mining association rules in Distributed Environment through Ant Colony  Optimization Algorithm, M.Sc thesis (in Persian), Iran  University of Science and Technology, 2009.

  By N. Chaiyarataiia and A. M. S. Zalzala Recent Developments in Evolutionary and Genetic Algorithms: Theory and Applications Innovations and Applications, IEEE , 1997.

 By Masaya Yoshikawa and Hidekazu Terai A Hybrid Ant Colony Optimization Technique for Job-Shop Scheduling Problems Software Engineering Research, Management and Applications (SERA’06) 2006.

 By Dieferson Luis Alves de Araujo’ , Heitor S. Lopes’, Alex A. Freitas2 A Parallel Genetic Algorithm for Rule Discovery in Large Databases,  IEEE.

 By Yun-lei Cai, Duo Ji ,Dong-feng Cai, A KNN Research Paper Classification Method Based on Shared Nearest Neighbor, Proceedings of NTCIR-8 Workshop Meeting, June 15–18, 2010, Tokyo, Japan.

 By Reza Samsami, Comparison Between Genetic Algorithm (GA), Particle Swarm Optimization (PSO) and Ant Colony Optimization (ACO) Techniques for NOx Emission Forecasting in Iran, World Applied Sciences Journal 28 (12): 1996-2002, 2013 ISSN 1818-4952 © IDOSI Publications, 2013

 By Sanjay Tiwari, Mahainder Kumar Rao, Optimization In Association Rule Mining Using Distance Weight Vector And Genetic Algorithm, International Journal of Advanced Technology & Engineering Research (IJATER), Volume 4, Issue 1, Jan. 2014.

 By Rafael S. Parpinelli, Heitor S. Lopes, and Alex A. Freitas, Data Mining with an Ant Colony Optimization Algorithm, Brazil

  M. Dorigo, A. Colorni and V. Maniezzo, “The Ant System: optimization by a colony of cooperating agents,” IEEE Transactions on Systems, Man, and Cybernetics-Part B, vol. 26, no. 1, 1996.

 By Pengfei Guo Xuezhi Wang Yingshi Han The Enhanced Genetic Algorithms for the Optimization Design, IEEE 2010.

 By Giulia Boato, Fabrizio Granelli, Mora: A Movement Based Routing Algorithm For Ad Hoc Networks, University Of Trento, 26 Dec 2015.

...(download the rest of the essay above)