The discovery of association rules constitutes a very important task in the process of data mining. Association rules are an important class of regularities within data which have been extensively studied by the data mining community. The general objective here is to find frequent co-occurrences of items within a set of transactions. The found co-occurrences are called associations. The idea of discovering such rules is derived from market basket analysis where the goal is to mine patterns describing the customer’s purchase behavior [15]. Today, mining this type of rules is a very important discovery method in the KDD Process [31]. A simple association rule could look as follows: Cheese? Beer[support=0.1, confidence=0.8] . Put simply, this rule expresses a relationship between Beer and Cheese. The support measure states that beer and cheese appeared together in 10% of all recorded transactions. The confidence measure describes the chance that there is beer in a transaction provided that there is also cheese. In this case, 80% of all transactions involving cheese also involved beer. We can thereby assume that people who buy cheese are also likely to buy beer in the same transaction. Such information can aid retail companies to discover cross-sale opportunities and guide the category management in this way. In addition, it enables companies to make recommendations which can be especially useful for online retail shops.

Association rule mining is user-centric because its objective is the elicitation of interesting rules from which knowledge can be derived [16]. Interestingness of rules means that they are novel, externally significant, unexpected, nontrivial, and actionable. An association mining system aids the process in order to facilitate the process, filter and present the rules for further interpretation by the user.

A lot of interest in association rule mining was ignited by the publications [47] and [46] in 1993/94. In those papers, an algorithm for mining association rules in large databases has been described.

3.1 BASICS

We state the problem of mining association rules as follows: I={i1 , i2 , … , im } is a set of items, T={t1 , t2 , … , tn} is a set of transactions, each of which contains items of the itemset I . Thus, each transaction ti is a set of items such that ti?I . An association rule is an implication of the form: X ?Y , where X ?I , Y?I and X nY=? . X (or Y ) is a set of items, called itemset [15]. An example for a simple association rule would be {bread }?{butter } . This rule says that if bread was in a transaction, butter was in most cases in that transaction too. In other words, people who buy bread often buy butter as well. Such a rule is based on observations of the customer behavior and is a result from the data stored in transaction databases. Looking at an association rule of the form X ?Y , X would be called the antecedent, Y the consequent. It is obvious that the value of the antecedent implies the value of the consequent. The antecedent, also called the ‘left hand side’ of a rule, can consist either of a single item or of a whole set of items. This applies for the consequent, also called the ‘right hand side’, as well. The most complex task of the whole association rule mining process is the generation of frequent itemsets. Many different combinations of items have to be explored which can be a very computation-intensive task, especially in large databases. As most of the business databases are very large, the need for efficient algorithms that can extract itemsets in a reasonable amount of time is high.

Often, a compromise has to be made between discovering all itemsets and computation time. Generally, only those itemsets that fulfill a certain support requirement are taken into consideration. Support and confidence are the two most important quality measures for evaluating the interestingness of a rule.

Support: The support of the rule X ?Y is the percentage of transactions in T that contain X nY. It determines how frequent the rule is applicable to the transaction set T. The support of a rule is represented by the formula supp(X ?Y) = XnY/ n where (XnY) is the number of transactions that contain all the items of the rule and n is the total number of transactions. The support is a useful measure to determine whether a set of items occurs frequently in a database or not. Rules covering only a few transactions might not be valuable to the business. The above presented formula computes the relative support value, but there also exists an absolute support. It works similarly but simply counts the number of transactions where the tested itemset occurs without dividing it through the number of tuples.

Confidence: The confidence of a rule describes the percentage of transactions containing X which also contain Y. conf (X?Y) = (XnY)-X this is a very important measure to determine whether a rule is interesting or not. It looks at all transactions which contain a certain item or itemset defined by the antecedent of the rule. Then, it computes the percentage of the transactions also including all the items contained in the consequent.

3.1.1 The Process

The process of mining association rules consists of two main parts. First, we have to identify all the itemsets contained in the data that are adequate for mining association rules. These combinations have to show at least a certain frequency to be worth mining and are thus called frequent itemsets. The second step will generate rules out of the discovered frequent itemsets.

1. Mining Frequent Patterns

Mining frequent patterns from a given dataset is not a trivial task. All sets of items that occur at least as frequently as a user-specified minimum support have to be identified at this step. An important issue is the computation time because when it comes to large databases there might be a lot of possible itemsets all of which need to be evaluated. Different algorithms attempt to allow efficient discovery of frequent patterns.

2. Discovering Association Rules

After having generated all patterns that meet the minimum support requirements, rules can be generated out of them. For doing so, a minimum confidence has to be defined. The task is to generate all possible rules in the frequent itemsets and then compare their confidence value with the minimum confidence (which is again defined by the user). All rules that meet this requirement are regarded as interesting. Frequent sets that do not include any interesting rules do not have to be considered anymore. All the discovered rules can in the end be presented to the user with their support and confidence values.

3.1.2 Research

The process of mining association rules consists of two parts. The fist part is discovering frequent itemsets in the data. Secondly, we want to deduce inferences from these itemsets. The first step is the much more complex part, thus the majority of related research has focused on itemset discovery. Given E distinct Items within the search space, we have to explore 2|E| possible combinations. Due to the fact that |E| is often large, naive exploration techniques are frequently intractable [16].

Research is focusing on the following topics:

? Restrict the exploration by developing and applying interest measures and pruning strategies.

? Reducing the IO-cost by making use of hardware advances, enabling large datasets to become memory resident or techniques like intelligent sampling.

? Creating useful data structures to make analysis more tractable.

? Producing condensed conclusion sets which allow the whole set to be inferred from a reduced set of inferences, lowering storage and simplifying user interpretation.

A variety of algorithms for performing the association rule mining task have already been developed, most of which focus on finding all relevant inferences in a data set. In addition, increasing attention is given to algorithms that try to improve computing time and user interpretation.

3.2 BINARY ASSOCIATION RULES

By the term binary association rules, we refer to the classical association rules in market basket analysis. Here, a product can either be in a transaction or not, making only boolean values (true or false, represented by 1 and 0) possible. Every item in a transaction can thus be defined as a binary attribute with domain {0,1}. The formal model is defined in [AgIS93] as follows: ‘Let I=i1 , i2 ,… , im be a set of binary attributes, called items. Let T be a database of transactions. Each transaction t is represented as a binary vector, with t [k ]=1 if t bought the item ik , and t [k ]=0 otherwise. There is one tuple in the database for each transaction. Let X be a set of some items in I . We say that a transaction t satisfies X if for all items ik?X , t [k ]=1 .’

An association rule is, as already stated in chapter 3.1, an implication of the form X ?Y where X and Y are sets of items contained in I and Y is not present in X . We call a rule satisfied in T with the confidence factor 0=c=1 if at least c % of transactions in T that support X also support Y . The notation X ?Y|c can be used to express that our rule has a confidence factor of c. In [47], the problem of rule mining is divided into two subproblems:

? It is necessary to identify all combinations of items that have a transaction support above a certain threshold, called minsupport. We will call those sets of items that show a sufficient support large or frequent itemsets, and those not meeting the threshold small itemsets. Syntactic constraints can also be taken into consideration, for example if we are only interested in rules that contain a certain item in the antecedent or the consequent.

? After identifying the itemsets that satisfy the minsupport, it is important to test whether it satisfies the confidence factor c. Only the previously defined large itemsets have to be tested at this stage. The confidence is computed by dividing the support of the whole itemset by the support of the antecedent.

After having solved the first problem in finding all relevant large itemsets, the second part is rather straightforward. In order to discover large itemsets, the Apriori algorithm was developed as the first and nowadays best known algorithm for mining association rules. The Apriori and other algorithms will be explored in greater detail.

**...(download the rest of the essay above)**