Essay: Data mining

Essay details and download:

  • Subject area(s): Information technology essays
  • Reading time: 21 minutes
  • Price: Free download
  • Published: 4 March 2016*
  • File format: Text
  • Words: 4,431 (approx)
  • Number of pages: 18 (approx)

Text preview of this essay:

This page of the essay has 4,431 words. Download the full version above.

Various journals and articles related to data mining were studied and analyzed. Some compared the already existing algorithms while some try to modify them and improve the performance. Table 2.0 shows some research works carried out by renounce scholars in the field of frequent item set mining.
As frequent itemset mining is very important in the mining of association rules, various techniques are proposed for generation of the frequent item sets. For the purpose of this thesis, we take a look at the following techniques:-
 Horizontal layout/ Candidate set generation techniques
• Apriori algorithm
• AprioriTID algorithm
 Projected database / non-candidate set generation techniques
• FP-growth algorithm
• Prepost+ algorithm
Apart from the above mentioned algorithms, there are more than a dozen other algorithms used to mine frequent itemsets. The algorithms vary mainly on how the candidates’ itemsets are generated and how the supports for the candidates’ itemsets are counted. Regardless of the variations of the algorithms, they all share the same phenomena of mining frequent itemsets:-
• In the first pass, the support of each individual item is counted and the large ones are determined.
• In each subsequent pass, the large itemsets determined in the previous pass is used to generate new itemsets called candidate itemsets.
• The support of each candidate itemset is counted and large ones are determined.
• This process continues generally until new itemsets are found.
2.2 Horizontal layout/ Candidate set generation techniques
In general, this describes a database in which each row represents a transaction which has a transaction identifier (TID), followed by a set of items. Example of horizontal layout dataset is shown in table 2.1.
2.2.1 Apriori Algorithm
Apriori algorithm is used for frequent item set mining and association rule learning. The algorithm use a level-wise search, where k-itemsets (An itemset which contains k items is known as k-itemset) are used to explore (k+1)-itemsets, to mine frequent itemsets from transactional database for Boolean association rules. In this algorithm, frequent subsets are extended one item at a time and this step is known as candidate generation process. Then groups of candidates are tested against the data. To count candidate item sets efficiently, Apriori uses breadth-first search
method and a hash tree structure.
The Apriori algorithm developed by [9] is also most well-known for association rule mining. The search for association rules is guided by two parameters: support and confidence. Apriori returns an association rule if its support and confidence values are above user defined threshold values. The output is ordered by confidence. If several rules have the same confidence then they are ordered by support. Thus apriori favors more confident rules and characterizes these rules as more interesting. However it has certain disadvantages also. It only explains the presence and absence of an item in transactional databases and requires a large number of database scan. Moreover the minimum support threshold used is uniform and the number of candidate itemsets produced is large.
The working of Apriori algorithm fairly depends upon the Apriori property which states that” All nonempty subsets of a frequent itemsets must be frequent” [2]. It also described the
anti-monotonic property which says if the system cannot pass the minimum support test, all its supersets will fail to pass the test [2, 3]. Therefore if the one set is infrequent then all its supersets are also infrequent and vice versa. This property is used to prune the infrequent candidate elements. In the beginning, the set of frequent l-itemsets is found. The set of that contains one item, which satisfy the support threshold, is denoted by L1. In each subsequent pass, we begin with a seed set of itemsets found to be large in the previous pass. This seed set is used for generating new potentially large itemsets, called Candidate itemsets, and count the actual support for these candidate itemsets during the pass over the data. At the end of the pass, we determine which of the candidate itemsets are actually large (frequent), and they become the seed for the next pass. Therefore, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can be found. The feature first invented by [2] in Apriori algorithm is used by the many algorithms for frequent pattem generation. The basic steps to mine the frequent elements are as follows [3]:
• Generate and test: In this first find the 1-itemset frequent elements L1 by scanning the database and removing all those elements from C1 which cannot satisfy the minimum support criteria.
• Join step: To attain the next level elements Ck join the previous frequent elements by selfjoin i.e. Lk_1 * Lk_1 known as Cartesian product of Lk_1 i.e. This step generates new candidate k-itemsets based on joining Lk_1 with itself which is found in the previous iteration. Let Ck denote candidate k-itemset and Lk be the frequent
k-itemset.
• Prune step: Ck is the superset of Lk so members of Ck may or may not be frequent but all K-1 frequent itemsets are included in Ck thus prunes the Ck to fnd K frequent itemsets with the help of Apriori property. i.e. This step eliminates some of the candidate k-itemsets using the Apriori property. A scan of the database to determine the count of each candidate in Ck would result in the determination of Lk (i.e., all candidates having a count no less than the minimum support count are frequent by definition, and therefore belong to Lk). Ck, however, can be huge, and so this could involve grave computation. To shrink the size of Ck, the Apriori property is used as follows. Any (k-l)-itemset that is not frequent cannot be a subset of a frequent k-itemset. Hence, if any (k-1)-subset of candidate k-itemset is not in Lk_1 then the candidate cannot be frequent either and so can be removed from Ck. Step 2 and 3 is repeated until no new candidate set is generated.
The apriori algorithm is thus stated:-
Apriori Algorithm: Find frequent itemsets using an iterative level-wise approach based on candidate generation.
Input:
ï,§ D, a database of transactios;
ï,§ min_sup, the minimum support count threshold.
Output: L, frequent itemsets in D.
Method:
(1) L1 find_frequent_1-itemsets (D);
(2) for (k=2;Lk_1 ;k++){
(3) Ck, apriori_gen{Lk-1);
(4) for each transaction t ∈ D {// scan D for counts
(5) Ct = subset (Ck, t); // get the subsets of t that are cancelled
(6) for each candidate c ∈ Ct,
(7) c.count++;
(8) }
(9) Lk = {c ∈ Ck|c.count ≥ min_sup]
(10) }
(11) return L = Uk,Lk;
procedure apriuri_gen(Lk-1:frequent (k-1)-itemsets)
(1) for each itemset I1 ∈ Lk_1
(2) for each itemset I2 ∈ Lk_1
(3) if (I1[1]= I2 [1]) Ë” (I1[2]= I2 [2])
Ë” …… Ë” (I1 [k-2] = I2[k—2]) Ë” (I1[k-1] < I2 [k-1]) then (4) c = I1 ⋈I2; // join step: generate candidates (5) if has_infrequent_subse(c, Lk-1) then (6) delete c; // prune step: remove unfruitful candidate (7) else add c to Ck; (8) } (9) return Ck; procedure has_infrequent_subset(c: candidate k-itemset; Lk_1: frequent (k – 1)- itemsets); // use prior knowledge (1) for each (k-1) – subset s of c (2) if s ∉ Lk_1 then (3) return TRUE; (4) return FALSE; The apriori-gen function takes argument Lk-1, the set of all large (k-1) itemsets. It returns a superset of the set of the set of all large k-1 itemsets. Let us look at a concrete working example of the apriori example. The horizontal layout dataset is shown in table 2.2. Table 2.2 horizontal dataset D for Apriori algorithm Figure 2.2: Working example of Apriori Algorithm 2.2.2 AprioriTID Just like the Apriori algorithm, AprioriTID algorithm uses the generation function in order to determine the candidate item sets. The only difference between the two algorithms is that, in AprioriTID algorithm the database is not referred for counting support after the first pass itself. Here a set of candidate item sets is used for this purpose for k>1. When a transaction doesn’t have a candidate k-item set in such a case the set of candidate item sets will not have any entry for that transaction. This will decrease the number of transaction in the set containing the candidate item sets when compared to the database. As value of k increases every entry will become smaller than the corresponding transactions as the number of candidates in the transactions will keep on decreasing. Apriori only performs better than AprioriTID in the initial passes but more passes are given AprioriTID certainly has better performance than Apriori.
The AprioriTID uses the apriori-gen function to determine the candidate itemsets before the pass begins. The interesting feature of this algorithm is that the database D is not used for counting support after the first pass. Rather the set Ck is used for this purpose. Each member of the set Ck is of the form <TID, {Xk}>, where each Xk is potentially large k-itemset present in the transaction with identifier TID. For k=1, Ć corresponds to the database D, although conceptually each item i is replaced by the itemset {i}. For k>1, Ćk is generated by the algorithm (step 10). The member of Ćk corresponding to transaction t is <t.TID, {C ∈ Ck | c contained in t}>. If a transaction does not contain any candidate k-itemset, then Ćk will not have an entry for this transactions in the database, especially for large value of k. In addition, for large values of k, each entry may be smaller than the corresponding transaction because very few candidates may be contained in the transactions. However, for small values of k, each entry may be larger than the corresponding transaction because an entry in the Ck includes all candidate k-itemsets contained in the transactions.
1) L1 = {large 1-itemsets};
2) C 1: database D;
3) for ( k=2; Lk-1 ≠ Ø); k++ ) do begin
4) Ck : apriori-gen(Lk-1); New candidates
5) C k; = Ø;
6) for all entries t ∈ C k-1 ; do begin
7) // determine candidate itemsets in Ck, contained in the transaction with identifier t.TID
Ct = {c ∈ Ck | (c – c[k]) ∈ t.set-of-itemsets Ë” (c – c [k- 1]) ∈ t.set-of-itemsets};
8) forall candidates c ∈ Ct; do
9) c.count++;
10) if (Ct ≠ Ø) then C k += < t.TID,Ct >;
11)end
12) Lk = {C E Ck; | c.count ≥ minsup}
13) end
14) Answer : Uk Lk;
Figure 2.4: working example of AprioriTID Algorithm
2.3 Projected database / non-candidate set generation techniques
The concept of projected database was proposed and applied to mine the frequent itemsets efficiently because early approaches are able to mine the frequent itemsets but use large amount of memory. This type of database uses divide and conquer strategy to mine itemsets therefore it counts the support more efficiently then Apriori based algorithms. Tree projected layout based approaches use tree structure to store and mine the itemsets. The projected based layout contains the record id separated by column then record.
Tree projection is defined as the lexicographic tree with nodes contains the frequent itemsets [l4]. The lexicographic trees usually follow the ascending order for saving the frequent itemsets according to the support for better mining. Tree Projection algorithms based upon two kinds of ordering breadth-first and depth-first. For breath-first order, nodes are constructed level by level in the lexicographic tree for frequent itemsets [1 1]. In order to compute frequencies of nodes (corresponding frequent itemsets) at k level, tree projection algorithm maintained matrices at nodes of the k-2 level and one database scan was required for counting support [5]. Every transaction is projected by node sequentially. The projected set of transaction for reduced set is used to evaluate frequency.
For depth-first order, database is projected along the lexicographic tree and also requires
fitting into main memory [13]. The advantage is that the projected database will become smaller along the branch of the lexicographic tree while the breadth-first needs to project the database from the scratch at each level. The disadvantage of depth-first is obvious that it needs to load database and projected databases in memory. The breadth-first method will also meet the memory bottleneck when the number of frequent items is large and the matrix is too large to fit in memory[5].
In many cases the Apriori candidate generate-and-test method significantly reduces the size of candidate sets, leading to good performance gain. However it can suffer from two nontrivial costs: It may need to generate a huge number of candidate sets. For example if there are 10^4 frequent 1-itemsets, the Apriori algorithm will need to generate more than 10^7 candidate 2-itemsets. It may need to repeatedly scan the database and check a large set of candidates by pattern matching.
2.3.1 FP-Growth.
The solution of the above mentioned issues with the candidate set generation is the opposite of it i.e. use of non-candidate set generation algorithms. One example of such algorithms is the FP-Growth, which mines the complete set of frequent itemsets without candidate generation. This method adopts a divide-and-conquer strategy as follows: first it compresses the database representing frequent items into frequent-pattern tree, or FP-tree, which retains the itemset association information. It then divides the compressed database into a set of conditional database, each associated with one frequent item or pattern fragment, and mines each such database separately.
It takes the help of prefix tree representation of the given database of transactions (called FP tree), which saves considerable amount of memory for storing the transactions. An FP-Tree is a prefix tree for transactions. Every node in the tree represents one item and each path represents the set of transactions that involve with the particular item. All nodes referring to the same item are linked together in a list, so that all the transactions that are containing the same item can be easily found and counted. Large databases are compressed into compact FP tree structure. FP tree structure stores necessary information about frequent item sets in a database.
Frequent-pattern tree: Design and construction
Let us take a look at how the compact FP-tree structure is constructed.
Let I = {a1, a2… am} be a set of items, and a transaction database DB=T1,…,Tn_, where T
i(i ∈ [1 …n]) is a transaction which contains a set of items in I . The support (or occurrence frequency) of a pattern A, where A is a set of items, is the number of transactions containing A in DB. A pattern A is frequent if A’s support is no less than a predefined minimum support threshold, ξ. Given a transaction database DB and a minimum support threshold ξ, the problem of finding the complete set of frequent patterns is called the Frequent-pattern mining problem.
Frequent-pattern tree
To design a compact data structure for efficient frequent-pattern mining, let’s first examine an example.
Table 2.3 Transactional database DB for FP-tree construction
Example 1. Let the transaction database, DB be the first two columns of Table 1, and the minimum support threshold be 3 (i.e., ξ = 3).
A compact data structure can be designed based on the following observations:
1. Since only the frequent items will play a role in the frequent-pattern mining, it is necessary to perform one scan of transaction database DB to identify the set of frequent items (with frequency count obtained as a by-product).
2. If the set of frequent items of each transaction can be stored in some compact structure, it may be possible to avoid repeatedly scanning the original transaction database.
3. If multiple transactions share a set of frequent items, it may be possible to merge the shared sets with the number of occurrences registered as count. It is easy to check whether two sets are identical if the frequent items in all of the transactions are listed according to a fixed order.
4. If two transactions share a common prefix, according to some sorted order of frequent items, the shared parts can be merged using one prefix structure as long as the count is registered properly. If the frequent items are sorted in their frequency descending order, there are better chances that more prefix strings can be shared.
With the above observations, the frequent-pattern tree can be constructed as follows:
Figure 2.5: FP-tree
First, a scan of DB derives a list of frequent items, (f: 4), (c: 4), (a: 3), (b: 3), (m: 3), (P: 3) (The number after “:” indicates the support), in which items are ordered in frequency descending order. This ordering is important since each path of a tree will follow this order. For convenience of later discussions, the frequent items in each transaction are listed in this ordering in the rightmost column of Table 2.3.
Second, the root of a tree is created and labeled with “null”. The FP-tree is constructed as follows by scanning the transaction database DB the second time.
1. The scan of the first transaction leads to the construction of the first branch of the tree:
(f:1),(c:1),(a:1),(m:1),(p:1). Notice that the frequent items in the transaction are listed according to the order in the list of frequent items.
2. For the second transaction, since its (ordered) frequent item list f, c, a, b, m shares a common prefix f, c, a with the existing path f, c, a, m, p, the count of each node along the prefix is incremented by 1, and one new node (b:1) is created and linked as a child of (a:2) and another new node (m:1) is created and linked as the child of (b:1).
3. For the third transaction, since its frequent item list f, b shares only the node f with the f -prefix subtree, f ’s count is incremented by 1, and a new node (b:1) is created and linked as a child of ( f :3).
4. The scan of the fourth transaction leads to the construction of the second branch of the tree,
(c:1), (b:1), (p:1).
5. For the last transaction, since its frequent item list f, c, a, m, p is identical to the first one, the path is shared with the count of each node along the path incremented by 1.
Based on this example, a frequent-pattern tree can be designed as follows.
Definition (FP-tree). A frequent-pattern tree (or FP-tree in short) is a tree structure defined below.
1. It consists of one root labeled as “null”, a set of item-prefix subtrees as the children of the root, and a frequent-item-header table.
2. Each node in the item-prefix subtree consists of three fields: item-name, count, and node-link, where item-name registers which item this node represents, count registers the number of transactions represented by the portion of the path reaching this node, and node-link links to the next node in the FP-tree carrying the same item-name, or null if there is none.
3. Each entry in the frequent-item-header table consists of two fields, (1) item-name and (2) head of node-link (a pointer pointing to the first node in the FP-tree carrying the item-name).
Based on the above definition, we have the following algorithm for mining the frequent patterns using FP-tree:
Algorithm (FP-growth: Mining frequent patterns with FP-tree by pattern fragment growth).
Input: A database DB, represented by FP-tree constructed according to Algorithm 1, and a minimum support threshold ξ.
Output: The complete set of frequent patterns.
Method: call FP-growth(FP-tree, null).
Procedure FP-growth(Tree, α)
{
(1) if Tree contains a single prefix path // Mining single prefix-path FP-tree
(2) then {
(3) let P be the single prefix-path part of Tree:
(4) let Q be the multipath part with the top branching node replaced by a null root
(5) for each combination (denoted as ,β) of the nodes in the path P do
(6) generate pattern, β ∪ α or with support = minimum support of nodes in β;
(7) let freq_patIern_set(P) be the set of patterns so generated; }
(8) else let Q be Tree;
(9) for each item ai in Q do { // Mining multipath FP-tree
(10) generate pattern β = ai ∪ α with support = ai.support;
(11) construct β’s conditional pattern-base and then β’s conditional FP-tree Treeβ:
(12) if Treeβ ≠ Ø
(13) then call FP-growth[Treeβ,β);
(14) let freq_pattern_set(Q) be the set of patterns so generated; }
(15) return(freq_patIern_set(P) ∪ freq_parttern_set(Q) ∪ (freq_pattern_set(P)
X freq_pattern_set(Q)))
2.3.2 Prepost+
In recent years, an algorithm called PrePost (Deng et al., 2012) was proposed for mining frequent itemsets. The high efficiency of PrePost is achieved by: (1) employing a novel data structure named N-list to represent itemsets; and (2) adopting single path property of N-list to directly discover frequent itemsets without generating candidate itemsets. The experiments in Deng et al. (2012) show that PrePost runs faster than some state-of-the-art mining algorithms including FP-growth.
Although PrePost adopts single path property of N-list to prune the search space, it still incurs the problem of too many candidates because it employs Apriori-like approach for mining frequent itemsets. Later on, a new algorithm called PrePost+ was proposed which can effectively avoid the above problem. PrePost+ employs N-list to represent itemsets and directly discovers frequent itemsets in an itemset & N-list search tree. For avoiding repetitive search, it also adopts Children—Parent Equivalence pruning to greatly reduce the search space.
PrePost adopts a prefix tree structure called PPC-tree to store the database. Each node in a PPC-tree is assigned with a Pre-Post code via traversing the PPC-tree with Pre and Post order. Based on the PPC-tree with Pre-Post code, each frequent item can be represented by an N-list, which is the list of PP-codes that consists of pre-order code, post-order code, and count of nodes registering the frequent item. PrePost as stated earlier adopts the Apriori-like approach to find frequent itemsets. It gets N-lists of the candidate itemsets of length (k+1) by intersecting N-lists of frequent itemsets of length k and thus discovers the frequent itemsets of length (k + 1). However, PrePost can directly mine frequent itemsets without generating candidates in most cases.
The framework of PrePost+ is the same as that of PrePost, which consists of: (1) Construct PPC-tree and identify all frequent 1-itemsets; (2) based on PPC-tree, construct the N-list of each frequent 1-itemset; (3) scan PPC-tree to find all frequent 2-itemsets; (4) mine all frequent k(>2)-itemsets. The main difference between PrePost+ and PrePost is that PrePost+ adopts superset equivalence as pruning strategy while PrePost adopts single path property of N-list as pruning strategy. We take a look at the framework of the prepost+:
1) PPC tree construction
We define PPC tree as
Definition 1. PPC-tree is a tree structure:
1) It consists of one root labeled as “null”, and a set of item prefix subtrees as the children of the root.
2) Each node in the item prefix subtree consists of five fields: item-name, count, children-list, pre-order, and post-order. Item-name registers which item this node represents. Count registers the number of transactions presented by the portion of the path reaching this node. Children-list registers all children of the node. Pre-order is the pre-order rank of the node. Post-order is the post-order rank of the node.
According to Definition 1, PPC-tree looks like an FP-tree [1, 15]. Note that, FP-tree is called PC-tree in [15]. However, there are three important differences between them.
First, FP-tree has a node-link field in each node and a header table structure to maintain the connection of nodes whose item-names are equal in the tree, where PPC-tree does not have such structures. So, PPC-tree is a simpler prefix tree.
Second, each node in the PPC-tree has pre-order and post-order fields while nodes in the FP-tree have not. The pre-order of a node is determined by a pre-order traversal of the tree. In a pre-order traversal, a node N is visited and assigned the pre-order rank before all its children are traversed recursively from left to right. In other word, the pre-order records the time when node N is accessed during the pre-order traversal. In the same way, the post-order of a node is determined by a post-order traversal of the tree.
In a post-order traversal, a node N is visited and assigned its post-order rank after all its children have been traversed recursively from left to right.
Third, after a FP-tree is built, it will be used for frequent itemset mining during the total process
of FP-growth algorithm, which is a recursive and complex process. However, PPC-tree is only used for generating the Pre-Post code of each node. Later, we will find that after collecting the Pre-Post code of each frequent item first, the PPC-tree finishes its entire task and could be deleted.
Based on Definition 1, we have the following PPC-tree construction algorithm.
To get a better understanding of the concept and the construction algorithm of PPC-tree, let’s examine the following example.
Table 2.4 Transactional database DB for PPC-tree contruction
Example 1. Let the transaction database, DB, be represented by the information from the left two
= {a, b, c, e, f}.
Figure 1 shows the PPC-tree resulting from Example 1 after Algorithm 1 execution. The node with columns of Table 1 and ξ =0.4. The frequent 1-itemsets set F1 (4, 8) means that its pre-order is 4, post-order is 8, the item-name is b, and count is 4. Note that the PPC-tree is constructed using the right most column of Table 2.4 according to Algorithm 1. Obviously, the second column and the last column are equivalent for mining frequent itemsets under the given minimum support. In the rightmost columns of Table 2.4, all infrequent items are eliminated and frequent items are listed in support-descending order. This ensures that the DB can be efficiently represented by a compressed tree structure.
Figure 2.7: PPC-tree
Thus the algorithm for the PPC tree construction is given as:
Input: A transaction database DB and a minimum support ξ.
Output: A PPC-tree and F1 (the set of frequent 1-itemsets).
Method: Construct-PPC-tree (DB,ξ)
1: [Frequent 1-itemsets generation]
2: According to ξ, scan DB once to find F, the set of frequent 1-itemsets (frequent items), and their supports.
3: Sort F1 in support descending order as L1, which is the list of ordered frequent items. Note that, if the supports of some frequent items are equal, the orders can be assigned arbitrarily.
4: [PPC-tree construction]
5: Create the root of a PPC-tree, Tr, and label it as “null”.
6: for each transaction Trans in DB do
7: Select the frequent items in Trans and sort out them according to the order of F1. Let the sorted frequent-item list in Trans be [p|P], where p is the first element and P is the remaining list.
8: Call insert tree ([p|P],Tr).
9: end for
10: [Pre-Post code generation]
11: Scan PPC-tree to generate the pre-order and the post-order of each node.
12:
13: [Function insert tree ([p|P],Tr)]
14: if Tr has a child N such that N.item-name = p.item-name then
15: increase N’s count by 1;
16: else
17: create a new node N, with its count initialized to 1, and add it to Tr’s children-list;
18: if P is nonempty then
19: call insert tree(P, N) recursively.
20: end if
21: end if
2) N-List: Definition and Properties
Given a PPC-tree, the N-list of a frequent item is a sequence of all the PP-codes of nodes registering the item in the PPC-tree. The PP-codes are arranged in an ascending order of their pre-order values.
Each PP-code in the N-list is denoted by (x, y): z, where x is its pre-order, y is its post-order and z is its count. And the list of the frequent itemsets is denoted by <(x1, y1):z1>,
<(x2, y2):z2>……….. <(xn, yn):zn> where x1<x2<…………….xn.
For example N-list of f includes two node given by <(2,0):1> and <(7,3):1>. The figure below shows the N-list of all the nodes in the PPC-tree:
Figure 2.8:N-list of frequent itemsets
From the above example, we can denote the following properties of N-list
Property 1. Given any two different nodes N1and N2, which represent the same item (N.item-name = N2.item-name), if N1.pre-order < N2.pre-order, then N1.post-order < N.post-order.
When N1.pre-order < N2.pre-order, it means that N1 is traveled earlier than N2 during the pre-order traversal. Since N1 cannot be an ancestor of N2 because they both register the same item, N1 must be on the left branch of PPC-tree compared with N2. During the post-order traversal, the left branch will also be traversed earlier than N2, so N1.post-order < N2.post-order.
Property 2. Given the N-list of item i denoted by <(x1, y1):z1>, <(x2, y2):z2>……….<(xn, yn):zn>, the support of item i is z1+z2+….+zn. This is denoted from the definition of PP-code. Since each PP-code corresponds to a node in the PPC-tree, whose count registers the number of transaction including item i. The sum of counts of a node registering item i is i’s support count. For example from figure 2 (PPC tree) N-list of item f is <(2,0):1> and <(7,3):1>, the support of item f is given by 1+1=2.
MINING FREQUENT ITEMSETS USING N-LIST
The main step involved in the mining of frequent itemsets in prepost+ as stated earlier are:
1) Construct PPC-tree and identify all frequent itemsets; 2) based on the PPC-tree, construct the N-list of each frequent 1-itemsets; 3) scan the PPC — tree to find all frequent 2-itemsets; 4) mine all frequent itemsets. For step 1 algorithm 1 in subsection 3.2(PPC algorithm) shows its details. For other steps, details are below:
BUILDING N-LIST OF FREQUENT 1-ITEMSETS
Given transaction database DB and minimum support ξ, once we build the PPC-tree, it is easy to obtain the N-list of each frequent 1-itemset. By traversing the PPC-tree with pre-order, we can access each node in PPC-tree. For each node N, we insert (N.pre-order, N.post-order):N.count into the N-list of the item registered by N. Algorithm 2 shows the details of how to construct the N-lists of all frequent 1-itemsets.
Algorithm N-lists construction
Input: PPC-tree and L1, the set of frequent 1-itemsets.
Output: NL1, the set of the N-lists of frequent 1-itemsets.
Procedure N-lists_construction (PPC-tree)
1. Create NL1, let NL1[k] be the N-list of L1[k].
2. for each node N of PPC-tree accessed by pre-order traversal do
3. if (N.item – name = L1[k].item-name) then
4. insert ((N.pre-order, N.post-order) : N.count) into NL1 [k]
5. end if
6. end for
MINING FREQUENT 2-ITEMSETS
Many studies [5,10] reveal that the process of mining frequent 2-itemsets is high-cost for Apriori-based algorithms because the number of candidate 2-itemsets is usually huge, so it will be time-consuming to find frequent 2-itemsets by joining frequent 1-itemsets. Ref. [5] proposed a typical way to find all frequent 2-itemsets without joining frequent 1-itemsets. The method goes as follows. First, for each transaction, we get all its 2-itemsets (subset). Then, we can get the support of each 2-itemset after we deal with all transactions. Finally, it’s easy to find frequent 2-itemsets when the support of each 2-itemset is known.
However, we think out a better method, which can find out all frequent 2-itemsets by traversing the PPC-tree. Our method finds frequent 2-itemsets as follows. By traversing the PPC-tree with pre-order, we can access every node of the PPC-tree. For each node N, letting N be one of its ancestors, we increase the count of 2-itemset N.item-name∪N.item-name by N.count. After scanning the PPC-tree, we get all 2-itemsets containing in the PPC-tree and their support (their final counts). In finding frequent 2-patters, we just need to check whether the supports of these 2-itemsets are not less than ξ ×|DB|.
Because the PPC-tree is more compressed than the corresponding transaction database and our method does not need to generate all 2-itemsets in each transaction, our method is more efficient than the method in [5]. Algorithm 3 shows the details of our method. To raise efficiency, we implement Algorithms 2 and 3 in the same pre-order traversal. Figure below shows the frequent 2 and 3 itemsets:
Algorithm mining frequent 2-itemsets
Input: PPC-tree and L1, the set of all frequent 1-itemsets.
Output: L2 the set of all frequent 2-itemsets.
Procedure L2_Construction (PPC-tree)
1: Let L1[k] be the kth element in L1, set L1[k].order = k.
2: Create Temp2 = int[L1.size()][L1.size()].
3: for each node N of PPC-tree accessed by pre-order transversal do
4: for each ancestor node of N, Let it be Na, do
5: Temp2.[N.item – name.order][Na.item – name.order]+ = N.count;
6: end for
7: end for
8: for each element Temp2[i.j] in Temp2 do
9: if Temp2[i.j] ≥ ξ x |DB| then
10): insert L1[i] ∪ L1[j] into L2
11: end if
12: end for
To have a better and efficient mining process than the prepost method, we adopt the method of set enumeration tree (Burdick et al 2005) to represent the search space. Given a set of items I={i1,i2,…..,im} where i1< i2<….<im, a set enumeration tree can be constructed as follows:
Firstly, the root of the tree is created. Secondly, the m child nodes of the root resgistering and representing m 1-itemsets are created repectively. Thirdly, for a node representing itemset
{ijs,i-1….ij1} are registering ijs, the (m-js) child nodes of the node representing itemsets
{ijs+1,ijs,i-1…ij1} , {ijs+2,ijs,i-1…ij} , {imiijs-1,…ij1} and registering ijs+1, ijs+2,….,im respectively are created. Finally the set enumeration tree is built by executing the third step repeatedly until all leaf nodes are created for finding frequent itemsets and is depicted in fig 4(tree-prepost+). For example, the node in the bottom left of the tree represents itemset {bceaf} and registers item b.
For further understanding, the below property and rationale is given:
Property 1. Given itemset P and item i (∉P), if the support of P is equal to the support of
P ∪{i}, for any itemset A (A ∩ P = Ø ^ i ∉ A), the support of A ∪ P is equal to the support of A ∪ P ∪ {i}.
Rationale. The fact that the support of P is equal to the support of P∪{i} means that any transaction containing P also contains i. Given a transaction T, if T containing A∪P, it must contain A. According to the anterior conclusion, we know that T also contains i. That is, the support of A ∪P is equal to the support of A ∪P ∪{i}.
By employing Property 1, PrePost + can greatly reduce the search space. Let’s examine Figure 4. When generating the node representing ce, we find the support of ce is equal to the support of e, which is 4.
Let A be any superset of e without containing c. Note that, A can be rewritten as (A-{e}) ∪ {e}. Therefore, the support of A is equal to the support of A∪{c} according to Property 1. We denote the set of all frequent itemsets under the node representing e as FISe. In addition, FISe/c is defined as the set of frequent itemsets which do not contain c in FISe and FISe,c is defined as the set of frequent itemsets which contain c in FISe,c. Clearly, the union of FISe,c and FISe/c is FISe. To obtain FISe,c , we should search the subtree rooted at the node representing ce. To obtain FISe/c, we should search these subtrees rooted at the child nodes of the node representing e except the child node representing ce. According to Property 1, FISe,c is equal to {A∪{c}| A ∈ FISe/c }. Therefore, we can obtain FISe without searching the subtree rooted at the node representing ce.
Algorithm 1 shows the pseudo-code of PrePost+. Step (1) _ (4) of PrePost+ is the same as corresponding procedures of PrePost, which generate the frequent 1-itemsets, frequent 2-itemsets, and their N-list. Function NL_intersection() are used to generate N-lists of (k + 1)
1-itemsets by intersecting N-lists of k-itemsets. Step (6) — (9) call Procedure Building_Pattern_Tree() to search the set-enumeration tree (from level 2) and discovery all frequent k-itemsets (k ≥3) extended from each frequent 2-itemset.
NL_intersection() is a linear algorithm, thus it is very efficient. Step (1) _ (9) adopts 2-way comparison to find all elements in NL1, each of which is an ancestor of some element in NL2. All such elements constitute the intermediate result. Step 10 _ (9) merge elements with same (pre-code, post-code) pair in the intermediate result to obtain the final result. Let m and n be the cardinalities of NL1 and NL2 respectively. The computational complexity of 2 way comparison is O(m + n). According to 2-way comparison, the cardinality of the intermediate result is less than the cardinality of NL2. Therefore, the whole computational complexity of NL_intersection() is O(m + n).
Procedure Building_Pattern_Tree () employ Property 1 to construct a compressed frequent itemset tree without generating the subtree rooted at a child node of a node if the support of the itemset represented by the child node is equal to the support of the itemset represented by the node. Step (4) — (16) check each item in Cad_items, which is used to extend Nd. In Step (6), P1[1] means the first item in P. P in Step (7) is an itemset with i as the first item and Nd.itemset as the remainder. Step (8) generates the N-list of P. As shown by Step (9) and (10), if the support of P is equal to the support of Nd.itemset, only i is inserted into Nd.equivalent_items without generating the node representing P.
This technique is called Children—Parent Equivalence pruning, which has been used to discover maximal frequent itemset in Burdick et al. (2005). Step (11) — (16) find the items that are used to construct the child nodes of Nd1 and store these items in Next_Cad_ items for future extension. Step (17) — (24) find all frequent itemsets contained in Nd. Step (26) — (27) continue to extend the child nodes of Nd by recursively calling Building_Pattern_Tree(). Below is the algorithm showing full details of the prepost+ framework:

...(download the rest of the essay above)

About this essay:

If you use part of this page in your own work, you need to provide a citation, as follows:

Essay Sauce, Data mining. Available from:<https://www.essaysauce.com/information-technology-essays/data-mining-2/> [Accessed 29-03-24].

These Information technology essays have been submitted to us by students in order to help you with your studies.

* This essay may have been previously published on Essay.uk.com at an earlier date.