Hashing & Pruning is very popular association rule mining technique to improve the performance of traditional Apriori algorithm. Hashing technique uses hash function to reduce the size of candidate item set. Direct Hashing & Pruning (DHP),Perfect Hashing &Pruning (PHP) are the basic hashing algorithms. Many algorithms have been also proposed by researchers like Perfect Hashing Scheme (PHS), Sorting-Indexing andTrimming (SIT),HMFS etc. All algorithms have their own pros and cons.DHP algorithm suffer from collision andrequire more database scans to count the frequency of collided item sets.PHP algorithm eliminates collision problem but this algorithm increases the size of hash table which requires large amount of memory space and uses complex hash function. The main objective of this paper is to reduce the number of collision, database scans to count the frequency of collided itemsets and to make sure that the size of hash table does not increase. This paper proposes a new hashing idea Transaction Hashing and Pruning (THP).THP arranges the item sets into vertical format and then hashed the transactions id (TID) of candidate-k item sets into hash table bucket corresponding to that item set.

Keywords’ Association Rules, Data Mining,Hashing and Pruning

I. INTRODUCTION

Large amount of data is stored in databases and data warehouses.For better decision making it is important to extract useful information from the huge data. Association rule mining is most important technique of data mining for finding hidden patterns and association among various items in transaction databases. In general association rule is defined as X’Y, where X and Y are two different items of transaction database. Association rule mining is a two-step process:

a. Finding all frequent item sets whose support count is greater than minimum support.

b. Generate strong association rules from frequent item sets. Association rules that satisfy minimum support and minimum confidence are strong association rules.

1. Paper Organization

This paper is divided into various sections. Section II focuses on the various hash based algorithms proposed by researchers and explain the two basic hashing algorithms DHP and PHP. Frequent item sets are generated on a dataset [11].

Section III proposes a new hashing idea to overcome the problems of DHP and PHP algorithm.

Section IV shows the experimental results performed on different datasets to show how THP overcomes the problem of DHP and PHP algorithms.

Section V compares THP with DHP and PHP algorithm.

Section VI is the conclusion and future scope of the proposed work and then references are given.

II. RELATED WORK

1. Direct hashing & Pruning (DHP) algorithm [11] isan effective hash based algorithm for candidate item set generation. This algorithm reduces the size of candidate-2 item set. This reduction in candidate-2 item set size helps to trim the transaction database at earlier stages and also reduces the computational time.

2. H-Bit Array Hashing algorithm (H-BAH) [10] describe that there is collision problem in DHP algorithm.To avoid collision problem Quadratic probing is used but it increase the size of hash table. To overcome these problems H-Bit Array Hashing algorithm (H-BAH) reduces the size of hash table and also avoids collision and problem. H’Bit array is added to the header bucket of the table to give the information about the buckets which are hashed initially. When the collision occur this algorithm find the neighborhood of the originally hashed bucket to quickly place the collided item. This algorithm reduces the computational time and memory space

3. Perfect Hashing and Pruning (PHP) algorithm [9] reduces the extra work of DHP algorithm for counting the occurrences of candidate k+1 item set in each bucket. In PHP algorithm each items et has its own bucket which increase the size of hash table. This algorithm also reduces the search space by pruning the database at each step.

4. Perfect Hashing Scheme (PHS) [3] avoids collision problem in DHP algorithm. This algorithm uses an encoding scheme to transform large item sets into large-2 item set. Experimental result shows that PHS algorithm is three times faster than DHP algorithm and also scalable with large databases. The research paper also proposed a variant of PHS algorithm that reduces the memory requirements.

5. Multi-Phase Indexing and Pruning (MPIP) algorithm [14] generates non-collision hash tables and suitable for large datasets. Experimental result shows that MPIP algorithm is stable with different kinds of databases. MPIP algorithm provides updated information to decision makers. This algorithm works well in databases with new upcoming transaction. In this algorithm the required memory depends only on the number of items in database so it is easy to estimate the amount of memory required before starting the mining process. The main advantage of this algorithm is that it generated frequent-2 item set directly in one database scan without generating C1, L1 and C2.MPIP replaces hash tree into hash table to reduce the tree search cost.

6. Sorting Indexing and Trimming (SIT) algorithm [5] is a revised version of Apriori and MPIP algorithm by Sorting Indexing and Trimming. SIT algorithm takes the advantages of both Apriori and MPIP algorithm.It use MPIP algorithm to find L1 and L2 and Apriori algorithm to find candidate item set for k>2.Experimental results shows that SIT algorithm reduces the number of candidate item set and memory utilization.

7. In Inverted Hashing &Pruning (IHP) algorithm [7] for each item the transaction identifier of the transaction that contains the item are hashed into a hash table associated with the item. This hash table is known as TID Hash table (THT) of the item. After one complete pass of database THT of the items that does not contain frequent-1 item set (F1) is removed .TFT of F1 item sets is used to prune candidate-2 item set.

8. Multipass Inverted Hashing & Pruning (MIHP) algorithm[6] is a combination of IHPand Multipass Apriori(M-Apriori).IHP algorithm is used to find frequent-1 item set and to prune candidate item set generated in each step efficiently. M-Apriori is used to partition the frequent-I item set and process each partition separately. MIHP algorithm reduces the memory space and useful in large text databases.

9. Hash Based Frequent Item sets-Double Hashing (HBFI-DH) algorithm [12] uses hashing technology to store database in vertical format and to avoid collision and secondary clustering problem double hashing is used. Experimental result shows that HBFI-DH algorithm helps in fast retrieval of data and avoids unnecessary scans of database.

10. HMFS algorithm [15] takes advantages of DHP and Pincer Search algorithm. This algorithm reduces the number of database scans. DHP, Pincer-Search and HMFS algorithms are implemented on aPentium I11 800 MHz PC.The experimental result shows that HMFS algorithm works better on large databases than DHP and Pincer-Search algorithm.

2.1 Direct Hashing &Pruning (DHP)algorithm [11]

DHP algorithm is a hash based techniques to improve the performance of Apriori algorithm.DHP algorithm uses a hash function for candidate item set generation and also use pruning to successively reduce the size of transaction database. The working of DHP algorithm is described in section 2.1.1.

2.1.1 Working of DHP algorithm

Step1: Scan the database to count the support of candidate-1(C1) item set and select the items who have support count>=min_sup to add into large item set(L1).

Step 2: Now make possible set of candidate-2 item set in each transaction of database (D2). Hash function is applied on each candidate-2 item set to find the corresponding bucket number.

Step 3: Scan database (D2) and hash each item set of transactions into corresponding hash bucket. Some item sets are hashed into same bucket this is called collision problem.

Step4: Select only that candidate-2 item setwhose corresponding bucket count>=min_sup.If there is no collision then adds into L2.

a. If there is no collision then add the selected item sets into L2.

b. Else one more scan of database is required to count the support of collided item sets and the item set having support >=min_sup is added into L2.

Step 5: Now make possible set of candidate-3 item set (D3) and repeat the same procedure until Ck=’

Example:

Table 1: Data set D [11]

TID Items

100 A,C,D

200 B,C,E

300 A,B,C,E

400 B,E

Minimum support count (min_sup) =2

C1={{A:2},{B:3},{C:2},{D:1},{E:3}}

L1= {{A: 2}, {B: 2}, {C: 3}, {E: 3}}

Table 2: Possible set of candidate-2 items set (D2)

TID Items

100 {AC}{AD}{CD}

200 {BC}{CE}{BE}

300 {AB}{AC}{AE}{BC}{BE}{CE}

400 {BE}

H(x,y) = ((order of x)*10+ (order of y)) mod 7

Table 3: Hash table of possible-2 items set

{CE} {AE} {BC} {BE} {AB} {AC}

{CE} {BC} {BE} {CD}

{AD} {BE} {AC}

0 1 2 3 4 5 6

3 1 2 0 3 1 3

C2={{AC:2},{AD:1},{CD:1},{BC:2},{CE:2},{BE:3}}

L2= {{AC: 2}, {BC:2}, {CE: 2}, {BE: 3}}

Table 4: Possible set of candidate-3 items set (D3)

Items Count

200 {BCE}

300 {ACE},{BCE}

H(x, y, z) = ((order of x)*100+ (order of y)*10+ (order of z)) mod 7

Table 5: Hash table of possible-3 items set

{ACE} {BCE}

{BCE}

0 0 1 2

0 1 2 3

C3= {{BCE: 2}}

L3= {{BCE: 2}}

Pros & Cons of DHP algorithm

1. DHP uses simple hash function to reduce the size of candidate item set.

2. Size of hash table is small which requires less memory to store.

3. There is collision problem in DHP algorithm.

4. More database scans are required to count the support of collided items.

2.2Perfect Hashing &Pruning (PHP)algorithm [9]

PHP algorithm is an improvement in DHP algorithm to avoid collision problem. Thisalgorithm uses different hash function to map the candidate item set into hash table bucket. The property of this algorithm’s hash function isthat each item has its own bucket so that it avoids collision problem but hash table size is increased. Hash function is different for different level of candidate-k item set

2.2.1 Working of PHP algorithm

Table 6: Candidate-1 item set (C1)

Items Count

A 2

B 3

C 3

D 1

E 3

Min_sup=2

H(x) = (order of x) mod n, where n= number of distinct items, n=5

Table 7: Hash table of possible-1 items set

{A} {C} {D}

{E} {B} {C}

{E} {A} {B} {C}

{E} {B}

3 2 3 3 1

0 1 2 3 4

L1= {{A: 2}, {B: 3}, {C: 3}, {E: 3}}, n=4

Database D2 after pruning

D2={<100;{AC}>,<200;{BCE}>,<300;{ABCE}>,<400;{B

E}>}

Hash function for 2-itemsets:

H(x, y) = [(x’s order) * (y’s order)] mod (n*n)

Table 8: Possible set of candidate-2 items set (D2)

TID Items

100 {AC}

200 {BC}{CE}{BE}

300 {AB}{AC}{AE}{BC}{BE}{CE}

400 {BE}

Table 9: Hash table of possible-2 items set

{AC}

{BC} {BE} {CE}

{AB} {AC} {AE} {BC} {BE} {CE}

{BE}

1 2 1 2 3 2

2 3 5 6 10 15

L2= {{CE}, {BC}, {BE}, {AC}}, n=4

Database D3 after Pruning

D3= {<200 ;{ BCE}>, <300 ;{ ABCE}>}

Hash function for 3-itemset

H(x, y, z) = [(x’s order) * (y’s order)*(order of z)] mod (n*n)

Table 10: Possible set of candidate-3 items set (D2)

TID Items

200 {BCE}

300 {ACE},{BCE}

Table 11: Hash table ofpossible-3 items set

{BCE}

{BCE} {ACE}

2 1

14 15

C3= {BCE: 2}, L3= {BCE:2}

Pros & Cons of PHP Algorithm

1. There is no collision in the hash table because each item set has its own hash bucket.

2. Size of hash table is large in comparison of DHP algorithm.

3. PHP algorithm uses complex hash function.

**...(download the rest of the essay above)**