Analyzing large scale PPI network of complex species

3.1 Motivation
The approach proposed by Pandey et.al has been implemented on a single machine. However with the increase in vast amount of data along with time i.e for the purpose of analyzing large scale PPI network of complex species, sequential implementation of the algorithm will consume a lot of processing time. The proposed approach focuses on the parallel implementation of Hyperclique miner algorithm, used for mining frequent hyperclique patterns. The hyperclique pattern is used for association analysis of the PPI network. The parallel implementation is done by Hadoop [8], an open-source software framework that allows to store and process data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. The Hadoop software framework uses Map-Reduce programming model.
3.2 MapReduce Model
Google’s Map Reduce paradigm [9],[10],[11] is a distributed programming paradigm and an associated implementation to support distributed computing over large datasets. With the help of this technology a programmer without any experience in parallel and distributed system can easily utilize the resources of a large distributed system, since it hides the details of parallelization, fault-tolerance, locality optimization, and load balancing.
The ideas of map reduce technology originated from the map and reduce functions of the functional programming. The Map Reduce framework consists of large number of computers called nodes which are collectively referred to as cluster. The Map and Reduce functions of MapReduce framework are defined with respect to data structured (key, value) pairs. Map () can be expressed as
Map (k1, v1) ‘ list (k2, v2) (3)
The Map function is applied in parallel to every pair in the input dataset (k1, v1). This produces a list of pairs for each call list (k2, v2). After that, the MapReduce framework collects all pairs with the same key from all lists and groups them together, creating one group for each key. The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain:
Reduce (k2, list (v2)) ‘ list (v3). (4)
Each Reduce call typically produces either one value v3 or an empty return, though one call is allowed to return more than one value. The returns of all calls are collected as the desired result list.
3.3 Hadoop
Hadoop is a software framework of map reduce system which provides a distributed file system (HDFS)[13] that can store data across thousands of servers, which provides a means of running work (Map/Reduce jobs) across those machines, as well as running the work near the data. The Hadoop Map/Reduce framework has master/slave architecture. It has a single master server or job tracker and several slave servers or task trackers, one per node in the cluster. The job tracker is the point of interaction between users and the framework. Hadoop’s Distributed File System (HDFS) is designed to reliably store very large files across machines in a large cluster.
3.4 Algorithm
This subsection details out the parallel version of Hyperclique miner algorithm
Algorithm Parallel Hypercliqueminer algorithm
Input:
1. A database T of N transactions T= {t1,’.,tN}, each ti ‘ T is a record with m attributes {i1, i2,’. im} where m is between 0, 1 where the il(1 ‘j ‘m) is the Boolean value for the feature type Sj
2. A user specified minimum h-confidence threshold (hc)
3. A user specified minimum support threshold (supp)
Output: Hyperclique patterns with h-confidence > hc and support > supp
Method:
Begin
Scan T to get support count of every item.
Use the map function to count the occurrences of potential candidate set in a distributed way;
Apply reduce function to sum up the occurrence count of the potential candidate set;
Prune those items, whose support count is < minimum support count. The h-confidence value for itemset of size 1 is assumed to be 1; For the itemset of size >2 do the following:
The map function performs the self join operation of candidate set Ck-1 (where k=size of the itemset >1)to generate the potential candidate set;
The reduce function count the occurrences of the candidate set, and prune the itemset based on anti-monotone property of h-confidence support ;
Prune the cross support pattern based on cross support property of h-confidence;
Finally the reduce phase compute the support count and h-confidence of all the itemset and prune all the itemset which does not meet the user specified threshold of minimum support and h-confidence;
End;

Essay: Analyzing large scale PPI network of complex species

Essay details and download:

Text preview of this essay:

About this essay:

Essay details and download:

Text preview of this essay:

About this essay:

Essay Categories: