Performance Enhancement Of Big Data Analytics

ABSTRACT
The objective of the project is to propose a performance enhancement model in the area of big data analytics using the sense-decide before the map-reduce technique is applied. The challenging problems and issues in the analysis of very large or big data are due to the uncertainty and unstrcturedness of the data pool and volatility and visibility of the hidden data sets. These data and Meta data entities are to be brought into consideration before applying the conventional Map Reduce strategy for data processing not only towards correctness but also for the performance enhancement of the analytic problem. The formal representation of the proposed sense decide strategy precedes the map reduce for the performance enhancement has been done. The meta-pipes and value-pipes techniques have been proposed and implemented to build up a powerful analytic and the related benchmarks of the data generation, pre-fetching and pre-shuffling are considered to validate the proposed approach with number of Hadoop clusters.

CHAPTER 1
INTRODUCTION
The World Wide Web has been adopted as an ideal platform for developing data-intensive applications, since the communication paradigm of the Internet is sufficiently open and powerful. Representative data-intensive Web applications include search engines, online auctions, webmail, and online retail sales, to name just a few. Data-intensive applications like data mining and web indexing need to access ever expanding data sets ranging from a few gigabytes to several terabytes or even petabytes. The MapReduce programming framework can simplify the complexity of running parallel data processing functions across multiple computing nodes in a cluster, because scalable MapReduce helps programmers to distribute programs and have them executed in parallel. MapReduce automatically handles the gathering of results across the multiple machines and return a single result or set. More importantly, the MapReduce platform can offer fault tolerance that is entirely transparent to programmers.
Hadoop system basically consists of two major parts. The first part is the Hadoop MapReduce engine ‘ MapReduce. The second component is HDFS ‘ Hadoop Distributed File System, which is inspired by Google’s GFS (i.e., Google File System). Currently, HDFS divides files into blocks that are replicated among several different computing nodes with no attention to whether the blocks are divided evenly. When a job is initiated, the processor of each node works with the data on their local hard disks. The performance of cluster can be improved by Hadoop, because multiple nodes work concurrently to provide high throughput.
1.1 Scope of Hadoop
MapReduce is a programming model that supports parallel data processing in high-performance cluster computing environments. The MapReduce programming model is highly scalable, because the jobs in MapReduce can be partitioned into numerous small tasks, each of which is running on one computing node in a large-scale cluster. The Hadoop runtime system coupled with HDFS manages the details of parallelism and concurrency to provide ease of parallel programming as well as reinforced reliability.
2
Scheduling in MapReduce differs from traditional cluster scheduling in the following two ways. First, the MapReduce scheduler largely depends on data locality, i.e., assigning tasks to computing nodes where input data sets are located. Data locality plays an important role in achieving performance of clusters because the network bisection bandwidth in a large cluster is much lower than the aggregate bandwidth of the disks in computing nodes. Second, the dependence of reduce tasks on map tasks may cause performance problems in MapReduce. The dependence among reduce and map tasks can slow down the performance of clusters by imbalanced workload – some nodes are underutilized and others are overly loaded. A long-running job containing many reduce tasks on multiple nodes will not sitting idle on the nodes until the job’s map phases are completed. Therefore, the nodes running idle reduce tasks are underutilized due to the fact that reduce tasks reserve the nodes. To address this performance issue, a mechanism of preshuffling scheme to preprocess intermediate data between a pair of map and reduce tasks in a long-running job, thereby increasing the computing throughput of Hadoop clusters.
1.2 Data Distribution Issues
The data locality is a determining factor for MapReduce performance. To balance workload in a cluster, Hadoop distributes data to multiple nodes based on disk space availability. Such a data placement strategy is very practical and efficient for a homogeneous environment, where computing nodes are identical in terms of computing and disk capacity. In homogeneous computing environments, all nodes have identical workloads, indicating that no data needs to be moved from one node to another. In a heterogeneous cluster, however, a high-performance node can complete local data processing faster than a low performance node. After the fast node finishes processing the data residing in its local disk, the node has to handle the unprocessed data in a slow remote node
1.3 Data Locality Issue
Simply increasing cache size does not necessarily improve the I/O-subsystem and CPU performance. In the MapReduce model, before a computing node launches a new application, the application relies on the master node to assign tasks. The
3
master node informs computing nodes what the next tasks are and where the required data blocks are located. The computing nodes do not retrieve the required data and process it until assignment notifications are passed from the master node. In this way, the CPU are underutilized by waiting a long period for the notifications are available from the master node. Prefetching strategies are needed to parallelize these workloads so as to avoid the idle point.
1.4 Data Transfer Issue
A Hadoop application running on a cluster can impose heavy network load. This is an important consideration when applications are running on large-scale Hadoop clusters. During the shuffle phase, each reduce task contacts every other node in the cluster to collect intermediate files. During the reduce output phase, the final results of the entire job are written to HDFS
The reduce output phase might take longer than the shuffle phase. The average aggregate peak throughput is the aggregate throughput at which some component in the network resource saturates (i.e., when the network is at the maximum throughput capacity). Once one component in the network saturates the job as a whole won’t be able to go any faster even if there are other underutilized computing nodes.
4

Chapter 2
LITERATURE SURVEY
2.1 Importance of Preshuffling
Hadoop System follows a preshuffling scheme by itself but the efficiency of which can be improved by employing an Adaptive preshuffling that implements a push model and a pipeline along with the preshuffling scheme in the Hadoop system it shows that preshuffling-enabled Hadoop clusters are faster than native Hadoop clusters. The push model and the preshuffling scheme powered by the 2-stage pipeline can shorten the execution times of the Word Count process and Sort benchmarks when compared with Hadoop by an average of 10% and 14%, respectively [10].
In the push model, map task automatically sends intermediate data in the shuffle phase to the reduce task. Map task in the push model proactively start sending intermediate data to reduce task as soon as the data are produced. The push model allows reduce task to start their executions earlier rather than waiting for an entire intermediate data set to become available. The push model improves the efficiency of the shuffle phase, because reduce task need not be strictly synchronized with their map tasks, eliminating the need waiting for the entire dataset [10].
2.2 Prefetching and preshuffling in shared Mapreduce computation Environment
The two optimization schemes, prefetching and pre-shuffling, which improve the overall performance under the shared environment while retaining compatibility with the native Hadoop. Prefetching and Preshuffling schemes are implemented in the native Hadoop as a plug-in component called HPMR (High Performance MapReduce Engine). These schemes improve the overall performance effectively in the shared MapReduce computation environment [8]
The prefetching scheme exploits data locality, while the pre-shuffling scheme significantly reduces the network overhead required to shuffle key-value pairs.
5
2.3 Mapreduce: Simplified data processing on large clusters
For simplifying data processing on a large cluster, Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication [5].
This methodology allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system; also the implementation of MapReduce is highly scalable [5].
2.4 Mapreduce online
Hadoop Online Prototype (HOP) is an extension to the native Hadoop, that supports continuous queries, enables MapReduce programs to be written for applications such as event monitoring and stream processing. HOP retains the fault tolerance properties of the native Hadoop and can run unmodified user-defined MapReduce programs. Hadoop Online Prototype extends the applicability of the model to pipelining behaviors, while preserving the simple programming model and fault tolerance [4].
HOP provides significant new functionality, including ‘early returns’ on long-running jobs via online aggregation, and continuous queries over streaming data and also demonstrates benefits for batch processing by pipelining jobs. Thus HOP can reduce the time to job completion [4].
2.5 Improving Map Reduce performance through data placement in heterogeneous Hadoop clusters
In a heterogeneous Hadoop cluster ignoring the data locality issue can noticeably reduce the Map Reduce performance. It is important to address the problem of how to place data across nodes in a cluster in such a way that each node has a balanced data processing load. This mechanism distributes fragments of an input file to heterogeneous nodes based on their computing capacities. The data placement scheme aims at adaptively balancing the amount of data stored in each node in order to achieve improved data-processing performance [7].
6
CHAPTER 3
HADOOP FRAMEWORK
3.1 Apache Hadoop The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures [2].
The Hadoop system has two core components. The component are distributed file system called HDFS, and MapReduce programming framework for processing large datasets [2]. Hadoop software stack consists of
‘ Pig and Hive, user-friendly parallel data processing languages
‘ Zookeeper, a high-availability directory and configuration service,
‘ HBase, a web-scale distributed column-oriented store designed after its proprietary predecessors.
3.2 Map Reduce
MapReduce is a programming model and an associated implementation for processing and generating large data sets. The MapReduce model was designed for unstructured data processed by large clusters of commodity hardware; the functional style of MapReduce automatically parallelizes and executes large jobs over a computing cluster. The Map Reduce model is capable of processing many terabytes of data on thousands of computing nodes in a cluster. MapReduce automatically handles the messy details such as handling failures, application deployment, task duplications, and aggregation of results, thereby allowing programmers to focus on the core logic of applications [2] [3].
7
3.2.1 Map Reduce Model
Each Map Reduce application has two major types of operations – a map operation and a reduce operation. Map Reduce allows for parallel processing of the map and reduction operations in applications. Each mapping operation is independent of the other, meaning that all mappers can be performed in parallel on multiple machines.
The Map phase applies user specified logic to input data. The results, called as intermediate results, are then fed into the reducer phase so the intermediate results can be aggregated and written as a final result. The input data, intermediate result, and final result are all represented in the key/value pair format [3].
3.2.2 Map Task Execution
Each Map task is assigned a portion of an input file called a split. By default, a split contains a single HDFS block with 64MB, and the total number of file blocks normally determines the number of map tasks.
The execution of a map task can be separated into two stages. The map phase reads the task’s split and organizes the split into records (key/value pairs), and applies the map function to each record. After the map function has been applied to each input record, the commit phase registers the final output with the TaskTracker, which then informs the JobTracker that the task has been completed. The output of the map step is consumed by the reduce step, so the Output Collector stores map output in a format that is easy for the reduce tasks to consume [2].
Intermediate keys are assigned to reducers by applying a partitioning function. Thus, the Output Collector applies this function to each key produced by the map function, and stores each record and partition number in an in-memory buffer. The Output Collector spills this information to a disks when a buffer reaches its capacity. A spill of the in-memory buffer involves sorting the records in the buffer first by partition number, then by key. The buffer content is written to a local file system as a data file and index file. This points to the offset of each partition in the data file. The data file contains the records, which are sorted by the key within each partition segment.
8
Fig. 3.1 Execution Process of MapReduce Programming Model
During the commit phase, the final output of a map task is generated by merging all the spill files produced by this map task into a single pair of data and index files. These files are registered with the TaskTracker before the task is completed. The TaskTracker reads these files to service requests from reduce tasks [3].
3.2.3 Reduce Task Execution
The execution of the reduce task contains three steps.
‘ In the shuffle step, the intermediate data generated by the map phase is fetched. Each reduce task is assigned a partition of the intermediate data with a fixed key range, so the reduce task must fetch the content of this partition from every map task’s output in the cluster.
‘ In the sort step, records with the same key are grouped together to be processed by the next step.
9
‘ In the reduce step, the user-defined reduce function is applied to each key and corresponding list of values.
In the shuffle step, a reduce task fetches particular data from each map task. The JobTracker relays the location of every TaskTracker that hosts a map output to every TaskTracker that is executing a reduce task. Note that a reduce task cannot fetch the output of a map task until the map has finished its execution and commitment of its final output to the disk.
After receiving partitions from all mappers’ outputs, the reduce task enters the sort step. The output generated from mappers for each partition is already sorted by the reduce key. The reduce task merges these runs together to produce a single run that is sorted by key. The task then enters the last reduce step, in which the user-defined reduce function is invoked for each distinct key in a sorted order, passing it the associated list of values. The output of the reduce function is written to a temporary location on HDFS. After the reduce function has been applied to each key in the reduce task’s partition, the task’s HDFS output file is automatically renamed from its temporary location to its final location [10]
3.3 Scheduler
In a Hadoop cluster, there is a central scheduler managed by a master node, called Job Tracker. Worker nodes, called TaskTrackers, are responsible for task executions. JobTracker is responsible not only for tracking and managing machine resources across the cluster, but also for maintaining a queue of currently running MapReduce Jobs. Every TaskTracker periodically reports its state to the JobTracker via a heartbeat mechanism. TaskTrackers concurrently execute the task in each slave nodes.
The JobTracker and TaskTracker transfer information through a heartbeat mechanism. The TaskTracker sends a message to the JobTracker once every specified intervals (e.g., a message every 2 seconds). The Heartbeat mechanism provides a communication channel between the JobTracker and the TaskTracker. A task assignment is delivered to the Task Tracker in the form of a heartbeat. If the task fails, the JobTracker can keep track of this failure because the JobTracker receives no reply from the TaskTracker. The JobTracker monitors the heartbeats received from the TaskTracker to
10
make the task assignment decisions. If a heartbeat is not received from a TaskTracker during a specified time period, the TaskTracker is assumed to be malfunction.[9]
3.4 Hadoop Distributed File System
The Hadoop Distributed File System or HDFS is a distributed file system designed to run on commodity hardware. HDFS is the primary distributed storage used by Hadoop applications on clusters. Although HDFS has many similarities with existing distributed file systems, the differences between HDFS and other systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on cost-effective clusters. HDFS offering high throughput access to application data – is suitable for applications that have large data sets.
3.4.1 HDFS Architecture
Basically, an HDFS cluster consists of a single NameNode, which manages the file system namespace and regulates access of clients to files. In addition, there are a number of DataNodes. Usually, each node in a cluster has one DataNode that manages storage of the node on which tasks are running. HDFS exposes file system namespace and allows user data to be stored in the files. Internally, a file is split into one or more blocks and stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. The NameNode also determines the mappings of blocks to DataNodes. The DataNodes not only are responsible for serving read and write requests issued from the file system’s clients, but also perform block creation, deletion, and replication upon instructions from the NameNode.
The NameNode and DataNode are software modules designed to run on a GNU/Linux operating system. HDFS is built using the Java language as any machine that supports Java can run the NameNode or DataNode modules. Usage of the highly portable Java language means that HDFS can be employed on a wide range of machines. A typical deployment has a dedicated machine that runs only the NameNode module. Each of the other machines in the cluster runs one instance of the DataNode module.
11
Fig. 3.2 HDFS Architecture
The existence of a single NameNode in a cluster greatly simplifies the architecture of HDFS. The NameNode is an arbitrator and repository for all metadata in HDFS. The HDFS system is designed in such a way that user data never flows through the NameNode.
HDFS is designed to support very large files, because Hadoop applications are dealing with large data sets. These Hadoop applications write their data only once but read the data one or more times and require these reads to be performed at streaming speeds. HDFS supports write-once-read-many semantics on files. A typical block size used by HDFS is 64 MB; thus, an HDFS file is chopped up into 64 MB chunks. If it is possible, chunks are residing on different DataNodes.
12

CHAPTER 4
SENSE DECIDE ‘ MAP REDUCE APPROACH
4.1 Native Hadoop
The existing system of the project is to improve the performance of the mapreduce in hadoop cluster with the help of the two techniques namely adaptive preshuffling and prefetching techniques.
4.1.1 Conventional Prefetching and Preshuffling Mechanisms
Disadvantage of this mechanism is it will analyze all the data which is given to Map Reduce process leads to less efficient performance due to the ways of analyzing unwanted and wrong format of the data [10].
4.1.2 Adding Additional Mechanism
Though the prefetching mechanism may improve the data retrieval time, it is not very efficient in a heterogeneous network. The reason for this less efficient behavior is different nodes have different processing capability. Consider two data nodes of different processing capabilities, if the name node assigns two instances of the same fetching task, it is obvious that the node with higher processing capability completes the task much faster and now the node has to wait for the other node to complete the fetching task. This leads to underutilization of the available resources [6]
4.2. Sense Decide ‘ Map Reduce Mechanism
Sense and Decide mechanism is an optional feature to increase the capability of native Hadoop system. It works in co-ordination with the previously mentioned prefetching and preshuffling mechanism. One of the most advanced characters of this mechanism is that it can optimize the data retrieval process [8].
13
Fig.4.1 Sense-Decide Map-Reduce Architecture
4.2.1 Splitting the Load
One of the most important task is to manage the load on the cluster, The NameNode splits the request based on computational capability of the DataNodes. The NameNode is also concern about how fast the data can be retrieved and other optimization can be performed such as sharing the tasks.
4.2.2 Sensing the Data
The user data is stored in the Hadoop Distributed File System, when the user wishes to retrieve back a specific data from the data set. The user would send a request to the name node, which in turn sends a command to its data node.
14
The retrieval is based on the sensing process carried out by the NameNode, to determine what type of data the user is interested in, this process is the first stage of data retrieval. The NameNode senses this based on the user request. It can also be programmed to sense data based on the history of the user`s requests.
4.2.3 Decide Process
The sensed data is stored in the local HDFS, this is a subset of the dataset on the DataNode. The decide process is to determine of the level of relevance of the sensed data. The reason to perform the decide operation is to reduce the unnecessary retrievals which would reduce the performance of the entire Hadoop system.
4.2.4 Outcome of Sense and Decide
After the decide process is finished the NameNode has to gather the key value pair to initiate the Map Reduce process. This key value pair is the indicator to the actual location of the data to be retrieved, on a DataNode level. This key value pair is to be fed as the input to the Map Reduce phase of the system.
4.2.5 Need for Sense and Decide
Sensing and Decide mechanism focuses on customizing the Hadoop system according to data needs of the user. Various users on the cloud would request for various type of data, the retrieval is the responsibility of the name node. Which in turn runs other mechanisms to perform the retrieval effectively. In this approach the type of data that user requests is sensed at the initial phase and then the next phase focuses on what data is relevant to the user request, after this initial phase completes. The output of the Sense and Decide process is feed to the Prefetching and Preshuffling process of the Hadoop System.
The goal of combining Sense and Decide with Prefetching and Preshuffling is to reduce the data retrieval time. By sensing and deciding what data is more relevant, unwanted data retrievals could be minimized, which in turn improves the data retrieval efficiency [8].
15
4.2.6 Predictive Scheduling
The scheduling process is similar to the scheduler that is employed in an operating system. It schedules the different tasks to be assigned to DataNodes from the NameNode. An addition to the scheduler employed in this enhancement to the Hadoop system is that this is a predictive scheduler, which is to predictive not only the computation capability but also to determine what task is to be assigned next to the DataNode.
4.2.7 Fetching Process
The fetching mechanism is to accumulate all the data in the local HDFS based on the key value pair fed to the node by the NameNode. For example let us consider the word count application, the map function generates an output key value pair, this key value pair contains a specific word as the key and the number of occurrences of the word in the document.
4.2.8 Shuffler Mechanism
The shuffler process is to increase the level of reliability of the data retrieval system, by reducing the rate of stalling of the processes which is one of the well-known issues in the native Hadoop system. The shuffling is an user defined program that is run based on what is needed by the end user application.
4.2.9 Sort Process
Sorting is done as per user preferences, the user may wish to gather the most occurring key or the least occurring key first. The sorting is done at the DataNode. This means that there are more than one instance of the sort process running at the same time at different DataNodes in the cluster.
4.2.10 Reduction Process
The data retrieval process` last phase is the reduce process, In this phase all the sort instances feed their output to the Data Node’s reducer, which carries out the task of performing the relevance matching based on the user request or based on the user defined program running on the Hadoop system. When the reduction process completes, based on
16
the output key value pair, the reduced data set is stored again on the Hadoop Distributer File System.
4.3 SYSTEM REQURIMENTS
The following are the hardware and software components that are required for implementing and deploying this project work.
4.3.1 SOFTWARE REQURIEMENTS
Framework
Hadoop 1.1.2 (Stable Release)
Operating System
Ubuntu 12.04 LTS
Programming Platform
Oracle Java 6 or higher
Terminal
Konsole
4.3.2 HARDWARE REQUIREMENTS
Processor
Intel Dual Core i3 or higher
Hard Disk
500 GB or higher
RAM
4GB or higher
Interface
IEEE 802.3 or IEEE 802.11
17
CHAPTER 5
SENSE DECIDE ‘ MAP REDUCE DESIGN ASPECTS
5.1 Unified Modeling Language
UML includes set of diagrammatic notations. UML mainly used for visualizing, specifying, constructing and documenting the components of software and non-software components. UML notation are most important for making complete and meaningful model. Graphical notations used in structural things are the most widely used in UML. These are considered as the nouns of UML models. Following are the list of structural things. ‘ Classes ‘ Interface ‘ Data Flow ‘ Collaboration ‘ Use case ‘ Active classes ‘ Components ‘ Sequence 5.1.1 Data Flow Diagram The Data Flow Diagram (DFD) is a graphical representation of flow of data through an information system. The viewpoint of data can be processes the information using DFD. The DFD lets you visualize how the system operates, what the system accomplishes and how it will be implemented, when it is refined with further specification. Data flow diagrams are used to design information-processing systems but also as a way to model whole organizations. It can associate data with conceptual, logical, and
18
physical data models and object-oriented models. The two types of DFD, which support top down approach details are as follows ‘ Logical data flow diagrams – are implementation-independent and describe the system, rather than how activities are accomplished. ‘ Physical data flow diagrams – are implementation-dependent and describe the actual entities (devices, department, people, etc.) involved in the current system. It also describes about what kinds of information will be input to and output from the system, where the data will come from and go to, and where the data will be stored. It does not show information about the timing of processes, or information about whether processes will operate in sequence or in parallel 5.1.1.1 Data Flow Diagram Level 0 Fig.5.1 Data Flow Diagram Level 0
19
5.1.1.2 Data Flow Diagram Level 1 Fig.5.2 Data Flow Diagram Level 1
20
5.1.1.3 Data Flow Diagram Level 2 Fig.5.3 Data Flow Diagram Level 2 5.1.2 Sequence Diagram Sequence Diagram is a process of how to operate between with one another and with reference to particular order. Based on the time sequence, object interaction operations are been performed. It depicts the objects and classes involved in the scenario and the
21
sequence of messages exchanged between the objects needed to carry out the functionality of the scenario. Sequence diagrams are typically associated with use case realizations in the Logical View of the system under development. A sequence diagram shows, as parallel vertical lines (lifelines), different processes or objects that live simultaneously, and, as horizontal arrows, the messages exchanged between them, in the order in which they occur. This allows the specification of simple runtime scenarios in a graphical manner. Fig. 5.4 Sequence Diagram
22
REFERENCES
[1] Apache Software Foundation, Hadoop, http://hadoop.apache.org/hadoop.
[2] Apache Software Foundation, The pig project, http://hadoop.apache.org/pig.
[3] Boris Lublinsky, Kevin T.Smilth, Alex Yakubovich ,’Professional Hadoop Solutions’, John wiley & sons, Chapter 4, pp. 124,
[4] Condie T., Conway.N, Alvaro.P, Hellerstein J. M., Elmeleegy K. and Sears R., ‘Mapreduce online’, Proceedings of the 7th USENIX conference on Networked systems design and implementation, NSDI’10, Berkeley, CA, USA, USENIX Association, (2010), pp. 21-21.
[5] Dean J. and Ghemawat S., ‘Mapreduce: Simplified data processing on large clusters’, OSDI ’04, (2008), pp. 137-150.
[6] He. B., Fang W., Luo Q.,. Govindaraju N and Wang T., ‘Mars: a MapReduce framework on graphics processors, ACM, (2008).
[7] Lin H., Ma X., Archuleta. J., Feng. W. , Gsardner. M. and Zhang. Z., ‘Moon: Mapreduce on opportunistic environments’, Pro-ceedings of the 19th ACM International Sympo-sium on High Performance Distributed Computing, HPDC ’10, New York, NY, USA, ACM, (2010), pp. 95-106.
[8] Sangwon Seo, Kyungchang woo, Inkyo Kim, jin soo kim , ‘ H.P.M.R: Prefetching and preshuffling in shared Mapreduce computation Environment’ 2009, IEEE
[9] Xie J., Yin. S., Ruan. X., Ding. Z., Tian. Y., Majors. J., Manzanares. A. and Qin. X., ‘Improving mapreduce performance through data placement in heterogeneous hadoop clusters’, Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, (2010) April, pp. 1-9.
[10] Xie Jiong, Yun Tian, Shu Yin, Ji Zhang, Xiaojun Ruan, Xiao Qin, Adaptive Preshuffling in Hadoop Clusters, Procedia Computer Science, Volume 18, 2013, Pages 2458-2467, ISSN 1877-0509.

Essay: Performance Enhancement Of Big Data Analytics

Essay details and download:

Text preview of this essay:

About this essay:

Essay details and download:

Text preview of this essay:

About this essay:

Essay Categories: