Apache Hadoop YARN - A powerful Open Source Processing Engine

Table of Contents

Abstract

This paper describes the Apache Hadoop an open source in processing engine. The base of the Hadoop is a paper released by the Google File System Paper that was released in October 2003. The technology despite of a huge success was limited due to a single node failure. Distributing the workload and processing on huge data sets was easy for the developers. The main role was the ability of the application to produce efficient results in lesser times than other system. The causes of failure were the centerline task/job management and lack of support of programing model other than map reduce.

Therefore, to fulfill the backlogs a new generation of Hadoop YARN was introduced. YARNsim provides a virtual platform on which system architects can evaluate the design and implementation of Hadoop YARN systems. Also, application developers can tune job performance and understand the tradeoffs between different configurations, and Hadoop YARN system vendors can evaluate system efficiency under limited budgets. In this paper we will be discussing the architecture, experiment and the experiment results of the Hadoop YARN system. The technology was improved and confirmed by running in the production environments reported by the companies like yahoo.

I. INTRODUCTION

Hadoop YARN-based architecture provides a consistent level of service and response. Spark now is currently one of the many data access engines that work with YARN in HDP. It is the latest technology that is currently in use with Hadoop YARN.

When running in a cluster, (e.g. NFS mounts on the same path on every node) there is some kind of shared file that is needed. If such a file system exists, it can only distribute Spark in standalone mode.

The fundamental idea of YARN is to split up the two major responsibilities of the JobTracker i.e. resource management and job scheduling/monitoring, into separate daemons: a global ResourceManager and per-application ApplicationMaster (AM).

Resource Manager and slave node, Node Manager (NM), create a new and generic system for managing applications in a distributed manner.

The Resource Manager is also the ultimate authority that brings all the application resources in the system. ApplicationMaster is responsible for working with the NodeManager(s) to negotiate the capabilities of both the legal entity and the Resource Manager and to perform component tasks and monitors.

The ResourceManager has a pluggable Scheduler, which is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is a pure scheduler in the sense that it performs no monitoring or tracking of status for the application, offering no guarantees on restarting failed tasks either due to application failure or hardware failures. The Scheduler performs its scheduling function based on the resource requirements of the applications; it does so based on the abstract notion of a Resource Container which incorporates resource elements such as memory, CPU, disk, network etc.

The NodeManager is the per-machine slave, which is responsible for launching the applications’ containers, monitoring their resource usage (CPU, memory, disk, network) and reporting the same to the ResourceManager.

The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress. From the system perspective, the ApplicationMaster itself runs as a normal container. Here is an architectural view of YARN:

One of the crucial details of the application for MapReduce in the NEW YARN system that will be marking is that the current MapReduce framework has to be repeated without major surgery. This was crucial to ensure compatibility for existing MapReduce applications and users. It’s more than that.

The next post will drastically reduce the complexity of architecture and its benefits, such as significantly better scaling, support for multiple data processing frameworks (MapReduce MPI vs.), and cluster usage.

Apache Hadoop community has decided to promote the next-generation Hadoop data-processing framework, i.e. YARN, to be a sub-project of Apache Hadoop in the ASF. Apache Hadoop YARN joins Hadoop Common (core libraries), Hadoop HDFS (storage) and Hadoop MapReduce (the MapReduce implementation) as the sub-projects of the Apache Hadoop which, itself, is a Top Level Project in the Apache Software Foundation. Until this milestone, YARN was a part of the Hadoop MapReduce project and now is poised to stand up on its own as a sub-project of Hadoop.

In summary, Hadoop yarn, Apache Hadoop is an attempt to get past MapReduce data processing purposes. Hadoop HDFS, Hadoop data storage layer and folder reduce know the data processing layer. However, the MapReduce algorithm alone we have seen is not enough to solve the Hadoop for a variety of use cases. With the YARN, Hadoop residue a general resource management and distributed application framework; Allowing multiple computer applications to be adapted to the task. Hadoop MapReduce is now one such application for YARN.

II. HADOOP ARCHITECTURE

There are two main components of Hadoop Architecture

-Distributed File System

-MapReduce Engine

A. DISTRIBUTED FILE SYSTEM

The most important one is the Distributed file system in Hadoop (HDFS). Hadoop file system that runs on the top of existing file system on each node of Hadoop cluster. It is designed for very specific data access patterns Hadoop works best for very large files, the larger the file the less time Hadoop spends seeking for the next data location on disk. Seeks is generally expensive operations. It is useful only when you only need to analyze small subset of your data set. Since Hadoop is designed to run over your entire data set. It is best to minimized seeks by using large files.

Hadoop is designed for streaming or sequential data access rather than random access. Sequential data access mean fewer seeks, since Hadoop only seeks to the beginning of each block. Hadoop uses blocks to store a file or parts of a file

-File Blocks

Blocks are large, they are default of 64MB each. Most system runs with block size of 128MB or larger.

1 HDFS Block is supported by multiple operating system(OS) blocks.

Blocks are replicated to multiple nodes which allows for node failure without data loss.

B. MAPREDUCE ENGINE

A MapReduce program consist of two types of transformations that can be applied to data any number types. A map transformation and a reduce transformation.

A map reduce job is an executing map reduce program that is divided into map tasks that run in parallel with each other and a reduces tasks that run in parallel with each other as well.

C. MAIN NODES IN HADOOP.

HDFS

-NameNode

-DataNode

MapReduce

-JobTracker

-TaskTracker

1) NameNode: There is only one NameNode in the Hadoop cluster that manage the file system namespace and meta data.

2) DataNode: DataNodes are many per Hadoop cluster that manages blocks with data and serves them to clients. Periodically reports to name node the list of blocks it stores.

3) JobTracker Node: There is only one JobTracker per Hadoop cluster which receives job requests submitted by clients, also schedules and monitors MapReduce jobs on task tracker.

4) TaskTracker Node: TaskTracker are many per Hadoop cluster and executes MapReduce operations.

III. EXPERIMENT AND RESULTS

A. HARDWARE:

The experiment was preformed using 4 nodes connected directly using 1GBit Netgear switch. All 4 nodes were Dell PowerEdge T420 servers. The master node’s resources were 2x Intel Xeon E5-2420 (1.9GHz) processor units each consisting of 6 cores. The master node had 32GB of total RAM and 1TB of SATA (3.5 in, 64Mb cache) hard drive. The worker nodes were installed with 1x intel Xeon E5-2420 (2.20GHz) processors 32GB of Ram and 1×4 1TB of SATA (3.5 in,64MB cache) Hard drives.

B. SOFTWARE:

This section the software used for the cluster. Ubuntu LTS was installed on all 4 nodes by allocating the entire first disk. The number of open files value per user was changed from 1024 to 65000 as suggested by the TPCx-HS Benchmark (explained in the later section) for testing criteria.

C. PERFORMANCE:

The experiment was done on Hadoop YARN with TPCx-HS to check its performance. TPCx-HS was developed as a standard Big Data Benchmark to provide an objective measure of hardware, operating system and commercial Apache Hadoop File System API compatible software distributions, and to provide the industry with verifiable performance, price-performance and availability metrics.

In this section are evaluated the results of the multiple experiments. The presented results are obtained by executing the TPCx-HS kit provided on the official TPC website. However, the reported times and metrics are experimental, not audited by any authorized organization and therefore not directly comparable with other officially published full disclosure reports.

Figure 3 illustrates the times of the two cluster setups (shared and dedicated 1GBit networks) for the three datasets 100GB, 300GB and 1TB. It can be clearly observed that for all the cases the dedicated 1GBit setup performs around 5 times better than the shared setup. Similarly, Figure 4 shows that the dedicated setup achieves between 5 and 6 times more Performance metric than the shared setup.

Table 8 summarizes the experimental results introducing additional statistical comparisons. The Data Δ column represents the difference in percent of the Data Size to the data baseline in our case 100GB. In the first case, scale factor 0.3 increases the processed data with 200%, whereas in the second case the data is increased with 900%. The Time (Sec) shows the average time in seconds of two complete TPCx-HS runs for all the six test configurations. The following Time Stdv (%) shows the standard deviation of Time (Sec) in percent between the two runs. Finally, the Time Δ (%) represents the difference in percent of Time (Sec) to the time baseline in our case scale factor 0.1. Here we observe that for the shared setup the execution time takes five times longer than the dedicated setup

Figure 5 illustrates the scaling behavior between the two network setups based on different data sizes. The results show that the dedicated setup has a better scaling behavior than the shared setup. Additionally, the increase of data size improves the scaling behavior of both setups.

Table 9 depicts the average times of the three major phases, together with their standard deviations in percent. Clearly the data sorting phase (HSSort) takes the most processing time, followed by the data generation phase (HSGen) and finally the data validation phase (HSValidate).

1) Memory

Figure 10 shows the main memory utilization in percent of the Master node for the two network setups. In the dedicated 1Gbit setup the average memory used is around 48%, whereas in the shared setup it is around 91.4%.

In the same way, Figure 12 illustrates the main memory utilization in percent for one of the Worker nodes. For the dedicated 1Gbit case, the average memory used is around 92.3%, whereas for the shared 1Gbit case it is around 92.9%. This confirms the great resemblance in both graphics, indicating that the Worker nodes are heavily utilized in the two setups. It will be advantageous to consider adding more memory to our nodes as it can further improve the performance by enabling more parallel jobs to be executed.

IV. RELATED ADVANCEMENTS AND WORK

The work inside Apache and outside Apache is consist of the new doors towards technology. Following are the related work in advancements in field of Hadoop YARN.

A. THE HORSEPOWER OF HADOOP

The horse power of Hadoop gives fast and flexible results. As, the company data is playing the modernize role in their data architecture. The foundations are impacting the health of new business work with their immense technology. Many organizations are exploring a Hadoop-based data environment for its flexibility and scalability in managing big data. This shows the true image of Hadoop in today’s real world.

B. THE HADOOP ECOSYSTEM TABLE

The data storage elements are reliable and redundant. It comprised of HDFS data storage. YARN, Spark and storm are the cluster resource management framework and stream based task which help the data to process in real time? Moreover, this is all owned by Apache. Google introduces the initial development in this ecosystem table too.

C. BIG DATA TECHNOLOGIES BUILT ON APACHE YARN

Hadoop technologies are creating and effective role in the field of IT and science. The data representations is based on the following models which specify the SQL, graph processing & data representation etc. They are using the following systems and technology to built Apache YARN.

• NoSQL

• Interactive SQL

• Real-time Data Processing

• Graph processing

• Bulk Synchronous parallel

• In memory

• Dag Execution

D. RECENT RESEARCH

22 MARCH 2017: RELEASE 2.8.0 AVAILABLE

Apache Hadoop 2.8.0 contains several significant features and enhancements. For major features and improvement this release is initially not ready for the production and local use. Top issues are suppressed via testing and down streaming.

V. CONCLUSIONS:

The adoption and recollection of history found in Hadoop has completed and pushed the new doors of evolutionary architecture. They have completed their designs which transform that lead to YARN.

• YARN is providing the favourable and greater scalability

• The effective and higher efficiency

• Larger number of framework and data to share cluster timely.

• Yahoo has experiences their new technology which presents the massive -scale production of YARN.

Moreover, new immense technologies are evenly done by APACHE system too. Their ecosystem table and horse power of Hadoop are the new advancements in their field. Being a part of the society, YARN can serve the industry by its solid background and by giving the valuable framework to the researchers and individual to play vital role.

VI. ACKNOWLEGMENTS

We would like to represent our deepest appreciation to all who helped us complete this project report. I would specially thank Ms. Maham Shabir for providing us with the idea details of the technology that we have represented in this report. The encouragement and supreme guidance of Mr. Arif Khatak who really made it easy for us to express the idea and work on our presentation skills. I would like to give a special thanks to my team mates who worked hard and day and night with me and all their efforts for making this study of this technology possible.

VII. REFERENCES

Essay: Apache Hadoop YARN – A powerful Open Source Processing Engine

Essay details and download:

Text preview of this essay:

Abstract

I. INTRODUCTION

II. HADOOP ARCHITECTURE

III. EXPERIMENT AND RESULTS

IV. RELATED ADVANCEMENTS AND WORK

V. CONCLUSIONS:

About this essay:

Essay details and download:

Text preview of this essay:

Abstract

I. INTRODUCTION

II. HADOOP ARCHITECTURE

III. EXPERIMENT AND RESULTS

IV. RELATED ADVANCEMENTS AND WORK

V. CONCLUSIONS:

About this essay:

Essay Categories: