Essay: Big data (Page 2 of 2)

Essay details:

  • Subject area(s): Miscellaneous essays
  • Reading time: 8 minutes
  • Price: Free download
  • Published on: March 23, 2018
  • File format: Text
  • Number of pages: 2
  • Big data Overall rating: 0 out of 5 based on 0 reviews.

Text preview of this essay:

This page of the essay has 1603 words. Download the full version above.

Big data has taken many forms in data processing based on the requirements of the contemporary world. These requirements vary from company to company. For instance, a social networking service like Facebook needs to process the data based on its structure. It has entities like users, pages, articles, comments and likes the users get etc. that needs to be processed as a single unit. This is the best representation of Graph data. Facebook uses this graph data for data querying, posting new stories and a variety of other tasks like targeted advertising. To perform these tasks by processing and analyzing the Graph data some distinct techniques are required because standard techniques might not yield the desired results while working on enormous Graph data. Sometimes efficient processing is required for the large-scale graphs which has millions of vertices, billions of edges. This led to the evolution of large scale distributed graph computation frameworks like Pregel which was developed by Google and PowerGraph. But they failed to provide interactive data querying and fault tolerance. So GraphX has been developed which has the advantages of data-parallel and graph-parallel systems and uses RDD API which brings low cost fault tolerance to the processing of Graph data. It is then enhanced to GraphFrames which is built on GraphX which uses DataFrame API. The flow of work can be easily expressed by GraphFrames which transcends the performance of standalone applications. The computations in GraphFrames can be expressed in queries which is an effective way to reason the data. In this report I implement Breadth first search, Depth first search, Page rank algorithm, motif finding on the data set by obtaining Airport codes and location from OpenFlights using GraphX and GraphFrames.


In the recent years there is a huge increase in the incoming data in the fields including commercial, social networks, and medicine that can be categorized as the graph data. A graph is a mathematical representation of the data that is linked. Graphs has objects and these objects are connected by relationships. These relationships are formed by nodes which are technically known as vertices and these vertices are connected by lines which are technically known as edges. This graph data is of two types. They are directed graph in which edges in it contains direction with them and undirected graph in which edges does not have a direction with them. If World Wide Web is considered as a graph, then it contains web pages as vertices that connected with each other through hyperlinks.

For example, Facebook itself deals with over one billion users, likes, posts, comments etc. All these data can be expressed as the nodes which represents an entity and edges are the relationships in between them.


So, Facebook can look at the constructed graph and predict the people those are likely to be connected now or in the future. It can run complicated analysis on the graph structure and identify the closeness of the people. There are many other social media companies like LinkedIn, Twitter, Google which uses this graph structures to understand the structure of network of people. Even telecommunication companies can build sophisticated networks using the call patterns, calls frequency etc. to enhance the relation with the customers.

So growing size and importance of graph data led to the development of some of the specialized graph processing systems like Pregel, Giraph, PowerGraph. All these specialized graph processing systems provide abstraction to the users to efficiently execute the iterative graph algorithms. These abstractions simplify the design and implementation of graph algorithms to the real-world graph data. But the main problem is to formulate these abstractions as they run on separate run times [6]. Also, these approaches are based on vertex programming models. So, in a real-world data graphs are difficult to partition for a distributed environment. In addition to that concentrating on performance they did not consider much about the fault tolerance and do not provide the functionality to preprocess and build the data.

With the introduction of data parallel systems like MapReduce and Spark, tasks related to ETL (Extract, Transform, Load), scalability, implementing fault-tolerance became easy. Unlike the other distributed frameworks, Spark provides the control over the data by supporting in memory computations. Initially GraphX is developed which runs in Spark. This extends the Sparks fundamental data structure RDD (Resilient Distributed Dataset) to RDG (Resilient Distributed Graph) to perform wide range of graph operations. Then GraphFrames is developed based on GraphX which uses DataFrames which are more efficient than RDD’s.

Related works:


Pregel [4] is scalable API developed by Google to express arbitrary graph algorithms which uses Bulk Synchronous Parallel as an execution model. Pregel’s GAS (gather, apply, scatter) programming model uses message passing between vertices in a graph. Pregel manages this message passing into number of iterations called “supersteps”. Edges does not play a considerable role in Pregel. Instead of dealing with edges, Pregel stores information about the directed edges between vertices, outgoing edges from vertex and sends information to any vertex. Also, each vertex has an id, value, a list of its adjacent vertex id’s and the corresponding edge values. This makes easier to work on large-scale graph problems. With the introduction of Pregel many other systems like Apache Giraph, GPS, Mizan, GraphLab etc were developed [8].


Apache Giraph [10] is an iterative graph processing system mainly useful for the high scalability. It is used by Facebook to analyze the graph constructed using the data related to the users and their connections. Giraph enhanced the processing system by including master computation, sharded aggregators. It has Master node which is responsible to assign sub divisions to workers, collect statuses of the worker nodes, request checkpoints. Worker nodes is useful to invoke active vertices functions, compute local aggregation values and Zookeeper is responsible for worker, partitions mapping, checkpoint paths, aggregators.


GPS [11] is an open source implementation of Pregel. GPS is 12 times faster than Giraph because of its built-in optimizations such as single canonical vertex, reducing the allocating cost of Java objects by using message objects, improving the network usage by using per-worker rather than per-vertex message buffers and reducing thread synchronization.


GraphLab is also an open source implementation which incorporates the features of PowerGraph. It differs from other implementations as it is using vertex cuts rather than edge cuts. This is a feature that was taken from PowerGraph. Vertex cuts allow high degree vertices to be distributed across multiple

machines. This results in efficient load balancing for graphs with high degree distributions. Where in case of other works like Giraph, GPS, and Mizan they use edge cuts but do not replicate vertices. Also, GraphLab has Synchronous and Asynchronous execution modes which are useful for the effective usage of network and CPU resources.

GraphX API:

GraphX is an efficient graph processing framework which combines the most powerful aspects of Spark which are RDD, fault-tolerance, task-scheduling and embedded API’s related to SQL, machine learning, streaming. As it uses the flexibility of Spark it overcomes the problems of the conventional graph processing systems by providing the functionality to construct the graphs and post processing of graphs by supporting the wide range of gra
ph operators.

Graph Partitioning:

To process the graph in a distributed way then the graph needs to be represented in a distributed fashion These graph processing systems use graph partitioning algorithms for the efficient communication and distributed computation. Traditionally vertex-cut and edge-cut approaches are used for the graph partitioning.

Most of the graph processing systems use edge-cut which allows edges to be spanned across the machines and evenly distributes the vertices. So, the communication and distributed computation depends on the number of edges. To achieve the optimal work balance random edge-cut is used by randomly distributing the vertices across the nodes. This approach has a disadvantage because of cutting most of the edges. In the real-world scenario edge-cut approach cannot be implemented ideally as there will be millions of edges in the graph. So, in GraphX vertex-cut approach is used for the distributed graph partitioning. Unlike edge-cut, it evenly distributed the edges and allows vertices to be spanned. In this approach, efficient communication over-head and load balancing can be obtained by minimizing the machines spanned by the vertices.


In the above images A, B, C and D, E, F are in two different partitions. The edges are partitioned into two in the edge table. The routing table is used store the cutting statuses of the vertices.

Programming Abstraction:

Graph processing systems represent graph structured data as a property graph which has vertices and edges. Property graphs is extracted from the sources like social networks and web graphs which has high orders of magnitude more edges than vertices. Spark uses RDD for in-memory computation that lets application to store data in memory to reconstruct the lost data. RDDs are immutable collections that can be created by using various operators. These features of RDD is leveraged by GraphX to use RDG (Resilient Distributed Graph). RDG contains the attributes associated with vertices and edges in the graph. Each vertex contains the unique ID, each edge contains its attributes and attributes of both the vertices connected by the edge. Methods to utilize the attributes in the graph are mentioned in the listing. The vertices(), edges() methods returns the set of all vertices and edges containing the respective ID.

Listing 1: RDG interface in Scala

In addition to these methods there are some additional methods for mapping (to apply user defined function to the vertices and edges), updating (transforming the vertices and edges) and aggregating.

...(download the rest of the essay above)

About this essay:

This essay was submitted to us by a student in order to help you with your studies.

If you use part of this page in your own work, you need to provide a citation, as follows:

Essay Sauce, Big data. Available from:<> [Accessed 22-10-19].

Review this essay:

Please note that the above text is only a preview of this essay.

Review Title
Review Content

Latest reviews: