Big data | EssaySauce.com

Big data has taken many forms in data processing based on the requirements of the contemporary world. These requirements vary from company to company. For instance, a social networking service like Facebook needs to process the data based on its structure. It has entities like users, pages, articles, comments and likes the users get etc. that needs to be processed as a single unit. This is the best representation of Graph data. Facebook uses this graph data for data querying, posting new stories and a variety of other tasks like targeted advertising. To perform these tasks by processing and analyzing the Graph data some distinct techniques are required because standard techniques might not yield the desired results while working on enormous Graph data. Sometimes efficient processing is required for the large-scale graphs which has millions of vertices, billions of edges. This led to the evolution of large scale distributed graph computation frameworks like Pregel which was developed by Google and PowerGraph. But they failed to provide interactive data querying and fault tolerance. So GraphX has been developed which has the advantages of data-parallel and graph-parallel systems and uses RDD API which brings low cost fault tolerance to the processing of Graph data. It is then enhanced to GraphFrames which is built on GraphX which uses DataFrame API. The flow of work can be easily expressed by GraphFrames which transcends the performance of standalone applications. The computations in GraphFrames can be expressed in queries which is an effective way to reason the data. In this report I implement Breadth first search, Depth first search, Page rank algorithm, motif finding on the data set by obtaining Airport codes and location from OpenFlights using GraphX and GraphFrames.

Introduction:

In the recent years there is a huge increase in the incoming data in the fields including commercial, social networks, and medicine that can be categorized as the graph data. A graph is a mathematical representation of the data that is linked. Graphs has objects and these objects are connected by relationships. These relationships are formed by nodes which are technically known as vertices and these vertices are connected by lines which are technically known as edges. This graph data is of two types. They are directed graph in which edges in it contains direction with them and undirected graph in which edges does not have a direction with them. If World Wide Web is considered as a graph, then it contains web pages as vertices that connected with each other through hyperlinks.

For example, Facebook itself deals with over one billion users, likes, posts, comments etc. All these data can be expressed as the nodes which represents an entity and edges are the relationships in between them.

Source: https://developers.facebook.com/

So, Facebook can look at the constructed graph and predict the people those are likely to be connected now or in the future. It can run complicated analysis on the graph structure and identify the closeness of the people. There are many other social media companies like LinkedIn, Twitter, Google which uses this graph structures to understand the structure of network of people. Even telecommunication companies can build sophisticated networks using the call patterns, calls frequency etc. to enhance the relation with the customers.

So growing size and importance of graph data led to the development of some of the specialized graph processing systems like Pregel, Giraph, PowerGraph. All these specialized graph processing systems provide abstraction to the users to efficiently execute the iterative graph algorithms. These abstractions simplify the design and implementation of graph algorithms to the real-world graph data. But the main problem is to formulate these abstractions as they run on separate run times [6]. Also, these approaches are based on vertex programming models. So, in a real-world data graphs are difficult to partition for a distributed environment. In addition to that concentrating on performance they did not consider much about the fault tolerance and do not provide the functionality to preprocess and build the data.

With the introduction of data parallel systems like MapReduce and Spark, tasks related to ETL (Extract, Transform, Load), scalability, implementing fault-tolerance became easy. Unlike the other distributed frameworks, Spark provides the control over the data by supporting in memory computations. Initially GraphX is developed which runs in Spark. This extends the Sparks fundamental data structure RDD (Resilient Distributed Dataset) to RDG (Resilient Distributed Graph) to perform wide range of graph operations. Then GraphFrames is developed based on GraphX which uses DataFrames which are more efficient than RDD’s.

Related works:

Pregel:

Pregel [4] is scalable API developed by Google to express arbitrary graph algorithms which uses Bulk Synchronous Parallel as an execution model. Pregel’s GAS (gather, apply, scatter) programming model uses message passing between vertices in a graph. Pregel manages this message passing into number of iterations called “supersteps”. Edges does not play a considerable role in Pregel. Instead of dealing with edges, Pregel stores information about the directed edges between vertices, outgoing edges from vertex and sends information to any vertex. Also, each vertex has an id, value, a list of its adjacent vertex id’s and the corresponding edge values. This makes easier to work on large-scale graph problems. With the introduction of Pregel many other systems like Apache Giraph, GPS, Mizan, GraphLab etc were developed [8].

Giraph:

Apache Giraph [10] is an iterative graph processing system mainly useful for the high scalability. It is used by Facebook to analyze the graph constructed using the data related to the users and their connections. Giraph enhanced the processing system by including master computation, sharded aggregators. It has Master node which is responsible to assign sub divisions to workers, collect statuses of the worker nodes, request checkpoints. Worker nodes is useful to invoke active vertices functions, compute local aggregation values and Zookeeper is responsible for worker, partitions mapping, checkpoint paths, aggregators.

GPS:

GPS [11] is an open source implementation of Pregel. GPS is 12 times faster than Giraph because of its built-in optimizations such as single canonical vertex, reducing the allocating cost of Java objects by using message objects, improving the network usage by using per-worker rather than per-vertex message buffers and reducing thread synchronization.

GraphLab:

GraphLab is also an open source implementation which incorporates the features of PowerGraph. It differs from other implementations as it is using vertex cuts rather than edge cuts. This is a feature that was taken from PowerGraph. Vertex cuts allow high degree vertices to be distributed across multiple

machines. This results in efficient load balancing for graphs with high degree distributions. Where in case of other works like Giraph, GPS, and Mizan they use edge cuts but do not replicate vertices. Also, GraphLab has Synchronous and Asynchronous execution modes which are useful for the effective usage of network and CPU resources.

GraphX API:

GraphX is an efficient graph processing framework which combines the most powerful aspects of Spark which are RDD, fault-tolerance, task-scheduling and embedded API’s related to SQL, machine learning, streaming. As it uses the flexibility of Spark it overcomes the problems of the conventional graph processing systems by providing the functionality to construct the graphs and post pr
ocessing of graphs by supporting the wide range of gra
ph operators.

Graph Partitioning:

To process the graph in a distributed way then the graph needs to be represented in a distributed fashion These graph processing systems use graph partitioning algorithms for the efficient communication and distributed computation. Traditionally vertex-cut and edge-cut approaches are used for the graph partitioning.

Most of the graph processing systems use edge-cut which allows edges to be spanned across the machines and evenly distributes the vertices. So, the communication and distributed computation depends on the number of edges. To achieve the optimal work balance random edge-cut is used by randomly distributing the vertices across the nodes. This approach has a disadvantage because of cutting most of the edges. In the real-world scenario edge-cut approach cannot be implemented ideally as there will be millions of edges in the graph. So, in GraphX vertex-cut approach is used for the distributed graph partitioning. Unlike edge-cut, it evenly distributed the edges and allows vertices to be spanned. In this approach, efficient communication over-head and load balancing can be obtained by minimizing the machines spanned by the vertices.

Source: https://spark.apache.org/docs

In the above images A, B, C and D, E, F are in two different partitions. The edges are partitioned into two in the edge table. The routing table is used store the cutting statuses of the vertices.

Programming Abstraction:

Graph processing systems represent graph structured data as a property graph which has vertices and edges. Property graphs is extracted from the sources like social networks and web graphs which has high orders of magnitude more edges than vertices. Spark uses RDD for in-memory computation that lets application to store data in memory to reconstruct the lost data. RDDs are immutable collections that can be created by using various operators. These features of RDD is leveraged by GraphX to use RDG (Resilient Distributed Graph). RDG contains the attributes associated with vertices and edges in the graph. Each vertex contains the unique ID, each edge contains its attributes and attributes of both the vertices connected by the edge. Methods to utilize the attributes in the graph are mentioned in the listing. The vertices(), edges() methods returns the set of all vertices and edges containing the respective ID.

Listing 1: RDG interface in Scala

In addition to these methods there are some additional methods for mapping (to apply user defined function to the vertices and edges), updating (transforming the vertices and edges) and aggregating.

GraphFrames API:

The main enhancement in GraphFrames [1] is the API it uses which is similar to DataFrames in R. DataFrames in Python, R is similar to tables in a database (RDBMS). So GraphFrames consists of two DataFrames representing the edges and vertices of the graph. In a vertex DataFrame, it contains the vertex attributes and in edge DataFrame it contains the details edge attributes and attributes related to the vertices connected by the respective edge. In general, GraphFrames consists of four tabular views related to the edges, vertices, patterns and triplets.

Programming Abstraction:

GraphFrames supports the traditional database operators like filter, join etc. These operators are enough to implement many graph processing algorithms. Similar to GraphX, GraphFrames also consists of methods to access vertices, edges, filtering techniques, mapping techniques and many more.

The code snippet below returns the out-degree for each vertex.

Implementation:

Source: Spark Summit

GraphFrames is implemented on the top of SparkSQL. In the first step query is given by the user and it is parsed by the Pattern Parser. Then the query planner which is implemented on the top of Spark takes the pattern as the input and converts into a logical plan. During the query time it also receives the views from the GraphFrame. The planner will rewrite the query to a materialized view as the queries are very expensive to run. In the final step SparkSQL comes into action. It is useful when working with structured and semi-structured data. This SparkSQL is used to retrieve the data as the DataFrame.

GraphX vs GraphFrames:

GraphX GraphFrames

Abstraction RDD’s DataFrames

Core API Scala Python, Java, Scale

Return types Graph or RDD GraphFrames or DataFrames

Usages Algorithms Algorithms, Queries, Motif finding

Vertex Ids Long Any type

Vertex/Edge attributes Any type Any number of columns

Experimental Setup:

Software Configuration:

Hardware Configuration:

Dataset:

Algorithms:

PageRank:

Page rank is popularly used measure by search engines like google to rank the pages. The same algorithm can be used to find the importance for the airports in the dataset. This can be interpreted as the measure of importance of the airports.

Breadth First Search:

Breadth First Search (BFS) is used to search or traverse a graph. BFS can be used to find the shortest connections between two airports.

Relationships through Motif Finding:

Motifs are used to find the patterns. Motif finding is referred to as searching for structural patterns in each graph. Names for vertices or edges can be ignored in motifs when they are not necessary. In this paper motif finding is applied to find the flight itineraries that have intermediate airport.

Analysis and Results:

Constructed graph can be used for the analysis on the dataset.

Departure Delays:

Departure delays from any city can be obtained by utilizing the tabular data and map these corresponding columns of the dataset. Thus, by focusing on this quality of service can be improved by increasing the number of flights or find new methods to reduce the departure delay. Also, comparative study can be done by visualizing the cities that have most delays in the journey

Arrival Delays:

Arrival delay is one of the major factors to be considered while analyzing the flight performance.

Analysis on cancellation:

The total number of flight cancellations on a specific day of the week can be visualized.

Essay: Big data

Essay details and download:

Text preview of this essay:

About this essay:

Essay details and download:

Text preview of this essay:

About this essay:

Essay Categories: