SQL vs NoSQL: Feature Comparisons and MapReduce vs Parallel DBMS Analysis

4 SQL V/S NOSQL

SQL stands for Structured Query Languages which is used for defining and manipulating data. It is ex-tremely powerful and flexible and thus most popular and widely available option making it as a safe choice, especially for complex queries.

Even though it is said to be versatile, we need to define a schema for a table and it can only accept that form of data. If you need to change something, there may be a lot of changes required to make it adapta-ble and working.

On the other hand, NoSQL databases have a dynamic schema for unstructured data and data is stored in many ways. The data is stored in multiple formats such as Key-value store, column-oriented, document-oriented or graph-based organized. It does not need a structure for document creation, which means you can create a document without having to define the schema. This in turn means each document in each row in the database which is referred to as a docu-ment will be different from each other and even be unique.

4.1 Feature Comparisons

Scalability:

SQL databases are known to be vertically scalable v/s NoSQL which are horizontally scalable.

Vertically scalable means that you can add more CPU, RAM or SSD and so on in the same computer in which the item is installed. In the case of NoSQL, horizontally scalable means you can add more serv-ers to handle high volumes of requests. This makes the NoSQL extremely scalable and a powerful tool for applications with large volumes of data like the Big Data.

Structure:

SQL databases have a table-based structure while a NoSQL database are either a document-based, graph-based or Key-value pair. Since there is a proper struc-ture for SQL, it is more suited for multi-row transac-tions

4.2 MapReduce v/s Parallel DBMS

Another common comparison that can be made is between MapReduce and Parallel DBMSs. The MapReduce is a programming model developed by Jeffrey Dean and Sanjay Ghemawat, two Google Scholars in 2004. It is mainly used for Extract, trans-form and Load (ETL) operations and stored in a file system called GFS – Google File System. An exten-sion to this was developed by Yahoo as an open source project under Apache License referred as Ha-doop and the file system used to store data was HDFS – Hadoop Distributed File System. They proved to be a well-defined programming model to store huge amount of data and with this, many thought that they will displace Relational Database Models such as Teradata. Both of these were used in Distributed environments or computing that requires very resilient Network bandwidth and fault tolerance to work effectively. Network bandwidth has physical limitations, after which you can’t do anything. Fault tolerance means how it is dealt with when a fault or an issue on a node in the distributed environment occurs.

MapReduce Paradigm is very useful in such situ-ations where huge datasets need to be processed par-allelly in a distributed environment. The advantage of the MapReduce model is that it hides the complex details of parallelization, fault tolerance, data distri-bution and load balancing in a library. This helps the user who is using MapReduce to focus more on how the data can be processed rather than maintaining its properties.

Parallel DBMSs uses clusters of computers known as shared-nothing nodes, which have separate CPU, memory and disks, which are connected through a high-speed network. These parallel database system uses the technique of horizontal partitioning of rela-tional tables, along with partitioned execution of SQL queries. Horizontal partitioning refers to distributing rows of the tables across the nodes of the cluster, so that they can be processed parallelly. Various parti-tioning strategies include hash, range and round-robin partitioning.

Horizontal partitioning of rows is required for a scalable performance of the DBMS and this in turn will lead to concept of partitioned execution of SQL operators. When queries are executed, it’ll be execut-ed parallelly in all nodes and then these rows satisfy-ing the predicate condition will be passed to a Shuffle operator that dynamically repartitions the rows. Key benefit is that the Parallel DBMS system automatically manages the various partitioning strategies. They can also perform map and reduce using user-defined functions.

The authors of MapReduce and Parallel DBMSs: Friends or Foes[12] worked on performance compari-son between Parallel DBMSs and MapReduce using various experiments as follows to find out the trade-off in performances.

• Original MR Grep task – The task here is find out a three-character pattern by scanning through a 100B records. DB systems were about 2 times faster than Hadoop mainly be-cause of Architectural differences which was surprising.

• Web Log Task – Reading all Web server logs to calculate the ad revenue for each of the vis-ited IP address. Here also DB system was even faster than Hadoop as compared to Grep task.

• Join Task – The task is to perform a join op-eration over two tables requiring additional aggregation and filtering. Here also DB was much faster than Hadoop.

These performance differences can be mainly at-tributed to the Architectural differences which is in fact a result of the implementation choice of each sys-tem.

Repetitive Record parsing – Default configuration of Hadoop stores data in the accompanying HDFS in the same textual format, which needs to be parsed by the user code for each of the Map and Reduce func-tions to convert it to its respective types. In contrast to this in DBMSs, records are parsed when it is load-ed initially. This makes it to run effectively faster dur-ing runtime.

Pipelining – There is no writing to disk because no intermediate data. First, operator pushes the data to the second operator. Writing gives good fault toler-ance but significant performance overhead.

Scheduling – It is able to minimize data transmission among different nodes, because query execution plan is known in advance. This knowledge helps them to optimize it properly.

Column-oriented-storage – In a column-store based database, it reads attributes necessary for solving the user query. HADOOP/HDFS are row-stores which is a demerit in this regard.

These comparisons clearly state that each has its own merits and demerits. One alone can’t survive whereas a combination of both will give more power to our models. MR is best for complex analytical problems and ETL. DBMS can be considerable fast once the data is loaded, loading takes considerable amount of time. Reshuffle between Map and Reduce is equivalent to Group By operations in SQL.’ But it is better suited for query intensive operations. It’s best to conclude that a system where both co-exists is the best.

4.2 Figures

In this section we can just go through various figures that were published by StackOverflow based on their user survey in 2018.

From the Figures 1 and 2, you can definitely make out that the Relational Databases such as MySQL, PostgreSQL and SQL Sever tops the list of most popular Databases in 2018. This proves that the RDBMs model is not still a legacy system and it’ll definitely live for another 10 years going by the trend.

Figures 3 and 4 shows that there is a trend to-wards switching to Redis, PostgreSQL, Elasticsearch and MongoDB in the near future.

Most popular databases were MySQL and SQL Server for a long time but nowadays there are com-panies who want to switch over to PostgreSQL or use NoSQL such as MongoDB. The problem of switching is the cost, financially and in regard to time which might be slowing things down for the process of switching.

Fig. 1. Most Popular Databases according to all users:

Fig. 2. Most popular databases according to profes-sionals:

Fig. 3. Most Loved Databases by developers:

Fig. 4. Most Wanted databases by developers

Fig. 5. Most Dreaded Databases by developers

5 CONCLUSION

Although, it seems evident that Relational Data-bases are not going to die soon and in fact they are still the most popular choices by the professionals around the world, you can definitely expect switches with the constant evolution in data and also with the pace with which data is growing. You can see data generated from your mobile phones, smart watches, IOT devices and other Web resources.

It is in fact a choice still to the developers in choosing the right one for their right need. If the data is very big of the peta byte sizes and it continuously evolves and there is difficulty in defining the struc-ture of the data you should obviously go for some NoSQL data.

On the other hand, if the data is of the size of Tera Bytes and the structure of the data doesn’t change much like in the case of banking applications where you need to store customer information and newer customers are continuously added, you might want to choose a SQL database like the PostgreSQL.

There are also plenty of options for support for Cloud storages in Relational Databases like Amazon RDS or Google Cloud SQL or Non-Relational model like Firebase that you can make use of.

It is clear that as in the case of Parallel DBMSs to MapReduce model, NoSQL will clearly complement the SQL databases and many of the companies are switching to a model where both SQL and NoSQL are used together.

Essay: SQL vs NoSQL: Feature Comparisons and MapReduce vs Parallel DBMS Analysis

Essay details and download:

Text preview of this essay:

About this essay:

Essay details and download:

Text preview of this essay:

About this essay:

Essay Categories: