In this article, we are going to discuss various solutions to support real-time data analytics such as Spark, Storm, Samza, Trident, Organizations are using alternate storage infrastructure options such as NoSQL, HDFS (Hadoop Distributed File System), HBase for processing big data. Various organisations seek to obtain meaningful information from large chunks of data but with the conventional techniques are neither sufficient nor efficient to carry out such operations in time. Followed by the discussion on big data challenges, i.e. scalability, availability, performance, precision, data integrity, data quality and data security.
In an advanced technological world, data is generated from many sources such as social network, large e-Commerce business, credit cards, medical records, tourism details (flight records, railway records, accommodation logbooks). Moreover, with smartphones, military surveillance, (various applications data), scientific research, official government records (population count, social security numbers, voter ID’s), archives (videos, photos, documents) piping up data as well [1]. Big data has three attributes known as the 3Vs: – Velocity, Variety, Volume. The speed of data depends whether it is real-time or near-real-time, unique data types due to diversity in data sources (social media, i.e. Facebook, Twitter etc. , blogs, microsensors, market feeds) and the bulkiness of dataset (zettabytes, petabytes, terabytes of data is being generated) [2]. These 3Vs are needed for bringing out superior insights that can lead to better decision making [2]. Surplus amount of data is being generated by today's rigorous applications which are tailoring the scientific community because sometimes the large chunks of data create really complex problems which are just absurd from the analytics point of view [2]. The real-time processing demanded by the data-intensive applications is propelling the limits of the present-day data processing infrastructures. Wall Street trading, system monitoring, command and control in military environments are few examples of stream-based applications [3]. The dramatic escalation of high volumes of data streams is seising up the present Big Data processing environments. Further sections of the paper will explain challenges in Big Data, then the various storage infrastructures (NoSQL, HDFS, HBase) followed by some real-time processors (Storm, Spark,) that help in manufacturing relevant outcomes, and finally the ongoing research and development in the of real-time data analytics.
Big Data provides with a lot of new opportunities but brings along a whole new set of complex problems that we have to deal. Hassle is during capture of data, storing the large volumes of data, sharing , analysis, management and visualization, privacy and security issues add to the already goofed up scenario [4]. It has been 52 years of Moore’s law (Intel co-founder Gordon Moore) “The number of transistors per square inch on integrated circuits will take every two years (Substantial CPU performance)”. This has been the case for the past few decades, but the I/O speeds have increased fairly which cannot handle big data sets to the full potential. Moreover, enhancement of the information processing methods has been on the slower side [4], [5]. Some challenges have been discussed below:
1) Data Cleaning
The reliability of the sources cannot be granted, and quality has to be of the excellent calibre to engage the resources and turn the data into meaningful information. Data may contain inconsistencies, incomplete data or noisy data that would not be of no good for any organisation.
2) Data Capture & Storage
There are about 2.5 quintillion bytes of data being generated every day, and storage capacity for storing the big chunk of information is twofold over the period of 3 years since the year 1980 [2]. The large volume of data that being is created is captured but at a very high cost.
3) Data Analysis
Big Data introduces the challenge of tackling with increasing amounts of data that needs to be analysed to get meaningful information out of it. This analysis helps an organisation to foresight particular patterns so that they can be used in making business decisions [4]. Various analytical techniques are being implemented to get proper insight on the analysis such as statistical analysis, machine learning, visualization, data mining.
Genuine visualisation is very necessary as the large datasets that are being accumulated by organisations, i.e. e-Commerce websites: Amazon, E-bay, Flipkart, Ali-Baba have millions of users and n number of products. For the massive volume of data, some visualization is needed to get an original picture of what is going on. Interactive visualisation tools such as Tableau, Power BI, Chart.js, Data hero, Plotly etc. which can manoeuvre with large datasets and convert them into insightful graphs, charts etc. Sentiment analysis can be performed on such pictorial data which helps in analysis of data in a more efficient way.
5) Data Streaming
Data streams come from the real-time applications such as stock market feeds, different types of sensor networks blogs, traffic data etc.; patterns are extracted from the hidden values in the enormous datasets [4]. Incoming data streams are affected by the variability of the streams because of their unpredicted nature and changing environment [4].
The most challenging aspect of Big Data computing is data processing. There are many programming frameworks such as Apache Hadoop, Apache Spark, HPCC(High-Performance Computing Cluster), Strom, Samza etc. [1]. Latency can be pruned down by grasping the data immediately after it is received but at the cost of high per-item overhead. However, for higher efficiency data has to be processed and buffered in batches. Entirely stream based systems such as Strom and Samza dispense quite low latency proportionately high-per item price [6].
Compatible with all programming languages, i.e. Ruby, Python, Perl, Java etc. It is scalable, fault-tolerant and work can even be reallocated at runtime making it elastic.
It is used for processing large streaming data. It is an open source system which is easy to operate. For the execution of storm, end users they need to build a different set of topologies (Graph of Computation). The data flow in storm topology is represented as directed edges between a different set of nodes. A backpressure mechanism was introduced by storm for a bottleneck situation, i.e. influx of data is more than that of the data that is being processed. [2] [6]
2) Samza
It has similar working as Storm where the streams are being partitioned, and the data items inside it are set in order. Several parallel tasks are deployed for attaining scalability. The no. of tasks cannot be incremented dynamically at runtime. One of the advantages of Samza is that data can be buffered in the middle of two processing steps, which can be utilised by different teams of the same organisation. Because of this, there is no need of the backpressure mechanism (Necessary in Case of Storm). Samza can curb the data loss as it can inspect the processing tasks systematically and can reboot/reload the processes from the point where the whole fiasco started.[6]
It is an open source programming model which is designed for elegant analysis and high processing speed. For a polished Big Data analysis, cloud computing is the most crucial definitive which provides on-demand recourses and can be used for massive scale computing problems (Greater Complexity). With in-memory caching there is a remarkable enhancement in the performance of the framework. Sparks facilitates some machine learning algorithms which help in breaking down the large volumes of streaming data into small data blocks. It distributes the data automatically and transforms the data into Dstream (Discretized stream). [1], [6]
Definition of NoSQL is not entirely accurate, i.e. “Not Only SQL” or “Not Relational” [7]. Cloud computing’s drawback of not supporting relational database management systems has led to popularise the NoSQL Databases. The NoSQL make use of plenty of servers to scale Online Transaction Processing applications. NoSQL helps in recovery of a huge chunk of data and storage of the bulk data through a proper mechanism [8]. Figure 2 shows the comparison between the three NoSQL databases discussed below.
HBase is a column-oriented data model and not a row-oriented model. It can handle billions of rows & millions of columns [9]. In these type of systems, tasks can be escalated over huge datasets. It has the capacity to manage substantial table updates, as it has a flexible structured hosting. It is schema free and can be accessed through API’s and other access methods such as JAVA API, RESTful HTTP API, Thrift. It supports some languages C, C#, C++, Java, JavaScript, Perl, PHP, Ruby, Scala, Haskell etc. [11]. Zookeeper instance supports HBase, used when full consistency is required, compiling, and making calculations on documents. It makes use of triggers.
2) MongoDB
It does not have any schema, and it is just a document-oriented data model. It is used when we need to query an enormous dataset quickly. Also used for Data Warehousing solutions. It is accessed using JSON and can work on different operating systems such as Linux, Solaris, Windows, OS X. It supports various programming languages like C, C#, C++, Groovy, JAVA, PHP, Python, Scala [9]. It is memory thirsty and might not perform as efficient as HBase because it uses log-structured merge tree. It does not make use of triggers. MongoDB supports secondary index and master-slave replication methods for the amalgamation of data [9].
It is a column-oriented data model. For data workloads that run over multiple nodes without any particular point of failure, Cassandra escalates the performance. It implements a peer-to-peer distributed system among all the nodes that are present in the cluster [10]. It uses the gossip protocol so that the nodes can communicate with each other and watch out for the misbehaving nodes in the background [10]. It provides top-notch read and write throughput. It supports first programming languages such as Java, Python, JavaScript, PHP, C++, Ruby, NET, Go etc. [11]. It utilises triggers too. Cassandra, when deployed over various data centres, uses snitch to unearth the network topology while replicating data [10].
The various types of challenges that we discussed, have to be improved and innovative methods need to be deployed for overcoming these drawbacks and moving into a new phase of scientific research. Challenges like data security, data aggregation, big data cleaning, imbalanced data sets have to be optimised and made reliable for extracting insight & knowledge. Hardware has to be improved so that efficiency & performance can be more progressive and have a top-notch system that can analyse data. A few years down the line, real-time analytics can be productive in the field of research because of growing use of the internet, smartphones plus the introduction to artificial Intelligence will lead to some great inventions, and new creative ideas will pop up to solve these profound issues. A lot of human resources and investments will be needed to make any significant progress in the field of Big Data Analytics.
Conclusion:
In this paper, we discussed the overview, challenges, solutions to support real-time analytics and comparisons of alternative storage infrastructure options. From the discussion in this paper, we can conclude that Big Data has much potential in achieving new heights in the field of analytics because it is constrained by the limitation in storage and robust I/O techniques. The real-time streaming data is escalating exponentially which are being handled by Stream processors like Spark, SAMZA, Storm etc. For the real stream data management, we have the NoSQL databases like HBase, MongoDB, Cassandra etc. who can handle such large volumes of data and store them in a structured way, which is then used to get insightful information. There is much scope for future development in this field as it is ever growing and very innovative and productive. The present era of Big Data is bringing a new wave of revolution in the field of science and technology.