Nowadays, Big Data and Cloud Computing have become the trending topic in Information Technology (IT). Big Data refers to a massive scale of data that can be in a form of unstructured, semi-structured and structured data whereas “Cloud Computing” is a term refers to a type of computing in which services are delivered through the Internet. Most organization thinks that the combination of Big Data and Cloud Computing provides beneficial outcome to them such as it is a cost-effective solution in big data analytics. However, the security of cloud computing associated with the Big Data Application has become the primary concern in the industry.
1.1 Big Data and Cloud Computing
Big Data refers to an extremely large and complex datasets that may be difficult using traditional data mining techniques and tools which is used to extract and transform useful information. According to Inukollu, Arsi and Ravuri (2014), there are 4 characteristics of Big Data Analytics which also known as the 4V’s; Volume, Variety, Velocity and Veracity. Figure 1.1 explains the meaning of each characteristic.
Cloud Computing refers to a technology which allow user to share, store, access and process data on the internet rather than local servers or personal devices. In addition, the service model of cloud is usually classified into three types such as Software as a Service (SaaS), Platform as a Service (PaaS) and Infrastructure as a Service (IaaS). However, due to the high availability of cloud to all users, it faces more security issues and challenges.
Many organizations use big data to store and analyze data about their organization, business and customers but not all of them have the fundamental knowledge especially from a security perspective. In addition, the integration between both big data and cloud computing may lead to some security vulnerabilities. The most popular security vulnerability is platform heterogeneity since it will create a new unfamiliar platform in the cloud and many existing security tools may not work on such platforms. Therefore, in order to make big data becomes secure, new security tools or techniques are needed to work with big data and cloud computing such as encryption, logging, honeypot detection, access control and others (Inukollu et al., 2014). The types of security techniques are discussed in the section below.
2.0 Security Tools and Techniques
Encryption refers to a process of scramble or reorder the text to make it unreadable by anyone especially the hacker since the data is collected from various places and group in cluster which allows them to steal critical information. There are two types of encryption that can be used to increase security in cloud such as data encryption and network encryption. In data encryption, different machines should use different encryption keys to encrypt and decrypt the data stored in order to prevent hacker to extract meaningful information and misuse it even though they have successfully retrieved the data. Other than that, network encryption is a process of encrypting data to ensure the transmitted data across computer network is unreadable to anyone and has been widely used on the internet to protect information that is being sent between browser and server such as passwords, sensitive information and payment information. Network encryption is implemented through Internet Protocol Security (IPSec) and it is a set of open Internet Engineering Task Force (IETF) standards that is used to create a framework for private communication over IP networks. Besides, encrypted packet appear to be identical to unencrypted packet that are easily routed through any IP network and if a hacker can get through the network communication network, the useful information or packet would still be difficult to extract (Inukollu et al., 2014). They also agree that encryption is a good security strategy to slow down professional hackers from stealing an organization sensitive information.
Dependable logging instrument is essential to the procedure of framework and application occasion reviewing. Also, integrating cloud database in the logging framework is a gainful option since it can essentially lessen the expense of database arrangement and support. In any case, a log proprietor will lose the security control of log data if the data are put away in a cloud database. Aggressors could hence utilize this shortcoming to misrepresent log data in a cloud database condition. In this paper, we furnish a safe logging framework integrating with the cloud database. Log examiners in this can utilize the public key to approve the honesty of log data (Pătraşcu & Patriciu, 2015). The secret key can be utilized to create signatures of log and block data in this framework. We additionally give an implementation to this framework and an execution assessment of signing/verifying log data. Our investigation exhibits a strategy to secure log data for log proprietors in the cloud database. Besides, the proposed secure logging framework can be effortlessly conveyed in a cloud registering condition.
2.3 Software Format and Node Maintenance
Advances in cloud computing reshape the assembling business into versatile, on-demand benefit situated, and exceptionally circulated cost-efficient plan of action. Anyway, it likewise acts difficulties such like reliability, availability, adaptability, and safety on machines and procedures crosswise over spatial limits (Pătraşcu & Patriciu, 2015). To address these difficulties, this paper explores a cloud-based worldview of predictive maintenance dependent on mobile agent to empower convenient information acquisition, sharing and utilization for enhanced accuracy and reliability in blame diagnosis, remaining administration life expectation, and maintenance planning. In the new worldview, an ease cloud sensing and computing node is right off the bat developed with embedded Linux operating system, mobile agent middleware, and open source numerical libraries. Information sharing, and interaction is accomplished by mobile agent to appropriate the analysis algorithms to cloud sensing and computing node to locally process data and offer analysis results. At last, the exhibited cloud-based worldview of predictive maintenance is approved on a motor tried system.
2.4 Nodes Authentication
Essentially, Node Authentication is a procedure of guaranteeing that a given node and its information are legit. Whenever a node joins a group, it ought to be confirmed. In the event of a pernicious node, it ought to not be permitted to join the bunch. Authentication techniques like Kerberos can be utilized to validate the approved nodes from malicious ones.
2.5 Rigorous System Testing of Map Reduce Jobs
MapReduce is a procedure and a program show for conveyed registering dependent on java. The MapReduce algorithm contains two critical assignments, to be specific Map and Reduce. Guide takes an arrangement of information and changes over it into another arrangement of information Inukollu, (V. N., Arsi, S., & Ravuri, S. R, 2014), where singular components are separated into tuples (key/esteem sets). Furthermore, reduce task, which takes the yield from a guide as an information and joins those information tuples into a littler arrangement of tuples. As the arrangement of the name MapReduce suggests, the decrease undertaking is constantly performed after the map job.. Under the MapReduce demonstrate, the information handling natives are called mappers and reducers. Breaking down an information handling application into mappers and reducers is now and again nontrivial. Yet, when we compose an application in the MapReduce frame, scaling the application to keep running more than hundreds, thousands, or even a huge number of machines in a group is simply an arrangement change. This basic versatility is the thing that has pulled in numerous developers to utilize the MapReduce demonstrates.
2.6 Honeypot Nodes
Honeypot detection technique is necessary in securing Big Data in the cloud computing. The ability of honeypot to determine the possibility of attacks and to analyze the activity of the attackers can overcome the limitations of intrusion detection system in the cloud computing. The combination of Intrusion Detection System (IDS) based on mobile agent sand and honeyd, honeycomb and honeynet on the Open Stack cloud computing can improve detection rate in the cloud environment (Saadi & Chaoui, 2016). The user is authenticated initially using RSA encryption and the incoming and outgoing traffic is monitored by firewall to make communication between the user and the provider more secure. The Honeywall receives traffic and separates internal area, demilitarized zone and honeypot area while the pirates attracted to honeyd in honeypot zone. The system able to detects suspicious activity by lowering the false positive and negative rate. The security of data is improved when using consolidated security approach in cloud computing.
2.7 Data Leak Prevention System (Data Classification)
Data Classification is one of the methods in Data Leak Prevention System (DLP) which can control the application of security and the protection level in cloud computing. The process of Data Classification includes identifying and grouping levels of data and sensitivity level by using a set of parameters of characteristics such as Access Control, Content and Storage which provide security to data (Rizwana & Sasikumar, 2015). The proposed parameters are used to analyzed and classify the data elements based on the content and access control parameters. And then, the classification is continued using properties for storage. The result from the classification enhances the security of data. The classification of data also can be based on the confidentiality degree that consists of basic, confidential and highly confidential (Tawalbeh et al., 2015). Both of the processes are done manually where the user identifies the class of the data. Different confidential levels required different encryption methods. This would save more time when the cryptography algorithm is applied according to the degree of the data confidentiality. The framework is tested and compared with different algorithms. The result shows less processing time and guaranteed the integrity and confidentiality of the data.
2.8 Third Party Secure Data Publication to Cloud
Cloud computing in big data refers to a larger scale the set of policies and control deployed to protect company data. By using third party to this paradigm can help secure the data. There are a lot of characteristics to evaluate the third party such as algorithms used, parameter, environment, future scope and objective (Noopur Katre, 2016). While in the context of cloud the infrastructure is placed according to third parties and guarding the data confidentiality is the main motto. Thus, this is important for implementing clear data management choices which is original data must only be accessed by trusted parties or the owner of the data and for any unreliable context data must be encrypted. However, moving the plain data to the cloud which is being operated and administered by certain operators must be provided with high level of security and trust.
2.9 Access Control
In cloud storage, the Cloud Service Provider (CSP) cannot be unconditional trusted, which is always a potential threat to data security in company systems. Thus, steps to avoid threat is by access control which is a commonly used approach. (Tengfei Li,2015) For sensitive and crucial data stored in cloud, Access control is a best approach to provide the confidence and the privacy protection towards the data. Access control is the process of mediating every request to resources and data maintained by a system and determining whether the access request should be granted or denied. The access control of cloud storage permits authorized users access resources and refuse unauthorized users to access resources on cloud. It can be realized by two means, access control model and access mechanism. In an access control model, several roles are created specifically according to access policies. The access control of cloud storage is realized by checking the role of every users who wants to access the cloud data base don their usage.
Datasets are becoming increasingly more pertinent when executing the performance assessment of cloud-scheduling, resource-allocation, and load-balancing algorithms used for eagle-eyed examination of efficiency and performance in a real-world cloud. Assessing the scheduling and allocation policies on cloud infrastructures under a varying load and system size is a challenging problem. Real cloud workload is hard to acquire for performance analysis and investigation due to the users’ data confidentiality and policies maintained by Cloud Service Providers (CSPs). In addition, using real testbeds limits the experiments to the scale of the testbed. Hence, testing the accuracy performance with real-world datasets is crucial in the field of research, and synthetic data does not realistically represent an actual dataset (Makonin, S.; Wang, Z.J.; Tumpach, Z.J, 2018). The most appropriate alternative is to make the investigation in a simulation environment with a load of varying behavior in the cloud environment. For cloud computing research, it is valuable to formulate and ensure a widespread availability of realistic datasets that show how resourcefully the cloud addresses the user requirements.
3.1 Google Cloud Jobs Dataset
The GoCJ dataset (Hussain, A., Aleem, M., Khan, A, 2018) is provided as a supplementary data in text and Excel file formats in amalgamation with the two dataset generator files:
Each row in the text file describes the size of a specific job in terms of Millions of Instructions (MI). The Monte Carlo simulation method (Ghorbannia, A.; Yalda, D, 2014) is employed to generate the dataset comprised of any required number of jobs. The specification of the GoCJ dataset is presented in Table 1. Description of GoCJ dataset.
We have to close up cloud computing overview, as distributed and hierarchical framework having scheduled storage, virtual machine images (embedded virtual cache and transient storage), and network resources deployment with the complement of CC functionality at the edge of IoT. This work represented some key characteristics with enduring general ideas about possible challenges as like how cloud computing can expand CC services at the edge. Hence, it is clarifying some use cases that provoked the necessity of cloud computing especially about the real-time data analysis importance for IIoT, typically in health care, STLS, and smart grid. Our work emphasizes cloud’s consequences and disruption in main three aspects, IoT, big data analytics, and storage.
A lot of research works have been considering IIoT applications and most of them have been connected with cloud computing, surely in primarily developing markets such as manufacturing or oil and gas. Yet, most of research works still in waiting or uncertain towards the progress and development of smart cities might be with combinations of smart building, STLS and sensor monitoring networks that are already organized. Cloud provides services for wide area connectivity, global coordination, heavy duty computation, and massive storage capacity, while cloud will facilitate user- centric service, edge resource pooling, rapid innovation, and real-time processing (M.ChiangandT.Zhang, 2016). This transformative epoch is an interesting era to start thoroughly discovering what cloud may look like and which differences will be passed to the world of virtual computing in the next 10 years.
...(download the rest of the essay above)