Assimilation of Hadoop into enterprise business intelligence and data warehousing

Shortcomings of Hadoop
¬ Metadata in Hadoop is stored in a single NameNode. It becomes a single point of failure for the entire environment, therefore a much more expensive and robust server is required to house the NameNode . However, following alternate approaches can also be considered :
A) A different distribution of Hadoop such as MapR, which fixes the NameNode problem can be used.
B) Companies such as ZettaSet have built additional tooling around Hadoop, including NameNode high availability, but which do not fork the Apache distribution.
C) Since this NameNode issue is specific to HDFS (Hadoop distributed file system), an approach is to replace this with IBM’s GPFS-SNC, which similarly averts this problem. GPFS is also POSIX (portable operating system for UNIX) compliant, which HDFS is not.
¬ A challenge with HDFS and Hadoop tools is that, in their current state, they demand a fair amount of hand coding in languages that the average BI professional does not know well, namely Java, R, and Hive. For development purposes to reduce the use of programming in MapReduce, there is Pig, which actually consists of PigLatin (the language) and a runtime environment that executes PigLatin code. This doesn’t really help lessen the programming issue but Pig has been designed specifically for analysis purposes and takes away having to understand map and reduce functions. Hive provides a SQL interface to Hadoop, where some functions are not available and some perform very poorly. Next, there is Jaql, which is a query language based on JSON (JavaScript Object Notation) and which was donated to the open source community by IBM. In its big Insights product IBM offers an ANSI-compliant SQL interface that sits on top of Jaql.
Another associated problem is that some vendors exploit MapReduce directly with their own products which is great for data integration but does not help with query processing.
¬ Hadoop is a difficult environment to manage. If we consider that we have that hundreds of servers in a cluster. Both alternate distributions (MapR and so forth) and build-around products (ZettaSet, BigInsights ) aim to help here, and there is also the ZooKeeper project from Apache, which provides synchronisation, configuration management and other cross-cluster services. Money saved on hardware can quickly be spent in man hours for configuration and maintenance. Companies like Cloudera and Hortonworks, capitalized on this opportunity to offer managed Hadoop services for customers.
¬ Present environment of Hadoop creates an issue because there are a few software tools for Hadoop and they lack adequate metadata management, also the processing in real time is quite inefficient when it comes to operational BI and ad hoc analytic queries .Given the big chunks of data, this problem is inevitable.For these problems, HDFS is proving to be quite useful,
¬ A major problem is that Architectural changes are required for integrating Hadoop with business intelligence and data warehousing. Sometimes, like all analytical applications Hadoop is deployed in a silo for maximal business value which is achieved by integrating data from both Hadoop and traditional data warehouse environment. But, even if Hadoop starts in a silo in an organization, integration with BI/DW is the most important step. A successful integration requires adjustments to an existing architecture which is part and parcel of architecting a new big data analytic system.
Advantages of Hadoop:
¬ Hadoop is a highly scalable storage platform. It can store and distribute very large data sets across hundreds of inexpensive servers that operate in parallel. Unlike traditional relational database systems (RDBMS) that can’t scale to process large amounts of data, Hadoop enables businesses to run applications on thousands of nodes involving thousands of terabytes of data.
¬ Hadoop also offers a cost effective storage solution for businesses’ exploding data sets. The problem with traditional relational database management systems is that it is extremely cost prohibitive to scale to such a degree in order to process such massive volumes of data. In an effort to reduce costs, many companies in the past would have had to down-sample data and classify it based on certain assumptions as to which data was the most valuable. The raw data would be deleted, as it would be too cost-prohibitive to keep. While this approach may have worked in the short term, this meant that when business priorities changed, the complete raw data set was not available, as it was too expensive to store. Hadoop, on the other hand, is designed as a scale-out architecture that can affordably store all of a company’s data for later use. The cost savings are staggering: instead of costing thousands to tens of thousands of pounds per terabyte, Hadoop offers computing and storage capabilities for hundreds of pounds of terabytes.
¬ Hadoop enables businesses to easily access new data sources and tap into different types of data (both structured and unstructured) to generate value from that data. This means businesses can use Hadoop to derive valuable business insights from data sources such as social media, email conversations or clickstream data. In addition, Hadoop can be used for a wide variety of purposes, such as log processing, recommendation systems, data warehousing, market campaign analysis and fraud detection.
¬ The major advantages of the Hadoop are Automation and Parallelization. Hadoop provides a great fault tolerance which partially slows the process but does lead to a system failure. Even with the failure of some of the data nodes, data is not lost because we have a replication of the same data on other data nodes.
¬ One more advantage is Component recovery where a failed node can rejoin the system without restarting the complete system itself. Component failures do not affect the outcomes of a job which in turn ensures greater scalability, where increasing the resources will increase the performance.
Scope of improvement in Hadoop
¬ Hadoop includes a number of security features, such as permission checks and access control for job queues. Service-level authorization is the initial authorization mechanism that ensures clients connecting to a particular Hadoop service have the necessary permissions. Add-on products that provide encryption or other security measures are available for Hadoop from a few third-party vendors. Even so, there’s a need for more granular security at the table level in HBase, Hive, and HCatalog.
¬ A lot of of Hadoop’s evolution is at the tool level and not much in the HDFS platform. After security, users’ most pressing need is for better administrative tools, especially for cluster deployment and maintenance.
¬ HDFS has a good reputation for reliability due to the redundancy and failover mechanisms of the cluster on which it sits. However, HDFS is currently not a high availability system, because its architecture centers around NameNode. e permanent loss of NameNode data renders the cluster’s HDFS inoperable, even after restarting NameNode. Also Hadoop SecondaryNameNode (which provides a partial, latent backup of NameNode) and third-party patches, but these fall short of true high availability.
¬ Hadoop faces a lot of latency issues and some improvements which will overcome the data latency of batch oriented Hadoop.Hadoop should support real time operation, fast query execution and streaming data.
¬ Better development tools also can improve Hadoop performance to a great extent in areas like metadata management, query design and higher approach for less hand coding.
Exceptional Ways in which Hadoop is used by organizations:
¬ Online travel – Cloudera’s Hadoop distribution currently powers about 80 percent of all online travel booked worldwide.
¬ Mobile data- Hadoop plays a role in storage and processing of mobile data by wireless providers, and a little market-share math probably could help one pinpoint the customers.
¬ E-commerce- one large retailer (I assume eBay (s ebay), which is a major Hadoop user and manages a large marketplace of individual sellers that would help account for those 10-plus million merchants) added 3 percent to cloudera’s net profits after using Hadoop for just 90 days.
¬ Energy discovery- to sort and process data from ships that troll the ocean collecting seismic data that might signify the presence of oil reserves.
¬ Energy savings -Companies like Opower, uses Hadoop to power its service that suggests ways for consumers to save money on energy bills. Certain capabilities, such as accurate and long-term bill forecasting were hardly feasible without Hadoop.
¬ Infrastructure management- This is a rather common use case, actually, as more companies (including Etsy) are gathering and analyzing data from their servers, switches and other IT gear.
¬ Image processing- A startup called Skybox Imaging is using Hadoop to store and process images from the high-definition images its satellites will regularly capture as they attempt to detect patterns of geographic change. Skybox recently raised $70 million for its efforts.
¬ Fraud detection- Hadoop is used by both financial services organizations and intelligence agencies. One of those users, Zions Bancorporation, explained to me recently how a move to Hadoop lets it store all the data it can on customer transactions and spot anomalies that might suggest fraudulent behavior.
¬ IT Security – Companies also use Hadoop to process machine-generated data that can identify malware and cyber attack patterns. IpTrust, which uses Hadoop to assign reputation scores to IP address, which lets other security products decide whether to accept traffic from those sources.
¬ Health care- Apixio, which uses Hadoop to power its service that leverages semantic analysis to provide doctors, nurses and others more-relevant answers to their questions about patients’ health.
Hadoop, Big data and Enterprise business intelligence
Most of the business leaders believed that that big data will revolutionize business operations in a tremendous way but nearly a decade into the big data era, only few businesses believe that they are generating strategic value from the data they collect, mostly because there needs to be a plan to take advantage of Big data. Existing BI structures are not flexible enough. Most organizations take too long to get to the ultimate goal of a centralized BI environment, and by the time they think they are done, there are new data sources, new regulations, and new customer needs, which all require more changes to the BI environment.
Few use cases are mentioned as follows where a big data technology – like Hadoop – can help a BI environment.
[Note: Many of these big data use cases are really still being developed]
Stage 1: Hadoop for data staging and ODS
There are few organizations where the BI/DW team is looking to use Hadoop to simplify, accelerate, and enhance their existing ETL and data staging processes. Hadoop brings at least two significant advantages to your ETL and data staging processes. The first is the ability to ingest massive amounts of data as-is. That means that you do not need to pre-define the data schema before loading data into Hadoop. This includes both traditional transactional data (e.g., point-of-sale transactions, call detail records, general ledger transactions, call center transactions), but also unstructured internal data (like consumer comments, doctor’s notes, insurance claims descriptions, and web logs) and external social media data (from social media sites such LinkedIn, Pinterest, Facebook and Twitter). So regardless of the structure of your incoming data, you
can rapidly load it all into Hadoop, as-is, where it then becomes available for your downstream ETL, DW, and analytic processes (Explained in figure a below).
¬
Figure a
¬ The second advantage that Hadoop brings to your BI/DW architecture occurs once the data is in the Hadoop environment. Once it’s in your Hadoop ODS, you can leverage the inherently parallel nature of Hadoop to perform your traditional ETL work of cleansing, normalizing, aligning, and creating aggregates for your EDW at massive scale.
Many ETL vendors, like Pentaho, Talend, and Datameer are modifying their products to seamlessly create MapReduce parallel ETL jobs. These vendors provide drag-and-drop UIs for generating ETL MapReduce jobs, removing a great deal of complexity from the data integrator/developer.
Once the raw data is in the Hadoop environment, developers can do data transformations and enrichments that were not easy to do before including:
A) Parse complex, unstructured data feeds (like consumer comments, web logs, and twitter feeds) to capture key data and metrics of importance (e.g., visitor id, session id, site id, display ad id, display ad location) that can then be integrated with the existing structured data in your EDW. For example, imagine the business possibilities if you were able to mine the wealth of social media data on your customers to identify their interests, passions, associations, and affiliations, and then integrate those new customer insights with the existing customer data that is meticulously maintained in your CRM system.
B) Create advanced composite metrics that require you to process many days, weeks, or even months of history. I covered some of these composite metrics in a previous blog, but creating metrics around frequency, recency, and sequencing is possible in a Hadoop environment that is not easily done within your traditional ETL environment. These composite metrics facilitate more detailed and complex analysis yielding key performance indicators that might be better indicators and/or predictors of performance.
Stage 2: Hadoop for Data Warehouse:
¬ After doing all of this parsing of unstructured data for new customer and business metrics and data enrichment to create new composite metrics, the output of the Hadoop ODS can then feed your standard EDW. The Hadoop ODS has the advantage of being able to create structure out of unstructured data, which can then be integrated with your existing structured, transactional data in the EDW (Figure b).
¬
Figure b
¬ Because of the massive scalability of Hadoop and the ability to ingest massive amounts of data quickly, you can significantly accelerate your traditional ETL processes and more easily fulfill your EDW SLA’s. Not only can you shrink the latency between when the data transaction or event occurs and when it’s available in your EDW, but you can also provide more granular, detailed data in your EDW, especially if your EDW is based on an MPP architecture (which reduces the need for indices, aggregate tables and materialized views, thereby saving more even more EDW loading time and management effort).
Stage 3: Hadoop for analytic sandbox
¬ The third stage of Hadoop as your ODS affects the self-provisioning and rapid and frequent iterations typically required from your analytics sandbox environment. In this environment, the data scientists can grab what data they need out of the Hadoop ODS without worrying about impacting the EDW environment. They can select whatever level of granularity they need from whichever data sources they need in order to build, test, refine, and publish their analytic results (see Figure c).
¬
Figure c
¬ With the Hadoop ODS, the data scientist have the ability to store and access data they “might” need from an analytics perspective; data that may never find its way into the EDW. For example, the Data Scientists might want to store large amounts of social media or web log data, or large varieties of widely available third-party data (from places like data.gov) to enhance analytic modeling. The Data Scientists would grab this data out of the Hadoop ODS when and as they needed it, depending upon their current analytic needs.

Sources :
¬ http://www.bloorresearch.com/analysis/problems-hadoop/
¬ http://www.sas.com/content/dam/SAS/en_us/doc/whitepaper2/integrating-hadoop-business-intelligence-datawarehousing-106436.pdf
¬ http://www.ijser.org/researchpaper/Advantages-of-Hadoop.pdf
¬ http://www.itproportal.com/2013/12/20/big-data-5-major-advantages-of-hadoop/
¬ https://gigaom.com/2012/06/05/10-ways-companies-are-using-hadoop-to-do-more-than-serve-ads/
¬ https://infocus.emc.com/william_schmarzo/understanding-the-role-of-hadoop-in-your-bi-environment/
¬

Essay: Assimilation of Hadoop into enterprise business intelligence and data warehousing

Essay details and download:

Text preview of this essay:

About this essay:

Essay details and download:

Text preview of this essay:

About this essay:

Essay Categories: