Hadoop and Big Data: Past, Present, Future

2013-10-07T12:18:11+00:00 October 7th, 2013|Uncategorized|

Every successful technology goes through several cycles of invention, discovery, socialization, adoption and continuous improvement.  Hadoop is no exception.  It has been embraced by early adopters and is now in the “discovery path” for other customers and vendors.  The adoption is well supported by third party vendors who have customized and extended their product offerings with their own Hadoop distributions and implementation to help customers adopt the new technology.

Hadoop was created several years back by Google for their own internal needs.  It was based on its MapReduce and distributed file system platforms.  The goal was to put computing logic on thousands of commodity hardware systems leveraging parallel processing.

Organizations are trying to craft and fit Big Data into their overall IT strategy, particularly for their analytical platforms and data warehouses. It is a well-known and proven fact that Hadoop can be used for clickstream analysis and social media analytics as companies such as Google, Facebook and Yahoo have been using it for years. But using these new technologies for analytical and data warehousing systems is something new and still in discovery phase and hence Gartner still reports “Big Data” technology is in its “hype” cycle.

Since this technology is new, many customers are wanting to perform proof-of-concept (POC) projects to understand how this new technology can be leveraged to analyze enormous volumes of structured and unstructured data which they never thought would be possible.  With the cost effectiveness of underlying cheap commodity hardware and Hadoop, such analysis is now possible.

With this paradigm shift, many vendors have come up with their own ways to leverage Big Data.  Below are a few of the major players that have customized their product(s), or developed new products, to help their customers leverage the evolving capabilities of Big Data technology.

  • Talend has made available their Talend Platform for Big Data that not only allows customers to source data from a Hadoop system, but also has reporting capabilities to visualize data directly from Hadoop without the need for external integration to a data warehouse.
  • MicroStrategy has the ability to transparently analyze the data stored in Hadoop distributions such as Cloudera, Hortonworks and even Amazon’s EMR and their thrift connector to Hadoop.  This allows business users to analyze Big Data using the simple drag and drop of attributes and metrics to visualize data.  Behind the scenes, it generates Hive-QL to query data from Hadoop.  It can also query data directly from Cloudera’s Impala and SAP’s in-memory HANA database.
  • Oracle can store and manage Big Data through their Big Data Appliance powered by Cloudera.  Oracle also has tight integration with Hadoop through Big Data connectors that enable SQL access to data on Hadoop directly from the Oracle database.
  • IBM with their InfoSphere products that helps customers analyze petabytes of Big Data with relatively minor effort through a pre-integrated hardware.
  • Teradata includes additional functionality that allows seamless integration of data on a Hadoop cluster with the data warehouse.
  • SAP extends its real time data platform and in-memory database HANA with integration to Hadoop. It can read from and load data to Hive and HDFS, perform rapid batch updating and loading to SAP HANA, Sybase IQ server and any other data store. It supports text data processing by analyzing web logs, social media and relevant content directly from files.
  • Microsoft has Windows working with the Hadoop distribution from HortonWorks. They are now offering a Windows Azure HDInsight service that enables them to provide Hadoop on the cloud.  That innovation allows the user to build and take down a Hadoop cluster in minutes.  Organizations can analyze Hadoop data with PowerPivot, PowerView and other Microsoft BI tools for data on the cloud.
  • EMC acquired their own Hadoop distribution: Pivotal HD. They recently announced HAWQ, which integrates a Hadoop cluster within their MPP appliance providing ultimate portability for data between relational MPP database and a Hadoop cluster.  This has been a big challenge with other Hadoop distributions. EMC also has a SQL interface and various tools to query unstructured data on Hadoop via SQL.

Things to Consider

It seems almost all data warehousing vendors have pieces of the Big Data ecosystem by either direct integration with Hadoop or through implementing their own Hadoop distributions in their appliances.

While we know Hadoop is being primarily used for processing and making sense of structured and unstructured data that includes clickstream, user comments and media content, organizations are now finding compelling reasons to use Hadoop in their data warehouse landscape as mentioned earlier.

This has become a compelling use case for many customers to use Hadoop for processing huge data volumes that used to take hours, but is now done in minutes.  In a data warehouse environment this has tremendous benefit to parallel process source system data bringing it to manageable size before loading the data to a data warehouse or even a data warehouse appliance.

With several vendors releasing data integration interfaces to data on a Hadoop file system, it is becoming rather simple to bring data in and out of a Hadoop cluster to the data warehouse for analytics.

Business Intelligence (BI) vendors are also building interfaces to connect directly with data on Hadoop file system, thus eliminating the need to bring data out of Hadoop file system.  With promising data visualization products, customers can quickly pair Big Data with visual analytics to present the best data to business users on their quest to answer important business questions.

Another less exciting but very effective use of Hadoop is “data archival.”  Organizations have the constant problem of archiving due to the continual increase of growing data volumes in both source systems and the data warehouse. The challenges are both in archiving the data as well as making sure it is accessible, upon request, in a reasonable timeframe.  The best approach until now has been to archive source system data in flat files, on a regular file system, rather than in a relational database system (RDBMS) so that data can be archived in incremental of files rather than huge tables that are inaccessible later.

Within Hadoop environment, the very same data in flat files can now be archived on a Hadoop file system and still make it quickly accessible through simple and tight integration between Hadoop and their regular data warehouse.  Plus, as mentioned earlier, there are now several BI tools that can query data directly out of the Hadoop distributed file system directly.

Similar to application or source systems leveraging Hadoop, data warehouses can also leverage Hadoop in their data staging or preparation area to process complex logic if there is a need to crunch huge data volumes in minutes instead of hours.

Data warehouses can also benefit tremendously from the new technology, if there is an analytical need that can be served by BI tools that have capability to query the underlying Hadoop file system directly.

With NOSQL databases in combination with Hadoop, organizations can combine structured with semi or unstructured data within same environment without having to worry about the size or growth of such data. HP Vertica and ParAccel PADB columnar databases can run on commodity hardware of hundreds of nodes with high data compression providing amazing MPP performance for queries on commodity hardware.

Another promising usage with using Hadoop combined with NoSQL databases is to help organizations build something rapidly and help business make decisions by making data available in less time compared to building data warehouse or data marts that take lots of up-front time to deliver.  While this approach would not get the optimized solution with perfect data that can only be delivered with building a data warehouse with data quality, it lets the users see some quick results.  But this would certainly help to keep up with business needs before the final “optimized” solution with “perfect” data is delivered.

This rapid approach is fueled by languages such as PIG which are not as efficient as Java Map/Reduce but gets the work done in half the time—an additional technique to  help organizations build something fast and then come back and optimize later.

More and more organizations are looking to use predictive analysis to predict behaviors of their customers with the help of machine/statistical languages such as “R” which can now be performed on millions of records using NoSQL and Hadoop.

There is also a challenge to consider.  The Hadoop Distributed File System (HDFS) is a shared nothing architecture similar to MPP appliances.  So expect the same issues with Hadoop as the issues with MPP, such as concurrency and high latency.  The processing power of Hadoop is only realized if there is really a need to process massive amount of data.  Remember there is always initial processing time added to all your requests.  Organizations can simply have different sizes and multiple clusters/node configurations for different requirements to manage concurrency needs due to “shared nothing” architecture.

So What’s Next?

With several offerings from various vendors, it is a daunting task for organizations to decide which platform, or combination of platforms, would help them leverage Big Data projects efficiently.  One logical path is to review your existing technologies and vendors and then pick the vendor that has already made a significant effort in the Big Data arena.  You can hopefully leverage existing platform work with minimal fuss regarding Hadoop and other Big Data technologies.

Setting up a Hadoop cluster is certainly not free, and there are various costs associated with its establishment.  But they are incremental as nodes in a cluster are added and removed as needed.  This makes the platform extremely scalable and flexible. Cloudera estimates anywhere from $3,000 to $7,000 per node, depending on what is used and how you configure each node in a cluster.

Hadoop has numerous limitations.  It is limited to low latency queries.  Security is still a concern.  Some have concerns about open source technologies, in general.  There is also a large learning curve due to NoSQL support.

It is important to consider everything and specifically have a use cases or need before jumping in the Hadoop Big Data bandwagon.  But looks like Hadoop is going to be in the future for many of you…and it’s here to stay!