The Data Storm and Hadoop: An Introduction

By |2018-04-10T16:49:06-06:00June 28th, 2014|Insight Post|

If you have any doubts about the data flood that is covering the globe, here are a few amazing stats.  Around the world, in just one minute…

  • 200 million+ emails are sent
  • 4 million Google searches are performed
  • 250,000 Facebook shares take place
  • 48,000 Apple apps are downloaded
  • Approximately 600 new websites are launched
  • 100 hours of YouTube video are uploaded
  • 1.3 million YouTube videos are viewed
  • 6 million Facebook pages are viewed

This huge surge of data, only superficially measured by the above metrics, is changing everything. Consumers see it in the form of recommendations and gift ideas and very focused email marketing campaigns.  Businesses push hard for more and more data to drive value in energy exploration, predictive and breakthrough healthcare, breath-taking methods of fraud detection and methodologies to capture our purchase sentiments and behavior.

The flood is in full force and it’s not likely to end any time soon.

Even traditional enterprise data is still growing rapidly, but the biggest influx of data comes from many newer and additional sources like social networks, blogs, chats, online product reviews, web pages, emails, documents, images, videos, music and sensors.  Just reading that list feels unstructured and chaotic—and it is just that.  Today’s data and other document types do not fit into the neat, structured world of the past.

This new collection of all kinds and sizes of data has been collectively coined Big Data.  It might be better named Huge, Fast, Unstructured and Overwhelming Data.  But for now, Big Data will do.

Interesting, But Now What?

So what is the challenge today?  Business users and IT are trying to figure out how to unite and process Big Data.  How do you connect all the dots and find meaningful connections?  How do you acquire new analytical insights that provide value to the business and its growth?  Who in your company is going to figure all this out?

Luckily due to the advances in the technology itself, new methodologies, platforms and ecosystems are able to help you store, process and analyze this data.  New applications, architectures and networks help you tie everything together.  A new breeds of business analysts, data scientists and consultants are helping build the roadmaps to new business insights.

The Gateway to the Big Data Playground

There are many pieces to a Big Data infrastructure, but one of the widely recognized workhorses of the movement is Hadoop.  It’s an odd name but it is quite likely you have heard it pop up in conversations throughout your organization.

At its simplest level, Hadoop is comprised of two primary components: MapReduce and the Hadoop Distributed File System (HDFS).  MapReduce is the parallel-processing engine that allows Hadoop to churn through large data sets very quickly.  HDFS is a file system that lets Hadoop distribute/scale across low-cost servers and store data on multiple compute nodes in order to boost performance (and usually save money).

Hadoop was created because existing approaches were inadequate to process and store huge amounts of data. Its roots began with the challenge of indexing the entire World Wide Web every day. Google developed a paradigm called MapReduce in 2004, and then Yahoo! eventually started Hadoop as an implementation of MapReduce in 2005 and released it as an open source project in 2007.

To ease deployment and management of Hadoop clusters compared to downloading the open-source Hadoop code bases and then stitching everything together, various companies have made commercial distributions available.  This article is not going to discuss all these various vendors, but it’s where you’ll hear company names like IBM BigInsights, Cloudera and HortonWorks, among numerous others.  These distributions integrate with various data warehouses, database and other data management products—all with the goal of moving data between Hadoop clusters and other environments.

Is Hadoop the Path to Value from Big Data?

So now you understand that Hadoop is a popular new environment created by Google to crunch through huge amounts of data. Should your CEO, CFO and other executives even care about this?

There are some things to consider, like how soon will Big Data affect my company and is Hadoop the right way to unlock the value of the Big Data?  In some industries, the way that Big Data will create value is not always clear. Only knowledgeable professionals with an intimate understanding of the business and the data it generates and can collect, will be able to find the business insights and value from their Big Data. Once you decide Big Data will make an impact, the next questions might be: what is a reasonable amount to spend and should we consider starting with a proof of concept?

Here are a few key questions to consider:

  • How much large-scale data is coming at our business and from what sources?
  • How can we tell what that information is worth to us?
  • What are our competitors doing with Big Data?
  • What are some of the possible business insight questions we will be able to answer that we can’t answer now?
  • What decision making can be improved with more information?
  • What processes could be re-engineered and improved if we knew more?  Would that save us money?
  • What new applications will we be able to create that will help the business in things like more revenue, decreased costs, improved customer satisfaction?
  • Do we have employees with the right skills to help us be more data centric in our decisions?

The big issue is not to talk about whether or not you should use Hadoop, but rather “what use should we make of data, big and small, to help the business?”

The right way to talk about Hadoop to a CEO, is not to talk about it at all until you have made a h2 case that the analysis of Big Data is going to be important to your business. It is likely that Hadoop will be one of the ways you get the value out of Big Data.

Data Warehouse Augmentation

If your company is on the fence with Big Data overall, there is another angle to learn and use Hadoop.  There is growing interest in using Hadoop to handle the increased pressures on your existing data warehouse.  For many companies, the impact of Big Data has made it nearly impossible for a single platform to meet all of the company’s data warehouse needs.  Hadoop will not replace relational databases or traditional data warehouse platforms, but its amazing price/performance gives you an option to lower costs while keeping your existing reporting and analytics infrastructure.

One of the best places to try Hadoop is at the front end with your ETL processes.  Or consider some testing at the back end where you can use Hadoop to create active archives.  There could be other activities, depending on your situation, where Hadoop’s parallel processing powers for both structured and unstructured data could improve your existing data warehouse performance.

Healthy Skepticism

Most executives start with a skeptical attitude about new technology for good reasons.  They’ve seen the failures and watched horror stories unfold. They know that their technology team loves to play with the latest and the greatest technologies. They know that some IT investments blow up as often as they work out. Gone are the days for investments in technology just because it sounds cool.

Experienced and thoughtful management will look for a realistic path to measured business value, along with the outlines of a plan for broad adoption if all goes well. The path to business value should describe the initial questions that will be answered by the technology, the processes that will be improved and the decisions that will be made better with more information. The suspected business impacts should be defined.

Then just go for it.  Dip your toe in the flood waters and see what happens.