Big Data and Cloud Data Warehousing Knowledge: Do You Have the Technical Skills You Need?

By |2018-05-01T14:12:15-06:00September 6th, 2013|Insight Post|

by Bryan Hearron

 

BI Professionals are used to working with a wide range of products and platforms and typically have a pretty substantial tool belt to be able to work across a multitude of different technologies.

Over the past couple of months I took the opportunity to experiment with technologies that are entering the data warehousing ecosystem. These technologies included the Cloudera Sandbox, Hortonworks Sandbox, IBM Big Insights Sandbox, and Amazon’s Redshift. One of my goals from this evaluation was to be able to inform co-workers and clients about “what skills do you need in your tool belt to be an outstanding technologist across these emerging technologies right now?”

To be clear, there are new technologies to the BI world like MapReduce, HDFS, Hcatalog, Pig, and Hive. However, my focus was on underpinning skills that allow a smooth transition to these emerging technologies. I welcome your feedback.

Set Based Processing

Understanding the advantages and disadvantages of set based processing is fundamental to working with tremendous amounts of data.  Singleton processing is good for OLTP systems, but in the Big Data world you have to think in terms of “mounds of data” and how do I move these mounds from point A to point B (and maybe even C, D, and so on).  Knowing how to work with large amounts of data with set based methods is tremendously important across each of these emerging technologies.

SQL

A solid foundation on SQL is a key to moving to Hive, Impala’s SQL language, and Hortonwork’s Stinger Initiative. If you know SQL, you can quickly work your way to a firm understanding of each. Additionally, set based processing on Red Shift will extensively utilize SQL skills to manage and analyze large datasets.

Virtualization

I had only used virtual machines a little in the past but I found that I am now a big fan of the virtual machine sandbox approach. I used both the VMWare player and Oracle’s Virtual Box. Virtual Box worked great with Hortonworks, but I found VMWare to be better fit for Cloudera (YMMV). For some reason the Virtual Box with Cloudera ran into graphical lag issues and rather than spend too much time troubleshooting the issue I just simply re-implemented into a VMWare player.

All of the experimental sandboxes I played with were easily implemented within a virtual machine. Being comfortable working with virtualization software will keep the frustration level down and allow you to focus on the new technology you are surveying. Additionally, virtualization keeps the software bloat to a minimum on your base operating system.

Command Line Linux

Many BI solutions are mature products that have for years provided nice user interfaces that encapsulate SQL generators and ETL transformations. However, the training materials I surveyed would require some knowledge of command line Linux. Understanding the Linux profile structure, how to navigate directories, and checking for running/dormant processes, are good baseline skills.

Linux Operating Systems

Every sandbox I experimented with was based on either Red Hat (RHEL) or CentOS. Being able to check your IP address, understand how to open a terminal, and understanding what you should and shouldn’t do with root are also good baselines.

From this early experimentation, I came up with the points above.

Are there any additional skills you feel are valuable to BI professionals as many of us move into the Big Data arena?