Is Your Data Warehouse Aging Gracefully?

Chris Jordan

/ 2014-04-01

How old is your data warehouse? It’s a simple question and probably one you don’t think about much. The majority of production data warehouses are now 15-20 years old and probably very transactional centric. Over the years, you’ve probably remodeled “the house” more than a few times—adding some “rooms” and “upgrades” here and there. It’s starting to feel its age as more Business Intelligence requirements have been added, including Mobile applications and specialized analytics. And more and more ideas seem to show up in your inbox every day, especially Big Data questions.

Just because things are feeling a little stretched, you don’t just throw your existing and mature investment away. The good news is that with today’s technology, often related to the surge in Big Data technologies and ideas, there are new architectures to modernize these “grey-haired data veterans” and make them more responsive to business needs and in super-charged time-frames.

Some of the technology/platform shifts relate to the way companies are now analyzing information and enabling insights you didn’t think of a decade or two ago. This article gives you several ideas to think about as you look at the aging of your data warehouse and the new technologies that help you modernize those investments.

Idea #1: IBM’s DB2 Analytics Accelerator

Some of you are IBM shops with your data warehouse primarily built around DB2. If you’re not, skip to idea #2. This first idea is all about the IBM environment.

The IBM DB2 Analytics Accelerator (IDAA) is a workload-optimized appliance that enables companies to integrate strategic business data into operational processes. By linking z/OS and the IDAA hardware-accelerated analytics, speeds for complex queries are delivered in unprecedented response times—and in a highly secure and available environment.

This kind of superior performance and scalability, with rapid appliance deployment, provides an ideal solution for complex analysis.

A key feature is its transparent integration into DB2, which allows running unchanged workload and queries that automatically benefit from faster response times and reduced load on the mainframe central processors.

IDAA integrates behind the application layer of a DB2 for z/OS environment, which results in important performance benefits:

Acceleration of current workload and speed increase (and less frustration) for the people running reports
Ability to analyze large quantities of data quickly
Off-loading of long-running CPU-consuming queries to free up main processors, giving the system more capacity without increasing costs
No changes needed to the applications nor queries which means no training issues as well
No need for DBA and data design work in analysis, index design, query rewriting, and so on
Investing time to adapt queries to exploit specific accelerator strengths (like SUM function) brings additional benefits
Reduce monthly licensing charges by offloading complex query workloads

Idea #2: Hadoop

With the arrival of new data sources and organizations who have a need to analyze everything, from text to streaming data, it became clear that the data warehouse had some limitations. Data warehouse architects can respond by tuning existing systems, but if you really need to incorporate a variety of new data types for Big Data, you may need to look at some of the new technology out there. As a result, many companies are turning to Hadoop to modernize their data warehouse and data integration architectures.

As most of you are aware, Hadoop is an open source framework for processing, storing and analyzing massive amounts of data. It was designed to handle petabytes and exabytes of data, both structured and unstructured, distributed over multiple nodes in parallel. You can leverage Hadoop to capture interesting data in different ways, store it as additional data types and enable exploration of data prior to normalization. The data can prepared for loading into your data warehouse by using MapReduce or one the many commercially available Hadoop-based tools.

So what does all this mean? Philip Russom, research director for data management at The Data Warehousing Institute, said in a report, “Hadoop is a wonderful complement to the data warehouse, but no one that has worked their way through it would see it as a replacement for the data warehouse.”

Hadoop is worth looking at in your modernization effort. With careful evaluation and modeling, a Hadoop platform alongside your data warehouse, can be used as a data management landing zone, a pre-processing area for ETL processes and as a queryable, low-cost archive platform.

More and more experts are saying that it is getting less likely that everything can be put into a data warehouse. The key is to make all the pieces work together.

Idea #3: MPP-Based Analytic Database Architectures

Unlike your data warehouse(s), MPP analytic databases can quickly take in large amounts of primarily structured data with minimal data modeling required and can scale to handle many terabytes and even petabytes of data. These highly capable systems support near real-time results to complex SQL queries. This is something missing in Hadoop and can sometimes mean Hadoop isn’t the answer to support near real-time Big Data applications.

The fundamental characteristics of these MPP analytic databases include:

Shared-nothing architectures to ensure there is no single point of failure in the analytics environment. If one machine fails, the others keep running.
Columnar architectures are employed in some MPP analytic databases instead of the traditional storing and processing of rows in a relational database. In this columnar environment, only columns that are necessary for the “answers” are processed rather than reading the entire row. The result: split second query results. It also means tables do not need to be as structured as most relational tables.
Advanced data compression is available with MPP analytic databases. 10 to 1 compression is not uncommon. Data compression and data encoding allow for critical scaling to massive volumes of data efficiently.
Commodity hardware can be used to run MPP analytic databases. Like Hadoop clusters, most MPP systems can run on commodity hardware from Dell, IBM, HP and others. Systems can be scaled in a cost-effective manner.

MPP analytic databases are not perfect for every Big Data application, but hybrid systems and all types of new vendor products are in development.

Idea #4: Cloud

Over the past few years, the high adoption of cloud computing technology by enterprises has become one of the emerging trends in the data warehousing market. Cloud-based data warehousing is becoming viewed by many as an infrastructure design with a higher potential than traditional data warehouse deployment methods. The huge early success of Amazon Web Services Redshift, a cloud-based data warehouse, has caught many skeptics by surprise.

Some organizations are starting to view a private cloud for supporting their data warehouse and analytics projects—especially some of the newer or even departmental or vertical data mart efforts. Budget constraints for IT, insufficient time to build in-house systems, and cost advantages attached to hosted or on-demand software-as-a-service licensing are important factors responsible for the increased adoption of cloud-based data warehousing solutions by enterprises.

In addition, cloud services do not have overhead costs of space, power, and cooling that are generally incurred with the deployment of traditional in-house data warehouse solutions. Security issues concern some, but most of these issues are being removed by the various aggressive and powerful vendors coming into the marketplace.

Idea #5: In-database Analytics

This is one of the biggest developments in Big Data. In-database analytics addresses one of the biggest hassles of advanced analytics. Doing the analysis in the database, where the data resides, eliminates the costs, time and security issues associated with the old approach by doing the processing in the data warehouse itself.

In-database analytics changes this by moving the analytics algorithms and processing to where the data is stored. Elimination of data movement obviously removes the hours needed to move terabytes of information to near zero. The huge increases in data volumes in today’s data warehouses has made this capability very interesting to many IT organizations. As more organizations use data to answer important questions, the questions they ask are becoming more complex, demanding more sophisticated tools, more precise results and better performance.

All of these factors in combination have created the need for in-database processing. The introduction of the column-oriented databases, specifically designed for analytics, data warehousing and reporting, has helped make the technology possible. Several key vendors provide tools and platforms in this area.

Idea #6: Extend Your Data Warehouse with External Data Sources

There is a growing market of data virtualization vendors that allow you to easily extend the power of your data warehouse by accessing non-data warehouse and external data sources on an “as needed basis. This type of “extended virtual data warehouse” allows you to get quick access to infrequently referenced data sources without having to go through long delays of processing to get and load interesting, external data into your warehouse.

For example, many companies have current/historic sales data, along with pricing information. But think about the analytical possibilities if you were to take that data and gain additional insight by loading information about things that could drive sales such as weather, income tax distribution timeframes, regional or local population growth/decline, household demographics, and other information. You can virtually access this information and then decide to load permanently if it is needed on a more frequent basis. While this kind of information has been available for some time, there are newer types of information available and through vendors that build management tools to make the whole process more agile.

Summary

Spend some time this year and do some deep thinking about what you could do to “modernize” some of your data warehouse investments. With all the focus on Big Data, start thinking “hybrid” as you consider how to use some of the new breakthroughs in combination with your data warehouse. It will take some time and effort. There is a shortage of skilled technologists in some of these areas. But now is the time to get started.