Every five to six years, there comes a technology wave, and if you are able to catch it, it will take you a long way. Throughout my career, I’ve ridden several of these waves. MPP data warehouses brought us incredible speed for analytics and a few headaches for data integration. We’re seeing in-memory analytics reducing disk latency. Hadoop based technologies are opening up new solutions every day for storage and compute workloads while our source systems are still generating varying degrees of velocity, volume, and variety.
As a traditional ETL developer, I would usually try to figure out the best solution to acquire, cleanse, and store this data in an optimal format for analytics…usually a data warehouse. Depending on the business need, number of sources, and complexity, this approach is a long one and quite labor intensive. Source systems create new data faster than we can consume them in traditional models. Hence, we see many organizations adopting a Data Lake approach. Here, we are simply concerned with optimizing the acquisition and storage of any data source. We worry about consumption later.
While data federation has been around for years, traditional technologies typically dealt with federating a relational source with a tabular, single file extract. Today, we’re asking federation to handle relational stores, API’s, HDFS, JSON, AVRO, logs, and unstructured text. It’s a tough task, but I was pretty impressed with SAP HANA’s approach and implementation of data federation.
This post is not about SAP HANA, but rather focuses on its data federation capabilities. I will try to explain basics, best practices, and few tips and tricks I came across during my experience working with data federation in HANA.
SAP’s data federation capabilities are built in their HANA database, which is known as Smart Data Access (SDA). SDA eliminates the need to replicate data into SAP HANA, instead it lets you query remote sources from HANA. SAP calls this ability to weave in a network of data sources the in-memory data fabric. SDA allows you to create virtual tables, which points to remote tables on remote sources. It allows you to write SQL queries in HANA, which operates on virtual tables. HANA query processor optimizes these queries, and executes only the relevant part of the query in the target database, returns the results of the query to HANA, and completes the operation.
SDA was first introduced in SPS06. Features matured over several releases, and it supports connectivity to ASE, Teradata, IQ, HANA, Oracle, Sql Server, Netezza, DB2, MaxDB and Hadoop. There are just a few one-time setup steps involved when setting up remote sources for first time.
All relational databases can be setup using ODBC drivers and RDBMS drivers on the UNIX server where HANA is installed. Once the drivers are installed, then create the remote sources using HANA studio. Refer to SAP administration guide for version details.
There are few different ways to setup Hadoop remote source. The most common way is to setup using ODBC drivers and HIVE/SPARK drivers on the UNIX server where HANA is installed. Once the drivers are installed, then create the remote source using HANA studio. Other ways include connecting via archive file/virtual UDFs from HANA studio and via spark controller on Hadoop.
Sometimes it is difficult to determine what should be the optimal way to federate data especially dealing with Hadoop sources. We recommend to use divide and conquer method. You can let your remote sources process data and query them from HANA as needed. For ex:- You would push huge volume of data processing on your Hadoop system, this way you are taking advantage of commodity hardware and their cheaper processing power. You can leverage your cheaper storage options and save data in those databases, while only bringing data sufficing your analytical needs into HANA via SDA.
SDA would submit query on remote server, therefore performance will be based on how powerful remote source is configured. It may or may not be adequate for your use case, and you might choose to copy data into HANA instead.
Leveraging Statistics – HANA has the ability to calculate statistics on remote data sources. These statistics can help query optimizer decide how to join two tables including remote tables and in which order to join them. There are two types of statistics you can enable. Histogram type will only saves counts, and simple type will save information such as counts, count distinct, min, max. Depending on your needs you can enable either type to improve your performance.
Querying Hadoop – When federating data from Hadoop, there are few tips and tricks we can use for better performance: