Data Discovery & Metadata

By |2019-07-18T08:48:11-06:00July 18th, 2019|Articles, Big Data, Insight Post, Technology|

Introduction

Data drives modern enterprises! Modern enterprises generate terabytes of data on a daily basis and with the advancements in the IoT (Internet of things) that number is only going to grow higher. Generated data may come in myriad forms, which can generally be placed into three groups:

  1. Unstructured (raw) data: any data which has no structure or format. Such data often comes in massive sizes and it is not easy to derive any value from it. Examples may include videos, text files, images, etc.
  2. Structured data: any data whose structure is predetermined, ie. every entry matches a predetermined format. Examples may include database tables, CSV files.
  3. Semi-structured data: data that has no formal structure, but is accompanied by tags which aid in understanding the data. A clear example is any XML file.

Given the various sources of data, and various formats (or the lack thereof) of ingress data, it becomes increasingly difficult to manage data, track data, and gain quick insight from it. Enterprises which are unable to gain value from the data quickly are doomed to fall behind and ultimately fail. It is here that data lakes, data discovery, and metadata come into play.

Data lakes, data discovery, metadata and the power of the cloud

Before we take a deep dive into the topic, it is important to understand the key concepts! So what is a data lake? What does data discovery mean? How important is the metadata? How can I make sense of the raw, unstructured data? And how can one take advantage of the cloud to derive value from the vast amount of data?

Simply put, data lakes are storage repositories that store a vast amount of data, usually in its raw, preprocessed form that traditional databases are unable to handle efficiently. Metadata is, at its core, data about data. It may hold information such as when a file was created, when should a file be archived, who created it, what process does it belong to, what processes/teams can use this data, etc.

Enter data discovery! Data discovery is the process of generating metadata and gaining insight from it. A well-designed data discovery system will help your team answer questions such as “Where can I find the data I need?“, “How has this data been used?“, “Is there a better alternative for me to use in my models?“, while a bad one will only add additional headaches to your team.

Now that you have started tracking your data, how do you make sense of it? Your data lake has no determined structure and your data has many different formats. What can you do to start using your data in your data processing pipelines? That is the role of data catalogs and data cataloging. Data cataloging is the process of extracting a common schema from various files, and data catalog is the store of that information.

One last question remains. How do you implement a data lake and the accompanying tools we’ve just touched upon in this section? And how do you do it in a scalable, cost-efficient way? One could use the Hadoop ecosystem to build an on-prem data lake, but that is difficult to scale and needs maintenance. Here is where the cloud comes into play! Various cloud providers, among which Amazon Web Services (AWS), Microsoft Azure (Azure), Google Cloud Platform (GCP) enable you to build your data lake infrastructure in a cost-efficient, highly-scalable, durable manner that requires minimal infrastructural maintenance. We will be focusing the rest of this post on how you can take advantage of AWS and its products to start making sense of the data you have!

Building a data lake and data management infrastructure using AWS

The following is a diagram of what we are going to build!

AWS offers many services to build your business on the cloud, ranging from click-to-deploy virtual machines, fully managed databases, machine learning to virtual networking, and access management. Our focus will be on the following services: S3, AWS Lambda, AWS Elasticsearch Service, and AWS Glue. First, we will explain what those services are. We will then explain how to use each of those services to handle the problem at our hands, and lastly, a high-level overview of a sample architecture that solves our problem.

So let’s jump right in and start with S3. S3 stands for Simple Storage Service and is primarily used for storing data in the cloud. Since the data is stored as a sequence of bytes, any type of data can be stored in S3, including, but not limited to, text files, images, videos, serialized objects, blobs, etc. Data is stored as objects and objects are grouped into buckets. Think of the buckets as root folders for your data.

Next comes AWS Lambda. AWS Lambda is a serverless framework that allows you to write, deploy and run your applications without worrying about the underlying infrastructure, ie. provisioning of virtual machines and scalability. The execution of an AWS Lambda application can be started manually, by making an HTTPS request, or it can be triggered by other AWS services, including but not limited to S3.

One of the fully managed databases offered by AWS is the Elasticsearch engine in the form of AWS Elasticsearch Service. Elasticsearch is an open-source NoSQL database and a powerful search engine, capable of processing, searching and retrieving a vast amount of (textual) data. Elasticsearch is often accompanied by Logstash and Kibana. Logstash is a data processing pipeline able to ingest data from various sources into Elasticsearch while Kibana can be used to visualize the data stored in Elasticsearch. AWS Elasticsearch Service offers a scalable Elasticsearch cluster, where you don’t have to worry about provisioning servers to run Elasticsearch and don’t have to manage deployments.

Finally comes AWS Glue. AWS Glue is a fully managed ETL service that makes it easy for you to categorize, process, enrich and filter your data. It can also serve as a central repository for your metadata in the form of AWS Glue Data Catalog.

Now that we have introduced you the services that you can use to create a data lake and a data management infrastructure, let’s discuss how to build one. Once you start ingesting data, you need to store that data somewhere for it to be further processed and analyzed. S3 is an obvious solution for the data lake. Let’s assume your company uses data that comes from three main sources: financial data, IoT, and inventory. We could go and create one S3 bucket to hold data arriving from all three sources, but a far better solution would be to create one bucket for each of the data sources. That allows you to manage data from each of the sources independently, and we can limit the access to data to only those departments that will use it.

Assuming that most of the data come in its raw form, we now need a way to extract meaning from it. Let’s start by attempting to interpolate the schema from the data that we have at our disposal. While AWS Glue can be used to run ETL jobs and store metadata, it also has the ability to catalog it, ie. Interpolate the schema of the data by crawling through it using crawlers. AWS Glue crawlers are able to interpret many of the popular formats, such as JSON and Avro, and if you need, you can develop your own crawler to categorize your data. Extracted schema data is now stored in the AWS Glue Data catalog, and can be freely accessed.

Now that you’ve made the first step towards better understanding the ingested data, let’s add additional information so that you can better manage and analyze your data. How do we do that? This is where the Lambda comes into the fold.

import boto3 # aws sdk
from elasticsearch import Elasticsearch # elasticsearch client sdk

S3 has the ability to trigger an AWS Lambda function whenever a new object is added or deleted, passing to the function’s environment the information, such as the name of the object, bucket in which to object is stored, etc.

def lambda_handler(context, event): # signature for the lambda function
# lets initialize the AWS Glue client and the elasticsearh client
glue_client = boto3.resource('glue') # we will use glue to get schema information about the data
es_client = Elasticsearch([{
'host': ELASTICSEARCH_HOST,
'port': ELASTICSEARCH_PORT
}])

That Lambda function could then go on and add additional data about the object that was created, such as who created it, how long should the object remain in storage, which team the object is available to, the schema of the objects, (since we now have that information available, thanks to AWS Glue) and what is the size of that data.

glue_response = glue_client.get_table(
CatalogId=CATALOG_ID, # the id of the data catalog where the data resides
DatabaseName=DATABASE_NAME, # the name of the database in which the table resides
Name=NAME # the name of the table for which to retrieve the definition
) # the response of this command is a dict
schema_data = glue_response['Table']['StorageDescriptor']

In short, the function could provide metadata about the object that was created. Writing such a function is no difficult task, and you can write it in one of the popular programming languages: Node, Python, Ruby, Go, Java and .NET Core.

# now we have to generate a metadata dict to store in elastic search
metadata_dict = {
'created': datetime.datetime.now(),
'owner': 'financials',
'bucket': bucket,
'filename': filename,
'availableTo': ['Advanced Analytics', 'Machine Learning'],
# additional data can be added as needed
'schema': schema_data
}

Once you’ve generated the required metadata, you need a place to access that metadata quickly and efficiently. This is the role of the AWS Elasticsearch Service. By enhancing the previously created Lambda function to store the generated metadata inside your Elasticsearch Cluster, you now have at your disposal a powerful engine where you can query your metadata, by asking questions such as “What data has been ingested between 05/15/2019 and 06/15/2019?” or “What is the total amount of data that has been generated by the financial department in the previous year?” and visualize the results of such queries in Kibana. ¸

# what is left now is to save/index data to elasticsearch
es_client.index(index='metadata', doc_type='metadata', body=metadata_dict)

Conclusion

We’ve shown you basic concepts on how to build a simple data lake, how to retrieve metadata about the incoming data, and how to store such metadata and gain value from it. We’ve also shown you how to use the AWS to create such an infrastructure, without having to worry about scalability, speed, redundancy, networking. More importantly, we hope that you now understand the benefits of having a metadata repository that is easily queryable and quick!

iOLAPJust ask.
✖︎