Creating and Managing a Data Mesh in AWS with Lake Formation

Maja Perusic, Iain Hunter

/ 2024-03-14

Run your own Data-Mesh POC on AWS

If you’re interested in building a Data-Mesh on AWS, engineers from iOLAP have built an AWS CDK project to enable you to rapidly set up your own Data-Mesh proof of concept. If you’re interested in seeing a demo, please get in touch.

Intro

In our last post [link] we explained what a Data Mesh is and how it can enable organisations to think about their data as a collection of valuable products. These Data Products will be owned and managed by Data Producers typically split across lines of business. A Data Governance layer manages all the Data Products from all the Data Producers. The Governance layer has two roles:

1.Security layer where admins secure data and ensure it is only shared with Data Consumers and third parties who have appropriate privileges.

2.Data Mart where Data Consumers can discover all of the organisation’s Data Products and request access to them.

By structuring your organisations data into a Data Mesh, we can more easily avoid the pitfalls of a central data lake managed by one team. Over time the central data-lake can become a Data Swamp with terabytes of data but with no rapid way to control, access or gain actionable insight from that data.

In this post we look at turning Data Mesh theory into practice. iOLAP supports multiple cloud platforms and technologies such as Snowflake and DataBricks, and all of them can be used to create a functioning Data Mesh. However, for this post we’ve selected AWS and we’ll explain the practicalities and the challenges of building a Data Mesh in AWS using the AWS Lake Formation service.

Communication and Buy-In

It would be foolish to jump straight to looking at the technology without mentioning the importance of listening to the concerns, and to generate enthusiasm and buy-in from the teams that will be involved in the project. Successful projects depend on aligned, motivated teams without which any project will struggle to launch. We can help you help you define, deliver and manage projects to ensure maximum success.

Data Mesh Architecture on AWS

To illustrate, we are going to imagine a Bank as a customer. We will first define the high-level architecture for a Data Mesh, which will be structured within AWS.

Our first architectural decision is to split our Data Producer, Data Governance and Data Consumers across 3 different AWS accounts. Our rationale:

1.Simplified management of users and roles and importantly allows each account to evolve separately to meet their particular use case.

2.Simplified tracking AWS billing and attaching AWS charges to the correct cost centre.

3.Simplified importing, securing and sharing of data products, as Lake Formation is designed with cross account data-sharing in mind.

center-big

For our bank customer, we will likely have multiple Producer accounts arranged along the lines of business, e.g. Investments, Insurance, Retail Banking, Credit Cards – each with their own Data Product Owner. Similarly, we could have multiple consumer accounts, e.g. Business Intelligence Team, Machine Learning Team, C-Level Executive team, each with different privileges and rights to view sensitive data products.

Let’s look at each account type in detail.

Data Producer Account

At a high-level each Producer account will have the same shape, utilising these core elements:

  • S3 Data Lake – Each Producer account will have its own Data-Lake, here we’ve adopted the Medallion Architecture. Thus, immediately we begin to break up the central data-lake. We expect the Producer to store all Data Products they want to share with the wider business within the Gold bucket.
  • ETL Pipelines –Data Producers will ingest, clean, validate and enrich data using a variety of ETL pipelines. It’s important to underline that in a Data Mesh, data producers own and manage their own pipelines as they are closest to the data and understand it the best. In our example we’re using a Serverless architecture for our pipelines and controlling them via Step Functions.
  • Data Sources – Data can be ingested from any number of sources, here we’ve suggested data is ingested via Batch, RDBMS and streaming sources. However, it could just as easily be third party sources like Fivetran or Snaplogic.
  • Lake Formation – Finally the Lake Formation service is used to manage the data-lake, databases and tables.

center-big

A sample serverless AWS Step Function ETL Pipeline to calculate Life Expectancy for Insurance products, it is operating over the Data Producer data-lake with data secured and managed by Lake Formation.

Data Governance Account

The Data Governance account is the lynchpin of the Data Mesh approach. It is here we ingest all Data Products from the various Data Producer accounts, secure them and then share them with Data Consumers who have requested access. This is done using these core elements:

  • Lake Formation – Governance Admins are given access to the Gold Buckets in all Data Producer Accounts. Admins can use Lake Formation Tags to secure the data and then share it with consumers.
  • Glue Crawlers –Crawlers are used to interrogate Data Products, understand their schema and then add them as tables to the appropriate Lake Formation database.
  • Athena – Athena is used to enable admins to view data so they can check it for correctness.

Data Consumer Account

The final type of account is the Consumer Account. The shape of this will vary from team to team depending on the use case. Here we’ve modelled a Machine Learning team who are using Athena and Sagemaker to interrogate the data-products, but any suite of tools can be used here.

Lake Formation is used to administer the Data Products that have been shared with each Consumer. Lake Formation Tags are used to give granular control to the rows in each Data Product, for example the Machine Learning team may not have access to PII data in certain insurance products.

AWS Data Mesh Challenges In-Depth

It would be a disservice to suggest that setting up a functional Data-Mesh on AWS is a trivial undertaking. The AWS documentation can be confusing and seemingly contradictory at times. However, once the Data Mesh is operational the benefits of real-time sharing of data, single source of truth, and data security are immediately apparent.

In this section we’ll pick out the elements of building a Data Mesh with Lake Formation that must be given detailed consideration to ensure the project is a success:

IAM Roles

You should spend time defining roles for all the users and admins of the Data Mesh. There are three broad categories:

  • Admins – These are the users who will configure and manage the Data Mesh within Lake Formation on each AWS Account, e.g. the Producer Admin, Governance Admin. The documentation notes the complex requirements for each admin.
  • Users – Typically these will be Data Consumers and members of Data Producer teams who may not have access to sensitive data, e.g. PII data.
  • Services – You may want to create roles for AWS services like Glue and Step Functions that have narrower access for their requirements

Sample Roles for a Consumer Account:

center-big

S3 Access for Governance Account

When setting up your Data Mesh you are going to become very familiar with S3 access errors. Simplify the process of sharing data, by ensuring all Data Products live in the Gold Bucket of each Producers Data Lake. Meaning the Governance Account only requires access to one bucket in each Producer account. See docs for details.

Data Lake Locations in the Producer Account:

center-big

Lake Formation Tag Ontology

Lake Formation Tag Based Access Control is the magic ingredient that simplifies the management, security and sharing of data with consumers. Getting the correct set of tags will require some trial and error. However, a useful starting point is to define tags for each line of business that will consume the data, one for the line of business who produced the data and a data confidentiality tag.

center-big

Define Public, Private, Sensitive Data for Column Based Access Control

Once you have imported a Data Product to the Governance Account, it is important that you use LF-Tags to mark the Data Confidentiality of each column. In this way you guarantee fine-grained access control over each data-product and Consumer Admins can control which of their user groups can see only Public Data and those that can see Sensitive data.

Here we assign the sensitive tag to the life expectancy calculation for a customer. As we want to ensure that data cannot be accidentally leaked to users who are not authorised to see it.

center-big

Next Steps – Data Mesh POC

Now that we’ve created our architecture and defined the important elements of our Data Mesh, we’re ready to start constructing a POC!

Starting small - we’d suggest identifying no more than 5 initial data products from 1 or 2 lines of business at most. Create your Producer, Governance, and Consumer accounts and experiment with sharing data between them. Don’t worry about ETL pipelines initially, simply sharing parquet files across your Data Mesh will take time to perfect – you can expect this setup stage to take between 2 and 4 weeks depending on the experience of your team.

Once you have your data sharing infrastructure in place. Expanding the Data Mesh can then be done as any other agile project. Work with your Data Producers to identify valuable data products and plan how to get them onto the Data Mesh, under Governance and then shared with Consumers.

Conclusion

This post has covered a lot of ground and it has hopefully inspired you to think about your data in new ways and motivated you to try creating your own Data Mesh.

To expedite this process, iOLAP has built an AWS CDK project to enable you to rapidly set up your own Data-Mesh POC. If you’re interested in seeing a demo, please get in touch.

Share This Story, Choose Your Platform!

Share This Story

Drive your business forward!

iOLAP experts are here to assist you