Team members: Alex Guerrero, Clyo Ramirez, Gabriel Rodriguez, Jeremy Williams
Data storage and data lake are topics often discussed in the IT industry. Apache Iceberg is a key piece to achieving an open lake house architecture so you can reduce the cost of data warehouses.
Iceberg is an open table format created by Apache that can manage large collections of files as tables, and it supports modern analytical data lake operations such as record-level insert, update, delete, and time travel queries.
This blog shows a complete demo to explain how to create a table schema using AWS Glue, what is the correct configuration to create a new Iceberg table using AWS Athena, and how the time-queries stored Iceberg data in AWS S3. Finally, this blog discusses about the Iceberg tables advantages as a conclusion.
Suppose you have a new data warehouse project in which the client wants to development a new data acquisition system to save their data with the next requirements:
For this case you can use Iceberg tables as your open-source table format, this type of tables fully supports SQL commands and allow to create partition columns, Iceberg can partition the data by time and store the data in different folders, just as an example supposed we have a column in the data tables that show the item year, if you pass that column as partition column Iceberg store the data by year in different s3 bucket folders, using time column partitions Iceberg can query the data tables faster, because the query is going to retry only certain items by certain time. Iceberg also support partition columns by month, days, hours, minutes and seconds.
Iceberg make possible time travel queries and rollback data. When you create, delete or update new table items, Iceberg automatically create snapshots and replace the old data with the snapshot data, this snapshot can be used to do run time travel queries and rollback any data table at any point of the time.
Iceberg data will be stored in a s3 bucket selected during the table creation, this bucket is cheap and it´s a secure place to store the Iceberg data. You can configure the bucket to be as safe as you want, just as an example you could configure the bucket to be private or you can encrypt the data stored in the bucket by adding a bucket policy.
To deploy this demo, we created and uploaded a CSV file to AWS S3 service, as you can see in the next two images:
Figure 1. msdp transactional sales CSV data
*Figure 2. msdp transactional data stored in S3 *
After uploading the CSV demo file, we created a new data table using an AWS Glue crawler. This data table was needed to create the data-schema.
Figure 3. Add new table in AWS Glue.
The crawler will ask you for a table name, a database to store your table, and the location of the CSV file to use for our data-schema. In this case the location is our S3 Bucket URL that we created a few steps before. After we complete the requirements, we will see similar data catalog resume:
Once the data catalog is completed, we can proceed to create the data schema. In the schema we need to define the data type of every data column:
Figure 5. Data schema.
After creating the data schema and defining the database we created the Iceberg table. For this demo we defined the same columns as the schema data of the previous step, the type of the table as Iceberg, the partition column as business_date_2 and the location in which we want to store the data. The metadata generated in this case was: 's3://Bucket Name/…' This query was generated in Athena AWS resource.
Figure 6. Create Iceberg data table
Next the table is populated using the CSV data found within the data catalog that was created two steps before, an example is showed in the Figure 7.
Figure 7. Populate Iceberg Table
The Iceberg table can then be tested using MySQL queries to see if all is in order:
Figure 8. Test Iceberg table using MySQL queries.
Figure 9. Query results
Figure 10. Query results
When we created this table, we added an s3 location to store our data. Using the s3 URL from when the iceberg table was created (s3://Bucket Name/…) you will see two paths: data and metadata.
Figure 11. S3 Bucket Iceberg Table
The metadata path stored the iceberg table configuration. The important thing of the metadata path is that for every query that hits the iceberg table, it creates a new snapshot id with the information that you query in that moment. Therefore, if you insert, delete or update data you get a snapshot id with the query information.
Figure 12. Iceberg table configuration (Snapshot id)
The data path creates a partition data for the column or the columns that we select when we were creating the iceberg table. In this demo we selected the column: business_date_2 so the data was stored using this partition. An example can be seen in the Figures 13 and 14.
*Figure13. Iceberg table data path (Column partition) *
*Figure 14. Iceberg table data path (Partition Column) *
Inside every partition we get only the data that matches with the time partition column. To test the data stored inside the Parquet file, we created a short JupyterNotebook to see if the iceberg table was working as we excepted. For this example, we selected business_date_2: ‘2021-04-05’, after we run the JupyterNotebook we compare the output with the data table (Figure 1) and it was correctly.
Figure 15. Jupyter Notebook parket file test.
In conclusion Iceberg tables allow users to create schemas and partitions. The design of this type of tables is optimized and stored on Amazon S3. Iceberg tables provide a way to store your data with partitions, increasing the performance of SQL table queries.
Iceberg tables maintain versions on Amazon S3 Bucket objects that it contains. A version is updated after each insert/delete/update is performed. This means that when you do “time-travel” type queries you can go back and look at the contents of an Iceberg table at some point in the past. You can either go back to a certain point in time or you can go to a specific version where the version is a snapshot ID associated with an Iceberg table version. Is important to mention that you cannot use time-travel queries to undo a dropped Iceberg table. In fact, once an Iceberg table is dropped, all the data associated with it in S3 is also deleted. This is completely different from what happens with "normal" Athena tables where dropping a table leaves its underlying data intact.