Challenges and Lessons Learned While Building a Real-Time Lakehouse using Apache Iceberg and Kafka

Jonas Böer, Elena Ouro Paz

Thursday 16:55 in Zeiss Plenary (Spectrum)

The Schwarz Group is present in thirty-two countries around the world with over ten thousand stores, hundreds of warehouses, an assortment of over two thousand different products and a single ERP system to manage them all.

Every country maintains its own databases for operational purposes but all the data is also gathered in one central analytics platform for all countries. Not only does this platform need to be stable and reliable, the data also needs to be made available for the consumers in near real-time –within mere minutes. The existing analytics platform was based on proprietary solutions which were expensive and required niche knowledge, severely limiting the number of available developers.

Therefore, we set out on a journey to completely redesign the analytics platform; a new solution, based as much as possible on Open Source technologies like Python, Kafka and Iceberg. Leveraging Python for its great ecosystem and ease of use. Kafka for fast and reliable message processing of over one thousand tables per country into one central hub. And Iceberg at the core, as our data lakehouse format for its fully transparent schema evolution and high performance through its rich metadata layer. Through our presentation we will showcase the different challenges we faced during the design of our new architecture, how our selected tech stack allowed us to tackle each of them and the lessons we learnt. We will focus on our challenges in four areas: scalability, performance, continuity of service and data quality.

Scalability: Ingesting changes on over one thousand tables coming from servers across thirty-two countries supporting operations of Europe’s largest retailer is no easy task. Our architecture needs to support receiving tens of thousands of events per second. We will present how we set up Kafka to support our current load and potential for future growth, how we use Tabular’s Iceberg sink connector to ingest all our tables, and how we leverage avro serialization and snappy compression of messages to reduce network traffic.

Performance: The large amounts of data we handle, paired with the influx of small files that can result from real-time data ingestion, made ensuring performance an extremely challenging aspect of our application. We will show how we designed our data lake house; using Iceberg's hidden partitioning to ensure performance while remaining flexible to evolve them over time and how we designed and implemented an effective maintenance job to reduce small files in our iceberg tables.

Continuity of Service: The existing analytics platform contains the core of the business data which is used for many analytics and forecasting use cases across the organization. One of main requirements was to ensure a smooth transition to the new architecture with as little downtime as possible. This meant facilitating the access to existing users by allowing them to retrieve the data in the same way they were doing it in the past, with minimal changes. We will show how our architecture ensures flexibility by allowing access to the data from diverse query engines and show an example on how we integrated our architecture with Snowflake.

Data quality: We faced some challenges when it came to the consolidation of all the data we receive from all 32 countries. For some tables, the schemas diverged across countries; having different sets of columns, different data types being used and even different primary keys. We will talk about how we handled data quality issues coming from the operational databases by using iceberg’s schema evolution capabilities, a schema registry and Kafka Connect single message transforms (SMTs).

Jonas Böer

Software Engineer since 2018, Data Engineer at inovex since 2022 – happy to get the job done, but prefers beautiful solutions.

Elena Ouro Paz