Open Table Formats (OTF) such as Hudi, Iceberg and Delta Lake have disruptively changed the data engineering landscape in recent years. While the Parquet file format has evolved as the de-facto standard for open, interoperable columnar storage for analyical workloads, it lacked first class support for critical features such as ACID compliance, incremental processing, flexible schema & partioning evolution and scalable meta data management. This led to increased development and maintenance efforts while building idempotent and failure tolerant data pipelines that often resulted in custom frameworks. OTFs solve all of these issues via providing a sophisticated meta data layer and improved maintenance capabilities on top of Parquet.
Driven by the promises of OTFs, we intended to replace our own bronze-read-only Parquet-based storage layer with Delta Lake. In theory, this should have improved performance, reduced maintenanced and provided more flexibility. However, we've stumbled upon several issues:
While the first two issues are solvable in foreseeable future, the last one is specific to our requirements and does not overlap with design decisions made for incremental processing in Delta Lake. Taken together, these points ultimately led us to go back to relying on Parquet again.
This talk is mainly intended for an intermediate data engineering audience but is well suited for interested beginners, too. The content of this talk is relevant for all architects and data engineers being responsible for storing and managing data for analytical workloads.