Friday 10:15
in Zeiss Plenary (Spectrum)
Data-as-Code (DaC) is a paradigm that streamlines data distribution by encapsulating dataset retrieval within Python packages, along with a data contract. This approach makes it easy to enforce data quality, effortlessly leverage on semantic versioning to prevent errors in the data pipeline, and abstracts away from the Data Scientist all the boilerplate code to load the data needed by the ML models, improving efficiency and consistency. This presentation will delve into the implementation of DaC, demonstrate its practical applications, and discuss the benefits it offers in modern data workflows.
This session will cover:
- Introduction to Data-as-Code (DaC):
- What problems do we want to solve with DaC
- What it is out of scope
- Implementing DaC:
- Packaging data as Python packages
- Defining data contracts
- Advantages of DaC:
- Application of semantic versioning to manage data changes effectively
- Breaking changes in data are automatically detected as part of the data distribution
- Abstraction of data loading mechanisms, allowing seamless transitions between data sources
- Elimination of hard-coded data field names, enhancing code maintainability
- Facilitation of unit testing through schema examples
- Inclusion of comprehensive data descriptions and metadata
- Centralized data distribution via the Python Package Index (PyPI)
- DaC in the real world:
- Step-by-step walkthrough of creating and distributing a DaC package
- Guidelines for data engineers on preparing data for DaC
- Instructions for data scientists on consuming DaC packages in their workflows
- Discussion on the scalability and adaptability of DaC
- Q&A Session:
- Addressing audience questions and remarks
Francesco Calcavecchia
Physicist, ML Engineer, Agile adept. I’d rather have a taste of everything than specialize. Eager to learn, unlearn, try out, share, help.