PyData Stack: Building and deploying pure Python, open source data platforms

Eric Thanenthiran

Wednesday 15:10 in Helium3

Modern data platforms can be built and deployed using completely open source, Python packages. In this talk, I’ll cover what constitutes a modern data stack and what open source Python packages can be used to build a stack suitable for the needs of most developers and companies. Rather than a one size fits-all approach, I’ll initially demonstrate the rich ecosystem of technologies available and the pros and cons of the technology choices.

To be concrete, we will demo an instance of this type of self-contained, deployable platform that is composed of specific technology choices for the key components: data pipelines, transformation engine, data warehouse, presentation layer and orchestration. This implementation will use Docker, Python and yes, even some SQL.

Structure

  1. What is a data platform? [5 mins]
    1. Why are they useful
    2. When do companies need them?
  2. What are the key components [5 mins]
    1. Data pipelines
    2. Transformation engine
    3. Data store
    4. Presentation layer
    5. Orchestration
  3. Python packages available for this? [10 mins]
    1. Data pipelines: dlt/PyAirbyte
    2. Transformation engine: dbt or sqlmesh
    3. Data store: s3/blob store/data warehouse/duckdb
    4. Presentation layer: Streamlit/Superset
    5. Orchestration: Prefect/Dagster
  4. The platform [10 mins]
    1. Design
    2. Technology choices
    3. Joining it all together
  5. Deployment [10 mins]
    1. IAC via Pulumi-Python

Outcomes

The aim of this talk is to equip attendees with an understanding of the availalbe technology choices and the knowledge to build their own data platforms. This would specifically be useful for attendees who may be software or backend engineers who may also be called upon to own the data stack to support business and analyst use cases. It may also help engineers who may be looking to re-platform legacy, expensive data platforms to a more modern data stack. For research and personal projects, spinning up a modern platform could be useful for compute heavy analytics that have outgrown local development.

Eric Thanenthiran

I lead the Engineering function at Tasman Analytics, a boutique data consultancy. We act as an interim/fractional data team and have built many, many data stacks for our clients. We are passionate about helping clients leverage the power of their data.

Personally, I have a background of mechanical engineering and have worked across a range of sectors including sustainability, energy, property, construction and architecture. I am an engineer at heart and perennially look to hone the craft of engineering.

Writing Python makes me incredibly happy.