From Tensors to Clouds — A Practical Guide to Zarr V3 and Zarr-Python 3

Sanket Verma

Wednesday 12:25 in Palladium

Zarr is a data format for storing chunked, compressed N-dimensional arrays and is sponsored by NumFOCUS) under their umbrella.

It is based on open-source technical specification and has implementations in several languages, with Zarr-Python being the most used.

After the successful adoption of Specification V3, our team has worked tirelessly over the last year to ensure the Python library's compliance with the latest spec.

Outline

First, I’d be talking about:

Understanding Zarr basics (5 mins.)

  • What is Zarr, and how it works?
    • The inner workings of Zarr using illustrated graphics
  • What is the Zarr Specification?
    • What's new in Zarr Spec V3?

Then, I'll be talking about the new Zarr-Python 3 and its significant features:

What's new in Zarr-Python 3? (15 mins.)

  • Major design updates
    • New storage backend
    • Creating Zarr arrays and groups asynchronously
    • New and improved codec pipeline
    • Native GPU support for creating and writing arrays
  • Changes and deprecations
    • Overview of the new API
    • Optimising performance for large arrays
    • Deprecation of several stores like LMDBStore, SQLStore, MongoDBStore, etc.
  • 3.0 Migration guide
    • Steps to migrate from Zarr-Python 2 to Zarr-Python 3
  • Extensions
    • How can Zarr-Python 3 be extended to add new custom data types, stores, chunking strategies, etc.?

Then, I’d be doing a hands-on session, which would cover the following:

Hands-on (5 mins.)

  • Creating Zarr arrays and groups using Zarr-Python 3
    • Plus walkthrough of the new features (mentioned above)
  • Writing and reading from Cloud object storage
    • Using S3/GCS/Azure to create Zarr arrays and write data to it
  • Looking under the hood
    • Use store and info functions to explain how your Zarr data is stored and display important information

Conclusion (5 mins.)

  • Key takeaways
  • How can you get involved?
  • QnA

This talk aims to address an audience that works with large amounts of data and is looking for a transparent, open-source, reliable, cloud-optimised, and environmentally friendly format.

The tone of the talk is set to be informative, story-telling and fun.

Intermediate knowledge of Python and NumPy arrays is required for the attendees to attend this talk.

After this talk, you’d:

  • understand the basics of Zarr and what's new in V3,
  • leverage the new functionalities of Zarr-Python 3 with improved performance,
  • make an informed decision on what data format to use for your data

Sanket Verma

Sanket is a data scientist based out of New Delhi, India. He likes to build data science tools and products and has worked with startups, governments, and organisations. He loves building community and bringing everyone together and is Chair of PyData Delhi and PyData Global.

Currently, he's taking care of the community and OSS at Zarr as their Community Manager.

When he’s not working, he likes to play the violin and computer games and sometimes thinks of saving the world!