Thursday 15:00
in Titanium3
1. Setting the Stage: Local vs. Distributed Storage (5 minutes)
- What’s the Big Deal with Storage?
- First, let’s talk about the shift from local storage (where we keep files on our own machines) to cloud-native storage (where data is spread across servers in the cloud).
- This shift is awsome but comes with new challenges: distributed systems can be tricky to work with, especially when you need to access them in a consistant way.
2. Enter fsspec: A Game Changer for File Systems (10 minutes)
What is fsspec?
- fsspec is a Python library that makes working with any kind of file system—whether it's local, in the cloud, or on a distributed system—much easier.
- It does this by giving us a unified way to interact with storage, no matter where the files actaully live.
Why is fsspec Awesome?
- It simplifies file operations (like opening and reading files) across different storage systems, saving us time and mental enery.
- Plus, it’s open-source, which means you can extend it and make it work for your own unique storage setup.
A. Using fsspec with Pandas
- Pandas & fsspec:
- If you work with Pandas, you’re probably familiar with loading and saving data. fsspec helps make this process smoother by letting you pull data from cloud storage (like AWS S3) with no fuss.
- We’ll see how this works in practise, making it easy to work with large datasets in the cloud.
B. Using fsspec with TensorFlow
- TensorFlow & fsspec:
- If you’re building machine learning models, TensorFlow needs to access training data and models, sometimes stored in the cloud.
- With fsspec, TensorFlow can seamlessly interact with cloud storage, making your ML pipelines more streamlined and less frustraiting.
C. Using fsspec with PyArrow
- PyArrow & fsspec:
- PyArrow is great for high-performance data processing. When working with big data files like Parquet, fsspec makes it easy to load and save them from cloud storage without missing a beat.
4. Extending fsspec: Building Your Own Solutions (5 minutes)
- What if I Need Something Custom?
- Sometimes, you need to work with storage systems that aren’t “out of the box.” The cool part about fsspec is that it’s highly extensible.
- I’ll walk through how you can easily extend fsspec to work with your own custom storage systems, using a real-world example of how we did this.
5. Wrap-Up & Key Takeaways (5 minutes)
The Big Picture:
- fsspec is a simple yet powerful tool for making cloud-native storage work seamlessly with Python data tools like Pandas, TensorFlow, and PyArrow.
- It’s the tool you didn’t know you needed to simplify your cloud storage tasks.
Final Thought:
- With fsspec, working with distributed storage doesn’t have to be hard. It makes everything feel like you’re working with local files, even when they’re scattered across the cloud.
6. Q&A Session (5 minutes)
Einat Orr
Dr. Einat Orr has 20+ years of experience building R&D organizations and leading the technology vision at multiple companies, the latest being Similarweb, that IPO in NYSE last May. Currently she serves as Co-founder and CEO of Treeverse, the company behind lakeFS, an open source platform that delivers a git-like experience to object-storage based data lakes. She received her PhD. in Mathematics from Tel Aviv University, in the field of optimization in graph theory.