Streamlining the Cosmos: Pythonic Workflow Management for Astronomical Analysis

Raphael Hviding

Wednesday 17:50 in Palladium

As astronomical surveys continue to grow in size and sophistication, researchers face mounting challenges in building efficient, scalable, and reproducible data processing pipelines. Modern observatories, like NASA's James Webb Space Telescope (JWST), are delivering unprecedented volumes of complex and specialized data, requiring innovative approaches to transform raw observations into meaningful, scientifically valid results. This talk focuses on leveraging Pythonic workflow management tools to address the unique challenges of processing large-scale astronomical datasets efficiently and reproducibly.

I will provide a brief overview of JWST including its capabilities and the groundbreaking science it has enabled. In particular we will focus on the Pure Parallel mode which collect serendipitous observations from regions of the sky adjacent to primary science targets. These opportunistic datasets are a powerful resource for blind extragalactic surveys, offering unique opportunities to uncover faint galaxies, cosmic structures, and rare astrophysical phenomena. However, their “unscheduled” and heterogeneous nature presents significant challenges: the data arrive in raw, uncalibrated formats and require intricate, multi-step workflows—such as artifact masking, background subtraction, and galaxy spectral analysis—before becoming scientifically usable.

In this talk, I will demonstrate how tools like Snakemake and Pixi offer powerful, Pythonic solutions for these challenges. I’ll show how these tools allow scientists to design modular, scalable, and highly parallel workflows that automate the reduction and analysis process while efficiently distributing computation across high-performance computing (HPC) clusters and cloud environments. By breaking workflows into smaller, reusable components, we can improve computational performance and maintain flexibility to adapt pipelines to new datasets, instruments, or evolving scientific goals.

Reproducibility remains a critical pillar of modern science, and I will highlight how combining workflow managers with environment management tools ensures version-controlled pipelines, transparent data lineage tracking, and reliable replication of results. This enables consistent analyses across diverse systems, fostering collaboration and long-term usability of scientific products.

This talk is designed for (data) scientists and researchers working with large-scale or complex datasets with multi-step reduction/analysis pipelines. This talk will provide a GitHub repository containing resources for building modular workflows, including examples of existing infrastructures, to provide actionable takeaways. Attendees will leave with a clear understanding of how to apply modern workflow management techniques to streamline their processing pipelines, improve reproducibility, and scale their analyses to meet the demands of their datasets.

Raphael Hviding