Arguably, glum
's standout feature is its ability to efficiently handle datasets consisting of a mix of dense, sparse and categorical features. To facilitate this, it relies on our (similarly open-source) tabmat
library, which provides classes and useful methods for mixed-sparsity data. glum
fits models by first converting input data to tabmat
matrices, and then using those matrices to do the necessary computations.
Therefore, dataframe-agnostism in our case mostly boils down to handling the conversion of different dataframes to tabmat
matrices (which themselves store data in numpy
arrays and sparse scipy
matrices) in an efficient manner. Most of it is rather smooth and straightforward due to narwhals
providing a convenient compatibility layer for a wide range of dataframe functionality. However, we have encountered a couple of pain points that might be of interest to other package maintainers and the PyData community. In particular,
In this talk I demonstrate how we used narwhals
to easily accept multiple types of dataframes. I will go into details about categorical and sparse columns, and present the challenges we encountered with those. I will also examine the benefits and challenges of supporting sparse columns in dataframe libraries and the Arrow stardard. These points are meant to facilitate discussion among the participants and in the PyData community.
At the end of the talk I will also briefly mention potential future plans for glum
and tabmat
, including the possibility to do computations directly on Arrow objects without converting them to numpy
and scipy
arrays.
glum
, it's backend library tabmat
, and the main ideas that make them performant.glum
dataframe-agnostic.narwhals
simplifies handling a wide variety of dataframes.glum
efficiently handles mixed-sparsity datanarwhals
helps to achieve dataframe-agnosticism with little effort