Arguably, glum's standout feature is its ability to efficiently handle datasets consisting of a mix of dense, sparse and categorical features. To facilitate this, it relies on our (similarly open-source) tabmat library, which provides classes and useful methods for mixed-sparsity data. glum fits models by first converting input data to tabmat matrices, and then using those matrices to do the necessary computations.
Therefore, dataframe-agnostism in our case mostly boils down to handling the conversion of different dataframes to tabmat matrices (which themselves store data in numpy arrays and sparse scipy matrices) in an efficient manner. Most of it is rather smooth and straightforward due to narwhals providing a convenient compatibility layer for a wide range of dataframe functionality. However, we have encountered a couple of pain points that might be of interest to other package maintainers and the PyData community. In particular,
In this talk I demonstrate how we used narwhals to easily accept multiple types of dataframes. I will go into details about categorical and sparse columns, and present the challenges we encountered with those. I will also examine the benefits and challenges of supporting sparse columns in dataframe libraries and the Arrow stardard. These points are meant to facilitate discussion among the participants and in the PyData community.
At the end of the talk I will also briefly mention potential future plans for glum and tabmat, including the possibility to do computations directly on Arrow objects without converting them to numpy and scipy arrays.
glum, it's backend library tabmat, and the main ideas that make them performant.glum dataframe-agnostic.narwhals simplifies handling a wide variety of dataframes.glum efficiently handles mixed-sparsity datanarwhals helps to achieve dataframe-agnosticism with little effort