Description
Is your feature request related to a problem?
For dataframes, Pandas currently only supports per-column quantiles; that is, given df[['c', 'a']].quantile(...)
, Pandas will compute the individual quantiles for columns c
and a
:
>>> df = pd.DataFrame({'a': [1, 0, 11, 12, 2], 'b': [1, 2, 3, 4, 5], 'c': [0, 1, 5, 2, 3]})
>>> df[['c', 'a']].quantile([0, 0.5, 1])
c a
0.0 0.0 0.0
0.5 2.0 2.0
1.0 5.0 12.0
It would be nice if Pandas also supported multi-column quantiles; that is, given df[['c', 'a']].quantiles(...)
, Pandas would compute the quantiles for the dataframe sorted by all columns. This is currently implemented by cuDF's dataframe:
In [11]: gdf = cudf.DataFrame({"a": [1, 1, 1, 1, 1], "b": [5, 4, 3, 2, 1]})
In [12]: gdf[["a", "b"]].quantiles([0, 0.5, 1])
Out[12]:
a b
0.0 1 1
0.5 1 3
1.0 1 5
Describe the solution you'd like
I imagine the addition of multi-column quantiles support could happen in two ways:
- the addition of a default kwarg to
DataFrame.quantile
to specify whether or not we want multi-column quantiles - the addition of a new method to compute multi-column quantiles independent from the logic of
quantile
In either case, my preference here would be to have this functionality accessible via DataFrame.quantiles
, to maintain consistency with cuDF.
API breaking implications
I can't think of any breakages this would cause, as long as any direct changes to quantile
ensure that the original behavior is maintained by default.
Describe alternatives you've considered
This could be accomplished by sorting the dataframe by all columns and then indexing based on manually computed quantiles, but I imagine there's a more performant way to do this.
Additional context
If this functionality were added, along with a multi-columnar searchsorted
, it would enable Dask dataframes to compute sort_values
with multiple sort-by columns, using an algorithm roughly similar to that of dask-cudf.