Skip to content

BUG: groupby.transform calls the user function ~1.5 times more than necessary #44977

Closed
@axil

Description

@axil

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd
df = pd.DataFrame({'key': ['z', 'z'], 'a': [1,2], 'b': [3,4]})
def f(x):
    print(x)
    return x.sum()
df.groupby('key').transform(f)

Issue Description

For every group the user function is first called with every series of this group (which is correct), but then with the group as a whole (which is not right, as the result is not used anywhere at all).

After digging through the source code I've found remnants of the 'fast path' and 'slow path' – an optimization that has long been gone, but those extra calls to user function are still there.

The commit which obliterates this optimization is
b8b6471
ENH: Add numba engine to groupby.transform (#32854)

After it was merged in, the path output variable of the _choose_path is not used anywhere any longer. So that extra call to the user function that was only necessary setup the path is not necessary as well:

-        path = None
         for name, group in gen:
             object.__setattr__(group, "name", name)

-            if path is None:
+            if engine == "numba":
+                values, index = split_for_numba(group)
+                res = numba_func(values, index, *args)
+                if func not in self._numba_func_cache:
+                    self._numba_func_cache[func] = numba_func
+                # Return the result as a DataFrame for concatenation later
+                res = DataFrame(res, index=group.index, columns=group.columns)
+            else:
                 # Try slow path and fast path.
                 try:
                     path, res = self._choose_path(fast_path, slow_path, group)
@@ -1376,8 +1422,6 @@ class DataFrameGroupBy(GroupBy[DataFrame]):
                 except ValueError as err:
                     msg = "transform must return a scalar value for each group"
                     raise ValueError(msg) from err
-            else:
-                res = path(group)

Or maybe it was done by mistake and deletion of the res=path(group) line should be reverted.

After this commit this line from the docs is no longer valid:

If f also supports application to the entire subframe, then a fast path is used starting from the second chunk.

Expected Behavior

0 1 # <-- this is correct
1 2
Name: a, dtype: int64
0 3 # <-- this is correct
1 4
Name: b, dtype: int64
a b # <-- this is wrong (result is ignored)
0 1 3
1 2 4
a b # <-- the result is correct
0 3 7
1 3 7

Installed Versions

INSTALLED VERSIONS

commit : 66e3805
python : 3.7.7.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.3.5
numpy : 1.21.2
pytz : 2019.2
dateutil : 2.8.0
pip : 21.0.1
setuptools : 41.1.0
Cython : 0.29.14
pytest : 5.1.3
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.7.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : None
fsspec : 0.8.5
fastparquet : None
gcsfs : None
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.15.1
pyxlsb : None
s3fs : None
scipy : 1.6.0
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : 0.16.2
xlrd : 1.2.0
xlwt : None
numba : 0.51.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    ApplyApply, Aggregate, Transform, MapBugGroupbyPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions