Skip to content

BUG: Series(nullable).tolist() -> numpy scalars #49890

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Nov 29, 2022

Conversation

lukemanley
Copy link
Member

@lukemanley lukemanley commented Nov 24, 2022

Series.tolist documentation states:

These are each a scalar type, which is a Python scalar (for str, int, float) or a pandas scalar (for Timestamp/Timedelta/Interval/Period)

Nullable dtype behavior seems to be inconsistent with the docstring and other dtypes:

main:

import pandas as pd

type(pd.Series([1], dtype="Int64").tolist()[0])             # numpy.int64
type(pd.Series([1], dtype="int64").tolist()[0])             # int
type(pd.Series([1], dtype="int64[pyarrow]").tolist()[0])    # int

PR:

import pandas as pd

type(pd.Series([1], dtype="Int64").tolist()[0])             # int
type(pd.Series([1], dtype="int64").tolist()[0])             # int
type(pd.Series([1], dtype="int64[pyarrow]").tolist()[0])    # int

The fix also provides an improvement in performance:

import pandas as pd
import numpy as np

ser = pd.Series(np.arange(10**6), dtype="Int64")

%timeit ser.tolist()

# 45.5 ms ± 980 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  <- main
# 28.5 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  <- PR
@lukemanley lukemanley added Bug ExtensionArray Extending pandas with custom dtypes or arrays. labels Nov 24, 2022
Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you profile this for pyarrow as well and with NA?

@jreback
Copy link
Contributor

jreback commented Nov 24, 2022

hmm i seem to recall that we very specifically have python scalers in all of these list conversion methods (eg iter as well)

i would be surprised this doesn't break many things

edit: i think i misread your post
you are making things consistent ok!

@lukemanley
Copy link
Member Author

lukemanley commented Nov 24, 2022

Could you profile this for pyarrow as well and with NA?

pyarrow doesn't hit this method. It hits ExtensionArray.tolist. I have a separate PR in the works to make that faster.

Here is this PR profiled with NA:

import pandas as pd
import numpy as np

ser = pd.Series(np.arange(10**6), dtype="Int64")
ser.iloc[::2] = pd.NA

%timeit ser.tolist()

# 92.9 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  <- main
# 59.5 ms ± 4.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  <- PR
@lukemanley
Copy link
Member Author

edit: i think i misread your post
you are making things consistent ok!

Yes, sorry my wording should have been more clear. This makes nullable return python scalars which is consistent with other dtypes

@phofl
Copy link
Member

phofl commented Nov 24, 2022

Thx, missed the file name. good that casting to object is still faster

@phofl
Copy link
Member

phofl commented Nov 24, 2022

Could you merge main? We got some deprecation warnings from numpy that are fixed on main

[1],
"int64[pyarrow]",
int,
marks=td.skip_if_no("pyarrow", min_version="1.0.0"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think min version is 7 or something similar, please adjust accordingly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to version 6 which is the min version tested, although this test should pass with any version. It looks like there are a handful of other tests that have less than 6.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can get rid of the min version then

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, thanks. updated

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM merge when ready @phofl

@phofl phofl added this to the 2.0 milestone Nov 29, 2022
@phofl phofl merged commit c872a8b into pandas-dev:main Nov 29, 2022
@phofl
Copy link
Member

phofl commented Nov 29, 2022

@lukemanley lukemanley deleted the nullable-tolist branch December 20, 2022 00:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays.
4 participants