Skip to content

PyArrow StringDtype / StringArray fallback policy #42613

Closed
@TomAugspurger

Description

@TomAugspurger

In #35169 / #42597 we discussed the desired behavior of PyArrow-backed StringArray when a certain method is not implemented in pyarrow.compute.

For string methods like str_normalize, which aren't currently implemented in pyarrow.compute, I believe we (silently) cast from Pyarrow[string] to an object-dtype ndarray of Python str objects at

mask = isna(self)
arr = np.asarray(self)
. That's going to be slow and more than doubles the memory usage of the array.

These kinds of performance cliffs are difficult for users to debug. I don't think we should do that conversion on behalf of the user. If something isn't implemented yet, then I think we should raise with a message saying they should convert to string[python] dtype first.

If we don't want to raise, we could emit a PerformanceWarning, similar to what we do for SparseArray when converting to dense.

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignArrowpyarrow functionalityNeeds DiscussionRequires discussion from core team before further actionStringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions