Skip to content

Datetime parsing (PDEP-4): allow mixture of ISO formatted strings?  #50411

Closed
@jorisvandenbossche

Description

@jorisvandenbossche

In pandas < 2, parsing datetime strings that are all strictly ISO formatted strings (so there is no ambiguity) but have slightly different formatting because of a different resolution, works fine:

>>> pd.to_datetime(["2022-01-01T09:00:00", "2022-01-02T09:00:00.123"])
DatetimeIndex(['2022-01-01 09:00:00', '2022-01-02 09:00:00.123000'], dtype='datetime64[ns]', freq=None)

With the changes related to datetime parsing (PDEP-4), this will now infer the format from the first, and then parsing the second fails:

>>> pd.to_datetime(["2022-01-01T09:00:00", "2022-01-02T09:00:00.123"])
...
ValueError: time data "2022-01-02T09:00:00.123" at position 1 doesn't match format "%Y-%m-%dT%H:%M:%S"

This is of course expected and can be explained with the changes in behaviour. But I do wonder if we want to allow some way to still parse such data in a fast way.

For this specific case, you can't specify a format since it is inconsistent. And before, this was actually fast because it took a fast ISO parsing path.
With the current dev version of pandas, I don't see any way to parse this in a performant way?

(this also came up in pyarrow's CI, it was an easy fix on our side to update the strings in the test, but still wanted to raise this as a specific case to consider, since with real world data you can't necessarily easily update your data)

cc @MarcoGorelli

Metadata

Metadata

Assignees

No one assigned

    Labels

    DatetimeDatetime data dtype

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions