Skip to content

REGR?: read_sql no longer supports duplicate column names #53117

Closed
@jorisvandenbossche

Description

@jorisvandenbossche

Probably caused by #50048, and probably related to #52437 (another change in behaviour that might have been caused by the same PR). In essence the change is from using DataFrame.from_records to processing the records into a list of column arrays and then using DataFrame(dict(zip(columns, arrays))). That has slightly different behaviour.

Dummy reproducer for the read_sql change:

import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})

from sqlalchemy import create_engine
eng = create_engine("sqlite://")
df.to_sql("test_table", eng, index=False)

pd.read_sql("SELECT a, b, a +1 as a FROM test_table;", eng)

With pandas 1.5 this returns

   a    b  a
0  1  0.1  2
1  2  0.2  3
2  3  0.3  4

with pandas 2.0 this returns

   a    b
0  2  0.1
1  3  0.2
2  4  0.3

I don't know how much we want to support duplicate column names in read_sql, but it is a change in behaviour, and the new behaviour of just silently ignoring it / dropping some data also isn't ideal IMO.

cc @phofl

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO SQLto_sql, read_sql, read_sql_queryRegressionFunctionality that used to work in a prior pandas version

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions