Skip to content

BUG: Appending or concatenating to empty ExtensionArray removes type information #48510

Closed
@ssche

Description

@ssche

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

    arr = []
    df = pd.DataFrame({'a': pd.array(arr, dtype=pd.Int64Dtype())})
    other = pd.DataFrame({'a': [1, 2]})

    df2 = df.append(other)
    # same issue for pd.concat(...)
    # df2 = pd.concat([df, other])
    assert df2['a'].dtype == df['a'].dtype
>       assert df2['a'].dtype == df['a'].dtype
E       AssertionError: assert dtype('O') == Int64Dtype()
E        +  where dtype('O') = 0    1\n1    2\nName: a, dtype: object.dtype
E        +  and   Int64Dtype() = Series([], Name: a, dtype: Int64).dtype

Issue Description

When appending a dataframe (df_other) to another dataframe (df) which has an empty column of type ExtensionDtype (in this case Int64Dtype, but the specific EA type doesn't matter), then the resulting dataframe's column (df2['a']) loses the dtype information and turns into an object dtype.

You can run the example with arr = [1] instead of the empty list (arr = []) and observe that - as expected - the type is not changed and remains Int64Dtype.

I traced the issue to _concatenate_join_units and _get_empty_dtype which ignores type information when the column is empty (if not unit.is_na). This in turn then fails to enter the elif any(is_1d_only_ea_obj(t) for t in to_concat) EA handling branch in _concatenate_join_units.

def _get_empty_dtype(join_units: Sequence[JoinUnit]) -> DtypeObj:
    ...
    dtypes = [unit.dtype for unit in join_units if not unit.is_na]
    if not len(dtypes):
        dtypes = [unit.dtype for unit in join_units if unit.block.dtype.kind != "V"]

    dtype = find_common_type(dtypes)
    ...
def _concatenate_join_units(
    join_units: list[JoinUnit], concat_axis: int, copy: bool
) -> ArrayLike:
    """
    Concatenate values from several join units along selected axis.
    """
    if concat_axis == 0 and len(join_units) > 1:
        # Concatenating join units along ax0 is handled in _merge_blocks.
        raise AssertionError("Concatenating join units along axis0")

    empty_dtype = _get_empty_dtype(join_units)

    has_none_blocks = any(unit.block.dtype.kind == "V" for unit in join_units)
    upcasted_na = _dtype_to_na_value(empty_dtype, has_none_blocks)

    to_concat = [
        ju.get_reindexed_values(empty_dtype=empty_dtype, upcasted_na=upcasted_na)
        for ju in join_units
    ]

    if len(to_concat) == 1:
        # Only one block, nothing to concatenate.
        concat_values = to_concat[0]
        if copy:
            if isinstance(concat_values, np.ndarray):
                # non-reindexed (=not yet copied) arrays are made into a view
                # in JoinUnit.get_reindexed_values
                if concat_values.base is not None:
                    concat_values = concat_values.copy()
            else:
                concat_values = concat_values.copy()

    elif any(is_1d_only_ea_obj(t) for t in to_concat):  # <-- this branch isn't entered
        # TODO(EA2D): special case not needed if all EAs used HybridBlocks
        # NB: we are still assuming here that Hybrid blocks have shape (1, N)
        # concatting with at least one EA means we are concatting a single column
        # the non-EA values are 2D arrays with shape (1, n)

        # error: No overload variant of "__getitem__" of "ExtensionArray" matches
        # argument type "Tuple[int, slice]"
        to_concat = [
            t if is_1d_only_ea_obj(t) else t[0, :]  # type: ignore[call-overload]
            for t in to_concat
        ]
        concat_values = concat_compat(to_concat, axis=0, ea_compat_axis=True)
        concat_values = ensure_block_shape(concat_values, 2)

    else:
        concat_values = concat_compat(to_concat, axis=concat_axis)

    return concat_values

Expected Behavior

Type information remains as both types are compatible (the fact that one Series is empty shouldn't matter).

Installed Versions

INSTALLED VERSIONS

commit : ca60aab
python : 3.8.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.19.8-200.fc36.x86_64
Version : #1 SMP PREEMPT_DYNAMIC Thu Sep 8 19:02:21 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8

pandas : 1.4.4
numpy : 1.23.2
pytz : 2020.4
dateutil : 2.8.1
setuptools : 59.6.0
pip : 22.2.2
Cython : 0.29.32
pytest : 7.1.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 0.9.6
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.6
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : 1.3.5
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : 1.1.1
matplotlib : None
numba : None
numexpr : 2.8.1
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 1.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
snappy : None
sqlalchemy : 1.3.23
tables : 3.7.0
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
zstandard : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Dtype ConversionsUnexpected or buggy dtype conversionsExtensionArrayExtending pandas with custom dtypes or arrays.Needs TestsUnit test(s) needed to prevent regressionsReshapingConcat, Merge/Join, Stack/Unstack, Explodegood first issue

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions