Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
arr = []
df = pd.DataFrame({'a': pd.array(arr, dtype=pd.Int64Dtype())})
other = pd.DataFrame({'a': [1, 2]})
df2 = df.append(other)
# same issue for pd.concat(...)
# df2 = pd.concat([df, other])
assert df2['a'].dtype == df['a'].dtype
> assert df2['a'].dtype == df['a'].dtype
E AssertionError: assert dtype('O') == Int64Dtype()
E + where dtype('O') = 0 1\n1 2\nName: a, dtype: object.dtype
E + and Int64Dtype() = Series([], Name: a, dtype: Int64).dtype
Issue Description
When appending a dataframe (df_other
) to another dataframe (df
) which has an empty column of type ExtensionDtype
(in this case Int64Dtype
, but the specific EA type doesn't matter), then the resulting dataframe's column (df2['a']
) loses the dtype information and turns into an object dtype.
You can run the example with arr = [1]
instead of the empty list (arr = []
) and observe that - as expected - the type is not changed and remains Int64Dtype
.
I traced the issue to _concatenate_join_units
and _get_empty_dtype
which ignores type information when the column is empty (if not unit.is_na
). This in turn then fails to enter the elif any(is_1d_only_ea_obj(t) for t in to_concat)
EA handling branch in _concatenate_join_units
.
def _get_empty_dtype(join_units: Sequence[JoinUnit]) -> DtypeObj:
...
dtypes = [unit.dtype for unit in join_units if not unit.is_na]
if not len(dtypes):
dtypes = [unit.dtype for unit in join_units if unit.block.dtype.kind != "V"]
dtype = find_common_type(dtypes)
...
def _concatenate_join_units(
join_units: list[JoinUnit], concat_axis: int, copy: bool
) -> ArrayLike:
"""
Concatenate values from several join units along selected axis.
"""
if concat_axis == 0 and len(join_units) > 1:
# Concatenating join units along ax0 is handled in _merge_blocks.
raise AssertionError("Concatenating join units along axis0")
empty_dtype = _get_empty_dtype(join_units)
has_none_blocks = any(unit.block.dtype.kind == "V" for unit in join_units)
upcasted_na = _dtype_to_na_value(empty_dtype, has_none_blocks)
to_concat = [
ju.get_reindexed_values(empty_dtype=empty_dtype, upcasted_na=upcasted_na)
for ju in join_units
]
if len(to_concat) == 1:
# Only one block, nothing to concatenate.
concat_values = to_concat[0]
if copy:
if isinstance(concat_values, np.ndarray):
# non-reindexed (=not yet copied) arrays are made into a view
# in JoinUnit.get_reindexed_values
if concat_values.base is not None:
concat_values = concat_values.copy()
else:
concat_values = concat_values.copy()
elif any(is_1d_only_ea_obj(t) for t in to_concat): # <-- this branch isn't entered
# TODO(EA2D): special case not needed if all EAs used HybridBlocks
# NB: we are still assuming here that Hybrid blocks have shape (1, N)
# concatting with at least one EA means we are concatting a single column
# the non-EA values are 2D arrays with shape (1, n)
# error: No overload variant of "__getitem__" of "ExtensionArray" matches
# argument type "Tuple[int, slice]"
to_concat = [
t if is_1d_only_ea_obj(t) else t[0, :] # type: ignore[call-overload]
for t in to_concat
]
concat_values = concat_compat(to_concat, axis=0, ea_compat_axis=True)
concat_values = ensure_block_shape(concat_values, 2)
else:
concat_values = concat_compat(to_concat, axis=concat_axis)
return concat_values
Expected Behavior
Type information remains as both types are compatible (the fact that one Series is empty shouldn't matter).
Installed Versions
INSTALLED VERSIONS
commit : ca60aab
python : 3.8.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.19.8-200.fc36.x86_64
Version : #1 SMP PREEMPT_DYNAMIC Thu Sep 8 19:02:21 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8
pandas : 1.4.4
numpy : 1.23.2
pytz : 2020.4
dateutil : 2.8.1
setuptools : 59.6.0
pip : 22.2.2
Cython : 0.29.32
pytest : 7.1.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 0.9.6
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.6
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : 1.3.5
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : 1.1.1
matplotlib : None
numba : None
numexpr : 2.8.1
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 1.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
snappy : None
sqlalchemy : 1.3.23
tables : 3.7.0
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
zstandard : None