Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
9.0.0
-
None
Description
Hi there,
First time submitting an issue here so apologies if there's anything I've missed.
I see the below bug, where by the dtype of the categories themselves (within a pd.Categorical are not preserved on a round trip via pyarrow. Hopefully the snippet below demonstrates the issue.
The reason this causes an issue, is because the dtypes need to be the same in order for the categories to be considered the same (so they can then be concatenated, for example).
Current workaround is to store as a plain pd.StringDtype() and then convert to pd.Categorical in memory with Pandas (which infers from the underlying type, but in doing so sacrifices disk saving of storing as a dictionary).
Using pyarrow 9.0.0 and pandas 1.4.4.
Thanks
import pandas as pd
import pyarrow as pa
# note, Categorical column B is constructed from `pd.StringDtype`
df = pd.DataFrame({"A": ["a", "b", "c", "a"]}, dtype=pd.StringDtype())
df["B"] = df["A"].astype("category")
print(df["B"].cat.categories)
# Index(['a', 'b', 'c'], dtype='string')
# however, this is downcast to `object` during a roundtrip
print(pa.Table.from_pandas(df).to_pandas()["B"].cat.categories)
# Index(['a', 'b', 'c'], dtype='object')