Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
4.0.1, 9.0.0, 10.0.0
-
None
-
None
-
Linux and Windows.
Apache Arrow Java version: 10.0.0, 9.0.0, 4.0.1.
Pandas 1.4.2 using pyarrow 8.0.0 (anaconda3-2022.05)
Description
I encountered this bug when I loaded a dataframe stored in the Arrow IPC format.
// Java Code from "Apache Arrow Java Cookbook" File file = new File("example.arrow"); try ( BufferAllocator rootAllocator = new RootAllocator(); FileInputStream fileInputStream = new FileInputStream(file); ArrowFileReader reader = new ArrowFileReader(fileInputStream.getChannel(), rootAllocator) ) { System.out.println("Record batches in file: " + reader.getRecordBlocks().size()); for (ArrowBlock arrowBlock : reader.getRecordBlocks()) { reader.loadRecordBatch(arrowBlock); VectorSchemaRoot vectorSchemaRootRecover = reader.getVectorSchemaRoot(); System.out.print(vectorSchemaRootRecover.contentToTSVString()); } } catch (IOException e) { e.printStackTrace(); }
Call stack:
Exception in thread "main" java.lang.IndexOutOfBoundsException: index: 0, length: 2048 (expected: range(0, 2024)) at org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:701) at org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:955) at org.apache.arrow.vector.BaseFixedWidthVector.reAlloc(BaseFixedWidthVector.java:451) at org.apache.arrow.vector.BaseFixedWidthVector.setValueCount(BaseFixedWidthVector.java:732) at org.apache.arrow.vector.VectorSchemaRoot.setRowCount(VectorSchemaRoot.java:240) at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:86) at org.apache.arrow.vector.ipc.ArrowReader.loadRecordBatch(ArrowReader.java:220) at org.apache.arrow.vector.ipc.ArrowFileReader.loadNextBatch(ArrowFileReader.java:166) at org.apache.arrow.vector.ipc.ArrowFileReader.loadRecordBatch(ArrowFileReader.java:197)
This bug can be reproduced by a simple dataframe created by pandas:
pd.DataFrame({'a': range(10000)}).to_feather('example.arrow')
Pandas compresses the dataframe by default. If the compression is turned off, Java can load the dataframe. Thus, I guess the bounds checking code is buggy when loading compressed file.
That dataframe can be loaded in polars, pandas and pyarrow, so it's unlikely to be a pandas bug.