Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-18198

IndexOutOfBoundsException when loading compressed IPC format

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 4.0.1, 9.0.0, 10.0.0
    • None
    • Java
    • None
    • Linux and Windows.
      Apache Arrow Java version: 10.0.0, 9.0.0, 4.0.1.
      Pandas 1.4.2 using pyarrow 8.0.0 (anaconda3-2022.05)

    Description

      I encountered this bug when I loaded a dataframe stored in the Arrow IPC format.

       

      // Java Code from "Apache Arrow Java Cookbook"
      File file = new File("example.arrow");
      try (
              BufferAllocator rootAllocator = new RootAllocator();
              FileInputStream fileInputStream = new FileInputStream(file);
              ArrowFileReader reader = new ArrowFileReader(fileInputStream.getChannel(), rootAllocator)
      ) {
          System.out.println("Record batches in file: " + reader.getRecordBlocks().size());
          for (ArrowBlock arrowBlock : reader.getRecordBlocks()) {
              reader.loadRecordBatch(arrowBlock);
              VectorSchemaRoot vectorSchemaRootRecover = reader.getVectorSchemaRoot();
              System.out.print(vectorSchemaRootRecover.contentToTSVString());
          }
      } catch (IOException e) {
          e.printStackTrace();
      } 

      Call stack:

      Exception in thread "main" java.lang.IndexOutOfBoundsException: index: 0, length: 2048 (expected: range(0, 2024))
          at org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:701)
          at org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:955)
          at org.apache.arrow.vector.BaseFixedWidthVector.reAlloc(BaseFixedWidthVector.java:451)
          at org.apache.arrow.vector.BaseFixedWidthVector.setValueCount(BaseFixedWidthVector.java:732)
          at org.apache.arrow.vector.VectorSchemaRoot.setRowCount(VectorSchemaRoot.java:240)
          at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:86)
          at org.apache.arrow.vector.ipc.ArrowReader.loadRecordBatch(ArrowReader.java:220)
          at org.apache.arrow.vector.ipc.ArrowFileReader.loadNextBatch(ArrowFileReader.java:166)
          at org.apache.arrow.vector.ipc.ArrowFileReader.loadRecordBatch(ArrowFileReader.java:197)

      This bug can be reproduced by a simple dataframe created by pandas:

       

      pd.DataFrame({'a': range(10000)}).to_feather('example.arrow') 

      Pandas compresses the dataframe by default. If the compression is turned off, Java can load the dataframe. Thus, I guess the bounds checking code is buggy when loading compressed file.

       

      That dataframe can be loaded in polars, pandas and pyarrow, so it's unlikely to be a pandas bug.

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            georeth Georeth Zhou
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: