Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-818

Allow PDFBox to be used with RandomAccessFile vs RandomAccessBuffer to allow for a memory vs performance tradeoff

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.10, 1.0
    • 1.1
    • parser
    • None

    Description

      After upgrading to Tika 0.10, began having OOM errors processing large amounts of PDFs in parallel. The heap dump indicated that all the memory was getting used up by PDFBox RandomAccessBuffers. After digging around, it looks like PDFBox now defaults to using RAM vs temporary files for PDF extraction. This can be overridden to use RandomAccessFiless.

      I propose that Tika controls file vs buffer based on the inputstream type received. If the TikaInputStream is a file, RandomAccessFile should be used and for other stream types, RandomAccessBuffer can be used.

      I believe the code to control this is here:
      https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java

      At ~line 87:
      PDDocument pdfDocument =
      PDDocument.load(new CloseShieldInputStream(stream), true);

      Not sure if this is the best approach and am curious if there are other ideas on how to control this and keep the interface clean.

      Attachments

        1. PDFParser.java.patch
          3 kB
          Paul Pearcy
        2. choose_inmemory_vs_temp_file_pdf_passes_tests.patch
          3 kB
          Paul Pearcy
        3. choose_inmemory_vs_temp_file_pdf.patch
          3 kB
          Paul Pearcy

        Activity

          People

            Unassigned Unassigned
            ppearcy Paul Pearcy
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 24h
                24h
                Remaining:
                Remaining Estimate - 24h
                24h
                Logged:
                Time Spent - Not Specified
                Not Specified