Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2496

TIKA crashes / runs out of memory on simple PDF

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 1.15
    • None
    • core
    • None
    • Linux, Java 8

    Description

      We're using TIKA embedded in a webcrawler and today I've encountered a PDF that results in OutOfMemory errors while being processed by TIKA.

      Tried with Xmx 5gb and pdf file sizes are approximately 50 mb.

      Tika version: 1.15
      Error as below:

      Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
      at org.apache.pdfbox.io.ScratchFileBuffer.addPage(ScratchFileBuffer.java:132)
      at org.apache.pdfbox.io.ScratchFileBuffer.ensureAvailableBytesInPage(ScratchFileBuffer.java:184)
      at org.apache.pdfbox.io.ScratchFileBuffer.write(ScratchFileBuffer.java:236)
      at org.apache.pdfbox.io.RandomAccessOutputStream.write(RandomAccessOutputStream.java:46)
      at org.apache.pdfbox.cos.COSStream$2.write(COSStream.java:266)
      at org.apache.pdfbox.pdfparser.COSParser.readValidStream(COSParser.java:1142)
      at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:970)
      at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:781)
      at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:742)
      at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:673)
      at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:633)
      at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:241)
      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:276)
      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1132)
      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1066)
      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:141)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)

      Please let us know how to fix this issue

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              chimbu chelambarasan
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: