Uploaded image for project: 'ManifoldCF'
  1. ManifoldCF
  2. CONNECTORS-1307

Tika extractor infinite loop on error

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • ManifoldCF 2.4
    • ManifoldCF 2.5
    • Tika extractor
    • None
    • windows 64bit, java version "1.8.0_77", pdfbox-1.8.10.jar, tika-parsers-1.10.jar

    Description

      The Tika extractor gets stuck (is trying to parse the same document again and again) on the following error:

      FATAL 2016-04-29 10:55:45,505 (Worker thread '41') - Error tossed: null
      java.lang.StackOverflowError
      	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
      	at org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:250)
      	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
      	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
      	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
      	at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
      	at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
      	at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:296)
      	at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:348)
      	at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
      	at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
      	at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
      	at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
      	at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
      	at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
      	at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
      	at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
      	at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
      

      -Xss - is the default one, which is, I believe, 512k.
      We can increase the stack trace size, but I think, this error should not lead to such situation.
      Thanks a lot!

      Attachments

        Issue Links

          Activity

            People

              kwright@metacarta.com Karl Wright
              kavdeev Konstantin Avdeev
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: