Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2939

Figure out how to allow OCR'ing of large PDFs via tika-server

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • server
    • None

    Description

      Tesseract can take quite a bit of time on large PDFs, which can lead to timeouts in jax-rs and the connection closing:

      Caused by: com.ctc.wstx.exc.WstxIOException: Closed
              at com.ctc.wstx.sw.BaseStreamWriter.flush(BaseStreamWriter.java:262)
              at org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.handleMessage(JAXRSDefaultFaultOutInterceptor.java:104)
      Caused by: org.eclipse.jetty.io.EofException: Closed
              at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:491)
              at org.apache.cxf.transport.http_jetty.JettyHTTPDestination$JettyOutputStream.write(JettyHTTPDestination.java:322)
              at org.apache.cxf.io.AbstractWrappedOutputStream.write(AbstractWrappedOutputStream.java:51)
              at com.ctc.wstx.sw.EncodingXmlWriter.flushBuffer(EncodingXmlWriter.java:742)
              at com.ctc.wstx.sw.EncodingXmlWriter.flush(EncodingXmlWriter.java:176)
              at com.ctc.wstx.sw.BaseStreamWriter.flush(BaseStreamWriter.java:260)
      

      I tried expanding the timeouts on the client side:

       RequestConfig config = RequestConfig.custom()
                      .setConnectTimeout(TIMEOUT * 1000)
                      .setConnectionRequestTimeout(TIMEOUT * 1000)
                      .setSocketTimeout(TIMEOUT * 1000).build();
      

      But this doesn't solve the problem.

      How can we/can we increase the timeout on the server side and is there a maximum?

      If we can't fix the problem with timeouts, we should figure out a way to let people select only a few pages for OCR so that clients can iterate through large PDFs.

      This issue is different from TIKA-1871 in that the problem isn't chunking the large document to get the file to tika-server; rather the problem is the amount of time it can take tika-server to run OCR on every page of a large PDF and return the full results.

      Attachments

        Activity

          People

            Unassigned Unassigned
            tallison Tim Allison
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: