Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3538

TikaServer, cancelling request client-side does not kill working OCR process

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 2.0.0-BETA
    • None
    • server
    • None
    • OS: ArcoLinux
      Kernel: 5.10.60-1-lts
      CPU: Intel i5-8400 (6) @ 4.000GHz
      Memory: 32Gb

    Description

      It appears that canceling a request will not stop work in Tika. The handler finishes the job and then fails as it attempts to return data.

      I would have expected tika to detect client-side cancellations and propagate this to relevant child processes, like tesseract, thus avoiding unnecessary work.

      I send a request like so. Here FILE is a pdf that has inline images and requires OCR scanning.

      curl -T "$FILE" \
                -s "http://localhost:9998/tika/text" \
                -H "Accept: application/json" \
                -H "X-Tika-OCRLanguage: dan+eng" \
                -H "X-Tika-PDFextractInlineImages: true"
      

      Then "ctrl-C" before the response is returned.

      Dockerfile:

      FROM apache/tika:2.0.0-full
      
      RUN DEBIAN_FRONTEND=noninteractive apt-get update && apt-get -y install tesseract-ocr-dan
      

      docker-compose.yaml:

      version: "3.9"
      
      
      services:
        tika:
          build: tika/
          ports:
            - "9998:9998"
      

       

      Attachments

        1. tika_error.log
          18 kB
          Jens Emil Schulz Østergaard

        Activity

          People

            Unassigned Unassigned
            eso Jens Emil Schulz Østergaard
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: