Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-630

Dealing with PDF documents from scanning programs

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 0.10
    • None
    • general

    Description

      Hey,

      sorry I didn't post this to mailing list, I kinda didn't get the confirmation.

      The issue is that often people don't even realize there is a difference in pdf documents (extracted from openoffice/ms office or pdf from a scanner software). And if Tika processes such a document, it detects pdf content type, but there are only images in there. I don't know how to deal with that. There should be a function that decides on the type of PDF document so that I can take it and use some OCR software for the PDF from scanner software.

      If there is a way to do that, could please anybody explain how to do that ?

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              vychtrle Joseph Vychtrle
              Votes:
              1 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: