Description
Hey,
sorry I didn't post this to mailing list, I kinda didn't get the confirmation.
The issue is that often people don't even realize there is a difference in pdf documents (extracted from openoffice/ms office or pdf from a scanner software). And if Tika processes such a document, it detects pdf content type, but there are only images in there. I don't know how to deal with that. There should be a function that decides on the type of PDF document so that I can take it and use some OCR software for the PDF from scanner software.
If there is a way to do that, could please anybody explain how to do that ?
Attachments
Issue Links
- is duplicated by
-
TIKA-93 OCR support
- Resolved