[TIKA-630] Dealing with PDF documents from scanning programs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 0.10
Fix Version/s: None
Component/s: general
Labels:
- ocr
- pdf,

Description

Hey,

sorry I didn't post this to mailing list, I kinda didn't get the confirmation.

The issue is that often people don't even realize there is a difference in pdf documents (extracted from openoffice/ms office or pdf from a scanner software). And if Tika processes such a document, it detects pdf content type, but there are only images in there. I don't know how to deal with that. There should be a function that decides on the type of PDF document so that I can take it and use some OCR software for the PDF from scanner software.

If there is a way to do that, could please anybody explain how to do that ?

Attachments

Issue Links

is duplicated by

TIKA-93 OCR support

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Joseph Vychtrle

Votes:: 1 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 01/Apr/11 00:01

Updated:: 01/Mar/15 22:13

Resolved:: 01/Mar/15 22:13