[TIKA-93] OCR support - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.7
Component/s: parser
Labels:
- memex

Description

I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

TIKA-93.patch
08/Feb/14 17:05
21 kB
Grant Ingersoll
TIKA-93.patch
08/Feb/14 19:07
28 kB
Grant Ingersoll
TIKA-93.patch
08/Feb/14 20:11
38 kB
Grant Ingersoll
TIKA-93.patch
09/Feb/14 13:46
40 kB
Grant Ingersoll
testOCR.docx
09/Feb/14 13:46
61 kB
Grant Ingersoll
testOCR.pdf
09/Feb/14 13:46
41 kB
Grant Ingersoll
testOCR.pptx
09/Feb/14 13:46
77 kB
Grant Ingersoll
TesseractOCRParser.patch
23/Feb/14 14:28
26 kB
Luís Filipe Nassif
TesseractOCRParser.patch
23/Feb/14 21:25
25 kB
Luís Filipe Nassif
TesseractOCR_Tyler.patch
29/May/14 18:07
17 kB
Tyler Bui-Palsulich
TesseractOCR_Tyler_v2.patch
09/Jun/14 22:22
18 kB
Tyler Bui-Palsulich
Petr_tika-config.xml
22/Aug/14 07:33
1 kB
Petr Vas
TesseractOCR_Tyler_v3.patch
15/Sep/14 22:24
20 kB
Tyler Bui-Palsulich
TesseractOCR_Tyler_v4.patch
18/Sep/14 22:05
20 kB
Tyler Bui-Palsulich

Issue Links

duplicates

TIKA-630 Dealing with PDF documents from scanning programs

Resolved

is related to

SOLR-6991 Update to Apache TIKA 1.7

Closed

TIKA-1526 ExternalParser should trap/ignore/workarround JDK-8047340 & JDK-8055301 so Turkish Tika users can still use non-external parsers

Resolved

Activity

People

Assignee:: Chris A. Mattmann

Reporter:: Jukka Zitting

Votes:: 12 Vote for this issue

Watchers:: 27 Start watching this issue

Dates

Created:: 12/Nov/07 02:06

Updated:: 06/Apr/15 17:05

Resolved:: 19/Sep/14 14:20