[TIKA-1201] Add possibility for switching to pdfbox NonSequentialPDFParser - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.4
Fix Version/s: 1.5
Component/s: parser
Labels:
None
Environment:

all

Description

As discussing, we can improve PDF extraction by 45% with this new NonSequentialPDFParser and fit more with PDF specification. This parser will be integrated by default in pdfbox 2.0.

ref.:
https://issues.apache.org/jira/browse/PDFBOX-1104
http://pdfbox.apache.org/ideas.html

We should provide an extended parser or parameter current PDFParser to call:

PDDocument.loadNonSeq(file, scratchFile);

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

TIKA-1201.patch
03/Dec/13 00:52
6 kB
Tim Allison

Issue Links

relates to

TIKA-1203 Some metadata not extracted from PDF files when NonSequentialPDFParser is used

Closed

Activity

People

Assignee:: Tim Allison

Reporter:: Hong-Thai Nguyen

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 02/Dec/13 15:26

Updated:: 25/Mar/14 16:21

Resolved:: 03/Dec/13 00:56