Description
As discussing, we can improve PDF extraction by 45% with this new NonSequentialPDFParser and fit more with PDF specification. This parser will be integrated by default in pdfbox 2.0.
ref.:
https://issues.apache.org/jira/browse/PDFBOX-1104
http://pdfbox.apache.org/ideas.html
We should provide an extended parser or parameter current PDFParser to call:
PDDocument.loadNonSeq(file, scratchFile);
Attachments
Attachments
Issue Links
- relates to
-
TIKA-1203 Some metadata not extracted from PDF files when NonSequentialPDFParser is used
- Closed