[TIKA-2845] Override ProcessPages in PDFTextStripper - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.21
Component/s: None
Labels:
None

Description

On the PDFBox user list, lehmi confirmed (and tilman clarified) that PDFTextStripper's processPages skips pages that lack a "Contents" element[1]. Inline images are part of the "Contents" element and would still be processed (e.g. in OCR).

However, there are other elements that might be on a page that does not have a "Contents" element, such as an annotation with an embedded file.

We should override processPages() to process all pages.

[1] Start of thread: https://lists.apache.org/thread.html/9f34f71f764ef2ac48bb2fe3d19aa0496fd989040a6df0c1d899a885@%3Cusers.pdfbox.apache.org%3E

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

testPDFFileEmbInAnnotation_noContents.pdf
03/Apr/19 13:15
194 kB
Tim Allison

Activity

People

Assignee:: Tim Allison

Reporter:: Tim Allison

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 03/Apr/19 13:14

Updated:: 03/Apr/19 15:15

Resolved:: 03/Apr/19 14:15