Details
-
Task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
On the PDFBox user list, lehmi confirmed (and tilman clarified) that PDFTextStripper's processPages skips pages that lack a "Contents" element[1]. Inline images are part of the "Contents" element and would still be processed (e.g. in OCR).
However, there are other elements that might be on a page that does not have a "Contents" element, such as an annotation with an embedded file.
We should override processPages() to process all pages.
[1] Start of thread: https://lists.apache.org/thread.html/9f34f71f764ef2ac48bb2fe3d19aa0496fd989040a6df0c1d899a885@%3Cusers.pdfbox.apache.org%3E