Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.8.9, 2.0.0
Description
As reported by Noam S. on the user mailing list:
My problem is that when trying to getText(doc) form a certain section of the pdf using setStartBookmark(item) and setEndBookmark(item) I get all the text rather than just the text from the specified section.
WhiIe trying to resolve this I realized that the writeText(doc, outputStream) method always calls resetEngine() method. That will reset all the parameters and delete the bookmarks I set.
The two lines that reset the bookmarks were added to resetEngine in PDFBOX-1808 in [ https://svn.apache.org/r1553175 ] in an attempt to save some memory.
Another weird segment can be found in the trunk:
I also found another weird piece of code in the trunk, which would mean that text extraction would fail if start and end bookmarks are identical:
if (startPage != null && endPage != null && startBookmark.getCOSObject() == endBookmark.getCOSObject()) { // this is a special case where both the start and end bookmark // are the same but point to nothing. In this case // we will not extract any text. startBookmarkPageNumber = 0; endBookmarkPageNumber = 0; }
earlier, that segment was:
if( startBookmarkPageNumber == -1 && startBookmark != null && endBookmarkPageNumber == -1 && endBookmark != null && startBookmark.getCOSObject() == endBookmark.getCOSObject() ) { //this is a special case where both the start and end bookmark //are the same but point to nothing. In this case //we will not extract any text. startBookmarkPageNumber = 0; endBookmarkPageNumber = 0; }
which makes more sense. The change was made last year in rev [ https://svn.apache.org/r1634252 ] as part of the pagetree refactoring.
I am writing a test to prevent this from breaking in the future.
Attachments
Issue Links
- is broken by
-
PDFBOX-1808 PDFTextStripper.getText - hight memory usage
- Closed
-
PDFBOX-2423 Page tree handling needs rewriting
- Closed