[PDFBOX-3079] Extracting text between bookmarks not working - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.0.0
Fix Version/s: None
Component/s: Text extraction
Labels:
- textextraction
Environment:
Windows

External issue URL:
http://www.java-forums.org/advanced-java/51032-pdox-1-6-0-extract-text-between-2-bookmarks-same-page-sos.html

Description

org.apache.pdfbox.text.PDFTextStripper does not really support extraction of content between bookmarks. from looking at the code in pdfbox-parent/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java it is clear that is using the bookmarks that the user provided to determine the pages to extract content from.

There is a business need to extract the text that lies strictly between bookmarks. Refer to the attached example program and sample file.
The extraction to the sections in the first page all return the entire first page instead of the content inside each bookmark.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

test.pdf
01/Nov/15 21:22
207 kB
rey bernal
Test.java
01/Nov/15 21:22
1 kB
rey bernal

Activity

People

Assignee:: Unassigned

Reporter:: rey bernal

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 01/Nov/15 21:18

Updated:: 03/Nov/15 06:56