Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3079

Extracting text between bookmarks not working

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.0.0
    • None
    • Text extraction
    • Windows

    Description

      org.apache.pdfbox.text.PDFTextStripper does not really support extraction of content between bookmarks. from looking at the code in pdfbox-parent/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java it is clear that is using the bookmarks that the user provided to determine the pages to extract content from.

      There is a business need to extract the text that lies strictly between bookmarks. Refer to the attached example program and sample file.
      The extraction to the sections in the first page all return the entire first page instead of the content inside each bookmark.

      Attachments

        1. test.pdf
          207 kB
          rey bernal
        2. Test.java
          1 kB
          rey bernal

        Activity

          People

            Unassigned Unassigned
            lanrebr rey bernal
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: