Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-521

Improved PDF Text Extraction that notes paragraph boundaries

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.8.0-incubator
    • 1.4.0
    • Parsing
    • None
    • all

    Description

      The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is to ignore paragraph demarcation in the text. It basically just renders each line of text as it discovers it, separating each line equally with the same line separator.

      This makes it difficult to identify paragraph (or even page) starts and stops in the extracted text. This is often necessary for text processing that needs to work with logical 'chunks' of text. Further, rendering into other formats (such as HTML or XML) is facilitated by resolving the document into more discrete logical text chunks.

      The request here is for improved text extraction that provides more discrete instrumentation of the parsing, allowing one to identify / tag paragraph starts and stops.

      Attachments

        1. pdftextstripper_patch.txt
          3 kB
          Mel Martinez
        2. pdftextstripper2.zip
          12 kB
          Mel Martinez

        Issue Links

          Activity

            People

              lehmi Andreas Lehmkühler
              m.martinez Mel Martinez
              Votes:
              2 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: