[PDFBOX-521] Improved PDF Text Extraction that notes paragraph boundaries - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.8.0-incubator
Fix Version/s: 1.4.0
Component/s: Parsing
Labels:
None
Environment:
all

Description

The current behavior of the org.apache.pdfbox.util.PDFTextStripper class is to ignore paragraph demarcation in the text. It basically just renders each line of text as it discovers it, separating each line equally with the same line separator.

This makes it difficult to identify paragraph (or even page) starts and stops in the extracted text. This is often necessary for text processing that needs to work with logical 'chunks' of text. Further, rendering into other formats (such as HTML or XML) is facilitated by resolving the document into more discrete logical text chunks.

The request here is for improved text extraction that provides more discrete instrumentation of the parsing, allowing one to identify / tag paragraph starts and stops.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

pdftextstripper_patch.txt
01/Dec/10 21:05
3 kB
Mel Martinez
pdftextstripper2.zip
15/Mar/10 22:06
12 kB
Mel Martinez

Issue Links

incorporates

PDFBOX-659 Newlines added in the middle of words

Closed

relates to

PDFBOX-533 PDFTextStripper.writeCharacters is called no where in the class

Closed

Activity

People

Assignee:: Andreas Lehmkühler

Reporter:: Mel Martinez

Votes:: 2 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 08/Sep/09 17:45

Updated:: 20/Dec/10 09:38

Resolved:: 16/Dec/10 16:35