[PDFBOX-4553] Break of backward compatibility from 2.0.14 to 2.0.15 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.0.15
Fix Version/s: 2.0.16, 3.0.0 PDFBox
Component/s: Text extraction
Labels:
- regression

Description

We use PDFTextStripper to parse some PDF documents. The parsing sometimes assumes the file template and the order of the words in it.

The following Kotlin code prints the text content of the attached file, sorted by position.

fun main() {
  val pdfTextStripper = PDFTextStripper()
  pdfTextStripper.sortByPosition = true
  val text = pdfTextStripper.getText(PDDocument.load(File("/path/to/file/KYPolicy2.pdf").readBytes()))
  print(text)
}

Running this code with PDFBox 2.0.14 and 2.0.15 giving different parsing for the line

POLICY PERIOD: FROM 02/18/2018 TO 02/18/2019 (2.0.14)

POLICY PERIOD: FROM 02/18/2018 02/18/2019TO (2.0.15)

I suspect the cause is the changes done in this commit:

https://github.com/apache/pdfbox/commit/068146a9c9fe59becbd82814b6a245f8158fce22

This somehow prevents us for safely upgrading to the newer version

KYPolicy2.pdf

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

KYPolicy2.pdf
27/May/19 11:30
25 kB
Uziel Sulkies

Issue Links

relates to

PDFBOX-4480 Problem extracting text in newline characters and spaces beetween words

Closed

PDFBOX-3464 character height 3 times higher than expected

Closed

Activity

People

Assignee:: Tilman Hausherr

Reporter:: Uziel Sulkies

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 27/May/19 12:18

Updated:: 28/Jun/19 04:39

Resolved:: 27/May/19 17:59