Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.8.4, 1.8.5, 1.8.6, 2.0.0
Description
PDFBOX-1770 introduced a regression with pdfs using a Type1C font. Special characters incluing ligatures can't be extracted anymore.
The issue was originally posted on users@pdfbox:
I ran pdfbox-app version 1.8.5 over the PDF Greenstone manual:
http://www.greenstone.org/docs/greenstone3/manual.pdf
It removed the fl and fi prefixes from words like "flexible", "file" and
"first". Perhaps these genuine word prefixes have been confused with f-based
ligatures?
We were previously using a pdfbox-app 1.5.* version and wanted to switch over to
a newer one. Version 1.8.2 does not have this issue.
The command we ran:
java -cp "/path/to/pdfbox-app-1.8.5.jar" -Dline.separator="<br />"
org.apache.pdfbox.ExtractText -html "/path/to/manual.pdf"
Relevant excerpts from the output generated:
- "improve exibility, modularity, and extensibility"
the 2nd word should be "flexibillity" - "Table 1 shows the le hierarchy for Greenstone3. The rst part shows the common"
The words "file" and "first" have been truncated to "le" and "rst"
I believe this is rather a bug than intended behaviour.
Attachments
Issue Links
- breaks
-
PDFBOX-2247 Regression in text extraction between 1.8.5 and 1.8.6
- Closed