[PDFBOX-2058] The text of pdfs using Type1C can't be extracted correct - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.8.4, 1.8.5, 1.8.6, 2.0.0
Fix Version/s: 1.8.6, 2.0.0
Component/s: Text extraction
Labels:
- type1cfont

Description

~~PDFBOX-1770~~ introduced a regression with pdfs using a Type1C font. Special characters incluing ligatures can't be extracted anymore.

The issue was originally posted on users@pdfbox:

I ran pdfbox-app version 1.8.5 over the PDF Greenstone manual:
http://www.greenstone.org/docs/greenstone3/manual.pdf

It removed the fl and fi prefixes from words like "flexible", "file" and
"first". Perhaps these genuine word prefixes have been confused with f-based
ligatures?

We were previously using a pdfbox-app 1.5.* version and wanted to switch over to
a newer one. Version 1.8.2 does not have this issue.

The command we ran:
java -cp "/path/to/pdfbox-app-1.8.5.jar" -Dline.separator="<br />"
org.apache.pdfbox.ExtractText -html "/path/to/manual.pdf"

Relevant excerpts from the output generated:

"improve exibility, modularity, and extensibility"
the 2nd word should be "flexibillity"
"Table 1 shows the le hierarchy for Greenstone3. The rst part shows the common"
The words "file" and "first" have been truncated to "le" and "rst"

I believe this is rather a bug than intended behaviour.

Attachments

Issue Links

breaks

PDFBOX-2247 Regression in text extraction between 1.8.5 and 1.8.6

Closed

Activity

People

Assignee:: Andreas Lehmkühler

Reporter:: Andreas Lehmkühler

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 04/May/14 16:50

Updated:: 28/Jul/14 18:43

Resolved:: 05/May/14 17:57