Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-2058

The text of pdfs using Type1C can't be extracted correct

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.8.4, 1.8.5, 1.8.6, 2.0.0
    • 1.8.6, 2.0.0
    • Text extraction

    Description

      PDFBOX-1770 introduced a regression with pdfs using a Type1C font. Special characters incluing ligatures can't be extracted anymore.

      The issue was originally posted on users@pdfbox:

      I ran pdfbox-app version 1.8.5 over the PDF Greenstone manual:
      http://www.greenstone.org/docs/greenstone3/manual.pdf

      It removed the fl and fi prefixes from words like "flexible", "file" and
      "first". Perhaps these genuine word prefixes have been confused with f-based
      ligatures?

      We were previously using a pdfbox-app 1.5.* version and wanted to switch over to
      a newer one. Version 1.8.2 does not have this issue.

      The command we ran:
      java -cp "/path/to/pdfbox-app-1.8.5.jar" -Dline.separator="<br />"
      org.apache.pdfbox.ExtractText -html "/path/to/manual.pdf"

      Relevant excerpts from the output generated:

      • "improve exibility, modularity, and extensibility"
        the 2nd word should be "flexibillity"
      • "Table 1 shows the le hierarchy for Greenstone3. The rst part shows the common"
        The words "file" and "first" have been truncated to "le" and "rst"

      I believe this is rather a bug than intended behaviour.

      Attachments

        Issue Links

          Activity

            People

              lehmi Andreas Lehmkühler
              lehmi Andreas Lehmkühler
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: