Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-881

Incorrect output when word spacing is achieved by matrix translation

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.3.1, 1.4.0
    • 1.4.0
    • Text extraction
    • None

    Description

      When extracting text in a PDF document in which word spacing is achieved by matrix translation, in versions 1.3.x and 1.4 the different words are being merged.

      This situation doesn't happen in 1.2 branch. After investigating a bit, the error was introduced with a refactoring of the PDFStreamEngine class, and is related to textMatrixEnd computation. In 1.2 branch the characterSpacingWidth was added after computing the textMatrixEnd, but in 1.3 (and 1.4) this characterSpacingWidth is preadded to the textMatrixEnd, so the system is unable to detect a new word.

      Attachments

        1. alta_padron.pdf
          411 kB
          David Rodríguez Alfayate
        2. output_1_2.txt
          8 kB
          David Rodríguez Alfayate
        3. output_1_3.txt
          7 kB
          David Rodríguez Alfayate
        4. pdfbox-characterspacing.patch
          2 kB
          David Rodríguez Alfayate

        Activity

          People

            lehmi Andreas Lehmkühler
            erudil David Rodríguez Alfayate
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: