Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-759

Special characters not extracted

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.1.0, 1.2.0
    • 1.4.0
    • Text extraction
    • None
    • all

    Description

      When trying to extract characters for mathematic formulas, there appear to be lots of characters that don't seem to have any meaning.
      Take the example on page 80 the last formula with the binomial coefficient. The first opening bracket, when extracted using the Foxit Reader or Adobe Reader gets a character with the int value 18 and the closing bracket is the int value 19. Now when I look at the TextPosition objects using PDFBox, there is one character to the left of the 5 and that one has the glyph name spacehackarabic/space and the int value 32.
      The next problem is that there seems to be a character at the same position as the 5, a 'controlLF'. What does it do at the same position as that number?
      Mpw after the character 2 are 3 other characters, another 'controlLF' and two 'spacehackarabic/space'. There is no indication whatsoever abouth the bracket. What do those extra characters mean? And why doesn't it show the character for the bracket that I am able to extract using the other PDF readers?

      The PDF can be downloaded from http://upload.wikimedia.org/wikibooks/de/f/f6/Mathematik_Stochastik.pdf

      Attachments

        1. PDFBOX759-Mathematik_Stochastik.txt
          283 kB
          Andreas Lehmkühler
        2. Mathematik_Stochastik.pdf
          4.75 MB
          Jason Kenny

        Activity

          People

            lehmi Andreas Lehmkühler
            dragon Jason Kenny
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: