Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-2385

inline image with EI at the end incorrectly parsed

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.8.7, 1.8.8, 2.0.0
    • 1.8.8, 2.0.0
    • Parsing

    Description

      I'm having a look at the files from TIKA-1419 where there's a big decrease in the token count. And I found another problem with inline images. This time, the file is like this:

      ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffEI
      Q
      

      Because of the first change in PDFBOX-2163, PDFBox assumes that this is Ascii85 code but it isn't. From my own tests, deleting the "Ascii85" test [ http://svn.apache.org/r1606177 ] and keeping the second change [ http://svn.apache.org/r1613645 ] (expecting spaces, 1-3 chars, blanks) works fine.

      I will have a look at some of the files (those with big token count decrease) mentioned in tallison@apache.orgs csv file over the next few days / weeks.

      Attachments

        1. PDFBOX-2385-146515.pdf
          565 kB
          Tilman Hausherr
        2. PDFBOX-2385-539663.pdf
          234 kB
          Tilman Hausherr
        3. PDFBOX-2385-862497.pdf
          405 kB
          Tilman Hausherr
        4. PDFBOX-2385-893083.pdf
          245 kB
          Tilman Hausherr

        Issue Links

          Activity

            People

              tilman Tilman Hausherr
              tilman Tilman Hausherr
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: