Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.8.7, 1.8.8, 2.0.0
Description
I'm having a look at the files from TIKA-1419 where there's a big decrease in the token count. And I found another problem with inline images. This time, the file is like this:
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffEI Q
Because of the first change in PDFBOX-2163, PDFBox assumes that this is Ascii85 code but it isn't. From my own tests, deleting the "Ascii85" test [ http://svn.apache.org/r1606177 ] and keeping the second change [ http://svn.apache.org/r1613645 ] (expecting spaces, 1-3 chars, blanks) works fine.
I will have a look at some of the files (those with big token count decrease) mentioned in tallison@apache.orgs csv file over the next few days / weeks.
Attachments
Attachments
Issue Links
- relates to
-
PDFBOX-2163 inline image with EI in the middle incorrectly parsed
- Closed
-
PDFBOX-2376 Small regression in text extraction with PDFBox 1.8.7 vs. 1.8.6
- Closed
-
TIKA-1442 Upgrade to PDFBox 1.8.8
- Closed
-
TIKA-1419 Upgrade to PDFBox 1.8.7
- Closed