[PDFBOX-2385] inline image with EI at the end incorrectly parsed - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.8.7, 1.8.8, 2.0.0
Fix Version/s: 1.8.8, 2.0.0
Component/s: Parsing
Labels:
- regression

Description

I'm having a look at the files from ~~TIKA-1419~~ where there's a big decrease in the token count. And I found another problem with inline images. This time, the file is like this:

ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffEI
Q

Because of the first change in ~~PDFBOX-2163~~, PDFBox assumes that this is Ascii85 code but it isn't. From my own tests, deleting the "Ascii85" test [ http://svn.apache.org/r1606177 ] and keeping the second change [ http://svn.apache.org/r1613645 ] (expecting spaces, 1-3 chars, blanks) works fine.

I will have a look at some of the files (those with big token count decrease) mentioned in tallison@apache.orgs csv file over the next few days / weeks.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PDFBOX-2385-146515.pdf
26/Sep/14 17:54
565 kB
Tilman Hausherr
PDFBOX-2385-539663.pdf
26/Sep/14 20:33
234 kB
Tilman Hausherr
PDFBOX-2385-862497.pdf
26/Sep/14 21:22
405 kB
Tilman Hausherr
PDFBOX-2385-893083.pdf
26/Sep/14 21:25
245 kB
Tilman Hausherr

Issue Links

relates to

PDFBOX-2163 inline image with EI in the middle incorrectly parsed

Closed

PDFBOX-2376 Small regression in text extraction with PDFBox 1.8.7 vs. 1.8.6

Closed

TIKA-1442 Upgrade to PDFBox 1.8.8

Closed

TIKA-1419 Upgrade to PDFBox 1.8.7

Closed

Activity

People

Assignee:: Tilman Hausherr

Reporter:: Tilman Hausherr

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 26/Sep/14 17:51

Updated:: 13/Dec/14 14:15

Resolved:: 27/Sep/14 21:38