Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.0.3, 2.0.4
-
Windows 10, java version "1.8.0_25"
Description
Text extraction from certain PDFs is not possible and PDF Box responses with NullPointerException. Text extraction from same PDF with version 1.8.13 is working.
Originally the issue was discovered while using the newest Apache Tika 1.14 library. I can not down-grade to PDF Box 1.8.13 with Apache Tika 1.14.
Unfortunately I can not provide the PDFs that fail to you. However, I did some testing and found out that “Token token = lexer.nextToken();” return Null.
Feb 07, 2017 12:17:40 PM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
SEVERE: Can't read the embedded Type1 font AAAAAB+Arial-BoldMT
java.io.IOException: Found token=null but expected NAME
Caused by: java.io.EOFException
at org.apache.pdfbox.io.ScratchFileBuffer.seek(ScratchFileBuffer.java:302)
at org.apache.pdfbox.pdfparser.COSParser.checkXRefOffset(COSParser.java:1177)
at org.apache.pdfbox.pdfparser.COSParser.parseXref(COSParser.java:202)
Attachments
Attachments
Issue Links
- relates to
-
PDFBOX-3112 Avoid crazy /Length1 values in font descriptor
- Closed
-
PDFBOX-2350 Type1 Parser hangs indefinitely
- Closed