Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.0.18, 2.0.19, 2.0.20, 2.0.21, 2.0.22
-
jdk 1.8, apache pdfbox, fontbox 2.0.18~, windows 10
Description
When calling PDFTextStripper.getText() function on pdfbox 2.0.18 or later, it fails to extract text with any condition.
It is suspected that the missing text extraction phenomenon is associated with either the font type or the font size or text's width and height.
I have attached the text extraction results of version 2.0.17 and version 2.0.18 and the sample data used for the test.
code
PDDocument pdDocument = PDDocument.load(new File(path)); PDFTextStripper stripper = new PDFTextStripper();
dependencies
<properties> <apache.pdfbox.version>2.0.18</apache.pdfbox.version> </properties> <dependencies> <dependency> <groupId>org.apache.pdfbox</groupId> <artifactId>pdfbox</artifactId> <version>${apache.pdfbox.version}</version> </dependency> <dependency> <groupId>org.apache.pdfbox</groupId> <artifactId>fontbox</artifactId> <version>${apache.pdfbox.version}</version> </dependency> <dependency> <groupId>org.apache.pdfbox</groupId> <artifactId>xmpbox</artifactId> <version>${apache.pdfbox.version}</version> </dependency> </dependencies>
Attachments
Attachments
Issue Links
- relates to
-
PDFBOX-4661 Regression No Unicode mapping with Identity-H font
- Closed