[PDFBOX-5090] Missing text extraction under certain conditions starting with apache pdfbox 2.0.18 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.18, 2.0.19, 2.0.20, 2.0.21, 2.0.22
Fix Version/s: 2.0.23, 3.0.0 PDFBox
Component/s: Text extraction
Labels:
- regression
Environment:
jdk 1.8, apache pdfbox, fontbox 2.0.18~, windows 10

Description

When calling PDFTextStripper.getText() function on pdfbox 2.0.18 or later, it fails to extract text with any condition.

It is suspected that the missing text extraction phenomenon is associated with either the font type or the font size or text's width and height.

I have attached the text extraction results of version 2.0.17 and version 2.0.18 and the sample data used for the test.

code

PDDocument pdDocument = PDDocument.load(new File(path));
PDFTextStripper stripper = new PDFTextStripper();

dependencies

<properties>
    <apache.pdfbox.version>2.0.18</apache.pdfbox.version>
</properties>

<dependencies>
    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>pdfbox</artifactId>
        <version>${apache.pdfbox.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>fontbox</artifactId>
        <version>${apache.pdfbox.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>xmpbox</artifactId>
        <version>${apache.pdfbox.version}</version>
    </dependency>
</dependencies>

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

独立財政機関をめぐる論点整理.pdf
27/Jan/21 07:11
537 kB
sungwon kim
独立財政機関をめぐる論点整理_3p_top.PNG
27/Jan/21 07:22
99 kB
sungwon kim
textstripper_2.0.18_独立財政機関をめぐる論点整理_3p_top.PNG
27/Jan/21 07:22
16 kB
sungwon kim
textstripper_2.0.18_128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG
27/Jan/21 07:22
7 kB
sungwon kim
textstripper_2.0.17_独立財政機関をめぐる論点整理_3p_top.PNG
27/Jan/21 07:22
16 kB
sungwon kim
textstripper_2.0.17_128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG
27/Jan/21 07:22
8 kB
sungwon kim
PDFBOX-5090_reduced.pdf
28/Jan/21 04:30
1 kB
Tilman Hausherr
PDFBOX-3442-DirectResources.pdf
30/Jan/21 18:07
73 kB
Tilman Hausherr
128채널심장전기도시스템을위한3차원매핑소프트웨어개발.txt
27/Jan/21 07:40
15 kB
Tilman Hausherr
128채널심장전기도시스템을위한3차원매핑소프트웨어개발.pdf
27/Jan/21 07:09
76 kB
sungwon kim
128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG
27/Jan/21 07:22
59 kB
sungwon kim

Issue Links

relates to

PDFBOX-4661 Regression No Unicode mapping with Identity-H font

Closed

Activity

People

Assignee:: Andreas Lehmkühler

Reporter:: sungwon kim

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 27/Jan/21 07:25

Updated:: 19/Mar/21 15:41

Resolved:: 01/Feb/21 07:25