[PDFBOX-5719] PDFbox fix - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Invalid
Affects Version/s: 2.0.26
Fix Version/s: None
Component/s: Text extraction
Labels:
Environment:
OS: Ubuntu
Java: 16

Language:
- Java

Description

Hello,

I am experiencing an issue related to the "No Unicode Mapping" warning in the PDFBox debugger. Similar to Apache DebugBar, I am saving font glyphs to disk and then using an AI to detect the characters. My objective is to update the font Unicode map based on the AI results and save the PDF.

Here's my main idea: Save unknown glyph Unicode mappings to disk, send each image to the AI for detection, and then update the font Unicode mapping. I found a helpful example on Stack Overflow (link: https://stackoverflow.com/questions/39485920/how-to-add-unicode-in-truetype0font-on-pdfbox-2-0-0), where the solution involves creating a CosStream to update the font Unicode mapping. This approach seems suitable for my needs.

In the mentioned question, the answer suggests creating a CosStream to update the font Unicode mapping. I want to retrieve the ToUnicode text as shown in the mentioned question and modify the text to fix the font Unicode, then update the font. However, I am unsure of how to obtain the ToUnicode text view (similar to the PDF debugger).

Can anyone provide assistance on how to address this issue? Any help would be greatly appreciated.

Sample pdf file attached

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Kommunikationsbedingungen-Einlagen_FIDOR-Bank.pdf
26/Nov/23 09:12
109 kB
MMG

Activity

People

Assignee:: Unassigned

Reporter:: MMG

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 26/Nov/23 09:13

Updated:: 16/Dec/23 09:58

Resolved:: 16/Dec/23 09:58