XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Invalid
    • 2.0.26
    • None
    • Text extraction
    • OS: Ubuntu
      Java: 16

    Description

      Hello,

      I am experiencing an issue related to the "No Unicode Mapping" warning in the PDFBox debugger. Similar to Apache DebugBar, I am saving font glyphs to disk and then using an AI to detect the characters. My objective is to update the font Unicode map based on the AI results and save the PDF.

      Here's my main idea: Save unknown glyph Unicode mappings to disk, send each image to the AI for detection, and then update the font Unicode mapping. I found a helpful example on Stack Overflow (link: https://stackoverflow.com/questions/39485920/how-to-add-unicode-in-truetype0font-on-pdfbox-2-0-0), where the solution involves creating a CosStream to update the font Unicode mapping. This approach seems suitable for my needs.

      In the mentioned question, the answer suggests creating a CosStream to update the font Unicode mapping. I want to retrieve the ToUnicode text as shown in the mentioned question and modify the text to fix the font Unicode, then update the font. However, I am unsure of how to obtain the ToUnicode text view (similar to the PDF debugger).

      Can anyone provide assistance on how to address this issue? Any help would be greatly appreciated.

      Sample pdf file attached

      Attachments

        Activity

          People

            Unassigned Unassigned
            gholamrezaeipt MMG
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: