Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2846

Add per page unicode mapping stats to the metadata in the PDFParser

    XMLWordPrintableJSON

Details

    • Task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.21
    • None
    • None

    Description

      As part of TIKA-2749, it would be useful to gather stats on characters that did not have a unicode mapping. Users could use this information now to determine which pages might benefit from OCR.

      I propose two parallel arrays of ints, with one entry per page. The first would contain the count of # of characters per page, and the second would be a count of the unmapped unicode characters per page.

      Many thanks to tilman for the guidance on how to gather this info easily.

      Attachments

        Issue Links

          Activity

            People

              tallison Tim Allison
              tallison Tim Allison
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: