Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1376

Improve embedded file name extraction in PDFParser

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Trivial
    • Resolution: Fixed
    • None
    • 1.6
    • parser
    • None

    Description

      When we extract embedded files from PDFs, we are currently using the key in the PDEmbeddedFilesNameTreeNode as the file name that we store as the value of Metadata.RESOURCE_NAME_KEY in the embedded document's metadata.

      I think we should try to get the file name from PDComplexFileSpecification's getFilename() first. If that is null, then we should fall back to the key value.

      Attachments

        Activity

          People

            tallison Tim Allison
            tallison Tim Allison
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: