[TIKA-4057] Skip Thumbnails from Metadata When Scanning PPTX files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Wish
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 2.6.0
Fix Version/s: None
Component/s: metadata, mime
Labels:
None

Description

I am scanning Pptx using tika parser/core 2.6.0 version and using EmbeddedDocumentExtractor to verify if embedded images are present in pptx or not. It seems that metadata contains thumbnails with mime type as "image/jpeg". The key and value for thumbnail are "dc:title" and "/docProps/thumbnail.jpeg" respectively. So even if there is no embedded image in pptx file, result always shows "Embedded image present" due to thumbnails. Is there any way to introduce any parameter in officeParserConfig that will skip the thumbnails while parsing . Thanks

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Kshitij

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 28/May/23 07:21

Updated:: 28/May/23 07:21