[PDFBOX-3068] Null metadata in 2.0 in some files that had metadata in 1.8.10 with old parser - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.8.10, 1.8.11, 2.0.0
Fix Version/s: 1.8.11, 2.0.0
Component/s: Parsing
Labels:
None

Description

Tilman's observation on 'Microsoft' below revealed 1) that we should use our BodyContentHandler so that title metadata doesn't slip into the body content and 2) the title and all metadata values from PDDocumentInformation is null for at least: NZ/NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU

        Path p = Paths.get("..NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU");
        PDDocument d = PDDocument.load(p.toFile());
        assertNull(d.getDocumentInformation().getTitle());
        assertEquals(8, d.getDocumentInformation().getMetadataKeys().size());

Manually reviewing a handful of documents in the metadata/metadata_value_count_diffs.csv file here, this looks to be quite pervasive...unless I'm botching the right way to load the documents and metadata.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU
28/Oct/15 15:45
24 kB
Tim Allison

Activity

People

Assignee:: Tilman Hausherr

Reporter:: Tim Allison

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 28/Oct/15 15:44

Updated:: 18/Jan/16 12:01

Resolved:: 31/Oct/15 11:42