I experienced this bug while PDF/A validation process. The document is not considered valid because the producer value is not in sync with PDDocumentInformation.
PDDocumentInformation.getProducer() = ` ' (one space)
AdobePDFSchema.getProducer() = `' (empty)
Below the metadata extracted from the PDF document:
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/">
<rdf:RDF xmlns:rdf="">
<rdf:Description rdf:about="" xmlns:xap="">
<xap:CreatorTool>Canon </xap:CreatorTool>
<rdf:Description rdf:about="" xmlns:pdf="">
<pdf:Producer> </pdf:Producer>
<rdf:Description rdf:about="" xmlns:pdfaid="">
<?xpacket end="w"?>
As you can see the Producer value should be equal to ` ' (one space).
The bug is located within the method DomXmpParser.removeComments. This method is invoked during the unmarshalling process and removes much more than comments, text nodes too!
I can fix (badly) MY issue by changing the code base from :
Text t = (Text) node;
if (t.getTextContent().trim().length() == 0)Unknown macro: { // XXX is there a better way to remove useless Text ? node.getParentNode().removeChild(node); }
into :
Text t = (Text) node;
if (t.getTextContent().startsWith("\n"))Unknown macro: { // XXX is there a better way to remove useless Text ? node.getParentNode().removeChild(node); }
But this is not a long term fix.
IMHO, the unmarshalling process should be reworked.