Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
Using nutch witch the following configuration, MetaTagsParser writes HTML meta tags to the metadata twice:
<property> <name>plugin.includes</name> <value>protocol-http|parse-(tika|metatags)</value> </property>
The problem seems to come from MetaTagsParser.java#L104-L111 :
Both the meta tags from the existing ParseResult and from the HTMLMetaTags are added to the metadata with a "metatag." prefix. But the ParseResult object already contains the HTML meta tags, because they have been added by TikaParser here: TikaParser.java#L198-L206
This bug is concerning, because it makes the segments uselessly big, especially if we want to store all metatags (by default, only metatag.description and metatag.keywords are stored, and thus duplicated).
I would also suggest making the output of Metadata::toString more readable(for instance by adding a newline before each new metadata value). It would have made this bug way easier to spot inside the output of the parsechecker.
Attachments
Issue Links
- duplicates
-
NUTCH-1559 parse-metatags duplicates extracted metatags
- Closed
- relates to
-
NUTCH-1559 parse-metatags duplicates extracted metatags
- Closed