Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2567

parse-metatags writes all meta tags twice

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.17
    • None
    • None

    Description

      Using nutch witch the following configuration, MetaTagsParser writes HTML meta tags to the metadata twice:

          <property>
              <name>plugin.includes</name>
              <value>protocol-http|parse-(tika|metatags)</value>
          </property>
      

      The problem seems to come from MetaTagsParser.java#L104-L111 :

      Both the meta tags from the existing ParseResult and from the HTMLMetaTags are added to the metadata with a "metatag." prefix. But the ParseResult object already contains the HTML meta tags, because they have been added by TikaParser here: TikaParser.java#L198-L206

       
      This bug is concerning, because it makes the segments uselessly big, especially if we want to store all metatags (by default, only metatag.description and metatag.keywords are stored, and thus duplicated).

      I would also suggest making the output of Metadata::toString more readable(for instance by adding a newline before each new metadata value). It would have made this bug way easier to spot inside the output of the parsechecker.

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              gbouchar Gerard Bouchar
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: