Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1922

DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 2.3
    • 2.3.1
    • None
    • None

    Description

      When Nutch 2 finds a link to a URL that was crawled in a previous batch, it resets the fetch status of that URL to unfetched. This makes this URL available for a re-fetch, even if its crawl interval is not yet over.

      To reproduce, using version 2.3:

      # Nutch configuration
      ant runtime
      cd runtime/local
      mkdir seeds
      echo http://www.l3s.de/~gossen/nutch/a.html > seeds/1.txt
      bin/crawl seeds test 2
      

      This uses two files a.html and b.html that link to each other.
      In batch 1, Nutch downloads a.html and discovers the URL of b.html. In batch 2, Nutch downloads b.html and discovers the link to a.html. This should update the score and link fields of a.html, but not the fetch status. However, when I run bin/nutch readdb -crawlId test -url http://www.l3s.de/~gossen/nutch/a.html | grep -a status, it returns status: 1 (status_unfetched).

      Expected would be status: 2 (status_fetched).

      The reason seems to be that DbUpdateReducer assumes that links to a URL not processed in the same batch always belong to new pages. Before NUTCH-1556, all pages in the crawl DB were processed by the DBUpdate job, but that change skipped all pages with a different batch ID, so I assume that this introduced this behavior.

      Attachments

        1. NUTCH-1922.patch
          2 kB
          Michiel

        Issue Links

          Activity

            People

              Unassigned Unassigned
              gerhard.gossen Gerhard Gossen
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: