Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
2.3
-
None
-
None
Description
When Nutch 2 finds a link to a URL that was crawled in a previous batch, it resets the fetch status of that URL to unfetched. This makes this URL available for a re-fetch, even if its crawl interval is not yet over.
To reproduce, using version 2.3:
# Nutch configuration
ant runtime
cd runtime/local
mkdir seeds
echo http://www.l3s.de/~gossen/nutch/a.html > seeds/1.txt
bin/crawl seeds test 2
This uses two files a.html and b.html that link to each other.
In batch 1, Nutch downloads a.html and discovers the URL of b.html. In batch 2, Nutch downloads b.html and discovers the link to a.html. This should update the score and link fields of a.html, but not the fetch status. However, when I run bin/nutch readdb -crawlId test -url http://www.l3s.de/~gossen/nutch/a.html | grep -a status, it returns status: 1 (status_unfetched).
Expected would be status: 2 (status_fetched).
The reason seems to be that DbUpdateReducer assumes that links to a URL not processed in the same batch always belong to new pages. Before NUTCH-1556, all pages in the crawl DB were processed by the DBUpdate job, but that change skipped all pages with a different batch ID, so I assume that this introduced this behavior.
Attachments
Attachments
Issue Links
- relates to
-
NUTCH-1679 UpdateDb using batchId, link may override crawled page.
- Closed