[NUTCH-1922] DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 2.3
Fix Version/s: 2.3.1
Component/s: None
Labels:
None

Description

When Nutch 2 finds a link to a URL that was crawled in a previous batch, it resets the fetch status of that URL to unfetched. This makes this URL available for a re-fetch, even if its crawl interval is not yet over.

To reproduce, using version 2.3:

# Nutch configuration
ant runtime
cd runtime/local
mkdir seeds
echo http://www.l3s.de/~gossen/nutch/a.html > seeds/1.txt
bin/crawl seeds test 2

This uses two files a.html and b.html that link to each other.
In batch 1, Nutch downloads a.html and discovers the URL of b.html. In batch 2, Nutch downloads b.html and discovers the link to a.html. This should update the score and link fields of a.html, but not the fetch status. However, when I run bin/nutch readdb -crawlId test -url http://www.l3s.de/~gossen/nutch/a.html | grep -a status, it returns status: 1 (status_unfetched).

Expected would be status: 2 (status_fetched).

The reason seems to be that DbUpdateReducer assumes that links to a URL not processed in the same batch always belong to new pages. Before ~~NUTCH-1556~~, all pages in the crawl DB were processed by the DBUpdate job, but that change skipped all pages with a different batch ID, so I assume that this introduced this behavior.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-1922.patch
29/Jan/15 08:31
2 kB
Michiel

Issue Links

relates to

NUTCH-1679 UpdateDb using batchId, link may override crawled page.

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Gerhard Gossen

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 26/Jan/15 11:02

Updated:: 13/Mar/24 14:50

Resolved:: 16/Sep/15 04:27