[NUTCH-1674] Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Index - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3
Fix Version/s: 2.3
Component/s: None
Labels:
None

Patch Info:

Patch Available

Description

Nutch always scan the whole crawldb in each phrase (generate, fetch, parse, update, index). When crawldb is big, the time to scan is bigger than the actual processing time.
We really need to skip records while scanning using ~~GORA-119~~ for example we can only get records belong to a specified batchId.
In my crawl the filter reduce the time to scan from 90 min to 30 min.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-1674_2.patch
28/Nov/13 13:02
16 kB
Alparslan Avcı
NUTCH-1674_3.patch
24/Dec/13 15:25
16 kB
Alparslan Avcı
NUTCH-1674_final.patch
28/Apr/14 14:15
15 kB
Alparslan Avcı
NUTCH-1674.patch
25/Nov/13 02:44
16 kB
Tien Nguyen Manh

Issue Links

depends upon

NUTCH-1714 Nutch 2.x upgrade to Gora 0.4

Closed

is related to

GORA-119 implement a filter enabled scan in gora

Resolved

relates to

NUTCH-1777 Fetcher not getting all the entries in input

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Tien Nguyen Manh

Votes:: 2 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 25/Nov/13 02:43

Updated:: 13/Mar/24 14:51

Resolved:: 15/May/14 08:14