Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.3
-
None
-
None
-
Patch Available
Description
Nutch always scan the whole crawldb in each phrase (generate, fetch, parse, update, index). When crawldb is big, the time to scan is bigger than the actual processing time.
We really need to skip records while scanning using GORA-119 for example we can only get records belong to a specified batchId.
In my crawl the filter reduce the time to scan from 90 min to 30 min.
Attachments
Attachments
Issue Links
- depends upon
-
NUTCH-1714 Nutch 2.x upgrade to Gora 0.4
- Closed
- is related to
-
GORA-119 implement a filter enabled scan in gora
- Resolved
- relates to
-
NUTCH-1777 Fetcher not getting all the entries in input
- Closed