[NUTCH-2144] Plugin to override db.ignore.external to exempt interesting external domain URLs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.12
Component/s: crawldb, fetcher
Labels:
None

Patch Info:

Patch Available
Flags:

Patch

Description

Create a rule based urlfilter plugin that allows focused crawler (db.ignore.external.links=true) to fetch static resources from external domains.
The generalized version of this: This plugin should permit interesting URLs from external domains (by overriding db.ignore.external). The interesting urls are decided from a combination of regex and mime-type rules.

Concrete use case:
When using Nutch to crawl images from a set of domains, the crawler needs to fetch all images which may be linked from CDNs and other domains. In this scenario, allowing all external links and then writing hundreds of regular expressions is not feasible for large number of domains.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

ignore-exempt.patch
19/Oct/15 14:51
34 kB
Thamme Gowda
ignore-exempt.patch
19/Oct/15 07:27
92 kB
Thamme Gowda

Issue Links

is broken by

NUTCH-2221 Introduce db.ignore.internal.links to FetcherThread

Closed

Activity

People

Assignee:: Chris A. Mattmann

Reporter:: Thamme Gowda

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 19/Oct/15 06:53

Updated:: 13/Mar/24 14:51

Resolved:: 29/Feb/16 07:06