Details
Description
Create a rule based urlfilter plugin that allows focused crawler (db.ignore.external.links=true) to fetch static resources from external domains.
The generalized version of this: This plugin should permit interesting URLs from external domains (by overriding db.ignore.external). The interesting urls are decided from a combination of regex and mime-type rules.
Concrete use case:
When using Nutch to crawl images from a set of domains, the crawler needs to fetch all images which may be linked from CDNs and other domains. In this scenario, allowing all external links and then writing hundreds of regular expressions is not feasible for large number of domains.
Attachments
Attachments
Issue Links
- is broken by
-
NUTCH-2221 Introduce db.ignore.internal.links to FetcherThread
- Closed