Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2144

Plugin to override db.ignore.external to exempt interesting external domain URLs

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 1.12
    • crawldb, fetcher
    • None
    • Patch Available
    • Patch

    Description

      Create a rule based urlfilter plugin that allows focused crawler (db.ignore.external.links=true) to fetch static resources from external domains.
      The generalized version of this: This plugin should permit interesting URLs from external domains (by overriding db.ignore.external). The interesting urls are decided from a combination of regex and mime-type rules.

      Concrete use case:
      When using Nutch to crawl images from a set of domains, the crawler needs to fetch all images which may be linked from CDNs and other domains. In this scenario, allowing all external links and then writing hundreds of regular expressions is not feasible for large number of domains.

      Attachments

        1. ignore-exempt.patch
          92 kB
          Thamme Gowda
        2. ignore-exempt.patch
          34 kB
          Thamme Gowda

        Issue Links

          Activity

            People

              chrismattmann Chris A. Mattmann
              thammegowda Thamme Gowda
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: