Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1941

Optional rolling http.agent.name's

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Trivial
    • Resolution: Fixed
    • 2.3, 1.9
    • 1.10, 2.3.1
    • fetcher, protocol
    • None
    • Patch Available

    Description

      In some scenarios, even whilst adhering to fetcher.crawl.delay, web admins can block your fetcher based merely on your crawler name.
      I propose the ability to implement rolling http.agent.name's which could be substituted every 5 seconds for example. This would mean that successive requests to the same domain would be sent with different http.agent.name.
      This behavior should be off by default.

      Attachments

        1. NUTCH-1941-2x-v6.patch
          6 kB
          Sebastian Nagel
        2. NUTCH-1941-ver6.patch
          7 kB
          Asitang Mishra
        3. NUTCH-1941-v5.patch
          4 kB
          Sebastian Nagel
        4. NUTCH-1941-itr4.patch
          3 kB
          Asitang Mishra
        5. NUTCH-1941-itr3.patch
          3 kB
          Asitang Mishra
        6. NUTCH-1941-ITR2.patch
          3 kB
          Asitang Mishra
        7. agent.names.txt
          199 kB
          Lewis John McGibbney
        8. NUTCH-1941-ver1.patch
          3 kB
          Asitang Mishra
        9. nutch.patch
          46 kB
          Asitang Mishra

        Activity

          People

            snagel Sebastian Nagel
            lewismc Lewis John McGibbney
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: