Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2801

RobotsRulesParser command-line checker to use http.robots.agents as fall-back

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.17
    • 1.18
    • checker, robots
    • None
    • Patch Available

    Description

      The RobotsRulesParser command-line tool, used to check a list of URLs against one robots.txt file, should use the value of the property http.robots.agents as fall-back if no user agent names are explicitly given as command-line argument. In this case it should behave same as the robots.txt parser, looking first for http.agent.name, then for other names listed in http.robots.agents, finally picking the rules for User-agent: *

      $> cat robots.txt
      User-agent: Nutch
      Allow: /
      User-agent: *
      Disallow: /
      
      $> bin/nutch org.apache.nutch.protocol.RobotRulesParser \
            -Dhttp.agent.name=mybot \
            -Dhttp.robots.agents='nutch,goodbot' \
            robots.txt urls.txt 
      Testing robots.txt for agent names: mybot,nutch,goodbot
      not allowed:    https://www.example.com/
      

      The log message "Testing ... for ...: mybot,nutch,goodbot" is misleading. Only the name "mybot" is actually checked.

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              snagel Sebastian Nagel
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: