Description
The RobotsRulesParser command-line tool, used to check a list of URLs against one robots.txt file, should use the value of the property http.robots.agents as fall-back if no user agent names are explicitly given as command-line argument. In this case it should behave same as the robots.txt parser, looking first for http.agent.name, then for other names listed in http.robots.agents, finally picking the rules for User-agent: *
$> cat robots.txt User-agent: Nutch Allow: / User-agent: * Disallow: / $> bin/nutch org.apache.nutch.protocol.RobotRulesParser \ -Dhttp.agent.name=mybot \ -Dhttp.robots.agents='nutch,goodbot' \ robots.txt urls.txt Testing robots.txt for agent names: mybot,nutch,goodbot not allowed: https://www.example.com/
The log message "Testing ... for ...: mybot,nutch,goodbot" is misleading. Only the name "mybot" is actually checked.
Attachments
Issue Links
- links to