[CONNECTORS-153] Crawler should follow the robots meta tag rules - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: ManifoldCF 0.1
Fix Version/s: ManifoldCF 0.2
Component/s: Web connector
Labels:
None

Description

The web crawler does obey robots.txt files, but not the robots meta tag rules. If a document has the following meta tag included, the crawler just ignores and fetches it anyway:
<meta name="robots" content="noindex, nofollow" />

I would recommend that the following changes are done in order to improve the crawler if one of the "Obey robots.txt ..." options is set:

1. <meta name="robots" content="noindex, nofollow" />

do not fetch the document at all

2. <meta name="robots" content="noindex, follow" />

only follow the other links in this document

3. <meta name="robots" content="index, nofollow" />

fetch the document, but do no follow any link in it.

4. Change most of the text that appear on the page for robots option settings to something like:
"Robots.txt usage" => "Robots.txt and Robots <meta> tag usage"
"Don't look at robots.txt" => "Ignore robots settings"
"Obey robots.txt for data caches only" => "Follow robots rules for data caches only"
"Obey robots.txt for all fetces" => "Follow robots rules for all fetches"

Attachments

Activity

People

Assignee:: Karl Wright

Reporter:: Erlend Garåsen

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 28/Jan/11 10:28

Updated:: 02/Feb/11 18:14

Resolved:: 28/Jan/11 12:48