Uploaded image for project: 'ManifoldCF'
  1. ManifoldCF
  2. CONNECTORS-153

Crawler should follow the robots meta tag rules

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • ManifoldCF 0.1
    • ManifoldCF 0.2
    • Web connector
    • None

    Description

      The web crawler does obey robots.txt files, but not the robots meta tag rules. If a document has the following meta tag included, the crawler just ignores and fetches it anyway:
      <meta name="robots" content="noindex, nofollow" />

      I would recommend that the following changes are done in order to improve the crawler if one of the "Obey robots.txt ..." options is set:

      1. <meta name="robots" content="noindex, nofollow" />

      • do not fetch the document at all

      2. <meta name="robots" content="noindex, follow" />

      • only follow the other links in this document

      3. <meta name="robots" content="index, nofollow" />

      • fetch the document, but do no follow any link in it.

      4. Change most of the text that appear on the page for robots option settings to something like:
      "Robots.txt usage" => "Robots.txt and Robots <meta> tag usage"
      "Don't look at robots.txt" => "Ignore robots settings"
      "Obey robots.txt for data caches only" => "Follow robots rules for data caches only"
      "Obey robots.txt for all fetces" => "Follow robots rules for all fetches"

      Attachments

        Activity

          People

            kwright@metacarta.com Karl Wright
            erlendfg Erlend GarĂ¥sen
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: