Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1483 Can't crawl filesystem with protocol-file plugin
  3. NUTCH-1878

urlnormalizer-regex to keep third slash in file:///path/index.html

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 1.9, 2.2.1
    • 2.3, 1.10
    • protocol
    • None
    • Patch Available

    Description

      The rule

      <!-- removes duplicate slashes -->
      <regex>
        <pattern>(?&lt;!:)/{2,}</pattern>
        <substitution>/</substitution>
      </regex>
      

      in regex-normalize.xml removes the third slash in file:///path/index.html. The resulting URL file://path/index.html fails to fetch because path is interpreted as host part of the URL as in file://localhost/path/index.html, cf. wikipedia, RFC 1738 (1994), and RFC 3986 (2005).

      (split as sub-task from NUTCH-1483)

      Attachments

        1. NUTCH-1878-v1.patch
          2 kB
          Sebastian Nagel

        Issue Links

          Activity

            People

              Unassigned Unassigned
              snagel Sebastian Nagel
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: