Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1892

Update the FileDumper tool to fetch only those URLs with status db_fetched in nutch

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.2.1
    • None
    • nutchNewbie
    • None

    Description

      The FileDumper tool is a tool that reads the crawled data from Nutch and dumps this data into its raw files. This tool currently dumps every single file irrespective of status, duplicates etc. This cause files that are fetched in error or files that have not been fetched because they were made unavailable by the server to also be dumped.

      The fix should be to fetch only those files that were fetched with status db_fetched by Nutch.

      Attachments

        Activity

          People

            Unassigned Unassigned
            shekarprashant Prasanth Iyer
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: