Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1674

Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Index

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.3
    • 2.3
    • None
    • None
    • Patch Available

    Description

      Nutch always scan the whole crawldb in each phrase (generate, fetch, parse, update, index). When crawldb is big, the time to scan is bigger than the actual processing time.
      We really need to skip records while scanning using GORA-119 for example we can only get records belong to a specified batchId.
      In my crawl the filter reduce the time to scan from 90 min to 30 min.

      Attachments

        1. NUTCH-1674_2.patch
          16 kB
          Alparslan Avcı
        2. NUTCH-1674_3.patch
          16 kB
          Alparslan Avcı
        3. NUTCH-1674_final.patch
          15 kB
          Alparslan Avcı
        4. NUTCH-1674.patch
          16 kB
          Tien Nguyen Manh

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tiennm Tien Nguyen Manh
              Votes:
              2 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: