Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2384

nutch 2.3.1 job not properly interacting with hadoop 2.7.1

    XMLWordPrintableJSON

Details

    • Test
    • Status: Closed
    • Major
    • Resolution: Incomplete
    • 2.3.1
    • None
    • nutchNewbie
    • None
    • nutch 2.3.1 + hadoop 2.7.1 + mongodb

    • Important

    Description

      Hey,

      I am testing the Nutch crawler on local environment as well as on Hadoop cluster.

      The script is able to fetch millions of documents but the apache job created after running the command "ant clean runtime" fails to do so.

      While testing in the local environment i.e using the following commands:
      bin/nutch fetch -all -crawlId <table-name>.
      It ends up fetching all the URLs that are present in the queue. And I have been able to crawl over a 100,000 URLs. (5000 seed URLs)

      Whereas, when I run the same project on the Hadoop cluster, I am not able to reach even the 100,000 mark. It has only fetched a 45,000 URLs. (1100 seed URLs)
      When tested with 5000 seed URLs, then also it was able to fetch such amounts of data.
      The plugins used in Nutch are as follows:
      protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|urlnormalizer-(pass|regex|basic)|scoring-opic

      The settings I am using with the hadoop cluster are as follows:

      MAPRED-SITE.XML:

      <property>
      <name>mapreduce.map.memory.mb</name>
      <value>1024</value>
      </property>
      <property>
      <name>mapreduce.reduce.memory.mb</name>
      <value>2048</value>
      </property>
      <property>
      <name>mapreduce.reduce.java.opts</name>
      <value>-Xmx1800m</value>
      </property>
      <property>
      <name>mapreduce.map.java.opts</name>
      <value>-Xmx712m</value>
      </property>
      <property>
      <name>mapred.job.tracker.http.address</name>
      <value>master:50030</value>
      </property>
      <property>
      <name>yarn.app.mapreduce.am.resource.mb</name>
      <value>1024</value>
      </property>
      <property>
      <name>yarn.app.mapreduce.am.command-opts</name>
      <value>-Xmx800m</value>
      </property>

      YARN-SITE.XML:

      <property>
      <name>yarn.scheduler.minimum-allocation-mb</name>
      <value>1024</value>
      <description>minimum memory allcated to containers.</description>
      </property>
      <property>
      <name>yarn.scheduler.maximum-allocation-mb</name>
      <value>5120</value>
      <description>maximum memory allcated to containers.</description>
      </property>
      <property>
      <name>yarn.scheduler.minimum-allocation-vcores</name>
      <value>1</value>
      </property>
      <property>
      <name>yarn.scheduler.maximum-allocation-vcores</name>
      <value>4</value>
      </property>
      <property>
      <name>yarn.nodemanager.resource.memory-mb</name>
      <value>12288</value>
      <description>max memory allcated to nodemanager.</description>
      </property>
      <property>
      <name>yarn.nodemanager.vmem-pmem-ratio</name>
      <value>2.1</value>
      </property>
      <property>
      <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
      <value>100</value>
      </property>
      <property>
      <name>yarn.nodemanager.vmem-check-enabled</name>
      <value>false</value>
      <description>Whether virtual memory limits will be enforced for containers</description>
      </property>

      The RAM available to the system is 6 GB and Network Bandwidth available is 4 Mb/sec.

      Attachments

        Activity

          People

            Unassigned Unassigned
            shubham.gupta Shubham Gupta
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: