Uploaded image for project: 'Apache Gobblin'
  1. Apache Gobblin
  2. GOBBLIN-156

Gobblin not working with KafkaSource and mapreduce

    XMLWordPrintableJSON

Details

    Description

      Hi,
      I'm trying to launch gobblin-mapreduce.sh on my job config, that is almost copy/paste from your wiki https://github.com/linkedin/gobblin/wiki/Kafka-HDFS-Ingestion

      I'm launching gobblin with command:

      ```
      bin/gobblin-mapreduce.sh --conf jobs/dump-kafka.properties --workdir work/
      ```

      But the job fails with the following repeated error in all mappers:

      ```
      java.lang.NoClassDefFoundError: kafka/common/TopicAndPartition
      at gobblin.source.extractor.extract.kafka.KafkaWrapper$KafkaOldAPI.createFetchRequest(KafkaWrapper.java:401)
      at gobblin.source.extractor.extract.kafka.KafkaWrapper$KafkaOldAPI.fetchNextMessageBuffer(KafkaWrapper.java:333)
      at gobblin.source.extractor.extract.kafka.KafkaWrapper.fetchNextMessageBuffer(KafkaWrapper.java:136)
      at gobblin.source.extractor.extract.kafka.KafkaExtractor.fetchNextMessageBuffer(KafkaExtractor.java:239)
      at gobblin.source.extractor.extract.kafka.KafkaExtractor.readRecordImpl(KafkaExtractor.java:125)
      at gobblin.instrumented.extractor.InstrumentedExtractorBase.readRecord(InstrumentedExtractorBase.java:121)
      at gobblin.instrumented.extractor.InstrumentedExtractor.readRecord(InstrumentedExtractor.java:34)
      at gobblin.runtime.LimitingExtractorDecorator.readRecord(LimitingExtractorDecorator.java:69)
      at gobblin.instrumented.extractor.InstrumentedExtractorDecorator.readRecordImpl(InstrumentedExtractorDecorator.java:64)
      at gobblin.instrumented.extractor.InstrumentedExtractorDecorator.readRecord(InstrumentedExtractorDecorator.java:57)
      at gobblin.runtime.Task.run(Task.java:169)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:745)
      Caused by: java.lang.ClassNotFoundException: kafka.common.TopicAndPartition
      at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
      ... 14 more
      ```

      It seems that gobblin does not include kafka (and other) jars in the mapreduce tasks's classpath.
      I also tried to include all the jars in lib/ directory to libjars with command:

      ```
      bin/gobblin-mapreduce.sh --conf jobs/dump-kafka.properties --workdir work/ --jars `ls lib/* | tr \n ,`
      ```

      But this time, I get error of clashing guava libraries:

      ```
      Error: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
      at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131)
      at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:722)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
      at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:422)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
      at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
      Caused by: java.lang.reflect.InvocationTargetException
      at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
      at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
      at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
      at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
      at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:129)
      ... 7 more
      Caused by: java.lang.NoSuchMethodError: com.google.common.collect.Sets.newConcurrentHashSet()Ljava/util/Set;
      at gobblin.configuration.SourceState.<clinit>(SourceState.java:54)
      at gobblin.runtime.mapreduce.MRJobLauncher$TaskRunner.<init>(MRJobLauncher.java:554)
      ... 12 more
      ```

      I have hadoop 2.4.0, which uses guava 11.0.2, while the one in lib/ is guava-15.0.

      Github Url : https://github.com/linkedin/gobblin/issues/386
      Github Reporter : kzarzycki-advertine
      Github Created At : 2015-10-15T07:29:37Z
      Github Updated At : 2016-03-10T00:36:08Z

      Comments


      kzarzycki wrote on 2015-10-17T06:41:39Z : Hey, anyone has comments on this ticket? I'll be grateful for your help with this, Thank you!
      Krzysztof

      Github Url : https://github.com/linkedin/gobblin/issues/386#issuecomment-148891116


      zliu41 wrote on 2015-10-19T19:05:02Z : Hi @kzarzycki seems the jars in `lib` were somehow not correctly added to the hadoop classpath. I couldn't repeat your errors (if you run `gobblin-mapreduce.sh` from the parent dir of `lib` it should automatically work), so I can only make some guesses. In `gobblin-mapreduce.sh` can you replace the line
      `export HADOOP_CLASSPATH=$GOBBLIN_DEP_JARS:$HADOOP_CLASSPATH`
      with one of the following:

      ```
      export HADOOP_CLASSPATH=lib:$HADOOP_CLASSPATH
      export HADOOP_CLASSPATH=lib
      export HADOOP_CLASSPATH=.:$HADOOP_CLASSPATH
      export HADOOP_CLASSPATH=.
      ```

      Then run `gobblin-mapreduce.sh` with or without option `--jars [path-to-lib]`.

      Not sure which combination is correct so you can try these options.

      Github Url : https://github.com/linkedin/gobblin/issues/386#issuecomment-149314749


      rsimiciuc wrote on 2015-11-02T15:09:58Z : I have the same problem. Any solution?

      Github Url : https://github.com/linkedin/gobblin/issues/386#issuecomment-153047233


      klyr wrote on 2015-11-17T09:30:12Z : Hi @kzarzycki-advertine,

      I had the same problem and struggled a while to fix it.
      In my case it was a problem with the hive-exec library embedding the (not shaded) guava library. It took precedence over the newer guava library.
      Here is the related JIRA issue: https://issues.apache.org/jira/browse/HIVE-5733

      A quick fix is to remove `hive-exec-0.13.1.jar` or not including it in the `--jars` option.
      Upgrading to hive version > 1.2.0 may also work.

      I hope it will help.

      Github Url : https://github.com/linkedin/gobblin/issues/386#issuecomment-157318430


      gilmichlin wrote on 2015-11-18T16:41:23Z : I can confirm it's I can reproduce on HDP 2.3.0
      ./gradlew clean build -PuseHadoop2 -PhadoopVersion=2.7.1 -PhiveVersion=1.2.1

      upgrade to hive 1.2.1 did not work for me

      just used:
      --jars `ls lib/* | grep -v hive | tr \n ,`

      and it was working

      Github Url : https://github.com/linkedin/gobblin/issues/386#issuecomment-157771981


      zliu41 wrote on 2015-11-18T16:58:19Z : @klyr @gilmichlin thanks for posting! I'll see if updating the hive version works.

      Github Url : https://github.com/linkedin/gobblin/issues/386#issuecomment-157778024


      zliu41 wrote on 2015-11-18T22:28:14Z : I've updated the hive version to 1.2.1. #466

      Github Url : https://github.com/linkedin/gobblin/issues/386#issuecomment-157885351


      gilmichlin wrote on 2015-11-18T22:37:41Z : 1.2.1 did not work for me with HDP 2.3.0 only the
      --jars ls lib/* | grep -v hive | tr \n ,

      Github Url : https://github.com/linkedin/gobblin/issues/386#issuecomment-157887533


      zliu41 wrote on 2015-11-19T18:22:04Z : @gilmichlin is it still because of the Guava dependency? Based on HIVE-5733 it shouldn't be a problem with Hive 1.2.0 or later.

      If so, is there any hive version that works for you?

      Github Url : https://github.com/linkedin/gobblin/issues/386#issuecomment-158145837


      gilmichlin wrote on 2015-11-20T18:46:14Z : I am going to check it out in the weekend

      Github Url : https://github.com/linkedin/gobblin/issues/386#issuecomment-158488974


      gilmichlin wrote on 2015-11-23T18:58:26Z : Still Guava
      you will be able to reproduce by loading HDP 2.3.X VM build with the following:

      ```
      ./gradlew clean build -PuseHadoop2 -PhadoopVersion=2.7.1 -PhiveVersion=1.2.1
      ```

      running the following wikipedia example

      ```
      /bin/gobblin-mapreduce.sh --conf /opt/gobblin/job/wikipedia.pull --workdir /user/root/gobblin/ --jars `ls lib/* | tr \n ,`
      ```

      will give the following error

      ```
      2015-11-23 18:50:15,857 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
      at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:134)
      at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:747)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
      at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:415)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
      at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
      Caused by: java.lang.reflect.InvocationTargetException
      at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
      at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
      at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
      at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
      at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:132)
      ... 7 more
      Caused by: java.lang.NoSuchMethodError: com.google.common.collect.Sets.newConcurrentHashSet()Ljava/util/Set;
      at gobblin.configuration.SourceState.<clinit>(SourceState.java:54)
      at gobblin.runtime.mapreduce.MRJobLauncher$TaskRunner.<init>(MRJobLauncher.java:525)
      ... 12 more
      ```

      listing Hive jars

      ```
      ls -l lib/ | grep hive
      rw-rr- 1 root root 47713 2015-11-18 16:17 hive-ant-1.2.1.jar
      rw-rr- 1 root root 292289 2015-11-18 16:17 hive-common-1.2.1.jar
      rw-rr- 1 root root 20599029 2015-11-18 16:17 hive-exec-1.2.1.jar
      rw-rr- 1 root root 100580 2015-11-18 16:17 hive-jdbc-1.2.1.jar
      rw-rr- 1 root root 5505100 2015-11-18 16:17 hive-metastore-1.2.1.jar
      rw-rr- 1 root root 916706 2015-11-18 16:17 hive-serde-1.2.1.jar
      rw-rr- 1 root root 1878543 2015-11-18 16:17 hive-service-1.2.1.jar
      rw-rr- 1 root root 32390 2015-11-18 16:17 hive-shims-0.20S-1.2.1.jar
      rw-rr- 1 root root 60070 2015-11-18 16:17 hive-shims-0.23-1.2.1.jar
      rw-rr- 1 root root 8949 2015-11-18 16:17 hive-shims-1.2.1.jar
      rw-rr- 1 root root 108914 2015-11-18 16:17 hive-shims-common-1.2.1.jar
      rw-rr- 1 root root 13065 2015-11-18 16:17 hive-shims-scheduler-1.2.1.jar
      ```

      running like that would work

      ```
      ./bin/gobblin-mapreduce.sh --conf /opt/gobblin/job/wikipedia.pull --workdir /user/root/gobblin/ --jars `ls lib/* | grep -v hive | tr \n ,`
      ```

      Github Url : https://github.com/linkedin/gobblin/issues/386#issuecomment-159027977


      rsimiciuc wrote on 2015-11-23T19:11:31Z : I had the same problem with running gobblin on CDH5, but i managed to solve
      it by shadowing guava

      On Monday, 23 November 2015, gilmichlin notifications@github.com wrote:

      > Still Guava
      > you will be able to reproduce by loading HDP 2.3.X VM build with the
      > following:
      >
      > ./gradlew clean build -PuseHadoop2 -PhadoopVersion=2.7.1 -PhiveVersion=1.2.1
      >
      > running the following wikipedia example
      >
      > /bin/gobblin-mapreduce.sh --conf /opt/gobblin/job/wikipedia.pull --workdir /user/root/gobblin/ --jars `ls lib/* | tr \n ,`
      >
      > will give the following error
      >
      > 2015-11-23 18:50:15,857 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
      > at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:134)
      > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:747)
      > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
      > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
      > at java.security.AccessController.doPrivileged(Native Method)
      > at javax.security.auth.Subject.doAs(Subject.java:415)
      > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
      > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
      > Caused by: java.lang.reflect.InvocationTargetException
      > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
      > at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
      > at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
      > at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
      > at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:132)
      > ... 7 more
      > Caused by: java.lang.NoSuchMethodError: com.google.common.collect.Sets.newConcurrentHashSet()Ljava/util/Set;
      > at gobblin.configuration.SourceState.<clinit>(SourceState.java:54)
      > at gobblin.runtime.mapreduce.MRJobLauncher$TaskRunner.<init>(MRJobLauncher.java:525)
      > ... 12 more
      >
      > listing Hive jars
      >
      > ls -l lib/ | grep hive
      > rw-rr- 1 root root 47713 2015-11-18 16:17 hive-ant-1.2.1.jar
      > rw-rr- 1 root root 292289 2015-11-18 16:17 hive-common-1.2.1.jar
      > rw-rr- 1 root root 20599029 2015-11-18 16:17 hive-exec-1.2.1.jar
      > rw-rr- 1 root root 100580 2015-11-18 16:17 hive-jdbc-1.2.1.jar
      > rw-rr- 1 root root 5505100 2015-11-18 16:17 hive-metastore-1.2.1.jar
      > rw-rr- 1 root root 916706 2015-11-18 16:17 hive-serde-1.2.1.jar
      > rw-rr- 1 root root 1878543 2015-11-18 16:17 hive-service-1.2.1.jar
      > rw-rr- 1 root root 32390 2015-11-18 16:17 hive-shims-0.20S-1.2.1.jar
      > rw-rr- 1 root root 60070 2015-11-18 16:17 hive-shims-0.23-1.2.1.jar
      > rw-rr- 1 root root 8949 2015-11-18 16:17 hive-shims-1.2.1.jar
      > rw-rr- 1 root root 108914 2015-11-18 16:17 hive-shims-common-1.2.1.jar
      > rw-rr- 1 root root 13065 2015-11-18 16:17 hive-shims-scheduler-1.2.1.jar
      >
      > running like that would work
      >
      > ./bin/gobblin-mapreduce.sh --conf /opt/gobblin/job/wikipedia.pull --workdir /user/root/gobblin/ --jars `ls lib/* | grep -v hive | tr \n ,`
      >
      > —
      > Reply to this email directly or view it on GitHub
      > https://github.com/linkedin/gobblin/issues/386#issuecomment-159027977.

      //R Mobile

      Github Url : https://github.com/linkedin/gobblin/issues/386#issuecomment-159031634


      x10ba wrote on 2015-12-05T01:32:53Z : Hi, I think my error is similar to this thread, so putting it here (not sure if I need to change my properties):

      Exception in thread main java.lang.RuntimeException: java.lang.ClassNotFoundException: gobblin.source.extractor.extract.kafka.kafkaSimpleSource

      Current sys:
      centos

      Invoke:
      [bin]$ ./gobblin-mapreduce.sh --conf ~/gobblin/gobblin-dist/conf/gobblin-mapreduce.properties --workdir ~/gobblin/work --jars ~/gobblin/gobblin-dist/lib/gobblin-core.jar

      kafkaSimpleSource lives in the gobblin-core.jar

      thanks,
      x10ba

      Github Url : https://github.com/linkedin/gobblin/issues/386#issuecomment-162124443


      qizongjun wrote on 2016-03-09T23:24:22Z : Anyone with luck on this? I am facing the same Kafka problem. I find kafka jar inside gobblin/lib there, and it contains TopicAndPartition.class.

      I am using latest Gobblin code.

      I tried removing the hive-exec.jar too. It did not work for me.

      2016-03-09 22:35:01,845 ERROR [TaskExecutor-0] gobblin.runtime.Task: Task task_kafka2hdfs_1457562888703_1 failed
      java.lang.NoClassDefFoundError: kafka/common/TopicAndPartition
      at gobblin.source.extractor.extract.kafka.KafkaWrapper$KafkaOldAPI.createFetchRequest(KafkaWrapper.java:401)
      at gobblin.source.extractor.extract.kafka.KafkaWrapper$KafkaOldAPI.fetchNextMessageBuffer(KafkaWrapper.java:333)
      at gobblin.source.extractor.extract.kafka.KafkaWrapper.fetchNextMessageBuffer(KafkaWrapper.java:136)
      at gobblin.source.extractor.extract.kafka.KafkaExtractor.fetchNextMessageBuffer(KafkaExtractor.java:227)
      at gobblin.source.extractor.extract.kafka.KafkaExtractor.readRecordImpl(KafkaExtractor.java:123)
      at gobblin.instrumented.extractor.InstrumentedExtractorBase.readRecord(InstrumentedExtractorBase.java:121)
      at gobblin.instrumented.extractor.InstrumentedExtractor.readRecord(InstrumentedExtractor.java:34)
      at gobblin.instrumented.extractor.InstrumentedExtractorDecorator.readRecordImpl(InstrumentedExtractorDecorator.java:64)
      at gobblin.instrumented.extractor.InstrumentedExtractorDecorator.readRecord(InstrumentedExtractorDecorator.java:57)
      at gobblin.runtime.Task.run(Task.java:172)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      at java.lang.Thread.run(Thread.java:745)
      Caused by: java.lang.ClassNotFoundException: kafka.common.TopicAndPartition
      at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
      ... 13 more

      Github Url : https://github.com/linkedin/gobblin/issues/386#issuecomment-194562251


      stakiar wrote on 2016-03-10T00:36:08Z : Adding add `kafka_2.11-0.8.2.1.jar` to the `--jars` option when you running `bin/gobblin-mapreduce.sh` fixes this

      Github Url : https://github.com/linkedin/gobblin/issues/386#issuecomment-194589153

      Attachments

        Activity

          People

            Unassigned Unassigned
            abti Abhishek Tiwari
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: