[GOBBLIN-91] No AbstractFileSystem for scheme: null (EMR 4.7.2, Hadoop 2.7.2) - ASF JIRA

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:

External issue URL:
https://github.com/linkedin/gobblin/issues/1162

Description

The instructions [here](http://gobblin.readthedocs.io/en/latest/user-guide/FAQs/#how-do-i-fix-unsupportedfilesystemexception-no-abstractfilesystem-for-scheme-null) have not resolved this issue.

We're trying to run Gobblin on AWS EMR 4.7.2, which has Amazon's Hadoop 2.7.2. (v2.6 seems popular for Gobblin, but the only available EMR version with 2.6 is deprecated and not recommended due to stability issues.)

Some notes:

We're pulling from Kafka, using a custom schema/serde framework for Avro, and publishing to S3.
Our Gobblin repo is checked out at 0.7.0 for dependency issues related to our serde and S3.
We gradle build Gobblin with Hadoop 2.7.2 and have a Clojure project for our serde, which builds and injects into the Gobblin lib dir, and we have some aws/s3 jdk libraries coming in as well.
Everything functions in standalone mode on a single EC2 instance.
We've tried running Gobblin in MR mode on EMR with and without the Hadoop jars in the Gobblin lib directory, and the EMR Hadoop bin and classpath dirs are being recognized.

Config details:
This script sources test environment variables and passes them off to the mapreduce config:

```
#!/bin/bash

Sets test environment variables for schwartz-gobblin

Set Hadoop home
export HADOOP_BIN_DIR=/usr/bin
export HADOOP_CLASSPATH=/usr/lib/hadoop

Use the following for obtaining IAM role keys
eval $(./sts-assume-get-keys arn:aws:iam::xxxxxxx:role/xxxxxxxxx)

Needed by Gobblin
export GOBBLIN_WORK_DIR=hdfs://<host:8020>/gobblin/work
export GOBBLIN_JOB_CONFIG_DIR=/home/hadoop/gobblin-dist/jobs
export ZOOKEEPER_CONNECT=<zookeeper1,zookeeper2,zookeeper3>

Test specific
export SCHWARTZ_GOBBLIN_FINAL_DIR=s3a://<bucket>/gobblin-test
export SCHWARTZ_GOBBLIN_FINAL_TABLE=data
export SCHWARTZ_GOBBLIN_STATE_STORE_URI=s3a://<bucket>/gobblin-test/state-store
export SCHWARTZ_GOBBLIN_STATE_STORE_DIR=s3a://<bucket>/gobblin-test/state-store
export SCHWARTZ_GOBBLIN_PUBLISHER_URI=s3a://<bucket>/gobblin-test/data
export SCHWARTZ_GOBBLIN_JOB_LENGTH=1
```

gobblin-mapreduce.properties then looks like this:

```
###############################################################################

1. 1. 1. 1. Gobblin MapReduce configurations #######################
        ###############################################################################

Source parameters
kafka.brokers=<kafka1,kafka2,kafka3>
source.class=gobblin-marsh.MarshGobblinSource
extract.namespace=gobblin-marsh

Writer and publisher parameters
data.publisher.type=gobblin.publisher.TimePartitionedDataPublisher
data.publisher.final.dir=${env:SCHWARTZ_GOBBLIN_FINAL_DIR}
writer.builder.class=gobblin.writer.AvroDataWriterBuilder
writer.partitioner.class=gobblin.CustomTimePartitioner
writer.file.path.type=default
writer.file.path=${env:SCHWARTZ_GOBBLIN_FINAL_TABLE}
writer.destination.type=HDFS
writer.output.format=AVRO
writer.codec.type=SNAPPY
writer.staging.dir=${env:GOBBLIN_WORK_DIR}/task-staging
writer.output.dir=${env:GOBBLIN_WORK_DIR}/task-output
data.publisher.replace.final.dir=false

File system parameters
fs.uri=hdfs://<namenodehost>:8020
writer.fs.uri=${fs.uri}
state.store.fs.uri=s3a://<bucket>/gobblin-test/state-store

S3 parameters
state.store.dir=s3a://<bucket>/gobblin-test/state-store
data.publisher.fs.uri=${env:SCHWARTZ_GOBBLIN_PUBLISHER_URI}
fs.s3a.access.key=${env:aws_access_key_id}
fs.s3a.secret.key=${env:aws_secret_access_key}
fs.s3a.buffer.dir=<bufferdir>

Gobblin execution parameters
taskexecutor.threadpool.size=10
taskretry.threadpool.coresize=4
taskretry.threadpool.maxsize=2
jobconf.dir=${env:GOBBLIN_JOB_CONFIG_DIR}

Where to start; How long to run each task
bootstrap.with.offset=earliest
extract.limit.enabled=true
extract.limit.type=time
extract.limit.time.limit=${env:SCHWARTZ_GOBBLIN_JOB_LENGTH}
extract.limit.time.limit.timeunit=minutes

Directory where error files from the quality checkers are stored
qualitychecker.row.err.file=${fs.uri}/gobblin/err

Directory where job locks are stored
job.lock.dir=${env:GOBBLIN_WORK_DIR}/locks

Directory where metrics log files are stored
metrics.log.dir=${env:GOBBLIN_WORK_DIR}/metrics

Interval of task state reporting in milliseconds
task.status.reportintervalinms=5000

MapReduce properties
mr.job.root.dir=${fs.uri}/gobblin/work/working
```

The job output (marshmallow is our serde) is:

```
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/gobblin-dist/lib/gobblin-marshmallow-0.1.0-SNAPSHOT-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/gobblin-dist/lib/slf4j-log4j12-1.7.21.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Exception in thread main gobblin.runtime.JobException: Job job_schwartzTest004S3v2_1469633575428 failed
at gobblin.runtime.AbstractJobLauncher.launchJob(AbstractJobLauncher.java:363)
at gobblin.runtime.mapreduce.CliMRJobLauncher.launchJob(CliMRJobLauncher.java:87)
at gobblin.runtime.mapreduce.CliMRJobLauncher.run(CliMRJobLauncher.java:64)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at gobblin.runtime.mapreduce.CliMRJobLauncher.main(CliMRJobLauncher.java:110)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
```

The gobblin-current.log contains the following:

```
...
2016-07-27 15:33:01 UTC INFO [main] org.apache.hadoop.yarn.client.RMProxy 92 - Connecting to ResourceManager at xxxxxxxxxxxxxxx:8032
2016-07-27 15:33:02 UTC INFO [main] org.apache.hadoop.mapreduce.lib.input.FileInputFormat 287 - Total input paths to process : 1
2016-07-27 15:33:02 UTC INFO [main] org.apache.hadoop.mapreduce.JobSubmitter 396 - number of splits:100
2016-07-27 15:33:02 UTC INFO [main] org.apache.hadoop.mapreduce.JobSubmitter 479 - Submitting tokens for job: job_1469481831759_0035
2016-07-27 15:33:02 UTC WARN [main] org.apache.hadoop.security.UserGroupInformation 1551 - PriviledgedActionException as:hadoop (auth:SIMPLE) cause:org.apache.hadoop.fs.UnsupportedFileSystemException: No AbstractFileSystem for scheme: null
2016-07-27 15:33:02 UTC INFO [main] org.apache.hadoop.mapreduce.JobSubmitter 441 - Cleaning up the staging area /tmp/hadoop-yarn/staging/hadoop/.staging/job_1469481831759_0035
2016-07-27 15:33:02 UTC WARN [main] org.apache.hadoop.security.UserGroupInformation 1551 - PriviledgedActionException as:hadoop (auth:SIMPLE) cause:org.apache.hadoop.fs.UnsupportedFileSystemException: No AbstractFileSystem for scheme: null
2016-07-27 15:33:02 UTC INFO [TaskStateCollectorService STOPPING] gobblin.runtime.TaskStateCollectorService 98 - Stopping the TaskStateCollectorService
2016-07-27 15:33:02 UTC WARN [TaskStateCollectorService STOPPING] gobblin.runtime.TaskStateCollectorService 119 - Output task state path hdfs://<host:8020>/gobblin/work/working/schwartzTest004S3v2/output/job_schwartzTest004S3v2_1469633575428 does not exist
2016-07-27 15:33:02 UTC INFO [main] gobblin.runtime.mapreduce.MRJobLauncher 464 - Deleted working directory hdfs://<host:8020>/gobblin/work/working/schwartzTest004S3v2
2016-07-27 15:33:02 UTC ERROR [main] gobblin.runtime.AbstractJobLauncher 321 - Failed to launch and run job job_schwartzTest004S3v2_1469633575428: org.apache.hadoop.fs.UnsupportedFileSystemException: No AbstractFileSystem for scheme: null
org.apache.hadoop.fs.UnsupportedFileSystemException: No AbstractFileSystem for scheme: null
at org.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:152)
at org.apache.hadoop.fs.AbstractFileSystem.get(AbstractFileSystem.java:240)
at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:332)
at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:329)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.fs.FileContext.getAbstractFileSystem(FileContext.java:329)
at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:443)
at org.apache.hadoop.mapred.YARNRunner.createApplicationSubmissionContext(YARNRunner.java:360)
at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:285)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:432)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at gobblin.runtime.mapreduce.MRJobLauncher.runWorkUnits(MRJobLauncher.java:198)
at gobblin.runtime.AbstractJobLauncher.launchJob(AbstractJobLauncher.java:296)
at gobblin.runtime.mapreduce.CliMRJobLauncher.launchJob(CliMRJobLauncher.java:87)
at gobblin.runtime.mapreduce.CliMRJobLauncher.run(CliMRJobLauncher.java:64)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at gobblin.runtime.mapreduce.CliMRJobLauncher.main(CliMRJobLauncher.java:110)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
2016-07-27 15:33:02 UTC INFO [main] gobblin.util.ExecutorsUtils 125 - Attempting to shutdown ExecutorService: java.util.concurrent.ThreadPoolExecutor@7eee074d[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
2016-07-27 15:33:02 UTC INFO [main] gobblin.util.ExecutorsUtils 144 - Successfully shutdown ExecutorService: java.util.concurrent.ThreadPoolExecutor@7eee074d[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
2016-07-27 15:33:02 UTC INFO [main] gobblin.util.ExecutorsUtils 125 - Attempting to shutdown ExecutorService: java.util.concurrent.ThreadPoolExecutor@531f7b83[Shutting down, pool size = 1, active threads = 0, queued tasks = 0, completed tasks = 1]
2016-07-27 15:33:02 UTC INFO [main] gobblin.util.ExecutorsUtils 144 - Successfully shutdown ExecutorService: java.util.concurrent.ThreadPoolExecutor@531f7b83[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1]
2016-07-27 15:33:02 UTC INFO [main] gobblin.util.ExecutorsUtils 125 - Attempting to shutdown ExecutorService: java.util.concurrent.ThreadPoolExecutor@5103fa5c[Shutting down, pool size = 1, active threads = 0, queued tasks = 0, completed tasks = 1]
2016-07-27 15:33:02 UTC INFO [main] gobblin.util.ExecutorsUtils 144 - Successfully shutdown ExecutorService: java.util.concurrent.ThreadPoolExecutor@5103fa5c[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1]
2016-07-27 15:33:02 UTC INFO [main] gobblin.runtime.app.ServiceBasedAppLauncher 162 - Shutting down the application
2016-07-27 15:33:02 UTC WARN [Thread-6] gobblin.runtime.app.ServiceBasedAppLauncher 157 - ApplicationLauncher has already stopped
```

Any help is greatly appreciated.

Github Url : https://github.com/linkedin/gobblin/issues/1162
Github Reporter : hilljb
Github Created At : 2016-07-27T16:26:15Z
Github Updated At : 2017-01-12T05:01:55Z

Comments

maiyatanglxn wrote on 2016-12-29T06:59:29Z : I have the same problem with you,have you solved it ?

Github Url : https://github.com/linkedin/gobblin/issues/1162#issuecomment-269589954

No AbstractFileSystem for scheme: null (EMR 4.7.2, Hadoop 2.7.2)

Details

Description

Comments

Attachments

Activity

People

Dates