Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.4.0
-
None
Description
I tried to run mnist_distributed.py using docker container, and launch failed.
The following is my command, and the docker image tf-1.13.1-cpu-base:0.0.1 was build in advance in mini-submarine.
java -cp $(hadoop classpath --glob):/opt/submarine-current/submarine-all-0.3.0-SNAPSHOT-hadoop-3.2.jar org.apache.submarine.client.cli.Cli job run --name tf-job-001 \ --framework tensorflow \ --docker_image tf-1.13.1-cpu-base:0.0.1 \ --input_path "" \ --num_ps 1 \ --ps_launch_cmd "export CLASSPATH=\$(/hadoop-current/bin/hadoop classpath --glob) && python mnist_distributed.py --steps 2 --data_dir /tmp/data --working_dir /tmp/mode" \ --ps_resources memory=1G,vcores=1 \ --num_workers 2 \ --worker_resources memory=1G,vcores=1 \ --worker_launch_cmd "export CLASSPATH=\$(/hadoop-current/bin/hadoop classpath --glob) && python mnist_distributed.py --steps 2 --data_dir /tmp/data --working_dir /tmp/mode" \ --env JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \ --env DOCKER_HADOOP_HDFS_HOME=/hadoop-current \ --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \ --env HADOOP_HOME=/hadoop-current \ --env HADOOP_YARN_HOME=/hadoop-current \ --env HADOOP_COMMON_HOME=hadoop-current \ --env HADOOP_HDFS_HOME=/hadoop-current \ --env HADOOP_CONF_DIR=/hadoop-current/etc/hadoop \ --conf tony.containers.resources=/opt/submarine-current/submarine-all-0.3.0-SNAPSHOT-hadoop-3.2.jar,/home/yarn/submarine/mnist_distributed.py
The following is partial NodeManager log.
2020-03-25 13:48:32,728 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1585136148243_0006_01_000001 transitioned from SCHEDULED to RUNNING 2020-03-25 13:48:32,728 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1585136148243_0006_01_000001 2020-03-25 13:48:32,740 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime: setting hostname in container to: ctr-1585136148243-0006-01-000001 2020-03-25 13:48:34,605 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime: Docker inspect output for container_1585136148243_0006_01_000001: ,ctr-1585136148243-0006-01-0000012020-03-25 13:48:34,605 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: container_1585136148243_0006_01_000001's ip = , and hostname = ctr-1585136148243-0006-01-000001 2020-03-25 13:48:34,613 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Skipping monitoring container container_1585136148243_0006_01_000001 since CPU usage is not yet available. 2020-03-25 13:48:36,234 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 255. Privileged Execution Operation Stderr: Docker container exit code was not zero: 255 Unable to read from docker logs(ferror, feof): 0 1Stdout: main : command provided 4 main : run as user is yarn main : requested yarn user is yarn Creating script paths... Creating local dirs... Getting exit code file... Changing effective user to root... Launching docker container... Inspecting docker container... Writing to cgroup task files... Writing pid file... Writing to tmp file /var/lib/hadoop-yarn/cache/yarn/nm-local-dir/nmPrivate/application_1585136148243_0006/container_1585136148243_0006_01_000001/container_1585136148243_0006_01_000001.pid.tmp container_1585136148243_0006_01_000001 Waiting for docker container to finish... Removing docker container post-exit...
The following is AM stdout.log.
======================================================================== LogType:amstdout.log LogLastModifiedTime:Wed Mar 25 13:02:27 +0000 2020 LogLength:6468 LogContents: [WARN ] 2020-03-25 13:02:25,503 method:org.apache.hadoop.util.NativeCodeLoader.<clinit>(NativeCodeLoader.java:60) Unable to load native-hadoop library for your platform... using builtin-java classes where applicable [ERROR] 2020-03-25 13:02:25,613 method:com.linkedin.tony.ApplicationMaster.init(ApplicationMaster.java:217) Failed to create FileSystem object org.apache.hadoop.security.KerberosAuthException: failure to login: javax.security.auth.login.LoginException: java.lang.NullPointerException: invalid null input: name at com.sun.security.auth.UnixPrincipal.<init>(UnixPrincipal.java:71) at com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:133) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) at javax.security.auth.login.LoginContext.login(LoginContext.java:587) at org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:1926) at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1837) at org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:710) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:660) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571) at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3487) at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3477) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3319) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:227) at com.linkedin.tony.ApplicationMaster.init(ApplicationMaster.java:215) at com.linkedin.tony.ApplicationMaster.run(ApplicationMaster.java:305) at com.linkedin.tony.ApplicationMaster.main(ApplicationMaster.java:293) at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1847) at org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:710) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:660) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571) at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3487) at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3477) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3319) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:227) at com.linkedin.tony.ApplicationMaster.init(ApplicationMaster.java:215) at com.linkedin.tony.ApplicationMaster.run(ApplicationMaster.java:305) at com.linkedin.tony.ApplicationMaster.main(ApplicationMaster.java:293) Caused by: javax.security.auth.login.LoginException: java.lang.NullPointerException: invalid null input: name at com.sun.security.auth.UnixPrincipal.<init>(UnixPrincipal.java:71) at com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:133) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) at javax.security.auth.login.LoginContext.login(LoginContext.java:587) at org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:1926) at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1837) at org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:710) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:660) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571) at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3487) at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3477) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3319) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:227) at com.linkedin.tony.ApplicationMaster.init(ApplicationMaster.java:215) at com.linkedin.tony.ApplicationMaster.run(ApplicationMaster.java:305) at com.linkedin.tony.ApplicationMaster.main(ApplicationMaster.java:293) at javax.security.auth.login.LoginContext.invoke(LoginContext.java:856) at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) at javax.security.auth.login.LoginContext.login(LoginContext.java:587) at org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:1926) at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1837) ... 11 more [INFO ] 2020-03-25 13:02:25,618 method:com.linkedin.tony.ApplicationMaster.main(ApplicationMaster.java:298) Application Master failed. Exiting End of LogType:amstdout.log *****************************************************************************
Attachments
Issue Links
- links to