Uploaded image for project: 'Apache Submarine'
  1. Apache Submarine
  2. SUBMARINE-457

Run TF MNIST example using Docker Container failed in mini-submarine

    XMLWordPrintableJSON

Details

    Description

      I tried to run mnist_distributed.py using docker container, and launch failed.
      The following is my command, and the docker image tf-1.13.1-cpu-base:0.0.1 was build in advance in mini-submarine.

      java -cp $(hadoop classpath --glob):/opt/submarine-current/submarine-all-0.3.0-SNAPSHOT-hadoop-3.2.jar org.apache.submarine.client.cli.Cli job run --name tf-job-001 \
       --framework tensorflow \
       --docker_image tf-1.13.1-cpu-base:0.0.1 \
       --input_path "" \
       --num_ps 1 \
       --ps_launch_cmd "export CLASSPATH=\$(/hadoop-current/bin/hadoop classpath --glob) && python mnist_distributed.py --steps 2 --data_dir /tmp/data --working_dir /tmp/mode" \
       --ps_resources memory=1G,vcores=1 \
       --num_workers 2 \
       --worker_resources memory=1G,vcores=1 \
       --worker_launch_cmd "export CLASSPATH=\$(/hadoop-current/bin/hadoop classpath --glob) && python mnist_distributed.py --steps 2 --data_dir /tmp/data --working_dir /tmp/mode" \
       --env JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
       --env DOCKER_HADOOP_HDFS_HOME=/hadoop-current \
       --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
       --env HADOOP_HOME=/hadoop-current \
       --env HADOOP_YARN_HOME=/hadoop-current \
       --env HADOOP_COMMON_HOME=hadoop-current \
       --env HADOOP_HDFS_HOME=/hadoop-current \
       --env HADOOP_CONF_DIR=/hadoop-current/etc/hadoop \
       --conf tony.containers.resources=/opt/submarine-current/submarine-all-0.3.0-SNAPSHOT-hadoop-3.2.jar,/home/yarn/submarine/mnist_distributed.py
      
      

      The following is partial NodeManager log.

      2020-03-25 13:48:32,728 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1585136148243_0006_01_000001 transitioned from SCHEDULED to RUNNING
      2020-03-25 13:48:32,728 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1585136148243_0006_01_000001
      2020-03-25 13:48:32,740 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime: setting hostname in container to: ctr-1585136148243-0006-01-000001
      2020-03-25 13:48:34,605 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime: Docker inspect output for container_1585136148243_0006_01_000001: ,ctr-1585136148243-0006-01-0000012020-03-25 13:48:34,605 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: container_1585136148243_0006_01_000001's ip = , and hostname = ctr-1585136148243-0006-01-000001
      2020-03-25 13:48:34,613 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Skipping monitoring container container_1585136148243_0006_01_000001 since CPU usage is not yet available.
      2020-03-25 13:48:36,234 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 255. Privileged Execution Operation Stderr:
      Docker container exit code was not zero: 255
      Unable to read from docker logs(ferror, feof): 0 1Stdout: main : command provided 4
      main : run as user is yarn
      main : requested yarn user is yarn
      Creating script paths...
      Creating local dirs...
      Getting exit code file...
      Changing effective user to root...
      Launching docker container...
      Inspecting docker container...
      Writing to cgroup task files...
      Writing pid file...
      Writing to tmp file /var/lib/hadoop-yarn/cache/yarn/nm-local-dir/nmPrivate/application_1585136148243_0006/container_1585136148243_0006_01_000001/container_1585136148243_0006_01_000001.pid.tmp
      container_1585136148243_0006_01_000001
      Waiting for docker container to finish...
      Removing docker container post-exit...
      

      The following is AM stdout.log.

      ========================================================================
      LogType:amstdout.log
      LogLastModifiedTime:Wed Mar 25 13:02:27 +0000 2020
      LogLength:6468
      LogContents:
      [WARN ] 2020-03-25 13:02:25,503 method:org.apache.hadoop.util.NativeCodeLoader.<clinit>(NativeCodeLoader.java:60)
      Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      [ERROR] 2020-03-25 13:02:25,613 method:com.linkedin.tony.ApplicationMaster.init(ApplicationMaster.java:217)
      Failed to create FileSystem object
      org.apache.hadoop.security.KerberosAuthException: failure to login: javax.security.auth.login.LoginException: java.lang.NullPointerException: invalid null input: name
       at com.sun.security.auth.UnixPrincipal.<init>(UnixPrincipal.java:71)
       at com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:133)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:498)
       at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755)
       at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
       at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
       at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
       at java.security.AccessController.doPrivileged(Native Method)
       at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
       at javax.security.auth.login.LoginContext.login(LoginContext.java:587)
       at org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:1926)
       at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1837)
       at org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:710)
       at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:660)
       at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571)
       at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3487)
       at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3477)
       at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3319)
       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:227)
       at com.linkedin.tony.ApplicationMaster.init(ApplicationMaster.java:215)
       at com.linkedin.tony.ApplicationMaster.run(ApplicationMaster.java:305)
       at com.linkedin.tony.ApplicationMaster.main(ApplicationMaster.java:293)
      
      
       at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1847)
       at org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:710)
       at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:660)
       at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571)
       at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3487)
       at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3477)
       at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3319)
       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:227)
       at com.linkedin.tony.ApplicationMaster.init(ApplicationMaster.java:215)
       at com.linkedin.tony.ApplicationMaster.run(ApplicationMaster.java:305)
       at com.linkedin.tony.ApplicationMaster.main(ApplicationMaster.java:293)
      Caused by: javax.security.auth.login.LoginException: java.lang.NullPointerException: invalid null input: name
       at com.sun.security.auth.UnixPrincipal.<init>(UnixPrincipal.java:71)
       at com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:133)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:498)
       at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755)
       at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
       at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
       at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
       at java.security.AccessController.doPrivileged(Native Method)
       at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
       at javax.security.auth.login.LoginContext.login(LoginContext.java:587)
       at org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:1926)
       at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1837)
       at org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:710)
       at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:660)
       at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571)
       at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3487)
       at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3477)
       at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3319)
       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:227)
       at com.linkedin.tony.ApplicationMaster.init(ApplicationMaster.java:215)
       at com.linkedin.tony.ApplicationMaster.run(ApplicationMaster.java:305)
       at com.linkedin.tony.ApplicationMaster.main(ApplicationMaster.java:293)
      
      
       at javax.security.auth.login.LoginContext.invoke(LoginContext.java:856)
       at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
       at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
       at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
       at java.security.AccessController.doPrivileged(Native Method)
       at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
       at javax.security.auth.login.LoginContext.login(LoginContext.java:587)
       at org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:1926)
       at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1837)
       ... 11 more
      [INFO ] 2020-03-25 13:02:25,618 method:com.linkedin.tony.ApplicationMaster.main(ApplicationMaster.java:298)
      Application Master failed. Exiting
      
      
      End of LogType:amstdout.log
      *****************************************************************************

      Attachments

        Issue Links

          Activity

            People

              lowc1012 Ryan Lo
              lowc1012 Ryan Lo
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m