Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-9684

HistoryServerArchiveFetcher not working properly with secure hdfs cluster

    XMLWordPrintableJSON

Details

    Description

      With my current setup, jobmanager and taskmanager are able to talk to hdfs cluster (with kerberos setup). However, running history server gets:

       

       

      2018-06-27 19:03:32,080 WARN org.apache.hadoop.ipc.Client - Exception encountered while connecting to the server : java.lang.IllegalArgumentException: Failed to specify server's Kerberos principal name
      2018-06-27 19:03:32,085 ERROR org.apache.flink.runtime.webmonitor.history.HistoryServerArchiveFetcher - Failed to access job archive location for path hdfs://openqe11blue-n2.blue.ygrid.yahoo.com/tmp/flink/openstorm10-blue/jmarchive.
      java.io.IOException: Failed on local exception: java.io.IOException: java.lang.IllegalArgumentException: Failed to specify server's Kerberos principal name; Host Details : local host is: "openstorm10blue-n2.blue.ygrid.yahoo.com/10.215.79.35"; destination host is: "openqe11blue-n2.blue.ygri
      d.yahoo.com":8020;
      at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
      at org.apache.hadoop.ipc.Client.call(Client.java:1414)
      at org.apache.hadoop.ipc.Client.call(Client.java:1363)
      at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
      at com.sun.proxy.$Proxy9.getListing(Unknown Source)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:498)
      at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
      at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
      at com.sun.proxy.$Proxy9.getListing(Unknown Source)
      at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:515)
      at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1743)
      at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1726)
      at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:650)
      at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
      at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
      at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
      at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
      at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
      at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.listStatus(HadoopFileSystem.java:146)
      at org.apache.flink.runtime.webmonitor.history.HistoryServerArchiveFetcher$JobArchiveFetcherTask.run(HistoryServerArchiveFetcher.java:139)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:748)
      Caused by: java.io.IOException: java.lang.IllegalArgumentException: Failed to specify server's Kerberos principal name
      at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:677)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:422)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
      at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:640)
      at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:724)
      at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
      at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462)
      at org.apache.hadoop.ipc.Client.call(Client.java:1381)
      ... 28 more
      

       

       

      Changed LOG Level to DEBUG and seeing 

       

      2018-06-27 19:03:30,931 INFO org.apache.flink.runtime.webmonitor.history.HistoryServer - Enabling SSL for the history server.
      2018-06-27 19:03:30,931 DEBUG org.apache.flink.runtime.net.SSLUtils - Creating server SSL context from configuration
      2018-06-27 19:03:31,091 DEBUG org.apache.flink.core.fs.FileSystem - Loading extension file systems via services
      2018-06-27 19:03:31,094 DEBUG org.apache.flink.core.fs.FileSystem - Added file system maprfs:org.apache.flink.runtime.fs.maprfs.MapRFsFactory
      2018-06-27 19:03:31,102 DEBUG org.apache.flink.runtime.util.HadoopUtils - Cannot find hdfs-default configuration-file path in Flink config.
      2018-06-27 19:03:31,102 DEBUG org.apache.flink.runtime.util.HadoopUtils - Cannot find hdfs-site configuration-file path in Flink config.
      2018-06-27 19:03:31,102 DEBUG org.apache.flink.runtime.util.HadoopUtils - Could not find Hadoop configuration via any of the supported methods (Flink configuration, environment variables).
      2018-06-27 19:03:31,178 DEBUG org.apache.flink.runtime.fs.hdfs.HadoopFsFactory - Instantiating for file system scheme hdfs Hadoop File System org.apache.hadoop.hdfs.DistributedFileSystem
      2018-06-27 19:03:31,829 INFO org.apache.flink.runtime.webmonitor.history.HistoryServerArchiveFetcher - Monitoring directory hdfs://openqe11blue-n2.blue.ygrid.yahoo.com/tmp/flink/openstorm10-blue/jmarchive for archived jobs.
      

       

       

      The root cause is 

      https://github.com/apache/flink/blob/release-1.4.2/flink-runtime-web/src/main/java/org/apache/flink/runtime/webmonitor/history/HistoryServer.java#L169

       

      FileSystem refreshFS = refreshPath.getFileSystem();
      

       

      The getFileSystem() is being called before 

      FileSystem.initialize(xxx)

      ever happened. 

      So it will call

      https://github.com/apache/flink/blob/release-1.4.2/flink-core/src/main/java/org/apache/flink/core/fs/FileSystem.java#L388-L390

       

      if (FS_FACTORIES.isEmpty()) {
      initialize(new Configuration());
      }
      

      and because the configuration is empty, it won't be able to connect to hdfs correctly.

       

      A workaround is to set HADOOP_CONF_DIR or HADOOP_HOME environment variables.

       But we should fix this since we have 

      fs.hdfs.hadoopconf
      

      config,  otherwise it will be confusing to users.

       

       

      Attachments

        Issue Links

          Activity

            People

              ethanli Ethan Li
              ethanli Ethan Li
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: