Details
Description
We experienced a transaction lag issue between aNN and oNN, causing problems in busier clusters. When HDFS_DELEGATION_TOKEN is created by aNN, the oNN couldn't catch up cache location immediately, leading to the issue of the token not being found in the cache in oNN.
We followed the document [https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ObserverNameNode.html] to enable oNN's functionality.
Here is our setup:
- nn1: aNN
- nn2: sNN
- nn3: sNN
- nn4: oNN
Due to heavier read traffic, we decided to add another oNN (nn5) and set dfs.client.failover.random.order=true for better read distribution. Otherwise, all traffic is routed to the first oNN in the list.
after modification
- nn1: aNN
- nn2: sNN
- nn3: sNN
- nn4: oNN
- nn5 :oNN
With the above setup, the HDFS_DELEGATION_TOKEN issue worsened, and simple pi/MapReduce/hive jobs started to fail."
Error from oNN logs
2024-01-15 11:03:26,152 WARN SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 10.xx.xx.xx:54014:null (DIGEST-MD5: IO error acquiring password) with true cause: (token (token for end-user1: HDFS_DELEGATION_TOKEN owner=end-user1, renewer=end-user1, realUser=, issueDate=1705338205996, maxDate=1705943005996, sequenceNumber=277018178, masterKeyId=2195) can't be found in cache)