XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 3.0.0-alpha4, 3.1.1, 3.3.0
Fix Version/s: 3.5.0
Component/s: resourcemanager
Labels:
- pull-request-available

Description

We can be observed that removing app info started at 06:17:20, but the NoNodeException was received at 06:17:35.
During the 15s interval, Curator was retrying the metadata operation. Due to the non-idempotent nature of the Zookeeper deletion operation, in one of the retry attempts, the metadata operation was successful but no response was received. In the next retry it resulted in a NoNodeException, triggering the STATE_STORE_FENCED event and ultimately causing the current ResourceManager to switch to standby .

2023-10-28 06:17:20,359 INFO  recovery.RMStateStore (RMStateStore.java:transition(333)) - Removing info for app: application_1697410508608_140368
2023-10-28 06:17:20,359 INFO  resourcemanager.RMAppManager (RMAppManager.java:checkAppNumCompletedLimit(303)) - Application should be expired, max number of completed apps kept in memory met: maxCompletedAppsInMemory = 1000, removing app application_1697410508608_140368 from memory:
2023-10-28 06:17:35,665 ERROR recovery.RMStateStore (RMStateStore.java:transition(337)) - Error removing app: application_1697410508608_140368
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
2023-10-28 06:17:35,666 INFO  recovery.RMStateStore (RMStateStore.java:handleStoreEvent(1147)) - RMStateStore state change from ACTIVE to FENCED
2023-10-28 06:17:35,666 ERROR resourcemanager.ResourceManager (ResourceManager.java:handle(898)) - Received RMFatalEvent of type STATE_STORE_FENCED, caused by org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
2023-10-28 06:17:35,666 INFO  resourcemanager.ResourceManager (ResourceManager.java:transitionToStandby(1309)) - Transitioning to standby state

Solution

The NoNodeException clearly indicates that the Znode no longer exists, so we can safely ignore this exception to avoid triggering a larger impact on the cluster caused by ResourceManager failover.

Other

We also need to discuss and optimize the same issues in safeCreate.

Attachments

Issue Links

links to

GitHub Pull Request #6577

GitHub Pull Request #6616

Activity

People

Assignee:: Unassigned

Reporter:: wangzhihui

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 05/Dec/23 11:33

Updated:: 01/Apr/24 13:02

Resolved:: 21/Mar/24 07:15