Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0
-
None
Description
I can observe a deadlock on the driver that can be triggered rather reliably in a job with a larger amount of tasks - upon using
spark.decommission.enabled: true spark.storage.decommission.rddBlocks.enabled: true spark.storage.decommission.shuffleBlocks.enabled: true spark.storage.decommission.enabled: true
It origins in the dispatcher-BlockManagerMaster making a call to updateBlockInfo when shuffles are migrated. This is not performed by a thread from the pool but instead by the dispatcher-BlockManagerMaster itself. I suppose this was done under the assumption that this would be very fast. However if the block that is updated is a shuffle index block it calls
mapOutputTracker.updateMapOutput(shuffleId, mapId, blockManagerId)
for which it waits to acquire a write lock as part of the MapOutputTracker.
If the timing is bad then one of the map-output-dispatchers are holding this lock as part of e.g. serializedMapStatus. In this function MapOutputTracker.serializeOutputStatuses is called and as part of that we do
if (arrSize >= minBroadcastSize) { // Use broadcast instead. // Important arr(0) is the tag == DIRECT, ignore that while deserializing ! // arr is a nested Array so that it can handle over 2GB serialized data val arr = chunkedByteBuf.getChunks().map(_.array()) val bcast = broadcastManager.newBroadcast(arr, isLocal)
which makes an RPC call to dispatcher-BlockManagerMaster. That one however is unable to answer as it is blocked while waiting on the aforementioned lock. Hence the deadlock. The ingredients of this deadlock are therefore: sufficient size of the array to go the broadcast-path, as well as timing of incoming updateBlockInfo call as happens regularly during decommissioning. Potentially earlier versions than 3.1.0 are affected but I could not sufficiently conclude that.
I have a stacktrace of all driver threads showing the deadlock: spark_stacktrace_deadlock.txt
A coworker of mine wrote a patch that replicates the issue as a test case as well: 0001-Add-test-showing-that-decommission-might-deadlock.patch