Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
2.6.0
-
None
-
None
Description
Problem
Reducer gets stuck in copy phase and doesn't make progress for very long time. After killing this task for couple of times manually, it gets completed.
Observations
- Verfied gc logs. Found no memory related issues. Attached the logs.
- Verified thread dumps. Found no thread related problems.
- On verification of logs, fetcher threads are not copying the map outputs and they are just waiting for merge to happen.
- Merge thread is alive and in wait state.
Analysis
On careful observation of logs, thread dumps and code, this looks to me like a classic case of multi-threading issue. Thread goes to wait state after it has been notified.
Here is the suspect code flow.
Thread #1
Fetcher thread - notification comes first
org.apache.hadoop.mapreduce.task.reduce.MergeThread.startMerge(Set<T>)
synchronized(pendingToBeMerged) {
pendingToBeMerged.addLast(toMergeInputs);
pendingToBeMerged.notifyAll();
}
Thread #2
Merge Thread - goes to wait state (Notification goes unconsumed)
org.apache.hadoop.mapreduce.task.reduce.MergeThread.run()
synchronized (pendingToBeMerged) { while(pendingToBeMerged.size() <= 0) { pendingToBeMerged.wait(); } // Pickup the inputs to merge. inputs = pendingToBeMerged.removeFirst(); }
Attachments
Attachments
Issue Links
- duplicates
-
MAPREDUCE-6334 Fetcher#copyMapOutput is leaking usedMemory upon IOException during InMemoryMapOutput shuffle handler
- Closed
Thanks a lot Jason for details. We are hitting exactly same scenario (disk bad) as explained in
MAPREDUCE-6334.We will try the patch and update the details in this jira.