Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.7.2
-
None
-
None
Description
When reduce task encounters compression related errors, AM doesn't retry the corresponding map task.
In one of the case we encountered, here is the stack trace.
2016-01-27 13:44:28,915 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#29 at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.lang.ArrayIndexOutOfBoundsException at com.hadoop.compression.lzo.LzoDecompressor.setInput(LzoDecompressor.java:196) at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:104) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192) at org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.shuffle(InMemoryMapOutput.java:97) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:537) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:336) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
In this case, the node on which the map task ran had a bad drive.
If the AM had retried running that map task somewhere else, the job definitely would have succeeded.
Attachments
Attachments
Issue Links
- is related to
-
TEZ-3833 Tasks should report codec errors during shuffle as fetch failures
- Closed