Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Won't Fix
-
3.3.0, 3.2.1, 3.1.3, 3.3.1
-
None
Description
The v2 MR commit algorithm moves files from the task attempt dir into the dest dir on task commit -one by one
It is therefore not atomic
- if a task commit fails partway through and another task attempt commits -unless exactly the same filenames are used, output of the first attempt may be included in the final result
- if a worker partitions partway through task commit, and then continues after another attempt has committed, it may partially overwrite the output -even when the filenames are the same
Both MR and spark assume that task commits are atomic. Either they need to consider that this is not the case, we add a way to probe for a committer supporting atomic task commit, and the engines both add handling for task commit failures (probably fail job)
Better: we remove this as the default, maybe also warn when it is being used
Attachments
Issue Links
- is related to
-
SPARK-33019 Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default
- Resolved
-
MAPREDUCE-7300 PathOutputCommitter to add method failedTaskAttemptCommitRecoverable()
- Resolved
- is superceded by
-
MAPREDUCE-7341 Add a task-manifest output committer for Azure and GCS
- Resolved
- links to