[MAPREDUCE-7465] performance problem in FileOutputCommitter for big list processed by single thread - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.2.3, 3.3.2, 3.2.4, 3.3.5, 3.3.3, 3.3.4, 3.3.6
Fix Version/s: None
Component/s: performance
Labels:
- pull-request-available

Description

when commiting a big hadoop job (for example via Spark) having many partitions,
the class FileOutputCommiter process thousands of dirs/files to rename with a single Thread. This is performance issue, caused by lot of waits on FileStystem storage operations.

I propose that above a configurable threshold (default=3, configurable via property 'mapreduce.fileoutputcommitter.parallel.threshold'), the class FileOutputCommiter process the list of files to rename using parallel threads, using the default jvm ExecutorService (ForkJoinPool.commonPool())

See Pull-Request: https://github.com/apache/hadoop/pull/6378

Notice that sub-class instances of FileOutputCommiter are supposed to be created at runtime dependending of a configurable property ([https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/PathOutputCommitterFactory.java|PathOutputCommitterFactory.java]).

But for example in Parquet + Spark, this is buggy and can not be changed at runtime.
There is an ongoing Jira and PR to fix it in Parquet + Spark: https://issues.apache.org/jira/browse/PARQUET-2416

Attachments

Issue Links

is duplicated by

MAPREDUCE-7470 multi-thread mapreduce v1 FileOutputcommitter

Resolved

relates to

PARQUET-2416 honor conf "mapreduce.outputcommitter.factory.class" with PathOutputCommitterFactory in ParquetOutputFormat.getOutputCommitter

Open

MAPREDUCE-7341 Add a task-manifest output committer for Azure and GCS

Resolved

links to

GitHub Pull Request #6378

GitHub Pull Request #6399

Activity

People

Assignee:: Unassigned

Reporter:: Arnaud Nauwynck

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 23/Dec/23 10:13

Updated:: 20/Mar/24 13:30