[HADOOP-11794] Enable distcp to copy blocks in parallel - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.21.0
Fix Version/s: 2.9.0, 3.0.0-alpha4
Component/s: tools/distcp
Labels:
None

Hadoop Flags:

Reviewed
Release Note:

Hide
If a positive value is passed to command line switch -blocksperchunk, files with more blocks than this value will be split into chunks of `<blocksperchunk>` blocks to be transferred in parallel, and reassembled on the destination. By default, `<blocksperchunk>` is 0 and the files will be transmitted in their entirety without splitting. This switch is only applicable when both the source file system supports getBlockLocations and target supports concat.

Show
If a positive value is passed to command line switch -blocksperchunk, files with more blocks than this value will be split into chunks of `<blocksperchunk>` blocks to be transferred in parallel, and reassembled on the destination. By default, `<blocksperchunk>` is 0 and the files will be transmitted in their entirety without splitting. This switch is only applicable when both the source file system supports getBlockLocations and target supports concat.

Description

The minimum unit of work for a distcp task is a file. We have files that are greater than 1 TB with a block size of 1 GB. If we use distcp to copy these files, the tasks either take a long long long time or finally fails. A better way for distcp would be to copy all the source blocks in parallel, and then stich the blocks back to files at the destination via the HDFS Concat API (~~HDFS-222~~)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAPREDUCE-2257.patch
30/Apr/11 07:24
62 kB
Rosie Li
HADOOP-11794.001.patch
19/Jan/17 23:57
52 kB
Yongjun Zhang
HADOOP-11794.002.patch
21/Jan/17 08:03
58 kB
Yongjun Zhang
HADOOP-11794.003.patch
31/Jan/17 07:38
61 kB
Yongjun Zhang
HADOOP-11794.004.patch
01/Feb/17 08:08
62 kB
Yongjun Zhang
HADOOP-11794.005.patch
01/Feb/17 19:30
62 kB
Yongjun Zhang
HADOOP-11794.006.patch
03/Feb/17 20:02
63 kB
Yongjun Zhang
HADOOP-11794.007.patch
08/Feb/17 05:41
70 kB
Yongjun Zhang
HADOOP-11794.008.patch
24/Feb/17 04:56
70 kB
Yongjun Zhang
HADOOP-11794.009.patch
23/Mar/17 06:32
70 kB
Yongjun Zhang
HADOOP-11794.010.patch
23/Mar/17 16:58
70 kB
Yongjun Zhang
HADOOP-11794.010.branch2.patch
05/Apr/17 23:05
70 kB
Yongjun Zhang
HADOOP-11794.010.branch2.002.patch
25/May/17 02:13
71 kB
Yongjun Zhang

Issue Links

blocks

HDFS-1776 Bug in Concat code

Closed

causes

HADOOP-15850 CopyCommitter#concatFileChunks should check that the blocks per chunk is not 0

Resolved

HADOOP-17611 Distcp parallel file copy breaks the modification time

Patch Available

is related to

HDFS-222 Support for concatenating of files into a single file

Closed

HADOOP-14764 Über-jira adl:// Azure Data Lake Phase II: Performance, Resilience and Testing

Resolved

HADOOP-14866 Backport implementation of parallel block copy in Distcp to hadoop 2.8

Resolved

relates to

HADOOP-16018 DistCp won't reassemble chunks when blocks per chunk > 0

Resolved

HADOOP-16049 DistCp result has data and checksum mismatch when blocks per chunk > 0

Resolved

HADOOP-16158 DistCp to support checksum validation when copy blocks in parallel

Resolved

(1 is related to, 3 relates to)

Activity

People

Assignee:: Yongjun Zhang

Reporter:: Dhruba Borthakur

Votes:: 4 Vote for this issue

Watchers:: 59 Start watching this issue

Dates

Created:: 11/Jan/11 18:30

Updated:: 30/Apr/21 18:03

Resolved:: 15/Apr/17 15:18