Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.21.0
-
None
-
Reviewed
-
Description
The minimum unit of work for a distcp task is a file. We have files that are greater than 1 TB with a block size of 1 GB. If we use distcp to copy these files, the tasks either take a long long long time or finally fails. A better way for distcp would be to copy all the source blocks in parallel, and then stich the blocks back to files at the destination via the HDFS Concat API (HDFS-222)
Attachments
Attachments
Issue Links
- blocks
-
HDFS-1776 Bug in Concat code
- Closed
- causes
-
HADOOP-15850 CopyCommitter#concatFileChunks should check that the blocks per chunk is not 0
- Resolved
-
HADOOP-17611 Distcp parallel file copy breaks the modification time
- Patch Available
- is related to
-
HDFS-222 Support for concatenating of files into a single file
- Closed
-
HADOOP-14764 Über-jira adl:// Azure Data Lake Phase II: Performance, Resilience and Testing
- Resolved
-
HADOOP-14866 Backport implementation of parallel block copy in Distcp to hadoop 2.8
- Resolved
- relates to
-
HADOOP-16018 DistCp won't reassemble chunks when blocks per chunk > 0
- Resolved
-
HADOOP-16049 DistCp result has data and checksum mismatch when blocks per chunk > 0
- Resolved
-
HADOOP-16158 DistCp to support checksum validation when copy blocks in parallel
- Resolved