[HDFS-13186] [PROVIDED Phase 2] Multipart Uploader API - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.2.0
Component/s: None
Labels:
None

Target Version/s:

3.2.0
Hadoop Flags:

Reviewed

Description

To write files in parallel to an external storage system as in HDFS-12090, there are two approaches:

Naive approach: use a single datanode per file that copies blocks locally as it streams data to the external service. This requires a copy for each block inside the HDFS system and then a copy for the block to be sent to the external system.
Better approach: Single point (e.g. Namenode or SPS style external client) and Datanodes coordinate in a multipart - multinode upload.

This system needs to work with multiple back ends and needs to coordinate across the network. So we propose an API that resembles the following:

public UploadHandle multipartInit(Path filePath) throws IOException;

public PartHandle multipartPutPart(InputStream inputStream,
    int partNumber, UploadHandle uploadId) throws IOException;

public void multipartComplete(Path filePath,
    List<Pair<Integer, PartHandle>> handles, 
    UploadHandle multipartUploadId) throws IOException;

Here, UploadHandle and PartHandle are opaque handlers in the vein of PathHandle so they can be serialized and deserialized in hadoop-hdfs project without knowledge of how to deserialize e.g. S3A's version of a UpoadHandle and PartHandle.

In an object store such as S3A, the implementation is straight forward. In the case of writing multipart/multinode to HDFS, we can write each block as a file part. The complete call will perform a concat on the blocks.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-13186.001.patch
09/Mar/18 09:59
32 kB
Ewan Higgs
HDFS-13186.002.patch
09/Mar/18 09:59
35 kB
Ewan Higgs
HDFS-13186.003.patch
09/Mar/18 10:01
35 kB
Ewan Higgs
HDFS-13186.004.patch
30/Apr/18 12:29
40 kB
Ewan Higgs
HDFS-13186.005.patch
25/May/18 10:22
44 kB
Ewan Higgs
HDFS-13186.006.patch
06/Jun/18 12:36
45 kB
Ewan Higgs
HDFS-13186.007.patch
06/Jun/18 13:23
47 kB
Ewan Higgs
HDFS-13186.008.patch
11/Jun/18 19:22
54 kB
Christopher Douglas
HDFS-13186.009.patch
14/Jun/18 13:33
57 kB
Ewan Higgs
HDFS-13186.010.patch
15/Jun/18 19:27
57 kB
Christopher Douglas

Issue Links

causes

HADOOP-16150 checksumFS doesn't wrap concat(): concatenated files don't have checksums

Resolved

HDFS-13707 [PROVIDED Storage] Fix failing integration tests in ITestProvidedImplementation

Resolved

is depended upon by

HADOOP-15576 S3A Multipart Uploader to work with S3Guard and encryption

Resolved

HDFS-13713 Add specification of Multipart Upload API to FS specification, with contract tests

Resolved

is related to

HBASE-20431 Store commit transaction for filesystems that do not support an atomic rename

Closed

Activity

People

Assignee:: Ewan Higgs

Reporter:: Ewan Higgs

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 23/Feb/18 14:29

Updated:: 14/Mar/19 19:42

Resolved:: 17/Jun/18 18:55