[HADOOP-16132] Support multipart download in S3AFileSystem - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: fs/s3
Labels:
None

Description

I noticed that I get 150MB/s when I use the AWS CLI

aws s3 cp s3://<bucket>/<key> - > /dev/null

vs 50MB/s when I use the S3AFileSystem

hadoop fs -cat s3://<bucket>/<key> > /dev/null

Looking into the AWS CLI code, it looks like the download logic is quite clever. It downloads the next couple parts in parallel using range requests, and then buffers them in memory in order to reorder them and expose a single contiguous stream. I translated the logic to Java and modified the S3AFileSystem to do similar things, and am able to achieve 150MB/s download speeds as well. It is mostly done but I have some things to clean up first. The PR is here: https://github.com/palantir/hadoop/pull/47/files

It would be great to get some other eyes on it to see what we need to do to get it merged.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HADOOP-16132.001.patch
28/Feb/19 20:20
44 kB
Justin Uang
HADOOP-16132.002.patch
28/Feb/19 20:53
44 kB
Justin Uang
HADOOP-16132.003.patch
28/Feb/19 22:34
52 kB
Justin Uang
HADOOP-16132.004.patch
01/Mar/19 20:16
54 kB
Justin Uang
HADOOP-16132.005.patch
01/Mar/19 21:32
54 kB
Justin Uang
seek-logs-parquet.txt
28/Feb/19 20:44
6 kB
Justin Uang

Issue Links

links to

GitHub Pull Request #645

Activity

People

Assignee:: Unassigned

Reporter:: Justin Uang

Votes:: 1 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 21/Feb/19 16:09

Updated:: 04/Oct/22 16:23