Details
-
Sub-task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
I noticed that I get 150MB/s when I use the AWS CLI
aws s3 cp s3://<bucket>/<key> - > /dev/null
vs 50MB/s when I use the S3AFileSystem
hadoop fs -cat s3://<bucket>/<key> > /dev/null
Looking into the AWS CLI code, it looks like the download logic is quite clever. It downloads the next couple parts in parallel using range requests, and then buffers them in memory in order to reorder them and expose a single contiguous stream. I translated the logic to Java and modified the S3AFileSystem to do similar things, and am able to achieve 150MB/s download speeds as well. It is mostly done but I have some things to clean up first. The PR is here: https://github.com/palantir/hadoop/pull/47/files
It would be great to get some other eyes on it to see what we need to do to get it merged.
Attachments
Attachments
Issue Links
- links to