XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • fs/s3
    • None

    Description

      I noticed that I get 150MB/s when I use the AWS CLI

      aws s3 cp s3://<bucket>/<key> - > /dev/null

      vs 50MB/s when I use the S3AFileSystem

      hadoop fs -cat s3://<bucket>/<key> > /dev/null

      Looking into the AWS CLI code, it looks like the download logic is quite clever. It downloads the next couple parts in parallel using range requests, and then buffers them in memory in order to reorder them and expose a single contiguous stream. I translated the logic to Java and modified the S3AFileSystem to do similar things, and am able to achieve 150MB/s download speeds as well. It is mostly done but I have some things to clean up first. The PR is here: https://github.com/palantir/hadoop/pull/47/files

      It would be great to get some other eyes on it to see what we need to do to get it merged.

      Attachments

        1. HADOOP-16132.001.patch
          44 kB
          Justin Uang
        2. HADOOP-16132.002.patch
          44 kB
          Justin Uang
        3. HADOOP-16132.003.patch
          52 kB
          Justin Uang
        4. HADOOP-16132.004.patch
          54 kB
          Justin Uang
        5. HADOOP-16132.005.patch
          54 kB
          Justin Uang
        6. seek-logs-parquet.txt
          6 kB
          Justin Uang

        Issue Links

          Activity

            People

              Unassigned Unassigned
              justin.uang Justin Uang
              Votes:
              1 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated: