[HADOOP-13208] S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the pseudo-tree of directories - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.8.0, 3.0.0-alpha1
Component/s: fs/s3
Labels:
None

Target Version/s:

2.8.0
Hadoop Flags:

Reviewed
Release Note:

Hide
S3A has optimized the listFiles method by doing a bulk listing of all entries under a path in a single S3 operation instead of recursively walking the directory tree. The listLocatedStatus method has been optimized by fetching results from S3 lazily as the caller traverses the returned iterator instead of doing an eager fetch of all possible results.

Show
S3A has optimized the listFiles method by doing a bulk listing of all entries under a path in a single S3 operation instead of recursively walking the directory tree. The listLocatedStatus method has been optimized by fetching results from S3 lazily as the caller traverses the returned iterator instead of doing an eager fetch of all possible results.

Description

A major cost in split calculation against object stores turns out be listing the directory tree itself. That's because against S3, it takes S3A two HEADs and two lists to list the content of any directory path (2 HEADs + 1 list for getFileStatus(); the next list to query the contents).

Listing a directory could be improved slightly by combining the final two listings. However, a listing of a directory tree will still be O(directories). In contrast, a recursive listFiles() operation should be implementable by a bulk listing of all descendant paths; one List operation per thousand descendants.

As the result of this call is an iterator, the ongoing listing can be implemented within the iterator itself

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HADOOP-13208-branch-2-001.patch
29/Jun/16 17:16
89 kB
Steve Loughran
HADOOP-13208-branch-2-007.patch
30/Jun/16 14:11
90 kB
Steve Loughran
HADOOP-13208-branch-2-008.patch
11/Jul/16 17:00
90 kB
Steve Loughran
HADOOP-13208-branch-2-009.patch
12/Jul/16 15:06
105 kB
Steve Loughran
HADOOP-13208-branch-2-010.patch
13/Jul/16 09:29
105 kB
Steve Loughran
HADOOP-13208-branch-2-011.patch
13/Jul/16 13:57
112 kB
Steve Loughran
HADOOP-13208-branch-2-012.patch
13/Jul/16 16:49
113 kB
Steve Loughran
HADOOP-13208-branch-2-017.patch
22/Jul/16 21:24
165 kB
Steve Loughran
HADOOP-13208-branch-2-018.patch
23/Jul/16 11:38
42 kB
Steve Loughran
HADOOP-13208-branch-2-019.patch
17/Aug/16 16:59
43 kB
Steve Loughran
HADOOP-13208-branch-2-020.patch
17/Aug/16 19:09
50 kB
Steve Loughran
HADOOP-13208-branch-2-021.patch
17/Aug/16 21:11
50 kB
Chris Nauroth

Issue Links

depends upon

HADOOP-13207 Specify FileSystem listStatus, listFiles and RemoteIterator

Resolved

is depended upon by

HADOOP-13371 S3A globber to use bulk listObject call over recursive directory scan

Resolved

SPARK-17593 list files on s3 very slow

Resolved

is related to

HADOOP-14755 WASB to implement listFiles(Path f, boolean recursive) through flat list

Resolved

HIVE-14165 Remove Hive file listing during split computation

Closed

relates to

SPARK-17593 list files on s3 very slow

Resolved

supercedes

HADOOP-15192 S3A listStatus excessively slow -hurts Spark job partitioning

Resolved

(1 relates to, 1 supercedes)

Activity

People

Assignee:: Steve Loughran

Reporter:: Steve Loughran

Votes:: 0 Vote for this issue

Watchers:: 16 Start watching this issue

Dates

Created:: 26/May/16 20:54

Updated:: 25/Jan/18 21:33

Resolved:: 17/Aug/16 21:59

Time Tracking

Estimated:

24h

Remaining:

24h

Logged:

Not Specified