Details
-
Sub-task
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
None
-
None
-
Reviewed
-
Description
A major cost in split calculation against object stores turns out be listing the directory tree itself. That's because against S3, it takes S3A two HEADs and two lists to list the content of any directory path (2 HEADs + 1 list for getFileStatus(); the next list to query the contents).
Listing a directory could be improved slightly by combining the final two listings. However, a listing of a directory tree will still be O(directories). In contrast, a recursive listFiles() operation should be implementable by a bulk listing of all descendant paths; one List operation per thousand descendants.
As the result of this call is an iterator, the ongoing listing can be implemented within the iterator itself
Attachments
Attachments
Issue Links
- depends upon
-
HADOOP-13207 Specify FileSystem listStatus, listFiles and RemoteIterator
- Resolved
- is depended upon by
-
HADOOP-13371 S3A globber to use bulk listObject call over recursive directory scan
- Resolved
-
SPARK-17593 list files on s3 very slow
- Resolved
- is related to
-
HADOOP-14755 WASB to implement listFiles(Path f, boolean recursive) through flat list
- Resolved
-
HIVE-14165 Remove Hive file listing during split computation
- Closed
- relates to
-
SPARK-17593 list files on s3 very slow
- Resolved
- supercedes
-
HADOOP-15192 S3A listStatus excessively slow -hurts Spark job partitioning
- Resolved