[HADOOP-13525] Optimize uses of FS operations in the ASF analysis frameworks and libraries - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Done
Affects Version/s: 2.8.1
Fix Version/s: 3.3.1
Component/s: fs, fs/s3
Labels:
None

Target Version/s:

3.4.0

Description

Review uses of the FS APIs in applications using the Hadoop FS API to access filesystems; identify suboptimal uses and tune them for better performance against HDFS and object stores

Assume arbitrary Hadoop 2.x releases: make no changes which are known to make operations on older versions of Hadoop slower
Do propose those changes which deliver speedups in later versions of Hadoop, while not impacting older versions, or risk of causing scalability problems.
Add more tests, especially scalable ones which also display metrics.
Use standard benchmarks and optimization tools to identify hotspots.
Use FS behaviour as verified in the FS contract tests as evidence that filesystems correctly implement the Hadoop FS APIs. If a use of an API call is made which hints at the expectation of different/untested behaviours, leave alone and add new tests to the Hadoop FS contract to determine cross-FS semantics.
Focus on the startup, split calculation and directory scanning operations: the ones which slow down entire queries.
Eliminate use of isDirectory(), getLength(), exists() if a followon operation (getStatus(),delete(), ... makes the use redundant.
Assume that FileStatus entries are not cached; the cost of creating them is 1 RPC call against HDFS, 1+ HTTPS call against object stores.
Locate calls to the listing operations, identify speedups, especially on recursive directory scans.
Identify suboptimal seek patterns (backwards as well as forwards) and attempt to reduce/eliminate through reordering and result caching.
Try to reuse the results of previous operations (e.g FileStatus instances) in follow-on calls.
Commonly used file formats (e.g ORC) will have transitive benefits.
Frameworks to use predicate pushdown where this delivers speedups
Document best practises identified and implemented.

Attachments

Issue Links

depends upon

SPARK-16980 Load only catalog table partition metadata required to answer a query

Resolved

HIVE-14165 Remove Hive file listing during split computation

Closed

HIVE-14269 Performance optimizations for data on S3

Open

MAPREDUCE-6760 LocatedFileStatusFetcher to use listFiles(recursive)

Open

MAPREDUCE-6800 FileInputFormat.singleThreadedListStatus to use listFiles(recursive)

Open

SPARK-17861 Store data source partitions in metastore and push partition pruning into metastore

Resolved

HADOOP-13321 Deprecate FileSystem APIs that promote inefficient call patterns.

Resolved

HADOOP-13427 Eliminate needless uses of FileSystem#{exists(), isFile(), isDirectory()}

Resolved

SPARK-14551 Reduce number of NameNode calls in OrcRelation with FileSourceStrategy mode

Resolved

SPARK-16736 remove redundant FileSystem status checks calls from Spark codebase

Resolved

SPARK-17159 Improve FileInputDStream.findNewFiles list performance

Resolved

SPARK-18917 Dataframe - Time Out Issues / Taking long time in append mode on object stores

Resolved

SPARK-17179 Consider improving partition pruning in HiveMetastoreCatalog

Closed

HIVE-14323 Reduce number of FS permissions and redundant FS operations

Closed

HIVE-14423 S3: Fetching partition sizes from FS can be expensive when stats are not available in metastore

Closed

PIG-4442 Eliminate redundant RPC call to get file information in HPath.

Closed

(11 depends upon)

Activity

People

Assignee:: Steve Loughran

Reporter:: Steve Loughran

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 20/Aug/16 11:33

Updated:: 04/Mar/21 17:42

Resolved:: 04/Mar/21 17:42