Description
Review uses of the FS APIs in applications using the Hadoop FS API to access filesystems; identify suboptimal uses and tune them for better performance against HDFS and object stores
- Assume arbitrary Hadoop 2.x releases: make no changes which are known to make operations on older versions of Hadoop slower
- Do propose those changes which deliver speedups in later versions of Hadoop, while not impacting older versions, or risk of causing scalability problems.
- Add more tests, especially scalable ones which also display metrics.
- Use standard benchmarks and optimization tools to identify hotspots.
- Use FS behaviour as verified in the FS contract tests as evidence that filesystems correctly implement the Hadoop FS APIs. If a use of an API call is made which hints at the expectation of different/untested behaviours, leave alone and add new tests to the Hadoop FS contract to determine cross-FS semantics.
- Focus on the startup, split calculation and directory scanning operations: the ones which slow down entire queries.
- Eliminate use of isDirectory(), getLength(), exists() if a followon operation (getStatus(),delete(), ... makes the use redundant.
- Assume that FileStatus entries are not cached; the cost of creating them is 1 RPC call against HDFS, 1+ HTTPS call against object stores.
- Locate calls to the listing operations, identify speedups, especially on recursive directory scans.
- Identify suboptimal seek patterns (backwards as well as forwards) and attempt to reduce/eliminate through reordering and result caching.
- Try to reuse the results of previous operations (e.g FileStatus instances) in follow-on calls.
- Commonly used file formats (e.g ORC) will have transitive benefits.
- Frameworks to use predicate pushdown where this delivers speedups
- Document best practises identified and implemented.
Attachments
Issue Links
- depends upon
-
SPARK-16980 Load only catalog table partition metadata required to answer a query
- Resolved
-
HIVE-14165 Remove Hive file listing during split computation
- Closed
-
HIVE-14269 Performance optimizations for data on S3
- Open
-
MAPREDUCE-6760 LocatedFileStatusFetcher to use listFiles(recursive)
- Open
-
MAPREDUCE-6800 FileInputFormat.singleThreadedListStatus to use listFiles(recursive)
- Open
-
SPARK-17861 Store data source partitions in metastore and push partition pruning into metastore
- Resolved
-
HADOOP-13321 Deprecate FileSystem APIs that promote inefficient call patterns.
- Resolved
-
HADOOP-13427 Eliminate needless uses of FileSystem#{exists(), isFile(), isDirectory()}
- Resolved
-
SPARK-14551 Reduce number of NameNode calls in OrcRelation with FileSourceStrategy mode
- Resolved
-
SPARK-16736 remove redundant FileSystem status checks calls from Spark codebase
- Resolved
-
SPARK-17159 Improve FileInputDStream.findNewFiles list performance
- Resolved
-
SPARK-18917 Dataframe - Time Out Issues / Taking long time in append mode on object stores
- Resolved
-
SPARK-17179 Consider improving partition pruning in HiveMetastoreCatalog
- Closed
-
HIVE-14323 Reduce number of FS permissions and redundant FS operations
- Closed
-
HIVE-14423 S3: Fetching partition sizes from FS can be expensive when stats are not available in metastore
- Closed
-
PIG-4442 Eliminate redundant RPC call to get file information in HPath.
- Closed