Details
-
Improvement
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
Impala 3.1.0
-
None
-
-
ghx-label-4
Description
The file handle cache currently does not allow caching remote file handles. This means that clusters that have a lot of remote reads can suffer from overloading the NameNode. Impala should be able to cache remote file handles.
There are some open questions about remote file handles and whether they behave differently from local file handles. In particular:
- Is there any resource constraint on the number of remote file handles open? (e.g. do they maintain a network connection?)
- Are there any semantic differences in how remote file handles behave when files are deleted, overwritten, or appended?
- Are there any extra failure cases for remote file handles? (i.e. if a machine goes down or a remote file handle is left open for an extended period of time)
The form of caching will depend on the answers, but at the very least, it should be possible to cache a remote file handle at the level of a query so that a Parquet file with multiple columns can share file handles.
Attachments
Issue Links
- is related to
-
IMPALA-9485 Enable file handle cache for EC files
- Resolved
-
IMPALA-10214 Ozone support for file handle cache
- Resolved
-
IMPALA-8428 Add support for caching file handles on s3
- Resolved
- relates to
-
IMPALA-10202 Enable file handle cache for ABFS files
- Resolved