Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
The current Kudu-Spark bindings implement a DefaultSource that extends a RelationProvider, which provides BaseRelations to Spark, which, as I understand it, are physical units of query execution and represent sets of rows. The Kudu BaseRelation (the KuduRelation) implements a couple of traits to fit into Spark: PrunedFilteredScan, which allows predicates to be pushed into Kudu, and InsertableRelation, which allows writes to be pushed into Kudu. An issue with these bindings is that, while they provide interfaces to insert/get data, they do not provide interfaces to push details to Spark that might be useful to optimizing a Kudu query.
Among other things, this is inconvenient for all datasources that might want to take such optimizations into their own hands, and the Spark community appears to be making efforts in revamping their DataSource APIs in the form of DataSourceV2, and as it pertains to read support, the v2 DataSourceReader. This new world order provides a clear path towards implementing various optimizations that are currently unavailable with the current Spark bindings, without pushing changes to Spark itself.
Of note, the v2 DataSourceReader can be extended with SupportsReportStatistics, which could allow Kudu to expose statistics to Kudu without having to rely on HMS (although pushing stats to HMS isn't an unreasonable approach either). More traits and details about the API can be found here.