[HUDI-1307] spark datasource load path format is confused for snapshot and increment read mode - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: reader-core, spark
Labels:

Description

as spark datasource read hudi table

1、snapshot mode

 val readHudi = spark.read.format("org.apache.hudi").load(basePath + "/*");
should add "/*" ,otherwise will fail, because in org.apache.hudi.DefaultSource.
createRelation() will use fs.globStatus(). if do not have "/*" will not get .hoodie and default dir
val globPaths = HoodieSparkUtils.checkAndGlobPathIfNecessary(allPaths, fs)

2、increment mode

both basePath and basePath + "/*" is ok.This is because in org.apache.hudi.DefaultSource

DataSourceUtils.getTablePath can support both the two format.

 val incViewDF = spark.read.format("org.apache.hudi").
 option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
 option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
 option(END_INSTANTTIME_OPT_KEY, endTime).
 load(basePath)

 val incViewDF = spark.read.format("org.apache.hudi").
 option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
 option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
 option(END_INSTANTTIME_OPT_KEY, endTime).
 load(basePath + "/*")

as increment mode and snapshot mode not coincide, user will confuse .Also load use basepath +"/" *or "/*/"* is confuse. I know this is to support partition.

but i think this api will more clear for user

 partition = "year = '2019'"
spark.read .format("hudi") .load(path) .where(partition)

```

Attachments

Issue Links

is related to

HUDI-2493 Verify removing glob pattern works w/ all key generators

Closed

Activity

People

Assignee:: liwei

Reporter:: liwei

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 30/Sep/20 06:33

Updated:: 22/Feb/22 15:40