[HUDI-5768] Fail to read metadata table in Spark Datasource - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 0.12.0, 0.12.1, 0.12.2
Fix Version/s: 0.13.0, 0.13.1, 0.12.3
Component/s: metadata
Labels:
- pull-request-available

Story Points:
1
Epic Link:
Metadata Table for File Listing & Query Planning

Description

Using Hudi 0.13.0 and Spark 3.3.0, reading a table created by 0.13.0:

scala> val df = spark.read.format("hudi").load("/Users/ethan/Work/tmp/20230127-test-cli-bundle/hudi_trips_cow_backup/.hoodie/metadata")
scala> df.count
scala.MatchError: HFILE (of class org.apache.hudi.common.model.HoodieFileFormat)
  at org.apache.hudi.HoodieBaseRelation.x$2$lzycompute(HoodieBaseRelation.scala:216)
  at org.apache.hudi.HoodieBaseRelation.x$2(HoodieBaseRelation.scala:215)
  at org.apache.hudi.HoodieBaseRelation.fileFormat$lzycompute(HoodieBaseRelation.scala:215)
  at org.apache.hudi.HoodieBaseRelation.fileFormat(HoodieBaseRelation.scala:215)
  at org.apache.hudi.HoodieBaseRelation.canPruneRelationSchema(HoodieBaseRelation.scala:295)
  at org.apache.hudi.BaseMergeOnReadSnapshotRelation.canPruneRelationSchema(MergeOnReadSnapshotRelation.scala:102)
  at org.apache.spark.sql.execution.datasources.Spark33NestedSchemaPruning$$anonfun$apply0$1.applyOrElse(Spark33NestedSchemaPruning.scala:56)
  at org.apache.spark.sql.execution.datasources.Spark33NestedSchemaPruning$$anonfun$apply0$1.applyOrElse(Spark33NestedSchemaPruning.scala:50)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
  at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1228)
  at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1227)
  at org.apache.spark.sql.catalyst.plans.logical.Aggregate.mapChildren(basicLogicalOperators.scala:976)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
  at org.apache.spark.sql.execution.datasources.Spark33NestedSchemaPruning.apply0(Spark33NestedSchemaPruning.scala:50)
  at org.apache.spark.sql.execution.datasources.Spark33NestedSchemaPruning.apply(Spark33NestedSchemaPruning.scala:44)
  at org.apache.spark.sql.execution.datasources.Spark33NestedSchemaPruning.apply(Spark33NestedSchemaPruning.scala:39)
  at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:211)
  at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
  at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
  at scala.collection.immutable.List.foldLeft(List.scala:91)
  at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:208)
  at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200)
  at scala.collection.immutable.List.foreach(List.scala:431)
  at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200)
  at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179)
  at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
  at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:179)
  at org.apache.spark.sql.execution.QueryExecution.$anonfun$optimizedPlan$1(QueryExecution.scala:126)
  at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
  at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:185)
  at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:510)
  at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:185)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
  at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:184)
  at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:122)
  at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:118)
  at org.apache.spark.sql.execution.QueryExecution.assertOptimized(QueryExecution.scala:136)
  at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:154)
  at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:151)
  at org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:204)
  at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$explainString(QueryExecution.scala:249)
  at org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:218)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:103)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3856)
  at org.apache.spark.sql.Dataset.count(Dataset.scala:3160)
  ... 47 elided

Using Hudi 0.12.0 and Spark 3.2.1 hit the same issue as above.

Using Hudi 0.11.1 and Spark 3.2.1 can read the same metadata table.

Attachments

Issue Links

links to

GitHub Pull Request #7924

Fail to read metadata table in Spark Datasource

Details

Description

Attachments

Issue Links

Activity

People

Dates