[SPARK-35010] nestedSchemaPruning causes issue when reading hive generated Orc files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Duplicate
Affects Version/s: 3.0.0, 3.1.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

In spark3, we have spark.sql.orc.imple=native and spark.sql.optimizer.nestedSchemaPruning.enabled=true as the default settings. And these would cause issues when query struct field of hive-generated orc files.

For example, we got an error when running this query in spark3

spark.table("testtable").filter(col("utc_date") === "20210122").select(col("open_count.d35")).show(false)

The error is

Caused by: java.lang.AssertionError: assertion failed: The given data schema struct<open_count:struct<d35:map<string,double>>> has less fields than the actual ORC physical schema, no idea which columns were dropped, fail to read.
  at scala.Predef$.assert(Predef.scala:223)
  at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:153)
  at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$3(OrcFileFormat.scala:180)
  at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2539)
  at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:178)
  at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116)
  at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
  at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
  at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
  at org.apache.spark.scheduler.Task.run(Task.scala:127)
  at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)

I think the reason is that we apply the nestedSchemaPruning to the dataSchema. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SchemaPruning.scala#L75

This nestedSchemaPruning not only prunes the unused fields of the struct, it also prunes the unused columns. In my test, the dataSchema originally has 48 columns, but after nested schema pruning, the dataSchema is pruned to 1 column. This pruning result in an assertion error in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L159 because column pruning in hive generated orc files is not supported.

This issue seems also related to the hive version, we use hive 1.2, and it doesn't contain field names in the physical schema.

Attachments

Issue Links

duplicates

SPARK-34897 Support reconcile schemas based on index after nested column pruning

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Baohe Zhang

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 09/Apr/21 16:36

Updated:: 11/Apr/21 03:31

Resolved:: 11/Apr/21 03:31