Details
-
Question
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
3.0.0
-
None
-
None
-
spark3.0
set spark.sql.hive.convertMetastoreOrc=true (default value in spark3.0)
set spark.sql.orc.impl=native(default velue in spark3.0)
Description
Before I address this issue, let me talk about the issue background: The current spark version we use is 2.2, and we plan to migrate to spark3.0 in near future. Before migration, we test some query in both spark2.2 and spark3.0 to check potential issue. The data source table of these query is orc format written by spark2.2.
I find that even if column pruning is applied, spark3.0’s native reader will read all columns.
Then I do remote debug. In OrcUtils.scala’s requestedColumnIds Method, it will check whether field name is started with “_col”. In my case, field name is started with “_col”, like “_col1”, “_col2”. So pruneCols is not done. The code is below:
if (orcFieldNames.forall(_.startsWith("_col"))) {
// This is a ORC file written by Hive, no field names in the physical schema, assume the
// physical schema maps to the data scheme by index.
assert(orcFieldNames.length <= dataSchema.length, "The given data schema " +
s"$
{dataSchema.catalogString}has less fields than the actual ORC physical schema, " +
"no idea which columns were dropped, fail to read.")
// for ORC file written by Hive, no field names
// in the physical schema, there is a need to send the
// entire dataSchema instead of required schema.
// So pruneCols is not done in this case
Some(requiredSchema.fieldNames.map { name =>
val index = dataSchema.fieldIndex(name)
if (index < orcFieldNames.length)
{ index }else
{ -1 }}, false)
Although this code comment explains reason, I still do not understand. This issue only happens in this case: spark3.0 uses native reader to read table written by spark2.2.
In other cases, there is no such issue. I do another 2 tests:
Test1: use spark3.0’s hive reader (running with spark.sql.hive.convertMetastoreOrc=false and spark.sql.orc.impl=hive) to read the same table, it only reads pruned columns.
Test2: use spark3.0 to write a table, then use spark3.0’s native reader to read this new table, it only reads pruned columns.
This issue I mentioned is a block we use native reader in spark3.0. Can anyone know further reason or provide solutions?
Attachments
Issue Links
- duplicates
-
SPARK-34897 Support reconcile schemas based on index after nested column pruning
- Resolved