Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
Drill doesn't support multiple columns within a batch having the same name. when doing a join where there are matching column names, the planner will insert a project to rename one of the columns to avoid this conflict.
However, it appears that there is some case-sensitive matching somewhere in the code path, because there are some cases where this rewrite does not happen:
For example, this query does do the column name change (see 01-03):
0: jdbc:drill:> explain plan for select n3.n_name from (select n2.n_name from cp.`tpch/nation.parquet` n1, cp.`tpch/nation.parquet` n2 where n1.n_name = n2.n_name) n3 join cp.`tpch/nation.parquet` n4 on n3.n_name = n4.n_name;
+------------+------------+ | text | json | +------------+------------+ | 00-00 Screen 00-01 UnionExchange 01-01 Project(n_name=[$0]) 01-02 HashJoin(condition=[=($0, $1)], joinType=[inner]) 01-04 HashToRandomExchange(dist0=[[$0]]) 02-01 Project(n_name=[$1]) 02-02 HashJoin(condition=[=($0, $1)], joinType=[inner]) 02-04 HashToRandomExchange(dist0=[[$0]]) 04-01 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tpch/nation.parquet]], selectionRoot=/tpch/nation.parquet, numFiles=1, columns=[`n_name`]]]) 02-03 Project(n_name0=[$0]) 02-05 HashToRandomExchange(dist0=[[$0]]) 05-01 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tpch/nation.parquet]], selectionRoot=/tpch/nation.parquet, numFiles=1, columns=[`n_name`]]]) 01-03 Project(n_name0=[$0]) 01-05 HashToRandomExchange(dist0=[[$0]]) 03-01 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tpch/nation.parquet]], selectionRoot=/tpch/nation.parquet, numFiles=1, columns=[`n_name`]]])
But if I change the one of the letters in one of the identifiers to uppercase, the rename goes away:
0: jdbc:drill:> explain plan for select n3.n_name from (select n2.n_name from cp.`tpch/nation.parquet` n1, cp.`tpch/nation.parquet` n2 where n1.N_name = n2.n_name) n3 join cp.`tpch/nation.parquet` n4 on n3.n_name = n4.n_name; +------------+------------+ | text | json | +------------+------------+ | 00-00 Screen 00-01 UnionExchange 01-01 Project(n_name=[$0]) 01-02 HashJoin(condition=[=($0, $1)], joinType=[inner]) 01-04 HashToRandomExchange(dist0=[[$0]]) 02-01 Project(n_name=[$1]) 02-02 HashJoin(condition=[=($0, $1)], joinType=[inner]) 02-04 HashToRandomExchange(dist0=[[$0]]) 04-01 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tpch/nation.parquet]], selectionRoot=/tpch/nation.parquet, numFiles=1, columns=[`N_name`]]]) 02-03 Project(N_name0=[$0]) 02-05 HashToRandomExchange(dist0=[[$0]]) 05-01 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tpch/nation.parquet]], selectionRoot=/tpch/nation.parquet, numFiles=1, columns=[`N_name`]]]) 01-03 HashToRandomExchange(dist0=[[$0]]) 03-01 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tpch/nation.parquet]], selectionRoot=/tpch/nation.parquet, numFiles=1, columns=[`N_name`]]])
Running this query without the rewrite results in failure:
java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
at java.util.ArrayList.rangeCheck(ArrayList.java:604) ~[na:1.7.0_21]
at java.util.ArrayList.get(ArrayList.java:382) ~[na:1.7.0_21]
at org.apache.drill.exec.record.VectorContainer.getValueAccessorById(VectorContainer.java:252) ~[drill-java-exec-0.7.0-incubating-SNAPSHOT-rebuffed.jar:0.7.0-incubating-SNAPSHOT]
at org.apache.drill.exec.record.AbstractRecordBatch.getValueAccessorById(AbstractRecordBatch.java:153) ~[drill-java-exec-0.7.0-incubating-SNAPSHOT-rebuffed.jar:0.7.0-incubating-SNAPSHOT]
at org.apache.drill.exec.test.generated.HashJoinProbeGen249.doSetup(HashJoinProbeTemplate.java:46) ~[na:na]
at org.apache.drill.exec.test.generated.HashJoinProbeGen249.setupHashJoinProbe(HashJoinProbeTemplate.java:97) ~[na:na]
at org.apache.drill.exec.physical.impl.join.HashJoinBatch.innerNext(HashJoinBatch.java:226) ~[drill-java-exec-0.7.0-incubating-SNAPSHOT-rebuffed.jar:0.7.0-incubating-SNAPSHOT]