Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
Impala 2.3.0
-
None
Description
When scanning a deeply nested Avro file Impala gets into an infinite loop/recursion and the query hangs. The query cannot be cancelled and will continue to take 100% of one CPU core. The only remedy is to restart the impalad.
Skye, I had applied Patch Set 4 of your Avro scanner CR.
Steps to repro:
1. Copy the attached Parquet data file to a local dir
2. Copy the file somewhere into a new HDFS dir (assuming /test-warehouse/max_depth/ below)
3. In Impala, create a Parquet table using that file:
create external table max_depth_parquet
like parquet '/test-warehouse/max_depth/max_depth.parq'
stored as parquet
location '/test-warehouse/max_depth/
4. In Hive, create an Avro table from that Parquet table:
create table max_depth_avro stored as avro as select * from max_depth_parquet;
During my initial investigation I found the following:
The query hangs in AvroSchemaElement::ConvertSchema() called from HdfsScanNode::Prepare().
I added some logging in AvroSchemaElement::ConvertSchema() to print the pointers of traversed child elements, and there appears to be a cycle because the pointers traversed repeat after some number of recursive calls.