Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
Given an Avro schema with a repeated record type, i.e.:
[ { "name": "RecordWithNestedFieldTypes", "namespace": "org.apache.parquet.avro", "type": "record", "fields" : [ { "name" : "nested_record_array", "type": { "type": "array", "items": { "name": "NestedRecord", "namespace": "org.apache.parquet.avro", "type": "record", "fields": [ { "name": "int_field", "type": "int" }, { "name": "string_field", "type": ["null", "string"] } ] } } } ] } ]
ParquetAvroReader will fail if you try to project a single field of the nested array type, with:
java.lang.ClassCastException: optional binary string_field (STRING) is not a group at org.apache.parquet.schema.Type.asGroupType(Type.java:247) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:359) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:303) at org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:76) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.<init>(AvroRecordConverter.java:613) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.<init>(AvroRecordConverter.java:562) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:367) at org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:143)
Running a debugger, it looks like this is happening because the #isElementType check tries to guess if the array element type is a record or not based on checking if fieldCount > 1: https://github.com/apache/parquet-mr/blob/945836c79b5bd3003512ace9e2d30d4cd03422f3/parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java#L932 . So, if you project a record with fieldCount == 1, it tries to collapse it into its single field type.
Repro: you can run the following test in `TestSpecificReadWrite`:
@Test public void testNestedProjectionSingleField() throws IOException { Path path = writeCarsToParquetFile(1, CompressionCodecName.UNCOMPRESSED, false); Configuration conf = new Configuration(testConf); Schema schema = Car.getClassSchema(); // Project a single field from nested schema List<Schema.Field> projectedFields = new ArrayList<Schema.Field>(); projectedFields.add(new Schema.Field( "serviceHistory", Schema.createUnion( Schema.create(Schema.Type.NULL), Schema.createArray( SchemaBuilder.builder(schema.getNamespace()) .record("Service") .fields() .requiredString("mechanic") .endRecord())))); Schema projectedSchema = Schema.createRecord(schema.getName(), schema.getDoc(), schema.getNamespace(), schema.isError()); projectedSchema.setFields(projectedFields); AvroReadSupport.setRequestedProjection(conf, projectedSchema); try (ParquetReader<Car> reader = new AvroParquetReader<Car>(conf, path)) { for (Car car = reader.read(); car != null; car = reader.read()) { assertNotNull(car.getServiceHistory()); } } }
Attachments
Issue Links
- links to