Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-2450

ParquetAvroReader throws exception projecting a single field of a repeated record type

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.14.0
    • None
    • None

    Description

      Given an Avro schema with a repeated record type, i.e.:

       

      [
          {
            "name": "RecordWithNestedFieldTypes",
            "namespace": "org.apache.parquet.avro",
            "type": "record",
            "fields" : [
                {
                  "name" : "nested_record_array",
                  "type": {
                      "type": "array",
                      "items": {
                        "name": "NestedRecord",
                        "namespace": "org.apache.parquet.avro",
                        "type": "record",
                        "fields": [
                            {
                               "name": "int_field",
                               "type": "int"
                            },
                            {
                              "name": "string_field",
                              "type": ["null", "string"]
                            }
                        ]
                      }
                  }
               }
            ]
          }
      ] 

      ParquetAvroReader will fail if you try to project a single field of the nested array type, with:

      java.lang.ClassCastException: optional binary string_field (STRING) is not a group
      at org.apache.parquet.schema.Type.asGroupType(Type.java:247)
      at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:359)
      at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:303)
      at org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:76)
      at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.<init>(AvroRecordConverter.java:613)
      at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.<init>(AvroRecordConverter.java:562)
      at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:367)
      at org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:143) 

      Running a debugger, it looks like this is happening because the #isElementType check tries to guess if the array element type is a record or not based on checking if fieldCount > 1: https://github.com/apache/parquet-mr/blob/945836c79b5bd3003512ace9e2d30d4cd03422f3/parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java#L932  . So, if you project a record with fieldCount == 1, it tries to collapse it into its single field type.

       

      Repro: you can run the following test in `TestSpecificReadWrite`:

      @Test
      public void testNestedProjectionSingleField() throws IOException {
        Path path = writeCarsToParquetFile(1, CompressionCodecName.UNCOMPRESSED, false);
        Configuration conf = new Configuration(testConf);
        Schema schema = Car.getClassSchema();
      
        // Project a single field from nested schema
        List<Schema.Field> projectedFields = new ArrayList<Schema.Field>();
        projectedFields.add(new Schema.Field(
            "serviceHistory",
            Schema.createUnion(
                Schema.create(Schema.Type.NULL),
                Schema.createArray(
                  SchemaBuilder.builder(schema.getNamespace())
                      .record("Service")
                      .fields()
                      .requiredString("mechanic")
                      .endRecord()))));
      
        Schema projectedSchema =
            Schema.createRecord(schema.getName(), schema.getDoc(), schema.getNamespace(), schema.isError());
        projectedSchema.setFields(projectedFields);
        AvroReadSupport.setRequestedProjection(conf, projectedSchema);
      
        try (ParquetReader<Car> reader = new AvroParquetReader<Car>(conf, path)) {
          for (Car car = reader.read(); car != null; car = reader.read()) {
            assertNotNull(car.getServiceHistory());
          }
        }
      } 

      Attachments

        Issue Links

          Activity

            People

              clairemcginty Claire McGinty
              clairemcginty Claire McGinty
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: