Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.10.0
-
None
-
None
Description
I am trying to read a parquet file in scala using the Avro interface (1.10.). The file was also generated using the same interface.
The data that I am writing looks like this:
case class Inner(b: Array[Int]) case class Outer(a: Array[Inner]) val data = Outer( Array( Inner(Array(1, 2)), Inner(Array(3, 4)) ) )
Using parquet-tools to read read the file looks like this:
$ parquet-tools cat /tmp/test.parquet a: .array: ..b: ...array = 1 ...array = 2 .array: ..b: ...array = 3 ...array = 4
But while trying to read the file I get the following exception:
Exception in thread "main" org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema mismatch: Avro field 'array' not found at org.apache.parquet.avro.AvroRecordConverter.getAvroField(AvroRecordConverter.java:225) at org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:130) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:279) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:232) at org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:78) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.<init>(AvroRecordConverter.java:536) at org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.<init>(AvroRecordConverter.java:486) at org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:289) at org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:141) at org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:95) at org.apache.parquet.avro.AvroRecordMaterializer.<init>(AvroRecordMaterializer.java:33) at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138) at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183) at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156) at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) at raw.runtime.writer.parquet.avro.Lixo$.main(Lixo.scala:78) at raw.runtime.writer.parquet.avro.Lixo.main(Lixo.scala)
This is the code used to generate this file:
val filename = "/tmp/test.parquet" val path = Paths.get(filename).toFile val conf = new Configuration() val schema: Schema = { val inner = Schema.createRecord("inner", "some doc", "outer", false, List(new Schema.Field("b", Schema.createArray(Schema.create(Schema.Type.INT)), "", null: Object)).asJava ) Schema.createRecord("outer", "", "", false, List(new Schema.Field("a", Schema.createArray(inner), "", null: Object)).asJava ) } val os = new FileOutputStream(path) val outputFile = new RawParquetOutputFile(os) val parquetWriter: ParquetWriter[GenericRecord] = AvroParquetWriter.builder[GenericRecord](outputFile) .withConf(conf) .withSchema(schema) .build() val data = Outer( Array( Inner(Array(1, 2)), Inner(Array(3, 4)) ) ) val record = new GenericData.Record(schema) val fieldA = schema.getField("a").schema() val recorData = { val fieldAType = fieldA.getElementType() data.a.map { x => val innerRecord = new GenericData.Record(fieldAType) innerRecord.put("b", x.b) innerRecord } } record.put("a", recorData) parquetWriter.write(record) parquetWriter.close() os.close()
Also if I pass the configuration option
parquet.avro.add-list-element-records = false
I get a different exception:
org.apache.avro.SchemaParseException: Can't redefine: list
Am I doing something wrong?
Attachments
Issue Links
- is duplicated by
-
PARQUET-1441 SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter
- Resolved