Description
Avro.IO.BinaryDecoder.ReadString() fails for strings with length > 256, i.e. when the StackallocThreshold is exceeded.
This can be seen when serializing and subsequently deserializing a GenericRecord of schema
{ "type": "record", "name": "Foo", "fields": [ { "name": "x", "type": "string" } ] }
with a field x containing a string of length > 256, as done in the test case Test(257):
public void Test(int n) { var schema = (RecordSchema) Schema.Parse("{ \"type\":\"record\", \"name\":\"Foo\",\"fields\":[{\"name\":\"x\",\"type\":\"string\"}]}"); var datum = new GenericRecord(schema); datum.Add("x", new String('x', n)); byte[] serialized; using (var ms = new MemoryStream()) { var enc = new BinaryEncoder(ms); var writer = new GenericDatumWriter<GenericRecord>(schema); writer.Write(datum, enc); serialized = ms.ToArray(); } using (var ms = new MemoryStream(serialized)) { var dec = new BinaryDecoder(ms); var deserialized = new GenericRecord(schema); var reader = new GenericDatumReader<GenericRecord>(schema, schema); reader.Read(deserialized, dec); Assert.Equal(datum, deserialized); } }
which yields the following exception
Avro.AvroException End of stream reached at Avro.IO.BinaryDecoder.Read(Span`1 buffer) at Avro.IO.BinaryDecoder.ReadString() at Avro.Generic.PreresolvingDatumReader`1.<>c.<ResolveReader>b__21_1(Decoder d) at Avro.Generic.PreresolvingDatumReader`1.<>c__DisplayClass37_0.<Read>b__0(Object r, Decoder d) at Avro.Generic.PreresolvingDatumReader`1.<>c__DisplayClass23_1.<ResolveRecord>b__2(Object rec, Decoder d) at Avro.Generic.PreresolvingDatumReader`1.ReadRecord(Object reuse, Decoder decoder, RecordAccess recordAccess, IEnumerable`1 readSteps) at Avro.Generic.PreresolvingDatumReader`1.<>c__DisplayClass23_0.<ResolveRecord>b__0(Object r, Decoder d) at Avro.Generic.PreresolvingDatumReader`1.Read(T reuse, Decoder decoder) at AvroTests.AvroTests.Test(Int32 n) in C:\Users\l.heimberg\Source\Repos\AvroTests\AvroTests\AvroTests.cs:line 41
The reason seems to be the following: when a string of length <= StackallocThreshold (=256) is read, a buffer, to read the content of the string from the stream into, is allocated on the stack with the exact length of the string. If the length is > StackallocThreshold, the buffer is obtained from ArrayPool<byte>.Shared.Rent(length), which returns a buffer of minimum length 'length', but possibly also a larger buffer.
The Read(Span<byte> buffer) method is used to read the content of the string from the input stream. The method always tries to read as much bytes from the input stream as this buffer has length, and in particular will fail with the exception shown above when the stream does not have enough data anymore. Thus, if the string has expected length > StackallocThreshold and the buffer obtained from ArrayPool<byte>.Shared.Rent(length) has size > length, the Read method will either throw the above AvroException (when the string is the last element in the stream) or will already consume parts of following data items in the stream, in any case causing corruption.
The provided patch turns the byte array returned by the ArrayPool into a Span with the correct length using the Splice method, instead of casting it implicitly to Span<byte>.
Possiby related: https://github.com/confluentinc/confluent-kafka-dotnet/issues/1398#issuecomment-748171083
Attachments
Attachments
Issue Links
- is related to
-
AVRO-2983 BinaryDecoder on NetStandard 2.1+ Fails To Read Large Strings
- Resolved