Uploaded image for project: 'Apache Avro'
  1. Apache Avro
  2. AVRO-1517

Unicode strings are accepted as bytes and fixed type by perl API

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.12.0
    • perl
    • None
    • Hide
      AVRO-1517 - Perl: Raise error when attempting to serialize strings with ordinal values >255 to 'bytes' or 'fixed' types.
      Show
      AVRO-1517 - Perl: Raise error when attempting to serialize strings with ordinal values >255 to 'bytes' or 'fixed' types.

    Description

      By default in perl, a string is a sequence of bytes, values 0-255. However, if a Unicode character is included that cannot be represented with a single byte, the string gets 'upgraded' to a non-byte-based Unicode string allowing ordinals outside that range. When string operations are done with byte and non-byte Unicode strings, the result is always non-byte, with the byte string first 'upgraded'. Upgrading consists of utf8 encoding and setting a utf8 flag on the string. ('utf8' is a variant of UTF-8 used by perl)

      The perl Avro API is accepting these Unicode strings as-is for the 'bytes' type. This is a problem because 1) values >255 are not valid as bytes, and any encoding is their job. 2) As Avro assembles the serialized data, perl 'upgrades' all the data, having the effect of utf8 encoding our serialized binary data.

      The correct behavior is for the Avro perl API is to attempt to downgrade the string, and if this fails because of contained values >255 then to raise an error. (The behavior of 'string' won't change, it will still take Unicode strings as expected.)

      Attachments

        1. AVRO-1517.patch
          8 kB
          John Karp

        Activity

          People

            jjatria José Joaquín Atria
            jkarp John Karp
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: