Uploaded image for project: 'HttpComponents HttpCore'
  1. HttpComponents HttpCore
  2. HTTPCORE-757

AbstractCharDataConsumer jams up with incomplete UTF-8 data

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 5.2.2
    • 5.2.3, 5.3-alpha1
    • None
    • None

    Description

      While streaming UTF-8-encoded data with the async HTTP client, we observed the following behaviour:

      • After several minutes of consuming from our stream, the client jammed up permanently and did not recover without a restart

      Upon closer inspection, we realised that `AbstractCharDataConsumer` (which we were extending to parse our data) was receiving incomplete UTF-8 characters from the end of the stream (i.e. the last character in the stream was multi-byte and we hadn't yet received all bytes for it), and this was causing it to go into an infinite loop on the following code:

      @Override
      public final void consume(final ByteBuffer src) throws IOException {
          final CharsetDecoder charsetDecoder = getCharsetDecoder();
          while (src.hasRemaining()) {
              checkResult(charsetDecoder.decode(src, charBuffer, false));
              doDecode(false);
          }
      }

      This was fairly time-consuming to figure out and required us to go deep into the brain of the library.

      We don't know how this could be improved exactly, but a couple of thoughts:

      • If this class expects a completely valid text string in the buffer with no trailing bytes:
        • Then it should throw some exception once it detects that it's failing to completely process the buffer
        • And the caller could deal with this somehow (either by catching this exception and waiting for more data, or otherwise ensuring that the input is valid before calling the consumer - though it's not clear how it could do that without also having knowledge of the encoding)
        • Alternatively, the caller could simply bubble up the exception and let us know that we shouldn't be using this class when there is only partial data. That would also have helped us to diagnose the issue
      • OTOH if this class is expected to be able to handle partially complete input:
        • Then it should store the trailing unprocessable bytes into a buffer, and prepend them to the beginning of the next input (hopefully resulting in a valid UTF-8 string, though it would also have to handle the case where it didn't)
        • This was roughly how we solved the issue on our side - we extended `
          AbstractBinDataConsumer` instead and handled the encoding ourselves

      Attachments

        Activity

          People

            Unassigned Unassigned
            simon.white Simon White
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: