[HTTPCORE-757] AbstractCharDataConsumer jams up with incomplete UTF-8 data - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 5.2.2
Fix Version/s: 5.2.3, 5.3-alpha1
Component/s: None
Labels:
None

Description

While streaming UTF-8-encoded data with the async HTTP client, we observed the following behaviour:

After several minutes of consuming from our stream, the client jammed up permanently and did not recover without a restart

Upon closer inspection, we realised that `AbstractCharDataConsumer` (which we were extending to parse our data) was receiving incomplete UTF-8 characters from the end of the stream (i.e. the last character in the stream was multi-byte and we hadn't yet received all bytes for it), and this was causing it to go into an infinite loop on the following code:

@Override
public final void consume(final ByteBuffer src) throws IOException {
    final CharsetDecoder charsetDecoder = getCharsetDecoder();
    while (src.hasRemaining()) {
        checkResult(charsetDecoder.decode(src, charBuffer, false));
        doDecode(false);
    }
}

This was fairly time-consuming to figure out and required us to go deep into the brain of the library.

We don't know how this could be improved exactly, but a couple of thoughts:

If this class expects a completely valid text string in the buffer with no trailing bytes:
- Then it should throw some exception once it detects that it's failing to completely process the buffer
- And the caller could deal with this somehow (either by catching this exception and waiting for more data, or otherwise ensuring that the input is valid before calling the consumer - though it's not clear how it could do that without also having knowledge of the encoding)
- Alternatively, the caller could simply bubble up the exception and let us know that we shouldn't be using this class when there is only partial data. That would also have helped us to diagnose the issue
OTOH if this class is expected to be able to handle partially complete input:
- Then it should store the trailing unprocessable bytes into a buffer, and prepend them to the beginning of the next input (hopefully resulting in a valid UTF-8 string, though it would also have to handle the case where it didn't)
- This was roughly how we solved the issue on our side - we extended `
  AbstractBinDataConsumer` instead and handled the encoding ourselves

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Simon White

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 02/Sep/23 12:05

Updated:: 20/Sep/23 17:13

Resolved:: 15/Sep/23 09:19