Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
Based on reschke's, comment. We are treating several content types incorrectly. We have in org.apache.hc.core5.http.ContentType several content types defined which are per definition UTF-8 and do not contain any charset parameter or have another form transport encoding. Affected are:
public static final ContentType APPLICATION_FORM_URLENCODED = create( "application/x-www-form-urlencoded", StandardCharsets.ISO_8859_1); public static final ContentType APPLICATION_JSON = create( "application/json", StandardCharsets.UTF_8); public static final ContentType APPLICATION_NDJSON = create( "application/x-ndjson", StandardCharsets.UTF_8); public static final ContentType APPLICATION_PDF = create( "application/pdf", StandardCharsets.UTF_8); public static final ContentType APPLICATION_PROBLEM_JSON = create( "application/problem+json", StandardCharsets.UTF_8); public static final ContentType MULTIPART_FORM_DATA = create( "multipart/form-data", StandardCharsets.ISO_8859_1); public static final ContentType MULTIPART_MIXED = create( "multipart/mixed", StandardCharsets.ISO_8859_1); public static final ContentType MULTIPART_RELATED = create( "multipart/related", StandardCharsets.ISO_8859_1); public static final ContentType TEXT_HTML = create( "text/html", StandardCharsets.ISO_8859_1); public static final ContentType TEXT_EVENT_STREAM = create( "text/event-stream", StandardCharsets.UTF_8);
- application/x-www-form-urlencoded: Does not have a charset parameter: https://www.iana.org/assignments/media-types/application/x-www-form-urlencoded. HTML5 defines https://url.spec.whatwg.org/#urlencoded-serializing how to apply alternative encoding, but UTF-8 is standard.
- application/json, application/x-ndjson, application/problem+json: There is no charset definition because JSON is always UTF-8. The charset paremeter has no meaning: https://datatracker.ietf.org/doc/html/rfc8259#section-11
- application/pdf: This is binary encoding, no charset
- text/event-stream: Defined always as UTF-8: https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events-intro
- text/html: https://html.spec.whatwg.org/ does not define ISO-8859-1 to be the default encoding. it says that encoding must be supplied by some means and an algorithm is applied to find it. It seems that UTF-8 is expected these days.
- multipart/mixed: Does not have a charset parameter, it is up to the parts to supply proper encoding to perform byte-to-char conversion: https://datatracker.ietf.org/doc/html/rfc2046
- multipart/related: Does not have a charset parameter, it is up to the parts to supply proper encoding to perform byte-to-char conversion: https://datatracker.ietf.org/doc/html/rfc2387
- multipart/form-data: Does not have a charset parameter, the RFC defines a charset form field for that: https://datatracker.ietf.org/doc/html/rfc7578#section-4.6
charset applies to the transport layer only and never to the semantics of the content-type. E.g., application/x-www-form-urlencoded.
Attachments
Issue Links
- is related to
-
HTTPCLIENT-2144 encoding of body changes during redirect with status code 307
- Resolved
-
HTTPCLIENT-2325 Avoid adding "; charset=" for multipart/form-data requests
- Resolved
I don't know how to properly solve this for now, but it needs to be addressed.