Uploaded image for project: 'HttpComponents HttpClient'
  1. HttpComponents HttpClient
  2. HTTPCLIENT-2159

Invalid handling of charset content type parameter

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 5.4-alpha1
    • None
    • None

    Description

      Based on reschke's, comment. We are treating several content types incorrectly. We have in org.apache.hc.core5.http.ContentType several content types defined which are per definition UTF-8 and do not contain any charset parameter or have another form transport encoding. Affected are:

          public static final ContentType APPLICATION_FORM_URLENCODED = create(
                  "application/x-www-form-urlencoded", StandardCharsets.ISO_8859_1);
          public static final ContentType APPLICATION_JSON = create(
                  "application/json", StandardCharsets.UTF_8);
          public static final ContentType APPLICATION_NDJSON = create(
                  "application/x-ndjson", StandardCharsets.UTF_8);
          public static final ContentType APPLICATION_PDF = create(
                  "application/pdf", StandardCharsets.UTF_8);
          public static final ContentType APPLICATION_PROBLEM_JSON = create(
                  "application/problem+json", StandardCharsets.UTF_8);
          public static final ContentType MULTIPART_FORM_DATA = create(
                  "multipart/form-data", StandardCharsets.ISO_8859_1);
          public static final ContentType MULTIPART_MIXED = create(
                  "multipart/mixed", StandardCharsets.ISO_8859_1);
          public static final ContentType MULTIPART_RELATED = create(
                  "multipart/related", StandardCharsets.ISO_8859_1);
          public static final ContentType TEXT_HTML = create(
                  "text/html", StandardCharsets.ISO_8859_1);
          public static final ContentType TEXT_EVENT_STREAM = create(
                  "text/event-stream", StandardCharsets.UTF_8);
      

      charset applies to the transport layer only and never to the semantics of the content-type. E.g., application/x-www-form-urlencoded.

      Attachments

        Issue Links

          Activity

            michael-o Michael Osipov added a comment -

            I don't know how to properly solve this for now, but it needs to be addressed.

            michael-o Michael Osipov added a comment - I don't know how to properly solve this for now, but it needs to be addressed.
            reschke Julian Reschke added a comment -

            Maybe deprecate them, pointing to correct alternatives?

            reschke Julian Reschke added a comment - Maybe deprecate them, pointing to correct alternatives?
            michael-o Michael Osipov added a comment -

            The problem aren't those static finals, but how we process them. We treat them as opaque definitions without any further semantics. That's our problem.

            michael-o Michael Osipov added a comment - The problem aren't those static finals, but how we process them. We treat them as opaque definitions without any further semantics. That's our problem.
            michael-o Michael Osipov added a comment -

            RFC 7231, appendix B says:

            The default charset of ISO-8859-1 for text media types has been
            removed; the default is now whatever the media type definition says.
            Likewise, special treatment of ISO-8859-1 has been removed from the
            Accept-Charset header field. (Section 3.1.1.3 and Section 5.3.3)

            michael-o Michael Osipov added a comment - RFC 7231, appendix B says: The default charset of ISO-8859-1 for text media types has been removed; the default is now whatever the media type definition says. Likewise, special treatment of ISO-8859-1 has been removed from the Accept-Charset header field. (Section 3.1.1.3 and Section 5.3.3)
            michael-o Michael Osipov added a comment -

            According to the appendix B I'd inclined to move text/plain to UTF-8.

            reschke, maybe one can flag the charset as implicit or non-public and avoid its serialization on the header value?

            michael-o Michael Osipov added a comment - According to the appendix B I'd inclined to move text/plain to UTF-8. reschke , maybe one can flag the charset as implicit or non-public and avoid its serialization on the header value?
            reschke Julian Reschke added a comment -

            So the value is needed for a) Content-Type field values, and b) to make serialization decisions?

            reschke Julian Reschke added a comment - So the value is needed for a) Content-Type field values, and b) to make serialization decisions?
            michael-o Michael Osipov added a comment - - edited

            Yes, that is correct. Otherwise I see no clean way to transform a StringEntity to wire format properly, but the parameter won't appear on the wire. Do you see a better way?

            Also the how form handling stuff needs to be revised.

            michael-o Michael Osipov added a comment - - edited Yes, that is correct. Otherwise I see no clean way to transform a StringEntity to wire format properly, but the parameter won't appear on the wire. Do you see a better way? Also the how form handling stuff needs to be revised.
            reschke Julian Reschke added a comment - - edited

            Yes, adding a flag sounds like the right thing to do.

            Not sure about the default for text/plain, though... Moving to UTF-8 in general is good, but this might cause compat problems, no?

            reschke Julian Reschke added a comment - - edited Yes, adding a flag sounds like the right thing to do. Not sure about the default for text/plain, though... Moving to UTF-8 in general is good, but this might cause compat problems, no?
            michael-o Michael Osipov added a comment -

            Yes, maybe. That is just an idea. But frankly, ISO-8859-1 is basically dead and those who need it need to set it explicitly and for all character based contructors. Never rely on defaults. See JEP 400.

            michael-o Michael Osipov added a comment - Yes, maybe. That is just an idea. But frankly, ISO-8859-1 is basically dead and those who need it need to set it explicitly and for all character based contructors. Never rely on defaults. See JEP 400.

            WRT ContentType it sounds like for some the charset is configurable and for others, the charset is fixed and not configurable. At least ContentType is almost immutable (the charset is immutable), see HTTPCORE-745.

            ggregory Gary D. Gregory added a comment - WRT ContentType it sounds like for some the charset is configurable and for others, the charset is fixed and not configurable. At least ContentType is almost immutable (the charset is immutable), see HTTPCORE-745 .

            michael-o Can this issue be closed as fixed by https://github.com/apache/httpcomponents-core/pull/375 ?

            Oleg

            olegk Oleg Kalnichevski added a comment - michael-o Can this issue be closed as fixed by https://github.com/apache/httpcomponents-core/pull/375 ? Oleg
            michael-o Michael Osipov added a comment -

            olegk, no both are unrelated and this one is still fully valid.

            michael-o Michael Osipov added a comment - olegk , no both are unrelated and this one is still fully valid.
            abernal Arturo Bernal added a comment -

            Hi michael-o  reschke ,

            I’ve just made a PR to address this issue. The changes aim to mitigate the charset handling for media types that shouldn’t include a charset by default, as well as improve the overall handling of implicit charsets.

            Looking forward to your feedback!

             

            TY

            abernal Arturo Bernal added a comment - Hi michael-o   reschke , I’ve just made a PR to address this issue. The changes aim to mitigate the charset handling for media types that shouldn’t include a charset by default, as well as improve the overall handling of implicit charsets. Looking forward to your feedback!   TY
            abernal Arturo Bernal added a comment -

            In master

            abernal Arturo Bernal added a comment - In master

            People

              Unassigned Unassigned
              michael-o Michael Osipov
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h 20m
                  3h 20m