Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
Description
In HS2 (and other components) we rely on UTF8 encoding, hence while storing strings as bytes, we store the UTF8-encoded bytes. Some java APIs rely on default system encoding in different ways, which can lead to incorrect encoding (if system settings defaults other than UTF8). This patch intends to fix 2 different paths:
1. ConstantVectorExpression
in my case, this:
LOG.info("default charset name: " + java.nio.charset.Charset.defaultCharset().name()); LOG.info("getBytes() = " + ((String) constantValue).getBytes()); LOG.info("getBytes(StandardCharsets.UTF_8) = " + ((String) constantValue).getBytes(StandardCharsets.UTF_8));
led to:
default charset name: US-ASCII
getBytes() = [B@73dcffb0
getBytes(StandardCharsets.UTF_8) = [B@2ead0b9c
on the customer side, queries returned wrong results when the filter contained the special character (which is part of UTF8 character table):
SELECT b FROM default.rlv_test1 where b='北京'; .... ??
2. Explain
Similarly, explain printed to a PrintStream of different encoding, leading to a plan like:
Map Operator Tree: TableScan alias: test_table filterExpr: (b = '??') (type: boolean) Statistics: Num rows: 2 Data size: 352 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: (b = '??') (type: boolean) Statistics: Num rows: 2 Data size: 352 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: a (type: int), '??' (type: string), c (type: string)
Attachments
Issue Links
- relates to
-
HIVE-28544 Ensure using UTF-8 encoding in some String/Char/Varchar related operations
- Resolved
-
HIVE-26651 MultiDelimitSerDe shouldn't rely on default charset when returning the deserialized string
- Closed
- links to