[IMPALA-5675] Support CHAR/VARCHAR length counted in number of UTF-8 characters, not bytes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: In Progress
Priority: Critical
Resolution: Unresolved
Affects Version/s: Impala 2.7.0
Fix Version/s: None
Component/s: Backend
Labels:
None
Environment:
Cloudera distro 5.10.1

Epic Link:
UTF-8
Epic Color:
ghx-label-6

Description

We have created external table with the following query:

CREATE EXTERNAL TABLE IF NOT EXISTS SAPNSQ.ZAP_GL_EX_IM_CSV ( GLREQUEST DECIMAL(30), KNUMC STRING, FACCP STRING, FCHAR VARCHAR(20), FCLNT VARCHAR(3), FCUKY STRING, FCURR DOUBLE, FDATS STRING, FDEC DECIMAL(8, 2), FFLTP FLOAT, FINT1 TINYINT, FINT2 SMALLINT, FINT4 BIGINT, FLANG STRING, FPREC DOUBLE, FQUAN DOUBLE, FTIMS STRING, FUNIT STRING, FSSTRING STRING, FCHAR40 VARCHAR(40) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" STORED AS TEXTFILE LOCATION "hdfs:///user/nsqhdp/H_CDC_IMPQ/ZAP_GL_EX_IM_CSV/63E55F5943E95122E1000000C0A83051"

CSV files are already present on specified location hdfs:///user/nsqhdp/H_CDC_IMPQ/ZAP_GL_EX_IM_CSV/63E55F5943E95122E1000000C0A83051

When we execute Select fchar40 FROM sapnsq.zap_gl_ex_im_csv ORDER BY fchar40 with both Hive and Impala, we get different results:

Hive (see Hive_query.png)
Impala (see Impala_query.png)

Seems that Impala engine is truncating strings when they contain non-ASCII characters.
So if a character is encoded with 2 bytes, Impala counts it as 2 chars (instead of 1).
Then the FCHAR40 VARCHAR(40) will actually return less than 40 characters.

Example:
1st row contains 3 special characters: É, Ï and ü
Select with Impala truncates the result by 3 characters.

According to Impala documentation (https://www.cloudera.com/documentation/enterprise/5-7-x/topics/impala_varchar.html), Unicode should be supported:
"All data in CHAR and VARCHAR columns must be in a character encoding that is compatible with UTF-8"

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Hive_query.png
18/Jul/17 08:46
28 kB
Branislav Lukáč
Impala_query.png
18/Jul/17 08:46
24 kB
Branislav Lukáč

Issue Links

is related to

IMPALA-2019 Proper UTF-8 support in string functions

Resolved

relates to

IMPALA-9662 Add builtin functions for masking UTF-8 strings

Resolved

Activity

People

Assignee:: Quanlong Huang

Reporter:: Branislav Lukáč

Votes:: 1 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 18/Jul/17 09:23

Updated:: 12/Jan/21 08:41