[HIVE-28262] Single column use MultiDelimitSerDe parse column error - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1.3, 4.1.0
Fix Version/s: 4.1.0, 4.0.1
Component/s: HiveServer2
Labels:
Environment:

Hive version: 3.1.3

Target Version/s:

4.1.0
Language:
- English

Description

ENV:

Hive: 3.1.3/4.1.0

HDFS: 3.3.1

--------------------------

Create a text file for external table load，(e.g:/tmp/data):

1|@|
2|@|
3|@|

Create external table:

CREATE EXTERNAL TABLE IF NOT EXISTS test_split_tmp(`ID` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' WITH SERDEPROPERTIES('field.delim'='|@|') STORED AS textfile location '/tmp/test_split_tmp';

put text file to external table path:

hdfs dfs -put /tmp/data /tmp/test_split_tmp

query this table and cast column id to long type:

select UDFToLong(`id`) from test_split_tmp;

why use UDFToLong function? because it will get NULL result in this condition，but string type '1' use this function should get type long 1 result.

+--------+
| id     |
+--------+
| NULL   |
| NULL   |
| NULL   |
+--------+

Therefore, I speculate that there is an issue with the field splitting in MultiDelimitSerde.

when I debug this issue, I found some problem below:

org.apache.hadoop.hive.serde2.lazy.LazyStruct#findIndexes

when fields.length=1 can't find the delimit index

private int[] findIndexes(byte[] array, byte[] target) {
  if (fields.length <= 1) {  // bug
    return new int[0];
  }
  ...
  for (int i = 1; i < indexes.length; i++) {  // bug
    array = Arrays.copyOfRange(array, indexInNewArray + target.length, array.length);
    indexInNewArray = Bytes.indexOf(array, target);
    if (indexInNewArray == -1) {
      break;
    }
    indexes[i] = indexInNewArray + indexes[i - 1] + target.length;
  }
  return indexes;
}

org.apache.hadoop.hive.serde2.lazy.LazyStruct#parseMultiDelimit

when fields.length=1 can't find the column startPosition

public void parseMultiDelimit(byte[] rawRow, byte[] fieldDelimit) {
  ...
  int[] delimitIndexes = findIndexes(rawRow, fieldDelimit);
  ...
    if (fields.length > 1 && delimitIndexes[i - 1] != -1) { // bug
      int start = delimitIndexes[i - 1] + fieldDelimit.length;
      startPosition[i] = start - i * diff;
    } else {
      startPosition[i] = length + 1;
    }
  }
  Arrays.fill(fieldInited, false);
  parsed = true;
}

Multi delimit Process:

Actual: 1|@| -> 1^A id column start 0 ,next column start 1

Expected: 1|@| -> 1^A id column start 0 ,next column start 2

Fix:

fields.length=1 should find multi delimit index
fields.length=1 should calculate column start position correct

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

CleanShot 2024-05-16 at 15.17.15@2x.png
16/May/24 07:19
227 kB
Liu Weizheng
CleanShot 2024-05-16 at 15.13.29@2x.png
16/May/24 07:19
137 kB
Liu Weizheng

Issue Links

links to

GitHub Pull Request #5252

Activity

People

Assignee:: Liu Weizheng

Reporter:: Liu Weizheng

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Due:: 16/May/24

Created:: 16/May/24 07:21

Updated:: 25/Sep/24 04:44

Resolved:: 03/Jul/24 01:14