Status: Resolved
Resolution: Fixed
3.1.3, 4.1.0
Hive version: 3.1.3
Hive: 3.1.3/4.1.0
HDFS: 3.3.1
Create a text file for external table load,(e.g:/tmp/data):
1|@| 2|@| 3|@|
Create external table:
CREATE EXTERNAL TABLE IF NOT EXISTS test_split_tmp(`ID` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' WITH SERDEPROPERTIES('field.delim'='|@|') STORED AS textfile location '/tmp/test_split_tmp';
put text file to external table path:
hdfs dfs -put /tmp/data /tmp/test_split_tmp
query this table and cast column id to long type:
select UDFToLong(`id`) from test_split_tmp;
why use UDFToLong function? because it will get NULL result in this condition,but string type '1' use this function should get type long 1 result.
+--------+ | id | +--------+ | NULL | | NULL | | NULL | +--------+
Therefore, I speculate that there is an issue with the field splitting in MultiDelimitSerde.
when I debug this issue, I found some problem below:
- org.apache.hadoop.hive.serde2.lazy.LazyStruct#findIndexes
when fields.length=1 can't find the delimit index
private int[] findIndexes(byte[] array, byte[] target) { if (fields.length <= 1) { // bug return new int[0]; } ... for (int i = 1; i < indexes.length; i++) { // bug array = Arrays.copyOfRange(array, indexInNewArray + target.length, array.length); indexInNewArray = Bytes.indexOf(array, target); if (indexInNewArray == -1) { break; } indexes[i] = indexInNewArray + indexes[i - 1] + target.length; } return indexes; }
- org.apache.hadoop.hive.serde2.lazy.LazyStruct#parseMultiDelimit
when fields.length=1 can't find the column startPosition
public void parseMultiDelimit(byte[] rawRow, byte[] fieldDelimit) { ... int[] delimitIndexes = findIndexes(rawRow, fieldDelimit); ... if (fields.length > 1 && delimitIndexes[i - 1] != -1) { // bug int start = delimitIndexes[i - 1] + fieldDelimit.length; startPosition[i] = start - i * diff; } else { startPosition[i] = length + 1; } } Arrays.fill(fieldInited, false); parsed = true; }
Multi delimit Process:
Actual: 1|@| -> 1^A id column start 0 ,next column start 1
Expected: 1|@| -> 1^A id column start 0 ,next column start 2
- fields.length=1 should find multi delimit index
- fields.length=1 should calculate column start position correct
Issue Links
- links to