[ORC-763] ORC timestamp inconsistencies before UNIX epoch - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.7.0, 1.6.8
Component/s: None
Labels:
None

Description

I did some experiments with Hive (Java ORC 1.5.1) and Impala (C++ ORC 1.6.2) and found some bugs related to timestamps before the UNIX epoch.

From Hive:
0: jdbc:hive2://localhost:11050/default> create table orc_ts (ts timestamp) stored as orc;
0: jdbc:hive2://localhost:11050/default> insert into orc_ts values ('1969-12-31 23:59:59'), ('1969-12-31 23:59:59.0001'), ('1969-12-31 23:59:59.001');
0: jdbc:hive2://localhost:11050/default> insert into orc_ts values ('1969-12-31 23:59:58'), ('1969-12-31 23:59:58.0001'), ('1969-12-31 23:59:58.001');
0: jdbc:hive2://localhost:11050/default> select * from orc_ts;
---------------------------

orc_ts.ts

---------------------------

1969-12-31 23:59:59.0

1969-12-31 23:59:59.0001

1970-01-01 00:00:00.001

1969-12-31 23:59:58.0

1969-12-31 23:59:58.0001

1969-12-31 23:59:58.001

---------------------------
Please note that we inserted '1969-12-31 23:59:59.001' and we got '1970-01-01 00:00:00.001'. So Java ORC read/writes are inconsistent in themselves.

From Impala:

[localhost:21050] default> select * from orc_ts;
+-------------------------------+
| ts                            |
+-------------------------------+
| 1969-12-31 23:59:59           |
| 1969-12-31 23:59:58.000100000 |
| 1970-01-01 00:00:00.001000000 |
| 1969-12-31 23:59:58           |
| 1969-12-31 23:59:57.000100000 |
| 1969-12-31 23:59:58.001000000 |
+-------------------------------+

From Impala the second and fifth timestamps are also off by one second. The third timestamp is also off by one second, but consistent with Java.

https://issues.apache.org/jira/browse/ORC-306 mentions a Java bug that it ORC tries to workaround. Seems like data files store values in a way to workaround the Java issue which is unnecessary in C++.

Looking at the code the Java and C++ code they construct timestamp values differently.

C++:

https://github.com/apache/orc/blob/654f777bc34841eed3b340047308f8dac7f554db/c%2B%2B/src/ColumnReader.cc#L377-L379

if (secsBuffer[i] < 0 && nanoBuffer[i] != 0) {
  secsBuffer[i] -= 1;
}

Java:

https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java#L1227-L1229

if (millis < 0 && newNanos > 999_999) {
  millis -= TimestampTreeWriter.MILLIS_PER_SECOND;
}

C++ 'checks for nanoBuffer[i] != 0' while Java checks for 'newNanos > 999_999'. Both only for timestamps before the epoch.

This gives us a pattern when C++ and Java is inconsistent:

timestamps before the UNIX epoch, AND
have the format YYYY-MM-DD HH:MM:ss.000XXX

I checked the actual values in the data files, written by ORC Java, read by ORC C++ lib:

(gdb) print secsBuffer[0] + epochOffset // 1969-12-31 23:59:59 => correct
$1 = -1
(gdb) print secsBuffer[1] + epochOffset // 1969-12-31 23:59:59.0001 (orc C++)=> 1969-12-31 23:59:58.000100000
$2 = -1
(gdb) print secsBuffer[2] + epochOffset // 1969-12-31 23:59:59.001 (orc C++)=> 1970-01-01 00:00:00.001000000
$3 = 0

(gdb) print secsBuffer[0] + epochOffset // 1969-12-31 23:59:58 => correct
$9 = -2
(gdb) print secsBuffer[1] + epochOffset // 1969-12-31 23:59:58.0001 (orc c++)=> 1969-12-31 23:59:57.000100000
$10 = -2
(gdb) print secsBuffer[2] + epochOffset // 1969-12-31 23:59:58.001 (orc c++)=> 1969-12-31 23:59:58.001000000
$11 = -1

The seconds are the same for '1969-12-31 23:59:58' and '1969-12-31 23:59:58.0001', but differ for '1969-12-31 23:59:58.001'. I think this is a bug in the Java writer. The workaround for the Java bug (~~ORC-306~~) shouldn't have any effect on the written data files.

So the values for the seconds are off by one when the timestamp is before the epoch and milliseconds are not zero.

I think the data files should always store the values corresponding to the spec, i.e. number of seconds since the ORC epoch, plus additional nanoseconds that we need to add to the timestamp. If that'd be true then we wouldn't need the above 'if' statement in the c++ code.

Attachments

Issue Links

is related to

ORC-306 Fix incorrect workaround for bug in java.sql.Timestamp

Closed

relates to

ORC-771 ORC timestamp consistency Test for sql.Timestamps close to epoch

Closed

links to

GitHub Pull Request #661

Activity

People

Assignee:: Gang Wu

Reporter:: Zoltán Borók-Nagy

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 17/Mar/21 14:08

Updated:: 21/Sep/21 20:44

Resolved:: 23/Mar/21 10:26