Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
ghx-label-8
Description
Impala uses the timezone of its server when converting Unix epoch time stored in a Kudu table in a column of UNIXTIME_MICROS type (legacy type name TIMESTAMP) into a timestamp. As one can see, the former (a value stored in a column of the UNIXTIME_MICROS type) does not contain information about timezone, but the latter (the result timestamp returned by Impala) does, and Impala's convention does make sense and works totally fine if the data is being written and read by Impala or by other application that uses the same convention.
However, Spark uses a different convention. Spark applications convert timestamps to the UTC timezone before representing the result as Unix epoch time. So, when a Spark application stores timestamp data in a Kudu table, there is a difference in the result timestamps upon reading the stored data via Impala if Impala servers are running in other than the UTC timezone.
As of now, the workaround is to run Impala servers in the UTC timezone, so the convention used by Spark produces the same result as the convention used by Impala when converting between timestamps and Unix epoch times.
In this context, it would be great to make it possible customizing the timezone that's used by Impala when working with UNIXTIME_MICROS/TIMESTAMP values stored in Kudu tables. That will free the users from the inconvenience of running their clusters in the UTC timezone if they use a mix of Spark/Impala applications to work with the same data stored in Kudu tables. Ideally, the setting should be per Kudu table, but a system-wide flag is also an option.
This is similar to IMPALA-1658.
Attachments
Issue Links
- causes
-
KUDU-3363 impala get wrong timestamp when scan kudu timestamp with timezone
- Resolved
- relates to
-
IMPALA-12322 return wrong timestamp when scan kudu timestamp with timezone
- Resolved