Uploaded image for project: 'ORC'
  1. ORC
  2. ORC-451

Timestamp statistics is wrong if read with useUTCTimestamp=true

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.5.0
    • None
    • None
    • None
    • timezone for both client and server "Europe/Moscow" (UTC+3)
      hive version 3.1.0.3.0.1.0-187

    Description

      We're using external orc tables and a timezone "Europe/Moscow" (UTC+3) for both client and server. After switching to hive 3 which uses orc 1.5.x we've got an issue with predicate push down filtering out matching stripes by timestamp. E.g. consider a table (it's orc data is in the attachment):

      create external table test_ts (ts timestamp) stored as orc;

      insert into test_ts values ("2018-12-24 18:30:00");

      // No rows selected

      select * from test_ts where ts < "2018-12-24 19:00:00";

      // the lowest filter to return the value

      select * from test_ts where ts <= "2018-12-24 21:30:00";

      The issue only affects external orc tables statistics. Turning ppd off with set hive.optimize.index.filter=false; helps.

      We believe it was the https://jira.apache.org/jira/browse/ORC-341, which introduced it.

      org.apache.orc.impl.SerializationUtils utc convertion is rather strange:

      public static long convertToUtc(TimeZone local, long time){
        int offset = local.getOffset(time - local.getRawOffset());  return time - offset;
      }

      This adds a 3 hour offset to our timestamp in UTC+3 timezone (shouldn't it substract 3 hours, btw?).

      If org.apache.orc.impl.TimestampStatisticsImpl is used with useUTCTimestamp=false, the timestamp is converted back in a compatible way via SerializationUtils.convertFromUtc. But hive seems to override default org.apache.orc.OrcFile.ReaderOptions with org.apache.hadoop.hive.ql.io.orc.ReaderOptions which have useUTCTimestamp(true) in it's constructor. With useUTCTimestamp=true evaluatePredicateProto predictate is using  TimestampStatisticsImpl.getMaximumUTC(), which returns the timestamp as is, i.e. in the example it's "2018-12-24 21:30:00 UTC+3".

      At the same time the predicate is not shifted (the value in this tez log is in UTC+3):

      2018-12-24 22:12:16,205 [INFO] InputInitializer {Map 1} #0 |orc.OrcInputFormat|: ORC pushdown predicate: leaf-0 = (LESS_THAN ts 2018-12-24 19:00:00.0), expr = leaf-0

      Attachments

        1. 000000_0
          0.2 kB
          Rei Mai
        2. hive_cfg.tar.gz
          8 kB
          Rei Mai

        Issue Links

          Activity

            People

              Unassigned Unassigned
              reimai Rei Mai
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: