Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-12087 Performance improvements on Iceberg table queries
  3. IMPALA-11701

Skip pushing down Iceberg predicates to Impala scanner if not needed

    XMLWordPrintableJSON

Details

    • ghx-label-13

    Description

      I use impala to query iceberg table, but the query efficiency is not ideal, compared with querying the hive format table of the same data, the time-consuming increase is dozens of times.

      The sql statement used is a very simple statistical query, be like :
      select count from `db_name`.tbl_name where datekey='20221001' and event='xxx'
      ('datekey' and 'event' are the partition fields)

      My personal feeling is that impala might fetch iceberg's metadata stats and return results very quickly, but it doesn't.

      The catalog of iceberg table is of the hadoop type, and Impala can access it by creating an external table in hive. By the way, iceberg table will perform snapshot expiration and data compaction on a daily basis, so there should be no small file problems.

      I found this warning using the explain statement:

      | WARNING: The following tables are missing relevant table and/or column statistics. |
      | iceberg.gamebox_event_iceberg
      

      Query: SHOW TABLE STATS `iceberg`.gamebox_event_iceberg
      ---------------------------------------------------------------------------------------------------------------------------------------+

      #Rows #Files Size Bytes Cached Cache Replication Format Incremental stats Location

      ---------------------------------------------------------------------------------------------------------------------------------------+

      0 590509 1.91TB NOT CACHED NOT CACHED PARQUET false hdfs:///hive/warehouse/iceberg/gamebox_event_iceberg

      ---------------------------------------------------------------------------------------------------------------------------------------+

      It seems like Impala is not syncing iceberg's table and column statistics. I'm not sure if this has anything to do with slow queries.

      As shown in the screenshot, i think the query time is mainly on planning and execution backends , but I don't know what is the reason for these two time consuming.

      Attachment is the complete profile for this query.

      How do I speed up the query? Can someone help with my question?plz.....

      Attachments

        Activity

          People

            gaborkaszab Gabor Kaszab
            libra_816 Qizhu Chan
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: