Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
ghx-label-13
Description
I use impala to query iceberg table, but the query efficiency is not ideal, compared with querying the hive format table of the same data, the time-consuming increase is dozens of times.
The sql statement used is a very simple statistical query, be like :
select count from `db_name`.tbl_name where datekey='20221001' and event='xxx'
('datekey' and 'event' are the partition fields)
My personal feeling is that impala might fetch iceberg's metadata stats and return results very quickly, but it doesn't.
The catalog of iceberg table is of the hadoop type, and Impala can access it by creating an external table in hive. By the way, iceberg table will perform snapshot expiration and data compaction on a daily basis, so there should be no small file problems.
I found this warning using the explain statement:
| WARNING: The following tables are missing relevant table and/or column statistics. | | iceberg.gamebox_event_iceberg
Query: SHOW TABLE STATS `iceberg`.gamebox_event_iceberg
---------------------------------------------------------------------------------------------------------------------------------------+
#Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
---------------------------------------------------------------------------------------------------------------------------------------+
0 | 590509 | 1.91TB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs:///hive/warehouse/iceberg/gamebox_event_iceberg |
---------------------------------------------------------------------------------------------------------------------------------------+
It seems like Impala is not syncing iceberg's table and column statistics. I'm not sure if this has anything to do with slow queries.
As shown in the screenshot, i think the query time is mainly on planning and execution backends , but I don't know what is the reason for these two time consuming.
Attachment is the complete profile for this query.
How do I speed up the query? Can someone help with my question?plz.....