[IMPALA-9741] Support query iceberg table by impala - ASF JIRA

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: Impala 4.0.0
Component/s: None
Labels:
- impala-iceberg

Epic Color:
ghx-label-10

Description

Since we have submit an patch of supporting create iceberg table by impala in ~~IMPALA-9688~~, we are preparing to implement iceberg table query by impala. But we need to read the impala and iceberg code deeply to determine how to do this.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

select-iceberg.jpg
15/May/20 10:31
109 kB
Sheng Wang

Issue Links

relates to

IMPALA-10205 Avoid MD5 hash for data file path of IcebergTable

Resolved

Activity

Descending order - Click to sort in ascending order

ASF subversion and git services added a comment - 09/Sep/20 15:30

Commit efc627d050caeb9947af2dfd3fc8a02236c44d0e in impala's branch refs/heads/master from Fang-Yu Rao
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=efc627d ]

~~IMPALA-10158~~: Set timezone to UTC for Iceberg-related E2E tests

We found that the tests of test_iceberg_query and test_iceberg_profile
fail after the patch for ~~IMPALA-9741~~ has been merged and that it is due
to the default timezone of Impala not being UTC. This patch fixes the
issue by adding "SET TIMEZONE=UTC;" before those test queries are run.

Testing:

Verified in a local development environment that the tests of
test_iceberg_query and test_iceberg_profile could pass after applying
this patch.

Change-Id: Ie985519e8ded04f90465e141488bd2dda78af6c3
Reviewed-on: http://gerrit.cloudera.org:8080/16425
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

ASF subversion and git services added a comment - 09/Sep/20 15:30 Commit efc627d050caeb9947af2dfd3fc8a02236c44d0e in impala's branch refs/heads/master from Fang-Yu Rao [ https://gitbox.apache.org/repos/asf?p=impala.git;h=efc627d ] IMPALA-10158 : Set timezone to UTC for Iceberg-related E2E tests We found that the tests of test_iceberg_query and test_iceberg_profile fail after the patch for IMPALA-9741 has been merged and that it is due to the default timezone of Impala not being UTC. This patch fixes the issue by adding "SET TIMEZONE=UTC;" before those test queries are run. Testing: Verified in a local development environment that the tests of test_iceberg_query and test_iceberg_profile could pass after applying this patch. Change-Id: Ie985519e8ded04f90465e141488bd2dda78af6c3 Reviewed-on: http://gerrit.cloudera.org:8080/16425 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

ASF subversion and git services added a comment - 08/Sep/20 10:12

Commit fb6d96e001c1a04475a8fd01f757dd0605cf3279 in impala's branch refs/heads/master from skyyws
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=fb6d96e ]

~~IMPALA-9741~~: Support querying Iceberg table by impala

This patch mainly realizes the querying of iceberg table through impala,
we can use the following sql to create an external iceberg table:
CREATE EXTERNAL TABLE default.iceberg_test (
level string,
event_time timestamp,
message string,
)
STORED AS ICEBERG
LOCATION 'hdfs://xxx'
TBLPROPERTIES ('iceberg_file_format'='parquet');
Or just including table name and location like this:
CREATE EXTERNAL TABLE default.iceberg_test
STORED AS ICEBERG
LOCATION 'hdfs://xxx'
TBLPROPERTIES ('iceberg_file_format'='parquet');
'iceberg_file_format' is the file format in iceberg, currently only
support PARQUET, other format would be supported in the future. And
if you don't specify this property in your SQL, default file format
is PARQUET.

We achieved this function by treating the iceberg table as normal
unpartitioned hdfs table. When querying iceberg table, we pushdown
partition column predicates to iceberg to decide which data files
need to be scanned, and then transfer this information to BE to
do the real scan operation.

Testing:

Unit test for Iceberg in FileMetadataLoaderTest
Create table tests in functional_schema_template.sql
Iceberg table query test in test_scanners.py

Change-Id: I856cfee4f3397d1a89cf17650e8d4fbfe1f2b006
Reviewed-on: http://gerrit.cloudera.org:8080/16143
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

ASF subversion and git services added a comment - 08/Sep/20 10:12 Commit fb6d96e001c1a04475a8fd01f757dd0605cf3279 in impala's branch refs/heads/master from skyyws [ https://gitbox.apache.org/repos/asf?p=impala.git;h=fb6d96e ] IMPALA-9741 : Support querying Iceberg table by impala This patch mainly realizes the querying of iceberg table through impala, we can use the following sql to create an external iceberg table: CREATE EXTERNAL TABLE default.iceberg_test ( level string, event_time timestamp, message string, ) STORED AS ICEBERG LOCATION 'hdfs://xxx' TBLPROPERTIES ('iceberg_file_format'='parquet'); Or just including table name and location like this: CREATE EXTERNAL TABLE default.iceberg_test STORED AS ICEBERG LOCATION 'hdfs://xxx' TBLPROPERTIES ('iceberg_file_format'='parquet'); 'iceberg_file_format' is the file format in iceberg, currently only support PARQUET, other format would be supported in the future. And if you don't specify this property in your SQL, default file format is PARQUET. We achieved this function by treating the iceberg table as normal unpartitioned hdfs table. When querying iceberg table, we pushdown partition column predicates to iceberg to decide which data files need to be scanned, and then transfer this information to BE to do the real scan operation. Testing: Unit test for Iceberg in FileMetadataLoaderTest Create table tests in functional_schema_template.sql Iceberg table query test in test_scanners.py Change-Id: I856cfee4f3397d1a89cf17650e8d4fbfe1f2b006 Reviewed-on: http://gerrit.cloudera.org:8080/16143 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

Sheng Wang added a comment - 07/Jul/20 02:32 - edited

Hi tarmstrong,boroknagyz,vihangk1, I have already completed a version of query iceberg table by impala. The main design is treated iceberg table as an unpartitioned hdfs table, including theses functions:

identity iceberg file format by table property;
push down iceberg partition column predicates to iceberg, to filter data files need to be scanned;

This is a simple version, and some code may be not good, hope you can give some advice, thanks a lot.
Here is the gerrit url: https://gerrit.cloudera.org/#/c/16143/

Sheng Wang added a comment - 07/Jul/20 02:32 - edited Hi tarmstrong , boroknagyz , vihangk1 , I have already completed a version of query iceberg table by impala. The main design is treated iceberg table as an unpartitioned hdfs table, including theses functions: identity iceberg file format by table property; push down iceberg partition column predicates to iceberg, to filter data files need to be scanned; This is a simple version, and some code may be not good, hope you can give some advice, thanks a lot. Here is the gerrit url: https://gerrit.cloudera.org/#/c/16143/

Sheng Wang added a comment - 17/May/20 12:20

tarmstrong I see, and thanks for your advice, I will try to implement this function recently based on ~~IMPALA-9688~~, I will updated here if any progress.

Sheng Wang added a comment - 17/May/20 12:20 tarmstrong I see, and thanks for your advice, I will try to implement this function recently based on IMPALA-9688 , I will updated here if any progress.

Tim Armstrong added a comment - 15/May/20 19:20

skyyws the HdfsScanNode implementation in the backend can handle mixed-format files. E.g. see how per_type_files_ is constructed and used in be/src/exec/hdfs-scan-node-base.cc. For Hive-style tables, each partition can have a different file format, so the file format is a property of HdfsPartitionDescriptor. But there's no reason the scan node can't be modified to determine the file format in a different way.

I think there will be a lot of code shared between IcebergScanNode.java and HdfsScanNode.java. I'm not sure the best way to achieve that code-sharing, maybe a common base class or by factoring out logic into a separate class.

Tim Armstrong added a comment - 15/May/20 19:20 skyyws the HdfsScanNode implementation in the backend can handle mixed-format files. E.g. see how per_type_files_ is constructed and used in be/src/exec/hdfs-scan-node-base.cc. For Hive-style tables, each partition can have a different file format, so the file format is a property of HdfsPartitionDescriptor. But there's no reason the scan node can't be modified to determine the file format in a different way. I think there will be a lot of code shared between IcebergScanNode.java and HdfsScanNode.java. I'm not sure the best way to achieve that code-sharing, maybe a common base class or by factoring out logic into a separate class.

Sheng Wang added a comment - 15/May/20 10:30 - edited

Hi boroknagyz tarmstrong, I have been thinking about how to implement query iceberg by impala recently, and here is my initial desgin. I will write a class named IcebergScanNode.java in frontend, and this class mainly contains these functions:

Transform impala conjunts to iceberg expressions, which means we can pushdown some predicates to icebrg;
Get specific data files from icebreg by these expressions, which stored in hdfs;
Use these specific data files to construct related thrift struct, such as THdfsFileSplit/TScanRangerSpec;
And then backend will use these thrift structs to construct "SCAN HDFS" to scan data, and this way we can reuse these code in backend.

And I have upload a very simple desgin picture as an attachment, but still some questions need to be consider:

If iceberg returns different format files, such as parquet/orc, does backend can handle these files?
if not, we may decide the table data format when create table, maybe by tblproperties, like this: 'iceberg_table_format'='parquet', and if so, we cannot select iceberg table which has different format data files.

Sheng Wang added a comment - 15/May/20 10:30 - edited Hi boroknagyz tarmstrong , I have been thinking about how to implement query iceberg by impala recently, and here is my initial desgin. I will write a class named IcebergScanNode.java in frontend, and this class mainly contains these functions: Transform impala conjunts to iceberg expressions, which means we can pushdown some predicates to icebrg; Get specific data files from icebreg by these expressions, which stored in hdfs; Use these specific data files to construct related thrift struct, such as THdfsFileSplit/TScanRangerSpec; And then backend will use these thrift structs to construct "SCAN HDFS" to scan data, and this way we can reuse these code in backend. And I have upload a very simple desgin picture as an attachment, but still some questions need to be consider: If iceberg returns different format files, such as parquet/orc, does backend can handle these files? if not, we may decide the table data format when create table, maybe by tblproperties, like this: 'iceberg_table_format'='parquet', and if so, we cannot select iceberg table which has different format data files.

IMPALA

Support query iceberg table by impala

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates