Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • Impala 4.0.0
    • None
    • ghx-label-10

    Description

      Since we have submit an patch of supporting create iceberg table by impala in IMPALA-9688, we are preparing to implement iceberg table query by impala. But we need to read the impala and iceberg code deeply to determine how to do this.

      Attachments

        1. select-iceberg.jpg
          109 kB
          Sheng Wang

        Issue Links

          Activity

            Commit efc627d050caeb9947af2dfd3fc8a02236c44d0e in impala's branch refs/heads/master from Fang-Yu Rao
            [ https://gitbox.apache.org/repos/asf?p=impala.git;h=efc627d ]

            IMPALA-10158: Set timezone to UTC for Iceberg-related E2E tests

            We found that the tests of test_iceberg_query and test_iceberg_profile
            fail after the patch for IMPALA-9741 has been merged and that it is due
            to the default timezone of Impala not being UTC. This patch fixes the
            issue by adding "SET TIMEZONE=UTC;" before those test queries are run.

            Testing:

            • Verified in a local development environment that the tests of
              test_iceberg_query and test_iceberg_profile could pass after applying
              this patch.

            Change-Id: Ie985519e8ded04f90465e141488bd2dda78af6c3
            Reviewed-on: http://gerrit.cloudera.org:8080/16425
            Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
            Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

            jira-bot ASF subversion and git services added a comment - Commit efc627d050caeb9947af2dfd3fc8a02236c44d0e in impala's branch refs/heads/master from Fang-Yu Rao [ https://gitbox.apache.org/repos/asf?p=impala.git;h=efc627d ] IMPALA-10158 : Set timezone to UTC for Iceberg-related E2E tests We found that the tests of test_iceberg_query and test_iceberg_profile fail after the patch for IMPALA-9741 has been merged and that it is due to the default timezone of Impala not being UTC. This patch fixes the issue by adding "SET TIMEZONE=UTC;" before those test queries are run. Testing: Verified in a local development environment that the tests of test_iceberg_query and test_iceberg_profile could pass after applying this patch. Change-Id: Ie985519e8ded04f90465e141488bd2dda78af6c3 Reviewed-on: http://gerrit.cloudera.org:8080/16425 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

            Commit fb6d96e001c1a04475a8fd01f757dd0605cf3279 in impala's branch refs/heads/master from skyyws
            [ https://gitbox.apache.org/repos/asf?p=impala.git;h=fb6d96e ]

            IMPALA-9741: Support querying Iceberg table by impala

            This patch mainly realizes the querying of iceberg table through impala,
            we can use the following sql to create an external iceberg table:
            CREATE EXTERNAL TABLE default.iceberg_test (
            level string,
            event_time timestamp,
            message string,
            )
            STORED AS ICEBERG
            LOCATION 'hdfs://xxx'
            TBLPROPERTIES ('iceberg_file_format'='parquet');
            Or just including table name and location like this:
            CREATE EXTERNAL TABLE default.iceberg_test
            STORED AS ICEBERG
            LOCATION 'hdfs://xxx'
            TBLPROPERTIES ('iceberg_file_format'='parquet');
            'iceberg_file_format' is the file format in iceberg, currently only
            support PARQUET, other format would be supported in the future. And
            if you don't specify this property in your SQL, default file format
            is PARQUET.

            We achieved this function by treating the iceberg table as normal
            unpartitioned hdfs table. When querying iceberg table, we pushdown
            partition column predicates to iceberg to decide which data files
            need to be scanned, and then transfer this information to BE to
            do the real scan operation.

            Testing:

            • Unit test for Iceberg in FileMetadataLoaderTest
            • Create table tests in functional_schema_template.sql
            • Iceberg table query test in test_scanners.py

            Change-Id: I856cfee4f3397d1a89cf17650e8d4fbfe1f2b006
            Reviewed-on: http://gerrit.cloudera.org:8080/16143
            Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
            Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

            jira-bot ASF subversion and git services added a comment - Commit fb6d96e001c1a04475a8fd01f757dd0605cf3279 in impala's branch refs/heads/master from skyyws [ https://gitbox.apache.org/repos/asf?p=impala.git;h=fb6d96e ] IMPALA-9741 : Support querying Iceberg table by impala This patch mainly realizes the querying of iceberg table through impala, we can use the following sql to create an external iceberg table: CREATE EXTERNAL TABLE default.iceberg_test ( level string, event_time timestamp, message string, ) STORED AS ICEBERG LOCATION 'hdfs://xxx' TBLPROPERTIES ('iceberg_file_format'='parquet'); Or just including table name and location like this: CREATE EXTERNAL TABLE default.iceberg_test STORED AS ICEBERG LOCATION 'hdfs://xxx' TBLPROPERTIES ('iceberg_file_format'='parquet'); 'iceberg_file_format' is the file format in iceberg, currently only support PARQUET, other format would be supported in the future. And if you don't specify this property in your SQL, default file format is PARQUET. We achieved this function by treating the iceberg table as normal unpartitioned hdfs table. When querying iceberg table, we pushdown partition column predicates to iceberg to decide which data files need to be scanned, and then transfer this information to BE to do the real scan operation. Testing: Unit test for Iceberg in FileMetadataLoaderTest Create table tests in functional_schema_template.sql Iceberg table query test in test_scanners.py Change-Id: I856cfee4f3397d1a89cf17650e8d4fbfe1f2b006 Reviewed-on: http://gerrit.cloudera.org:8080/16143 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
            skyyws Sheng Wang added a comment - - edited

            Hi tarmstrong,boroknagyz,vihangk1, I have already completed a version of query iceberg table by impala. The main design is treated iceberg table as an unpartitioned hdfs table, including theses functions:

            1. identity iceberg file format by table property;
            2. push down iceberg partition column predicates to iceberg, to filter data files need to be scanned;

            This is a simple version, and some code may be not good, hope you can give some advice, thanks a lot.
            Here is the gerrit url: https://gerrit.cloudera.org/#/c/16143/

            skyyws Sheng Wang added a comment - - edited Hi tarmstrong , boroknagyz , vihangk1 , I have already completed a version of query iceberg table by impala. The main design is treated iceberg table as an unpartitioned hdfs table, including theses functions: identity iceberg file format by table property; push down iceberg partition column predicates to iceberg, to filter data files need to be scanned; This is a simple version, and some code may be not good, hope you can give some advice, thanks a lot. Here is the gerrit url: https://gerrit.cloudera.org/#/c/16143/
            skyyws Sheng Wang added a comment -

            tarmstrong I see, and thanks for your advice, I will try to implement this function recently based on IMPALA-9688, I will updated here if any progress.

            skyyws Sheng Wang added a comment - tarmstrong I see, and thanks for your advice, I will try to implement this function recently based on IMPALA-9688 , I will updated here if any progress.
            tarmstrong Tim Armstrong added a comment -

            skyyws the HdfsScanNode implementation in the backend can handle mixed-format files. E.g. see how per_type_files_ is constructed and used in be/src/exec/hdfs-scan-node-base.cc. For Hive-style tables, each partition can have a different file format, so the file format is a property of HdfsPartitionDescriptor. But there's no reason the scan node can't be modified to determine the file format in a different way.

            I think there will be a lot of code shared between IcebergScanNode.java and HdfsScanNode.java. I'm not sure the best way to achieve that code-sharing, maybe a common base class or by factoring out logic into a separate class.

            tarmstrong Tim Armstrong added a comment - skyyws the HdfsScanNode implementation in the backend can handle mixed-format files. E.g. see how per_type_files_ is constructed and used in be/src/exec/hdfs-scan-node-base.cc. For Hive-style tables, each partition can have a different file format, so the file format is a property of HdfsPartitionDescriptor. But there's no reason the scan node can't be modified to determine the file format in a different way. I think there will be a lot of code shared between IcebergScanNode.java and HdfsScanNode.java. I'm not sure the best way to achieve that code-sharing, maybe a common base class or by factoring out logic into a separate class.
            skyyws Sheng Wang added a comment - - edited

            Hi boroknagyztarmstrong, I have been thinking about how to implement query iceberg by impala recently, and here is my initial desgin. I will write a class named IcebergScanNode.java in frontend, and this class mainly contains these functions:

            • Transform impala conjunts to iceberg expressions, which means we can pushdown some predicates to icebrg;
            • Get specific data files from icebreg by these expressions, which stored in hdfs;
            • Use these specific data files to construct related thrift struct, such as THdfsFileSplit/TScanRangerSpec;
            • And then backend will use these thrift structs to construct "SCAN HDFS" to scan data, and this way we can reuse these code in backend.

            And I have upload a very simple desgin picture as an attachment, but still some questions need to be consider:

            1. If iceberg returns different format files, such as parquet/orc, does backend can handle these files?
            2. if not, we may decide the table data format when create table, maybe by tblproperties, like this: 'iceberg_table_format'='parquet', and if so, we cannot select iceberg table which has different format data files.
            skyyws Sheng Wang added a comment - - edited Hi boroknagyz tarmstrong , I have been thinking about how to implement query iceberg by impala recently, and here is my initial desgin. I will write a class named IcebergScanNode.java in frontend, and this class mainly contains these functions: Transform impala conjunts to iceberg expressions, which means we can pushdown some predicates to icebrg; Get specific data files from icebreg by these expressions, which stored in hdfs; Use these specific data files to construct related thrift struct, such as THdfsFileSplit/TScanRangerSpec; And then backend will use these thrift structs to construct "SCAN HDFS" to scan data, and this way we can reuse these code in backend. And I have upload a very simple desgin picture as an attachment, but still some questions need to be consider: If iceberg returns different format files, such as parquet/orc, does backend can handle these files? if not, we may decide the table data format when create table, maybe by tblproperties, like this: 'iceberg_table_format'='parquet', and if so, we cannot select iceberg table which has different format data files.

            People

              skyyws Sheng Wang
              skyyws Sheng Wang
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: