Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-6364

InsertOverwrite operation on consistent hashing resulting in wrong data

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • index
    • None

    Description

      spark.sql(
        s"""insert into $tableName  values
           |(5, 'a', 35, 1000, '2021-01-05'),
           |(1, 'a', 31, 1000, '2021-01-05'),
           |(3, 'a', 33, 1000, '2021-01-05'),
           |(4, 'b', 16, 1000, '2021-01-05'),
           |(2, 'b', 18, 1000, '2021-01-05'),
           |(6, 'b', 17, 1000, '2021-01-05'),
           |(8, 'a', 21, 1000, '2021-01-05'),
           |(9, 'a', 22, 1000, '2021-01-05'),
           |(7, 'a', 23, 1000, '2021-01-05')
           |""".stripMargin)
      
      // Insert overwrite static partition
      spark.sql(
        s"""
           | insert overwrite table $tableName partition(dt = '2021-01-05')
           | select * from (select 13 , 'a2', 12, 1000) limit 10
      """.stripMargin)
      
      spark.sql(
        s"""
           | insert into $tableName values
           | (5,  'a3', 35, 1000, '2021-01-05'),
           | (3, 'a3', 33, 1000, '2021-01-05')
            """.stripMargin)
      

      After running the above case, we expect the result of the snapshot would be (13, "a3", 12.0, 1000, "2021-01-05"), (5, "a3", 35, 1000, "2021-01-05"), (3, "a3", 33, 1000, "2021-01-05"). 

      But the actual result is (13,a2,12.0,1000,2021-01-05).

      The root cause is that after running insert overwrite into a consistent bucket index, the  file groups in consistent_hashing_metadata does not match file groups on storage any more.

      Attachments

        Activity

          People

            jingzhang Jing Zhang
            jingzhang Jing Zhang
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: