[HUDI-1608] MOR fetches all records for read optimized query w/ spark sql - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Won't Fix
Affects Version/s: 0.7.0
Fix Version/s: None
Component/s: spark
Labels:

Description

Script to reproduce in local spark:

https://gist.github.com/nsivabalan/7250b794788516f1aec35650c2632364

```

scala> spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, id, __op from hudi_trips_snapshot order by _hoodie_record_key").show(false)

----------------------------------++--------------------------

_hoodie_commit_time

_hoodie_record_key

_hoodie_partition_path

__op

----------------------------------++--------------------------

20210210070347	1	1970-01-01	1	null
20210210070347	2	1970-01-01	2	null
20210210070347	3	2020-01-04	3	D
20210210070347	4	1998-04-13	4	I
20210210070347	5	2020-01-01	5	I
20210210070445	6	1998-04-13	6	I

----------------------------------++--------------------------

```

After an upsert, read optimized query returns records from both C1 and C2.

Also, I don't find any log files in partitions. all of them are parquet files.

ls /tmp/hudi_trips_cow/1998-04-13/

0d1e6a84-d036-42e9-806e-a3075b6bc677-0_1-23-12025_20210210065058.parquet

0d1e6a84-d036-42e9-806e-a3075b6bc677-0_1-61-25595_20210210065127.parquet

ls /tmp/hudi_trips_cow/1970-01-01/

7b836833-a656-485d-967a-871bdc653dc3-0_2-61-25596_20210210065127.parquet

7b836833-a656-485d-967a-871bdc653dc3-0_3-23-12027_20210210065058.parquet

Source of the issue: https://github.com/apache/hudi/issues/2255

Attachments

Issue Links

links to

GitHub Pull Request #2255

Activity

People

Assignee:: sivabalan narayanan

Reporter:: sivabalan narayanan

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 10/Feb/21 12:02

Updated:: 30/Dec/21 13:01

Resolved:: 30/Dec/21 13:01