[HUDI-2400] Allow timeline server correctly sync when concurrent write to timeline - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: compaction
Labels:
- pull-request-available

Description

Firstly, assume ~~HUDI-1847~~ is available and we can have an ingestion spark job and a compaction job running at the same time.
Assume we have a timestamp for each HoodieTimeLine object which represent the time it generated from hdfs.
Considering following case,
1. ingestion schedule compaction inline. Now we have a timeline: 1.deltaCommit.Completed, 2.Compaction.Requested (TimeStamp: 1L)
2. Then ingestion keep move on. We now have 1.deltaCommit.Completed, 2.Compaction.Requested 3.deltaCommit.Inflight (TimeStamp: 2L) in ingestion job.
3. We have an independent Spark job run compaction 2. We now have 1.deltaCommit.Completed, 2.Compaction.Inflight 3.deltaCommit.Inflight (TimeStamp: 3L)
4. Executors in ingestion job send request to timeline server, now they hold timeline with TimeStamp 2L. But Timeline Server have timestamp 3L which is later than client.

According to the logic in https://github.com/apache/hudi/blob/master/hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java#L137,
we thought local view of table's timeline is behind that of client's view as long as the timeline hashes are different. However this may not be true in the case mentioned above.
Here the hashes are different because client view is behind local view.

A simple solution is to add an attribute to timeline which is the timestamp we used above.
And timeline server may determine whether to sync fileSystemView by comparing timestamps between client and local rather than the difference between timeline hashes.

Attachments

Issue Links

duplicates

HUDI-2761 IllegalArgException from timeline server when serving getLastestBaseFiles with multi-writer

Closed

is a child of

HUDI-1847 Add ability to decouple configs for scheduling inline and running async

Closed

links to

GitHub Pull Request #4800

Activity

People

Assignee:: Unassigned

Reporter:: ZiyueGuan

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 05/Sep/21 17:55

Updated:: 14/Feb/22 02:26

Resolved:: 12/Dec/21 04:40

Time Tracking

Estimated:

0.5h

Remaining:

0.5h

Logged:

Not Specified