Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
None
-
None
Description
When a HUDI clean plan is executed, any targeted file that was not confirmed as deleted (or non-existing) will be marked as a "failed delete". Although these failed deletes will be added to `.clean` metadata, if incremental clean is used then these files might not ever be picked up again as a future clean plan, unless a "full-scan" clean ends up being scheduled. In addition to leading to more files unnecessarily taking up storage space for longer, then can lead to the following dataset consistency issue for COW datasets:
- Insert at C1 creates file group f1 in partition
- Replacecommit at RC2 creates file group f2 in partition, and replaces f1
- Any reader of partition that calls HUDI API (with or without using MDT) will recognize that f1 should be ignored, as it has been replaced. This is since RC2 instant file is in active timeline
- Some completed instants later an incremental clean is scheduled. It moves the "earliest commit to retain" to an time after instant time RC2, so it targets f1 for deletion. But during execution of the plan, it fails to delete f1.
- An archive job eventually is triggered, and archives C1 and RC2. Note that f1 is still in partition
At this point, any job/query that reads the aforementioned partition directly from the DFS file system calls (without directly using MDT FILES partition) will consider both f1 and f2 as valid file groups, since RC2 is no longer in active timeline. This is a data consistency issue, and will only be resolved if a "full-scan" clean is triggered and deletes f1.
This specific scenario can be avoided if the user can configure HUDI clean to fail execution of a clean plan unless all files are confirmed as deleted (or not existing in DFS already), "blocking" the clean. The next clean attempt will re-execute this existing plan, since clean plans cannot be "rolled back".