Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Fixed
-
0.11.1
-
0.5
Description
When I tested the behavior of clean and savepoint, I found that when clean is keeping latest versions, the files of savepoint will be deleted. By reading the code, I found that this should be a bug
For example, if I use "HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS", and set the “hoodie.cleaner.fileversions.retained” to 2, I do the following:
1. insert, get xxxx_001.parquet
2. savepoint
3. insert, get xxxx_002.parquet
4. insert, get xxxx_003.parquet
After the fourth step, the xxxx_001.parquet will be deleted even if it belongs to savepoint !
here is: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java: getFilesToCleanKeepingLatestVersions
- According to the following code, on the one hand, the checkpoints belonging to keepversion will be skipped and will not be counted in the calculation of keepversion, which I feel is unreasonable.
- On the other hand, if there is a checkpoint in the remaining version of the files, it will be deleted, which I don't think is in line with the design philosophy of savepoints.
while (fileSliceIterator.hasNext() && keepVersions > 0) { // Skip this most recent version FileSlice nextSlice = fileSliceIterator.next(); Option<HoodieBaseFile> dataFile = nextSlice.getBaseFile(); if (dataFile.isPresent() && savepointedFiles.contains(dataFile.get().getFileName())) { // do not clean up a savepoint data file continue; } keepVersions--; } // Delete the remaining files while (fileSliceIterator.hasNext()) { FileSlice nextSlice = fileSliceIterator.next(); deletePaths.addAll(getCleanFileInfoForSlice(nextSlice)); }
So I think the judgment logic of the checkpoint should be moved down, if can be fixed by this:
while (fileSliceIterator.hasNext() && keepVersions > 0) { // Skip this most recent version fileSliceIterator.next(); keepVersions--; } // Delete the remaining files while (fileSliceIterator.hasNext()) { FileSlice nextSlice = fileSliceIterator.next(); Option<HoodieBaseFile> dataFile = nextSlice.getBaseFile(); if (dataFile.isPresent() && savepointedFiles.contains(dataFile.get().getFileName())) { // do not clean up a savepoint data file continue; } deletePaths.addAll(getCleanFileInfoForSlice(nextSlice)); }
Thanks.
Attachments
Issue Links
- links to