Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
`SourceHadoopFsEndPoint.getWatermark` currently misses certain changes to the contents of the dirs it covers because it doesn't use the mod time of those enclosing dirs when calculating the watermark. despite those dirs not being among the files to copy, they must still participate in the watermark calculation because some compute engines, like spark, first write files to a temp subdir beneath the ultimate dest dir. (since those temp subdirs are not valid paths, they'll anyway be skipped.) once all executors have written their file, spark moves each from that temp subdir up to the enclosing ultimate dir location. such file movement DOES NOT update the mod time of the file itself, only that of its enclosing dir--hence the need to incorporate the enclosing dir's mod time, in order to observe such dir changes.