Uploaded image for project: 'Apache Gobblin'
  1. Apache Gobblin
  2. GOBBLIN-1985

Incorporate the mod time of enclosing dirs into the `SourceHadoopFsEndPoint` watermark calculations

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • None
    • gobblin-core
    • None

    Description

      `SourceHadoopFsEndPoint.getWatermark` currently misses certain changes to the contents of the dirs it covers because it doesn't use the mod time of those enclosing dirs when calculating the watermark. despite those dirs not being among the files to copy, they must still participate in the watermark calculation because some compute engines, like spark, first write files to a temp subdir beneath the ultimate dest dir. (since those temp subdirs are not valid paths, they'll anyway be skipped.) once all executors have written their file, spark moves each from that temp subdir up to the enclosing ultimate dir location. such file movement DOES NOT update the mod time of the file itself, only that of its enclosing dir--hence the need to incorporate the enclosing dir's mod time, in order to observe such dir changes.

      Attachments

        Activity

          People

            abti Abhishek Tiwari
            kipk Kip Kohn
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 1h
                1h