Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-2750

Improve the incremental data files metadata more efficiently for streaming source

    XMLWordPrintableJSON

Details

    • Task
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • None
    • 1.1.0
    • Common Core
    • None

    Description

      There are 3 ways for fetching the incremental data files for streaming read now:

      1. Read the incremental commit metadata and resolve the data files to construct the inc filesystem view
      2. Scan the filesystem directly and filter the data files with start commit time if the consuming starts from the 'earliest' offset
      3. For 2, there is a more efficient way: to look up the metadata table if it is enabled

      While these 3 ways are far away from enough for production:

      for 1: there was a bottleneck when the start commit time has been far away from now, and the instants may have been archived, it takes too much time to load those metadata files, in our production, more than 30 minutes, which is unacceptable.

      for 2&3: they are only suitable for cases that read the full history and incremental data set.

      We better propose a way to look up the incremental data files with arbitrary time interval instants, to construct the filesystem efficiently.

      Attachments

        Activity

          People

            linliu Lin Liu
            danny0405 Danny Chen
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: