Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-3066

Very slow file listing after enabling metadata for existing tables in 0.10.0 release

    XMLWordPrintableJSON

Details

    Description

      After 'metadata table' is enabled, File listing takes long time.

      If metadata is enabled on Reader side(as shown below), it is taking even more time per file listing task

      import org.apache.hudi.DataSourceReadOptions
      import org.apache.hudi.common.config.HoodieMetadataConfig
      
      val hadoopConf = spark.conf
      hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
      val basePath = "s3a://datalake-hudi"
      val sessions = spark
      .read
      .format("org.apache.hudi")
      .option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
      .option(DataSourceReadOptions.READ_PATHS.key(), s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
      .load()
      sessions.createOrReplaceTempView("sessions") 

      Existing tables (COW) have inline clustering on and have many replace commits.

      Logs seem to suggest the delay is in view.AbstractTableFileSystemView 

      resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata

      Also many log messages in AbstractHoodieLogRecordReader

       

      2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms to read  136 instants, 9731 replaced file groups

      2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of remaining logblocks to merge 1
      2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a data block from file s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.76_0-20-515 at instant 20211217035105329
      2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of remaining logblocks to merge 1
      2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next reader for logfile HoodieLogFile{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.121_0-57-663', fileLen=0}
      2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log file HoodieLogFile{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.20_0-35-613', fileLen=0}
      2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek policy
      2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a data block from file s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.62_0-34-377 at instant 20211217022049877
      2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of remaining logblocks to merge 1
      2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next reader for logfile HoodieLogFile{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.86_0-20-362', fileLen=0}
      2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log file HoodieLogFile{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.121_0-57-663', fileLen=0}
      2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek policy
      2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next reader for logfile HoodieLogFile{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.77_0-35-590', fileLen=0}
      2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a data block from file s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.20_0-35-613 at instant 20211216183448389
      2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of remaining logblocks to merge 1
      2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next reader for logfile HoodieLogFile{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.63_0-56-519', fileLen=0}
      2021-12-18 23:37:46,122 INFO log.AbstractHoodieLogRecordReader: Scanning log file HoodieLogFile{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.86_0-20-362', fileLen=0}
      2021-12-18 23:37:46,122 INFO s3a.S3AInputStream: Switching to Random IO seek policy
      2021-12-18 23:37:46,123 INFO log.AbstractHoodieLogRecordReader: Reading a data block from file s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.121_0-57-663 at instant 20211217090337935
      2021-12-18 23:37:46,123 INFO log.AbstractHoodieLogRecordReader: Number of remaining logblocks to merge 1
      2021-12-18 23:37:46,127 INFO log.AbstractHoodieLogRecordReader: Scanning log file HoodieLogFile{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.77_0-35-590', fileLen=0}
      2021-12-18 23:37:46,127 INFO s3a.S3AInputStream: Switching to Random IO seek policy
      2021-12-18 23:37:46,141 INFO log.HoodieLogFormatReader: Moving to the next reader for logfile HoodieLogFile{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.21_0-56-795', fileLen=0}
      2021-12-18 23:37:46,143 INFO log.AbstractHoodieLogRecordReader: Scanning log file HoodieLogFile{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.63_0-56-519', fileLen=0}
      2021-12-18 23:37:46,143 INFO s3a.S3AInputStream: Switching to Random IO seek policy
      2021-12-18 23:37:46,144 INFO log.AbstractHoodieLogRecordReader: Reading a data block from file s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.86_0-20-362 at instant 20211217045206861
      2021-12-18 23:37:46,144 INFO log.AbstractHoodieLogRecordReader: Number of remaining logblocks to merge 1
      2021-12-18 23:37:46,145 INFO log.AbstractHoodieLogRecordReader: Reading a data block from file s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.77_0-35-590 at instant 20211217035058289
      2021-12-18 23:37:46,145 INFO log.AbstractHoodieLogRecordReader: Number of remaining logblocks to merge 1
      2021-12-18 23:37:46,162 INFO log.AbstractHoodieLogRecordReader: Scanning log file HoodieLogFile{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.21_0-56-795', fileLen=0}
      2021-12-18 23:37:46,162 INFO log.AbstractHoodieLogRecordReader: Reading a data block from file s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.63_0-56-519 at instant 20211217022306173
      2021-12-18 23:37:46,162 INFO s3a.S3AInputStream: Switching to Random IO seek policy
      2021-12-18 23:37:46,162 INFO log.AbstractHoodieLogRecordReader: Number of remaining logblocks to merge 1
      2021-12-18 23:37:46,170 INFO log.HoodieLogFormatReader: Moving to the next reader for logfile HoodieLogFile{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.87_0-35-433', fileLen=0}
      2021-12-18 23:37:46,171 INFO log.HoodieLogFormatReader: Moving to the next reader for logfile HoodieLogFile{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.78_0-56-705', fileLen=0}
      2021-12-18 23:37:46,178 INFO log.AbstractHoodieLogRecordReader: Reading a data block from file s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-0000_20211216144130775001.log.21_0-56-795 at instant 20211216184106961
      2021-12-18 23:37:46,178 INFO log.AbstractHoodieLogRecordReader: Number of remaining logblocks to merge

      2021-12-18 23:38:35,380 INFO log.HoodieMergedLogRecordScanner: Number of log files scanned => 437
      2021-12-18 23:38:35,380 INFO log.HoodieMergedLogRecordScanner: MaxMemoryInBytes allowed for compaction => 1073741824
      2021-12-18 23:38:35,380 INFO log.HoodieMergedLogRecordScanner: Number of entries in MemoryBasedMap in ExternalSpillableMap => 165
      2021-12-18 23:38:35,380 INFO log.HoodieMergedLogRecordScanner: Total size in bytes of MemoryBasedMap in ExternalSpillableMap => 259380
      2021-12-18 23:38:35,380 INFO log.HoodieMergedLogRecordScanner: Number of entries in BitCaskDiskMap in ExternalSpillableMap => 0
      2021-12-18 23:38:35,380 INFO log.HoodieMergedLogRecordScanner: Size of file spilled to disk => 0
      2021-12-18 23:38:35,380 INFO metadata.HoodieBackedTableMetadata: Opened 437 metadata log files (dataset instant=20211218233649435, metadata instant=20211218233649435) in 22935 ms

      2021-12-18 23:38:37,193 INFO metadata.HoodieBackedTableMetadata: Opened 437 metadata log files (dataset instant=20211218233649435, metadata instant=20211218233649435) in 22802 ms

      Sample file listing tasks

       

      Attachments

        1. metadata_files_compacted.txt
          0.9 kB
          Harsha Teja Kanna
        2. metadata_files.txt
          40 kB
          Harsha Teja Kanna
        3. metadata_timeline_archived.txt
          2 kB
          Harsha Teja Kanna
        4. metadata_timeline_compacted.txt
          8 kB
          Harsha Teja Kanna
        5. metadata_timeline.txt
          7 kB
          Harsha Teja Kanna
        6. Screen Shot 2021-12-18 at 6.16.29 PM.png
          408 kB
          Harsha Teja Kanna
        7. Screen Shot 2021-12-20 at 10.05.50 PM.png
          253 kB
          Harsha Teja Kanna
        8. Screen Shot 2021-12-20 at 10.17.44 PM.png
          291 kB
          Harsha Teja Kanna
        9. Screen Shot 2021-12-21 at 10.22.54 PM.png
          485 kB
          Harsha Teja Kanna
        10. Screen Shot 2021-12-21 at 10.24.12 PM.png
          478 kB
          Harsha Teja Kanna
        11. stderr_part1.txt
          52.88 MB
          Harsha Teja Kanna
        12. stderr_part2.txt
          13.57 MB
          Harsha Teja Kanna
        13. timeline.txt
          79 kB
          Harsha Teja Kanna
        14. writer_log.txt
          2.20 MB
          Harsha Teja Kanna

        Issue Links

          Activity

            People

              guoyihua Ethan Guo (this is the old account; please use "yihua")
              h7kanna Harsha Teja Kanna
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: