Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-19205 Hive streaming ingest improvements (v2)
  3. HIVE-19206

Automatic memory management for open streaming writers

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 3.1.0, 3.0.0
    • 3.1.0, 3.0.0
    • Streaming
    • None

    Description

      Problem:
      When there are 100s of record updaters open, the amount of memory required by orc writers keeps growing because of ORC's internal buffers. This can lead to potential high GC or OOM during streaming ingest.

      Solution:
      The high level idea is for the streaming connection to remember all the open record updaters and flush the record updater periodically (at some interval). Records written to each record updater can be used as a metric to determine the candidate record updaters for flushing.
      If stripe size of orc file is 64MB, the default memory management check happens only after every 5000 rows which may which may be too late when there are too many concurrent writers in a process. Example case would be 100 writers open and each of them have almost full stripe of 64MB buffered data, this would take 100*64MB ~=6GB of memory. When all of the record writers flush, the memory usage drops down to 100*~2MB which is just ~200MB memory usage.

      Attachments

        1. HIVE-19206.3.patch
          48 kB
          Prasanth Jayachandran
        2. HIVE-19206.2.patch
          47 kB
          Prasanth Jayachandran
        3. HIVE-19206.1.patch
          46 kB
          Prasanth Jayachandran

        Issue Links

          Activity

            People

              prasanth_j Prasanth Jayachandran
              prasanth_j Prasanth Jayachandran
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: