Details
Description
When Kafka connect gives sink task it's own copy of List<SinkRecords> that RAM utilisation shoots up and at that particular moment the there will be two lists and the original list gets cleared after the sink worker finishes the current batch.
Originally the list is declared final and it's copy is provided to sink task as those can be custom and we let user process it however they want without any risk. But one of the most popular uses of kafka connect is OLTP - OLAP replication, and during initial copying/snapshots a lot of data is generated rapidly which fills the list to it's max batch size length, and we are prone to "Out of Memory" exceptions. And the only use of the list is to get filled > cloned for sink > get size > cleared > repeat. So I have taken the size of list before giving the original list to sink task and after sink has performed it's operations , set list = new ArrayList<>(). I did not use clear for just in case sink task has set our list to null.
There is a time vs memory trade-off,
In the original approach the jvm does not have spend time to find free memory
In new approach the jvm will have to create new list by finding free memory addresses but this results in more free memory.
Attachments
Issue Links
- links to