Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
Release Notes Required
Description
Original issue description
For in-memory storage Raft logging can be optimized as we don't need to have it active when topology is stable.
Each write can directly go to in-memory storage at much lower cost than synchronizing it with disk so it is possible to avoid writing Raft log.
As nodes don't have any state and always join cluster clean we always need to transfer full snapshot during rebalancing - no need to keep long Raft log for historical rebalancing purposes.
So we need to implement API for Raft component enabling configuration of Raft logging process.
More detailed description
Apparently, we can't completely ignore writing to log. There are several situations where it needs to be collected:
- During a regular workload, each node needs to have a small portion of log in case if it becomes a leader. There might be a number of "slow" nodes outside of "quorum" that require older data to be re-sent to them. Log entry can be truncated only when all nodes reply with "ack" or fail, otherwise log entry should be preserved.
- During a clean node join - it will need to apply part of the log that wasn't included in the full-rebalance snapshot. So, everything, starting with snapshots applied index, will have to be preserved.
It feels like the second option is just a special case of the first one - we can't truncate log until we receive all acks. And we can't receive an ack from the joining node until it finishes its rebalancing procedure.
So, it all comes to the aggressive log truncation to make it short.
Preserved log can be quite big in reality, there must be a disk offloading operation available.
The easiest way to achieve it is to write into a RocksDB instance with WAL disabled. It'll store everything in memory until the flush, and even then the amount of flushed data will be small on stable topology. Absence of WAL is not an issue, the entire rocks instance can be dropped on restart, since it's supposed to be volatile.
To avoid even the smallest flush, we can use additional volatile structure, like ring buffer or concurrent map, to store part of the log, and transfer records into RocksDB only on structure overflow. This sounds more compilcated and makes memory management more difficult. But, we should take it into consideration anyways.
- Potentially, we could use a volatile page memory region for this purpose, since it already has a good control over the amount of memory used. But, memory overflow should be carefully processed, usually it's treated as an error and might even cause node failure.
Attachments
Issue Links
- split to
-
IGNITE-17319 Remove entries replicated to all nodes from RAFT log in in-memory scenario
- Open
-
IGNITE-17335 Off-heap storage for RAFT log
- Open
-
IGNITE-17337 Support limiting volatile RAFT log memory by total entries size
- Open
-
IGNITE-17334 Basic volatile RAFT log storage
- Resolved
-
IGNITE-17336 Spill-out to disk support for volatile RAFT log storage
- Resolved