[IGNITE-16655] Volatile RAFT log for pure in-memory storages - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
- iep-74
- ignite-3

Epic Link:
Volatile RAFT log improvements
Ignite Flags:

Release Notes Required

Description

Original issue description

For in-memory storage Raft logging can be optimized as we don't need to have it active when topology is stable.

Each write can directly go to in-memory storage at much lower cost than synchronizing it with disk so it is possible to avoid writing Raft log.

As nodes don't have any state and always join cluster clean we always need to transfer full snapshot during rebalancing - no need to keep long Raft log for historical rebalancing purposes.

So we need to implement API for Raft component enabling configuration of Raft logging process.

More detailed description

Apparently, we can't completely ignore writing to log. There are several situations where it needs to be collected:

During a regular workload, each node needs to have a small portion of log in case if it becomes a leader. There might be a number of "slow" nodes outside of "quorum" that require older data to be re-sent to them. Log entry can be truncated only when all nodes reply with "ack" or fail, otherwise log entry should be preserved.
During a clean node join - it will need to apply part of the log that wasn't included in the full-rebalance snapshot. So, everything, starting with snapshots applied index, will have to be preserved.

It feels like the second option is just a special case of the first one - we can't truncate log until we receive all acks. And we can't receive an ack from the joining node until it finishes its rebalancing procedure.

So, it all comes to the aggressive log truncation to make it short.

Preserved log can be quite big in reality, there must be a disk offloading operation available.

The easiest way to achieve it is to write into a RocksDB instance with WAL disabled. It'll store everything in memory until the flush, and even then the amount of flushed data will be small on stable topology. Absence of WAL is not an issue, the entire rocks instance can be dropped on restart, since it's supposed to be volatile.

To avoid even the smallest flush, we can use additional volatile structure, like ring buffer or concurrent map, to store part of the log, and transfer records into RocksDB only on structure overflow. This sounds more compilcated and makes memory management more difficult. But, we should take it into consideration anyways.

Potentially, we could use a volatile page memory region for this purpose, since it already has a good control over the amount of memory used. But, memory overflow should be carefully processed, usually it's treated as an error and might even cause node failure.

Attachments

Issue Links

split to

IGNITE-17319 Remove entries replicated to all nodes from RAFT log in in-memory scenario

Open

IGNITE-17335 Off-heap storage for RAFT log

Open

IGNITE-17337 Support limiting volatile RAFT log memory by total entries size

Open

IGNITE-17334 Basic volatile RAFT log storage

Resolved

IGNITE-17336 Spill-out to disk support for volatile RAFT log storage

Resolved

Activity

People

Assignee:: Roman Puchkovskiy

Reporter:: Sergey Chugunov

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 04/Mar/22 11:16

Updated:: 20/Oct/23 11:26