Details
-
Epic
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
In-memory store
Description
Goals
We need an in-memory store, similar to Ignite-2. This store must reuse common replication infrastructure, in other words, be integrated into raft STM and support transactions.
The raft protocol implies some persistent state: metadata, logs, snapshot.
Simplest solution - write a raft persistent state on disk (this is already implemented for org.apache.ignite.internal.storage.basic.ConcurrentHashMapPartitionStorage).
Drawback - not fully in-memory solution, doesn't much differ from a database cache
We can go the pure in-memory way - keep all raft state in a volatile store.
Raft metadata
Must not be persisted for a pure in-memory cluster, because the state is always lost on restart.
Note: a node must always be removed from the raft group when it’s removed from baseline by auto adjust and should join as new (in-memory always works with auto-adjust similarly to Ignite 2). Out of scope.
Log store
Has working in-memory implementation (currently used in tests): org.apache.ignite.raft.jraft.storage.impl.LocalLogStorage
Note: generally speaking, log is only required for "historical rebalancing" after the snapshot rebalance. It won't be needed at all once it is possible to apply snapshot and concurrent updates at the same time, for example when a solution like mvcc is implemented.
Snapshots
Can be implemented over any kv store extended with some kind of Copy-On-Write support. Not implemented currently. More details below.
COW buffer
To create an in-memory snapshot, the snapshot data is written to a separate in-memory buffer. The buffer is populated from the state machine update thread either by the update operations or by a snapshot advance mini-task which is submitted to the state machine update thread as needed.
To maintain a snapshot, the state machine needs to keep an snapshot iterator boundary key. If a key being updated is smaller or equal than the boundary key, there is no need in any additional action because the snapshot iterator has already processed this key. If a key being updated is larger than the boundary key, the old version of the key is eagerly put to the snapshot buffer and the key is marked with snapshot ID (so that the key is skipped during further iteration). Snapshot advance mini-task iterates over a next batch of the keys starting from the boundary key and puts to the snapshot buffer only keys that are not yet marked by the snapshot ID.
This approach has similar memory requirements to the first alternative, but does not require to modify the storage tree so that it can store multiple versions of the same key. This approach, however, allows for transparent snapshot buffer offloading to disk which can reduce memory requirements. It is also simpler in implementation because the code is essentially single-threaded and only requires synchronization for the in-memory buffer. The downside is that snapshot advance tasks will increase tail latency of state machine update operations.
Can be implemented on top of any kv store.
Note: we should consider the possibility of streaming the snapshot instead of storing it in memory until it is completed.