[KAFKA-6144] Allow serving interactive queries from in-sync Standbys - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.5.0
Component/s: streams
Labels:
- kip-535

Description

Currently when expanding the KS cluster, the new node's partitions will be unavailable during the rebalance, which for large states can take a very long time, or for small state stores even more than a few ms can be a deal-breaker for micro service use cases.

One workaround is to allow stale data to be read from the state stores when use case allows. Adding the use case from ~~KAFKA-8994~~ as it is more descriptive.

"Consider the following scenario in a three node Streams cluster with node A, node S and node R, executing a stateful sub-topology/topic group with 1 partition and `num.standby.replicas=1`

t0: A is the active instance owning the partition, B is the standby that keeps replicating the A's state into its local disk, R just routes streams IQs to active instance using StreamsMetadata
t1: IQs pick node R as router, R forwards query to A, A responds back to R which reverse forwards back the results.
t2: Active A instance is killed and rebalance begins. IQs start failing to A
t3: Rebalance assignment happens and standby B is now promoted as active instance. IQs continue to fail
t4: B fully catches up to changelog tail and rewinds offsets to A's last commit position, IQs continue to fail
t5: IQs to R, get routed to B, which is now ready to serve results. IQs start succeeding again

Depending on Kafka consumer group session/heartbeat timeouts, step t2,t3 can take few seconds (~10 seconds based on defaults values). Depending on how laggy the standby B was prior to A being killed, t4 can take few seconds-minutes.

While this behavior favors consistency over availability at all times, the long unavailability window might be undesirable for certain classes of applications (e.g simple caches or dashboards).

This issue aims to also expose information about standby B to R, during each rebalance such that the queries can be routed by an application to a standby to serve stale reads, choosing availability over consistency."

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2019-10-09-20-47-38-096.png
09/Oct/19 15:17
139 kB
Navinder Brar
image-2019-10-09-20-33-37-423.png
09/Oct/19 15:03
140 kB
Navinder Brar

Issue Links

contains

KAFKA-6031 Expose standby replicas endpoints in StreamsMetadata

Resolved

is duplicated by

KAFKA-6249 Interactive query downtime when node goes down even with standby replicas

Resolved

is related to

KAFKA-6145 Warm up new KS instances before migrating tasks - potentially a two phase rebalance

Resolved

relates to

KAFKA-8994 Streams should expose standby replication information & allow stale reads of state store

Resolved

KAFKA-6555 Making state store queryable during restoration

Resolved

links to

GitHub Pull Request #7868

GitHub Pull Request #7960

GitHub Pull Request #7962

KIP-535: Allow state stores to serve stale reads during rebalance

mentioned in: Page Loading...

(4 links to, 1 mentioned in)

Sub-Tasks

There are no Sub-Tasks for this issue.

Activity

People

Assignee:: Navinder Brar

Reporter:: Antony Stubbs

Votes:: 6 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 30/Oct/17 17:36

Updated:: 29/Jan/20 17:43

Resolved:: 16/Jan/20 22:59