[KUDU-34] Create tests that inject faults into the recovery process. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: M3
Fix Version/s: None
Component/s: tserver
Labels:
None

Target Version/s:

Public beta

Description

Context:

On recovery a TS gets a set of blocks and log segments to recover from (either they were already present, locally, or may be fetched from other nodes). Replaying the log segments is required to rebuild the tablet's soft state. Simply using the exact same log segments as the new log, however, prevents us from having free compactions/flushes (they would have to be done in the exact same points as when the original log was created). This might be problematic if the recovering machine has more tablets/less resources than the one that originally created the log.

In order to solve this recovery moves the old/fetched log into a timestamped recovery directory, plays them, and on replay a new log is created that matches the current state of the tablet.

Possible Failure Cases:

We need to cover a series of cases to make sure that recovery will work even if failures occur mid-recovery, specifically in the following situations:

Failure Point: Bootstrapping TS fails after the bootstrap process starts but before any changes are made to the filesystem or tablet.
Expected Result: On bootup TS should find log segments in the log dir and no log recovery dir. TS should proceed to do normal recovery.

Failure Point: Bootstrapping TS fails after the bootstrap process starts, the recovery dir has been created and the old segments have been moved there, but there are no new segments in the log dir.
Expected Result: The TS knows the recovery process failed but no actual data was replayed since there was no new log. TS should proceed to do normal recovery using the segments in the recovery dir.
(Note that this assumes that moving the files is an atomic operation, which is not the case now).

Failure Point: Bootstrapping TS fails after the bootstrap process starts, the recovery dir has been created and the old segments have been moved there, and there are new segments in the log dir.
Expected Result: Right now there are no flushes/compactions during recovery, so the node might just delete the segments in the log dir and restart playing the segments in the recovery dir. When we add compactions/flushes during recovery, however, this becomes trickier, so a very simple solution might be to reset recovery and get new blocks and segments from another replica.

Failure Point: Bootstrapping TS fails after the bootstrap process starts, there are new segments in the log dir and there is no recovery dir.
Expected Result: TS should assume it failed during normal operation and proceed accordingly.

Attachments

Activity

People

Assignee:: Mike Percy

Reporter:: David Alves

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 04/Nov/13 18:10

Updated:: 28/Aug/15 02:47

Resolved:: 28/Aug/15 02:47