How In-Memory systems has high availability

Not Just Restart
6 min readSep 2, 2022

Availability for RPC/RMA hybrid based In-memory Distributed Systems

Replication is to ensure read and write availability in the face of unplanned failures and it will provide some tolerance to run in-memory system slow. This framework is designed to avoid inter-replica coordination to keep overheads low. Self-validating responses and retries play a role, assisting race resolution without the need for remote or global locking.

Availability has three main topics to discuss.

1. Quorum GETs Under Three Replicas

2. Multi-replica for mutation

3. Race Condition

4. Quorum Repairs

Quorum GETs Under Three Replicas

When running with three replicas, copies of each Key value pair are deployed to adjacent backend tasks. Which defines in clear for each key uses a consistent key hash to first determine the backend number for a logical primary replica I.if no replication available and then assigns copies of key value pairs to physical backends i, i+1 and i+2 (all mod N).

Next 2×R-style lookups to will take place replication by performing an index fetch from all replicas. Although all three backend will respond, due to load differences or network proximity, one backend response will arrive at the client first.

The client then fetches the datum from the first responding backend, this is defined as preferred backend.

Upon receipt of its second index response, the client can attempt a quorum per-key value-pair majority vote to ensure consistency between replicas. Deployments with three replicas and a quorum of two are known as R=3.2 and R=3.2 cells are resilient against single failures.

Each Index Entry has Version Number, which is globally unique and similar within a Key value pair. A GET under R=3.2 reports a cache hit if and only if

(1) Data Entry and its corresponding Index Entry pass checksum validation (e.g., no torn value was observed).

(2) At least two Index Entries agree on Version Number and Key Hash (i.e., a version quorum exists).

(3) The full Key in the Data Entry matches the requested Key (i.e., no hash collision).

(4) The Data Entry was fetched from a backend with the quorumed Version Number (i.e., data came from a quorum member), again capitalizing on the tenet of self-verifying responses.

By using this protocol, A successful GET quorum thus requires responses from only two replicas, a property useful for both failure/race tolerance and performance, because a slow backbend’s response can be ignored when the remaining two agree.

First responder preference leverages this property by speculating that the preferred backend will form a quorum with at least one subsequent response; this is likely whenever the mutation rate is sufficiently rare, relative to the overall size of the corpus. When this speculation fails, i.e., when the first responder isn’t part of the quorum, the client retries, preferentially fetching the datum from a distinct backend.

Multi-Replica Mutations

SET

Clients perform mutations (SET) by sending RPC call to all replicas for a particular key or Key value. Each proposes a client-nominated Version Number, A tuple contains of {True Time, Client Id, Sequence Number}, such that each Version Number is globally unique and the Version Numbers emitted by a particular client ascends Similar. By using True time a globally consistent coordinated clock — for the uppermost bits, each client eventually nominates the highest Version Number for retried mutation operations, which is crucial for per-client forward progress. Specifically, a mutation (SET) proceeds at a backend only when the client’s proposed Version Number is higher than the Version Number stored for each datum. As a result, each KV pair has a similar increasing Version Number, and all backbends can independently agree on final mutation order, without requiring a common RPC arrival order.

Erase

ERASE operations are a special case of mutations. Like SET, they use RPC and make forward progress even when a replica is down. ERASEs also required a client-nominated Version Number, retained so that late-arriving SETs cannot restore so it is considered as erased values.

But unlike other operations, Version Numbers for ERASE elements cannot reside in the index region, since such a design in capable to justify on DRAM capacity for erased elements. Anyhow Version Numbers for erased elements need not be RMA-accessible and so they are stored in a per-backend sideband data structure — a fully associative, fixed-size tombstone cache on the backend’s heap.

Compare-And-Set (CAS)

CAS operations are another special case of mutation. Like a SET, they install a new value, but only if the stored Version Number matches a provided Version Number. CAS provides a limited means of implementing conditional updates and reasoning about their success, with the provision that the Version Number is known a priority (e.g., memorized from a previous operation on the same key).

Race Condition

Clients usually don’t coordinate mutations (SET), and mutations of the same key may occur near-simultaneously. These mutations interact without explicit synchronization with RMA-based GETs. Race resolution depends on the self-verifying properties of GETs and associated retries performed by clients. This strategy has the significant advantage that no RMA-based synchronization is needed (e.g., remote locking), but the notable downside that forward progress is not guaranteed.

With an example R=3.2, This design objective was to provide obstruction free forward progress for GETs, notably that they will succeed when they don’t compete against a SET of the same key or can occur a failure condition reducing replica count below quorum.

It is possible for repeated mutations to starve GETs causing them to time out and report an error, once their retry count and/or deadline is exhausted.

In practice speed differential between RMA and RPC makes this a non-concern.

To exemplify race conditions that can arise in R=3.2. Let’s take this example consider a race in which Client 𝐶1 attempts SET 𝐾 = 𝑉1 while Client 𝐶2 attempts a GET of K, where initially 𝐾 → 𝑉0.

Depending on the timing and interleaving of operations at server replicas 𝑆1 through 𝑆3, 𝐶2’s GET attempt can result in quorum on value 𝑉0 (𝑉1), wherein the GET is logically ordered before (after) the SET, or can result in a retry able checksum failure; in this example, checksum failure occurs when the slowest of the three data array mutations, namely the one at 𝑆3, occurs during 𝐶2’s RMA data fetch operation, resulting in metadata quorum for 𝑉0, but a torn read nonetheless.

Note that an in quorate outcome — wherein an operation cannot arrive at a quorum is not possible when GETs race against single SETs. In contrast, a GET that races against multiple concurrent SETs, or experiences a failure condition (e.g., backend failure, torn read, etc.), may subsequently fail to achieve quorum.

This Approach will overcome such races by retrying operations that fail due to a potential race.

Quorum Repairs

A key with a quorum of only two backend instead of all three is called a dirty quorum.

In a dirty quorum, not all backend agree on a key’s existence or Version Number. Dirty quorums arise due to backend task failures, uncoordinated eviction (∼1 in 7M GETs), and RPC failures. A second failure such as the loss of an additional backend, causes the dirty quorum to degrade to an inquorate state, which is treated as a cache miss.

Cost of a cache miss can be high (e.g., may require reading data from persistent storage), this particular framework solution supports quorum repair in which a backend triggers an explicit on-demand recovery, sourcing from the remaining two healthy replicas.

To manage the risk of a dirty quorum degrading to an inquorate state, Backends independently scan their cohorts (i.e., the other two backends) for missing or stale Key value pairs, detected via KeyHash exchange to minimize overhead.

A backend observing a dirty quorum performs an on-demand repair on a key-by-key basis

(1) Issuing a SET to the dirty backend to install the missing key K at a new Version Number 𝑁.

(2) Updating the Version Number of the key K to 𝑁 at the repairing backend as well as the other (clean) backend.

This repair procedure ensures that all the three backends settle on a consistent view of Key K at Version Number 𝑁. This will tune the inter scan interval to suit the needs of the deployment; tens of seconds is typical. A similar process operates en masse whenever a backend restarts after an unplanned failure, such as a crash. Specifically, restarted backends request repairs from the other two healthy backends in their cohort pool.

--

--