This is Google’s In-memory KV (key value) caching system
In-memory Caching based distribution systems has these commands GET, SET and ERASE and it is frequently used. When we take a look at GET command which is mainly used fetch the key value (KV) pair from the backend systems. Efficiency is needed to observe, it should not take more CPU for single transaction. Most of the In-memory software libraries like redis or memcache, every one aware that Data like Key value pair are stored in memory.
Remote Procedure Call (RPC) frameworks are the backbone of modern Internet services and it provides programming abstraction to build distributed systems. RPC frameworks includes protocol versioning, memory management, auto-scaling, logging, and support for binary upgrades. These features cost few CPU overhead at client and server side, limiting operation (op) rate, bandwidth, and efficiency; RPC often costs >50 CPU-𝜇s in framework and transport code across client and server. Per-RPC overheads easily dominate remote operations with little server-side computation, such as in-memory retrieval applications like distributed caches and key value stores.
CliqueMap
Google has released paper on this topic to reduce the CPU overhead and to increase peak op rate for such applications, they developed own Distributed caching service with help of RMA (Remote memory access). They designed RPC/RMA hybrid based in-memory caching service. They segregated RPC is for (SET and ERASE) operation basically named as Mutation and RMA for GET operation. They confirmed based on their research RMA based distribution system which improves performance and efficiency. RPC based system are robust but it consume more CPU. Google want to reduce the CPU overhead here exactly. They named their service as CliqueMap.
CliqueMap uses an associative hash table to enable remote access. Each KV pair is guarded by a checksum that is used to validate responses. It has designed to accomplish providing performance lookups, Clear responsibilities between RPCs and RMAs across dataplane, control, and management operations. Support for key productionization expectations, such as evolution over time, high availability, interoperability, and ease of deployment. The main intention is well clear data plan retrieval GET should have minimal CPU overhead and it should be fast.
RPC based Key value caching system
Memcache is one of the RPC based solution. Stubby based RPC (gRPC) framework provide lot of feature, Such as application-to-application authentication, integrity protection, forward- and backward versioning assistance, per-RPC ACLs, and interoperability across multiple languages. But these features cost more cpu, usage would be >50 CPU-𝜇s across client and server — far higher overheads than those of state-of-the-art academic RPC prototypes. For most use cases, but for an in-memory KVCS (Key value caching system), a minimum CPU cost of 50𝜇s per op eclipses the CPU cost of simply accessing memory. Such a cost limits the usefulness of distributed caching, especially for systems with large working sets for which distributed storage is a critical means to husband expensive DRAM.
RMA based Key value caching system
A number of systems support using RMA as the network transport for KVCS dataplane primitives. Challenges arise when intersecting the core ideas of these systems with the requirements such as Memory efficiency, Enable agile evolution post deployment, Increased availability, Software interoperability, Optimizing to heterogeneous networking hardware. Systems built entirely atop RMA require careful coordination between client and server binaries, making post-deployment evolution a slow process. Simplifying assumptions around failures, hardware, or possibility of memory pre-allocation may simply not hold in practice, and can lead to performance but otherwise complex/impractical systems.
RMA/RPC hybrid Key value caching system
Both RPC and RMA are useful building blocks for higher-level systems, and use of one can complement, rather than exclude, the other. It offers a middle ground, and can more obviously accommodate requirements around tolerance and post-deployment agility.
Ideas behind this type of methodology are categorized by self validation response, In-memory system data structure, 2XR GET, SET and Client.
Self validation response
What exactly it means? Each Key value pairs are guarded with Check sum. RMA is not atomic and clients performing lookups always verify this checksum end-to-end (per KV pair). Checksum validation failures are attributed to torn reads, that is, an RMA read that observes intermediate state of a concurrent SET operation on the server this failures are rare, but normal. Adding the checksum, additional metadata accompanies a response that ensures clients and servers agree on configuration, memory layout, and version. Clients retry lookups that fail validation steps at an appropriate level of the stack.
In-memory system data structure
To manage wide-range of key and value sizes while handling key hash collisions, it uses an associative hash table. The in-memory data structure is composed of two logical sections for RMA accessible regions. One is index region and another is data region. The index region has fixed-size buckets, and each Bucket contains a number of fixed-size tuples, known as IndexEntries. An IndexEntry is tagged with a key hash and has a pointer example like memory region identifier, offset and size. This indicates a position in the data region, where in the actual KV pair is stored. Multiple DataEntries reside in the data region. This entire allocation is managed by RPC handlers.
2XR Get
This is another crucial feature on this approach. Here Get operation is referred as 2xR GET. It has two RMA reads in sequence or parallel, this is operated on the index and data regions. Flow was started here.
1. The client computes a hash mapping the Key (random string) to a fixed-size KeyHash, which uniquely identifies a backend and Bucket associated with the Key.
2. The client fetches the associated Bucket via an RMA read.
3. The client scans the Bucket for an IndexEntry with the desired KeyHash. If there is no match, then the client declares a miss.
4. The client issues a second RMA read to fetch the potentially-matching DataEntry.
5. on completion of the second RMA read, the client start validating the response
· 1. Validating the checksum end-to-end, to guard against tearing.
· 2. Verifying that the DataEntry contains the desired Key (not KeyHash).
· 3. Guarding against a (very) rare 128-bit hash collision.
SET
The client kick off an RPC call to insert KV pair to be SET, to the appropriate backend. On receiving a SET by RPC, Respective backend will allocates capacity for the new DataEntry in the data region,prepares an RMA-friendly pointer and scans the relevant Bucket in the index region for an existing mapping. If such an entry is found, the backend overwrites it with the new pointer, and reclaims the old DataEntry as free space. Backend system successfully applies SET only when doing so monotonically increases a particular KV pair’s version.
Client
The Client library transparently retries GET/SET operations to overcome transient failures, such as checksum validation failures that can arise under a race conditions, subject to both a user specified deadline and retry count. Retries occur at a layer appropriate to the error (e.g., checksum failures may retry RMA operations, but failed RMA operations may retry on new connections, etc.). This basic design delivers RMA’s intrinsic performance advantages for common-case read operations, and use of RPCs for SETs/ mutations eases the complexity of ensuring consistency.
Challenges how it was addressed on this approach
Memory efficiency via RPC-based mutations (SET/Erase)
It allocates and manages memory capacity locally, triggered by RPCs that run entirely in the backend (In-memory caching system). Using straightforward code, these backend implement rich replacement and allocation policies, and can restructure the index and data regions on demand (e.g., via defragmention/replacement) or even trigger background processes (e.g., memory reshaping). Self-validating lookup mechanisms ensure detection and retry of any resulting races. Importantly, these mechanisms make it possible for Hybrid RMA/RPC backend system to resize their memory footprint on demand, rather than wastefully pre-allocate/pre-register for peak memory capacity on startup.
Lightweight replication with client-based quoruming
To deal with slow or failed servers it offers deployment modes in which data is replicated across multiple backend system. To realize availability benefits with minimal performance cost, It adopts an uncoordinated replication approach in which replicas do not synchronize in the serving path, and clients use load-aware quoruming to resolve data consistency.
Self-validating responses coupled with retries as a key building block
While self-validating responses and retries resolve race resolution in the basic design, we also find them to be key enablers in supporting seamless binary upgrades and recovering from failures. Self-validating responses from the server make the client aware of transient changes occurring at backend, and trigger the clients to retry at the appropriate layer of the stack. Due to the number of backbends globally and weekly upgrade schedule, upgrades are the norm and this approach simplifies delivery of “hitless” upgrades.
Decoupled design for non–C-family clients
It provides support for Java, Go, and Python clients via language-specific shims, enabling non–C-family internal components of a broader system to access the corpora stored in CliqueMap and thereby easing adoption of CliqueMap in established multi-language serving ecosystems. Each language shim launches a subprocess, which in turn contains the primary C++ client library, thereby side-stepping error-prone native-code invocation mechanisms, maximizing code reuse, and unifying debugging processes among all languages.