How Google’s In-Memory systems allocate memory
Memory Allocation, reshaping and Availability for Hybrid RPC/RMA in-memory caching system (CliqueMap)
In this blog I would like to share little information which is related about Memory Allocation, reshaping, availability for Hybrid RPC/RMA in-memory caching system. This is the continuation about my earlier blog (Decode Google’s RMA/RPC hybrid In-memory KV (key value) caching system). If you need more insight about this particular framework, you can check my earlier blog.
Basic Responsibility for In-memory system
This framework is designed by having RPC protocol for Mutation (SET/Erase) and RMA protocol for GET. Particularly GET operation is optimized so it is using 2XR GET style. Memory allocation and Reshaping is important attribute on this framework. 2XR GET style gives lot of freedom when we relocate data and change the protocol. Client-side validation ensures that server-side modification of a KV pair or associated metadata implicitly poisons the operation. Although this happens very rare, such retries grant the backend code significant freedom when adjusting memory layout, simplifying both defragmentation and, later in the design’s evolution, dynamic resizing. Ultimately, server-side logic need only be concerned about making retry able conditions transient, detectable, and rare, rather than entirely invisible.
Memory Allocation and Reshaping
Index Region Allocation
Index region memory allocation is straightforward and is initially provisioned on backend restart based on the expected key count. Indexes can be upsized at runtime when they hit the high load.
Index Region Reshaping
Backend in-memory system creates a new second index, populates it, and then revokes remote access to the original index. At this point, client-initiated RMA operations will fail. Clients enacting retries for failed RMAs contact in-memory system via RPC as part of their retry procedure for such errors, at which time the client also learns the new per-backend index size. For simplicity, mutations (GET/Erase) will stall during an index resize.
Data Region Allocation
Data region is random-access in nature, the memory pool for Data Entries is governed by a slab based allocator and tuned to the deployment’s workload. Slabs can be repurposed to different size classes as values come and go in the lifetime of the backend task. Because all allocations occur within an RPC, allocation logic can freely use the familiar programming abstraction provided by RPC’s.
Data Region Reshaping
Memory registration for RMA is widely recognized to be expensive, as it requires the operating system to communicate with the RMA-capable NIC, to pin memory, and to manipulate translation tables. Some faulty designs it pre-allocate all backend memory at startup it will avoid memory registration during runtime, but we found this approach leads to DRAM and risks is also high for operating cost. Taking advantage of freely re-locatable Data Entries the pool resizes on demand, so that deployments can provision for common-case, rather than peak, DRAM usage. When the current capacity touches the threshold asynchronously reshaping will get kick off. Reshaping works by pre-allocating the maximum possible virtual address range needed to serve the entirety of a machine’s memory capacity, but only ever populating a subset of the address range. That is, the data pool is always virtually contiguous, but not fully backed by DRAM. During expansion it establishes a second, larger, overlapping RMA memory window2, and begins advertising this new window to clients as the data region. Clients converge over time to using the second window exclusively, in fact this will accessible by retry. Kernel memory management operations have unpredictable duration, in-memory system initiates growth according to a high-watermark policy, performing such work off the critical path, but triggered by some RPC-based operation.
As with the index region, data region downsizing occurs with non-disruptive restart. Rollout of the reshaping feature saved 10% (50TB) of customer DRAM at launch. Shortly thereafter the underlying corpus itself shrank, and without further human intervention and it dropped its DRAM usage by 50% (200TB). Since each individual backend makes an independent scaling decision, the aggregate savings is derived from the sum of many independent scaling decisions, and fluctuates in time.
Cache eviction — A mutation (via RPC) such as (SET/erase) triggers cache eviction when it encounters one of two conflict conditions:
Capacity Conflict — No spare capacity in the data region. An eviction anywhere in the data pool suffices.
Associative Conflict — No spare Index Entry in the key’s Bucket. For the newly-installed Key Value to be RMA-accessible, an existing Key Value pair must be evicted from the Bucket.
Eviction Procedure
Like reshaping, the eviction mechanism again relies on the self-verifying properties inherent in the design. Because a checksum covers the Index Entry and Data Entry in combination, the RPC handler processing an eviction can nullify Index Entry pointers and modify Data Entry contents. In parallel, this admits inter leavings in which 2×R GETs already in progress might still complete. This is acceptable as such GETs are considered ordered-before the eviction. Since the new inbound KV may differ in size from the evicted entry, multiple evictions in an appropriate size class may be needed. Once sufficient space is available, a new Data Entry is written, followed by the relevant Index Entry’s pointer into the data region, which establishes an ordering point after which the new value is visible. In practice, evictions occur at roughly half the rate of SETs.
Availability
Replication is to ensure read and write availability in the face of unplanned failures and it will provide some tolerance to run in-memory system slow. This framework is designed to avoid inter-replica coordination to keep overheads low. Self-validating responses and retries play a role, assisting race resolution without the need for remote or global locking.
Availability has three main topics to discuss, I will discuss these topics on next blog.
1. Quorum GETs Under Three Replicas
2. Multi-replica for mutation
3. Race Condition
4. Quorum Repairs