1. System Design for Big Data

Given n cache hosts, an intuitive hash function is key % n. It is simple and commonly used. But it has two major drawbacks:

It is NOT horizontally scalable.

Every time when adding one new cache host to the system, all existing mappings are broken. It will be a pain point in maintenance if the caching system contains a lot of data. Also, if the caching system is behind a popular service, it is not easy to schedule a downtime to update all caching mappings.
It may NOT be load balanced, especially for non-uniformly distributed data.

In real world, it is less likely that the data is uniformly distributed. Then for the caching system, it results that some caches are hot and saturated while the others idle and almost empty. In such situations, consistent hashing is a good way to improve the caching system.

1.1. What is Consistent Hashing

Consistent Hashing is a hashing strategy such that when the hash table is resized (e.g. a new cache host is added to the system), only k/n keys need to be remapped, where k is the number of keys and n is the number of caches. Recall that in a caching system using the mod as hash function, all keys need to be remapped.

Consistent hashing maps an object to the same cache host if possible. If a cache host is removed, the objects on that host will be shared by other hosts; If a new cache is added, it takes its share from other hosts without touching other shares.

1.2. When to use Consistent Hashing

Consistent hashing is a very useful strategy for distributed caching system and distributed hash tables. It can reduce the impact of host failures. It can also make the caching system more easier to scale up (and scale down).

Example of uses include:

Last.fm: memcached client with consistent hashing
Amazon: internal scalable key-value store, Dynamo
Chord: a distributed hash table by MITHow it works? As a typical hash function, consistent hashing maps a key or a cache host to an integer.

Suppose the output of the hash function are in the range of [0, 2^128) (e.g. MD5 hash). Image that the integers in the range are placed on a ring such that the values are wrapped around.

1.3. Here's how consistent hashing works

Given a list of cache servers, hash them to integers in the range. To map a key to a server, hash it to a single integer. Move clockwise on the ring until finding the first cache it encounters. That cache is the one that contains the key. See animation below as an example: key1 maps to cache A; key2 maps to cache C.

To add a new cache, say D, keys that were originally falling to C will be split and some of them will be moved to D. Other keys don't need to be touched.

To remove a cache or if a cache failed, say C, all keys that were originally mapping to C will fall into A and only those keys need to be moved to A. Other keys don't need to be touched.

1.4. Now let's consider the load balance issue

As we discussed at the beginning, the real data are essentially randomly distributed and thus may not be uniform. It may cause the keys on caches are unbalanced.

To resolve this issue, we add "virtual replicas" for caches. For each cache, instead of mapping it to a single point on the ring, we map it to multiple points on the ring, i.e. replicas. By doing this, each cache is associated with multiple segments of the ring.

If the hash function "mixes well", as the number of replicas increases, the keys will be more balanced.

1.5. Monotone Keys

If keys are known to be monotonically increased, binary searching can be used to improve the performance of locating a cache for a given key. Then the locate time can be reduced to O(log(n)).

consistent_hash