Repartition algorithms have one goal: spliting data or calculations as evenly as possible.
This is of course, done to match the available resources in either memory or treatment power.
State of the art
CRUSH, the distribution layout of Ceph
The distribution of data in Ceph is based on the idea of placement groups. We create a number of placement groups. Then the CRUSH algorithm assign placement groups to each server, data is replicated among the same placement group between multiple servers.
This leads, in my opinion, to some issues:
- determination of the number of placement groups.
- having to migrate the pool of data given changes in the cluster layout to please the placement group system.
The (approximate) maths behind the idea
The idea is based upon the splitting of a wide ring of integers among a comparably small number of instances.
As for practical purposes, the ring of unsigned integers between
264 excluded was chosen.
The rings are considered a suite of rings, each being recording a new state of the whole cluster.
R(n)be a suite of rings of the natural integers in the range
For any number
nof 64 digits in base 2, this number
n ∈ Rset.
Addition should wrap around the ring like a modulus addition. Which is the behavior of C-style unsigned integers.
For any number of
R(n)be the operator on
a ∈ R(n)and
b ∈ ℕsuch as
a+bis the lowest value
r ∈ ℕsuch as
As such, it is possible to get slices with a fixed distance to another by a simple addition of that distance.
It is worth noting that the ring doesn't possess subtraction or negative numbers, as such, distance is oriented and only defined by addition.
We them assign access points to daemons to slices of the ring. For this we use a ponderated value between a splitting ratio and the weight, for the next examples, only the weight will be demonstrated, making the splitting ratio be of
R(n)containing a daemon
redof weight 2.
This makes us a single slice, whose boundaries are the whole ring, and whose distance between boundaries is
264-1. Since there is a single slice, the weight of 2 have no influence here for now.
Sredbe a slice of this ring representing all of the ring. This slice is named, the slice of
R(0)assigned to the daemon
Now what about when we add a new daemon to the chain of rings?
R(n), this ring being equal to
R(0)with the addition of a daemon
greenof weight 1.
When adding another ring to the set, with here a weight of 1, the slices will be split among the 2, with 2/3 rounded down to the daemon with weight 2, and the
rest to the other.
Sredbe a slice of this ring representing two thirds of the values in ring starting at 0 and increasing up to
(264-1)×2÷3. This slice is named, the slice of
R(1)assigned to the daemon
Sgreenbe a slice of this ring representing one thirds of the values in ring starting at
(264-1)×2÷3excluded and increasing up to
264-1. This slice is named, the slice of
R(1)assigned to the daemon
Tmd4be the transform of a natural number expressed as a number of digits in base 2 that is multiple of 8 into a natural number of 128 digits in base 2 as defined by RFC 1320.
Thashbe the transform a number of digits in base 2 that is multiple of 8 into a natural number of 128 digits in base 2
xin the following way:
aheadbe the number formed by the first 64 digits of
atailbe the number formed by the last 64 digits of
Thash(x)is equal to the digit by digit exclusive or of
The implementation in Crystal
class Daemon getter weight : UInt64 getter location : Array(Int32) getter url : String def initialize(@weight : UInt64, @location : Array(Int32), @url : String) end end class SliceInfo JSON.mapping( slices: Array(Array(Slice)), last: Int32, shards: UInt32 ) def initialize(@slices, @last, @shards) end end
private def hash_impl(data : String) ctx = OpenSSL::Digest.new("md4") ctx << data digest = ctx.digest tag = IO::ByteFormat::BigEndian.decode(UInt64, digest) tag ^= IO::ByteFormat::BigEndian.decode(UInt64, digest+8) return tag end
High Availability Ring Protocol
A HARP system is composed of an heterogeneous number of nodes aware of their physical placement in the cluster, as well as a number of monitors connecting them. The monitors can fully access each other and are connected over the internet.
All of the nodes and monitors being aware of the layout of the monitor ring, may a connection be severed, new connections can be made easily.
Services are to implement query interfaces as well as relocation interfaces.
Relocation occur when a node is added or removed, it is done by querying the former owners for keys within a given slice.
The HARP protocol is a candidate for the implementation of the Scroda distributed currency system.
The concept of ring sets is plenty expanded within the Scroda database technical specification.
The concept of ring sets that govern the principles of HARP is very reusable. It provides a way to deterministically split data and treatment in a great number of servers or more generally of entities.