# my version of consistent hashing

When sharding, a common method of distributing your key randomly is to use some sort of hashing algorithm in your code, which is not time dependent, key dependent (i.e. evens, odds), etc.. You also want to keep in mind the work in reallocating data if you either add or remove shards.

The is a good amount of material on the subject,

and the original paper,

http://www.akamai.com/dl/technical_publications/ConsistenHashing…

if you like technical papers.

Why am I writing about it again? I don’t love the visual explanation out there (the circle) and breaking it into whole sections, so I’m going to explain it how I understand it best.

Naive hashing

Let’s say you had shards 1, 2 and 3, and wanted to add shard 4,

with naive hashing, you do some sort of modulus of the hash to assign it to a shard.

`whichshard = hash('key') % N`

where N = number of shards, % is the modulus and produces a number between 1 and N, which would be the number of the shard. A typical hash function might be crc32().

reallocation of data when adding or removing a shard is a real problem with this method. To demonstrate, we’ll assign some data represented by numbers 1-> 15 through out the nodes.

``` shard1 1, 4, 7, 10, 13
shard2 2, 5, 8, 11, 14
shard3 3, 6, 9, 12, 15```

now let’s add a shard, and redistribute these same pieces of data,

``` shard1 1, 5, 9, 13
shard2 2, 6, 10, 14
shard3 3, 7, 11, 15
shard4 4, 8, 12```

you’ll see that when you compare the data, most of them have shifted to other shards. in fact 3/4 of your data will be moved for 4 shards, 7/8 for 8 shards, etc… you’ll
have to move almost all your data each time you adjust the number of shards. This is because your algorithm using a modulus is accounting for an extra shard.

consistent hashing

``` shard1 1, 4, 7, 10, 13
shard2 2, 5, 8, 11, 14
shard3 3, 6, 9, 12, 15```

now let’s add a shard, and redistribute manually,

``` shard1 1, 4, 7, 10
shard2 2, 5, 8, 11
shard3 3, 6, 9, 12
shard4 13, 14, 15```

We see that we have chosen to assign node4 with a result from each on the first three shards, it’s does not matter which ones go to node4, and the result is none of the first three shards need data moved onto them, only moved off to the new shard. It also only accounts for 1/5 of the data; if we had 20 shards, we would only have to move about 1/20 of the data. Obviously you are going to have some sort of logic that maps the formula results to a node,

```if (result in (2,5,8,11)){
shard=2
}
etc...```

so for our demo, how does getting a random value from 1 to 15 work in a formula? We now have,

`whichshard = ceiling(hash('key')/maxvalue(hash())*N)`

ceiling would be a function that rounds up to the closest result N. N normalizes the result so instead of an answer between 0 and 1 to give you a result between 1 and N, matching your assigned values.

if we are using crc32, the max value is 4294967295, so something like,

`whichshard = ceiling(crc32('key')/4294967295 * N)`
`e.g.. -> ceiling(crc32('1234567')/4294967295 * 15) = 5`

and we see that the result ’5′, we assigned to shard2, and will always be assigned to shard2 unless we move it to another shard in our grouping definition.

Lastly, you can run the above formula as a select SQL statement, but in production, do it in your code so you are not hitting the db.

# Boo! a frightful hour in production, ghost of BIGINT

Well, we had quite an hour of uncertainty this morning, at 7:30am, the behaviour of one of our data servers changed radically, with no obvious reasons looking at the process list, error log, etc… We’ve done a number of things the past few days to ramp up for a version launch of one of our top games, which included new code, data structures, etc.. along with a change in hardware for the memcached instances.

processes were slowly building up, yet the load was low, and at least some queries certainly were processing. In retrospect I regret not looking at the show engine innodb status as that may have made it much more apparent what was going on, however considering the other two factors, I was first looking at other explanations to do with recent changes in code, the memcached changes, etc..

The head developer mentioned he looked at one of our big tables and it had 2.1 billion rows, however, we had already anticipated the datatype limitation of INT and went through the exercise of changing the primary key column to BIGINT, and so it seemed unlikely this was the cause. However, the number bothered me, and so I took a look as to exactly what is was and certainly it seemed the the table passing the threshold of int (2147483647 signed) exactly conincided with whatever problem we were having.

Looking at the processlist, sure enough there were many queries that were acting on the id 2147483647 (about 10% of the total queries). So what could be the problem? We use a lot of stored procedures to minimize network traffic, and the developer then remembered that in the stored procedure declarations the primary key was defined as INT, not BIGINT. All incoming queries (and new users) were getting a truncated max value of 2147483647 limited by the stored procedure, not the table. All users before 7:30am were processing normally albeit much more slowly, anyone signing up after 7:30am was hammering the same row, thus the slow query build up and of course contention due to the hotspot.

What’s scary is the dumb luck of us discovering this sooner than much later I never would have looked at the row count thinking it was a problem already solved. I’d like to think an eventual show engine innodb status would have revealed the row level locking on that id. Apologies to the user assigned id 2147483647 for their game data, I can’t imagine what the game experience must have been like for them.

# Galera gotchas

We’ve recently implemented Galera clustering and have been pleased with the relatively easy install and implementation. A quick description of galera is the joining of individual mysql dbs as nodes to create a cluster that features multi-threaded synchronous replication, which allows for true high availability while still using your original db and engine (innodb). Likewise, the ability to quickly break down the cluster to individual servers if need be.  Our specific set up includes two dbs and an arbitrator to avoid ‘split-brain’ in case of a node failure. The process of adding a node simply involves connecting to the group, and the data transfer and sync is automatically done via SST (state snapshot transfer) with the typical overhead associated with the familiar backup methods, mysqldump, rsync and more recently xtrabackup. With xtrabackup you truly have a non-blocking method to add and re-add nodes. Recent improvements also include the addition of IST (incremental state transfer) which allows you to disconnect a node, do work, and reconnect and quickly catch up on the missing transactions on that node.

As mentioned, the install and implementation has been quite smooth, however here are a few things to keep in mind,

1. On installation, when running into errors, it’s important you analyze both the joiner AND donor error logs, as they of course will have differing messages. We ran into what ended up being an xtrabackup issue, which was misleading from the joiner logs, but clear as day in the donor logs.

2. On initial installation, you’ll be told to set wsrep_cluster_address=gcomm:// for your first node, as there is nothing to join to. However, DON’T keep this in your my.cnf, as on restart, you’ll end up creating a new cluster (again), not join the one you’ve made. Change it to specify the ip of one of the other nodes.

3. Similar to replication, Galera will auto-increment offset by the number of nodes, this is automatic, however, keep this in mind regarding large tables and datatype limits.

4. You may be surprised to learn that some fundamental operations will lock the entire cluster for the duration of the operation without some care. Here are two examples and by no means the only statements that can be long running and cause grief,

• An alter table locks the entire cluster, even on a table that is not in use
• A load data infile also locks the entire cluster even on a table that is not in use

The first is partially due to how galera handles DDL statements, total order isolation (TOI) is the default, more info can be found here,

the fact that is affects all tables and dbs is a bug / feature, more details here,

I assume the load data infile lock is due to the synchronous requirements of the cluster waiting to receive confirmation of the commit on the second node.

you have a couple of options to avoid a cluster wide lock,

a) for the alter table scenario, as detailed at the link above, you can use a DDL handling method known as rolling schema upgrade (RSU) which automatically detaches the node for the duration of the operation, then resynchronizes when finished.

b) for both the load data infile and alter table you can do a more manual version of this by simply disconnecting each node, performing the operation, and reconnecting

c) A third version is to issue a command to only apply the command locally,

```SET wsrep_on=0;