Is mysql/innodb good for high volume inserts?

As we are collecting an ever increasing amount of data and analyzing it, data storage products with alternate file structures have become popular, such as LSMs (log structured merge files) which is implemented in Cassandra and Hbase, as well as fractal indexes found in TokuDB/MX, which is an alternate file structure plugin for mysql and mongoDB. The reasoning is this – To maintain b-tree indexes in a relational database such as mysql which clusters the data by primary key, there is a constant rebalancing process to keep the order. When your data set become larger than memory, this translate to a large increase in disk I/0 which of course kills performance. This is the price you pay for an optimized search algorithm, so the story goes (and when I say story, I mean promotional material for companies using non-btree structures).

so a quick review of each of the three structures,

B-tree – nested levels of pointers specifying a fractional range of the primary key, per block. The top level is the root node, the bottom level are the leaf nodes which contain the actual data, and the levels between are internal nodes. Branching factor for a b-tree is B, the number of nodes a block on the preceding level points to. For a block size of 16kb, and say 100 byte rows, the branching factor could be around 160 and the tree depth for a billion rows ~ 4 levels.

LSM – basically log files that are always appending, never updated even for an update, thus always write sequentially. Files are sorted. To search, you must scan all the files in reverse order. This is optimized by reducing the number of/merging files (compaction), and limiting the range of the scan per file by the query parameters. Bloom filters can help point queries (not ranges). They do this by maintaining an extremely frugal/efficient data structure to allow a read query to cut down on the number of files to be searched. As you can guess, reading is somewhat of an afterthought, for the benefit of the insert performance.

Fractal – similar to a b-tree, however there is a cache for each internal node block.  All levels are not immediately written but only when needed (when they are full) and cascade down one or more levels. Doing it this way allows many writes to one block done in memory, until finally ready to write a grouped leaf node block with many changes on it to disk. b-tree groups writes as well, but not specific to the node level, the real advantage is not having to read the leaf nodes randomly for each query by delaying the leaf node write. From my understanding, for a branching factor for fractal indexes of B^1/2 = (4MB / ex. 100 byte records)^1/2 = 200, and say just 1% of the total db size as memory (2TB -> 10GB) was allocated to node buffers and pivots, from that, 100-200 writes could be grouped to write to a leaf node on disk (see below for B definition).

So the latter two files structures sound like they are a no-brainer for high volume inserting. However, there is a critical thing to keep in mind here and which is typically glossed over. You can (and usually do) insert sequentially in b-tree structures as well, requiring very little memory to do so. The performance degradation for a b-tree only happens if your working dataset is larger than available memory and you’re inserting out of order of your primary key (i.e. not chronologically, for example alphabetical). Your insert has an equal chance of going into any block of existing data and so you are likely to be reading the entire dataset in a random fashion. However, if you are inserting sequentially, such as by time or an auto-increment primary key, your application is writing to the same block and/or adjacent block for the majority of your data and will flushed together. Your need for all the data in memory goes away. Secondary indexes are indeed out of order, but a much smaller data set and handled and grouped by the insert buffer. So basically if you are inserting without a primary key, or with an auto-inc primary key, or other time oriented way, it’s a non-issue.

We’ve done up to 100,000 q/s primarily inserts into mysql/innodb on our production machines using a b-tree data structure, 32-core CPUs with multiple secondary indexes, into a 3TB table, and that’s without using nosql api’s such as handlersocket or memcached (we do use stored procedures). That matches or surpasses most benchmarks for inserts I’ve seen, regardless of the product.

the math works out to be,

  1. B – records per block size
  2. N – total number of records
  3. branching factor -  B for btree, B^1/2 for fractal

B-tree – reads and writes disk logN/LogB, writes in memory 1/B

Fractal – reads and writes disk logN/LogB * B^1/2, writes in memory 1/B

LSM – writes disk  logN/LogB * B^1/2 (including compaction) reads (LogN/LogB)^2, writes in memory 1/B

obviously B is somewhat arbitrary, (tokuDB->fractal uses a default block size of 4MB, innodb->b-tree 16KB) a larger block size offsets the smaller branching factor (B vs. B^1/2) and so both can read and write on disk within the same ballpark. With the larger block size, writes in memory, arguably fractal is better, although both are quite efficient at a tiny fraction of a disk I/0 per operation. A larger block size however hurts random reads as more data is required to be read into memory. LSM is seemingly at a clear disadvantage on reads, needing to scan all files, thus a log factor more disk I/O. This last statement for LSM reads is a little misleading as Bloom filters used by LSMs make reads doable and reduce most of this overhead as a crude form of index, although not completely and not for ranges.

So what might be a situation where you’d want to organize your data other than chronologically? Perhaps a site such as Wikipedia, which have documents, accessed with a frequency not related on the age of the document. Additionally you may want to fetch a range of documents, say all beginning with ‘Ag-’. Then it may make sense to cluster it by document name. However, you still have a viable option with a b-tree, as it still might make more sense to write chronologically (sequentially), and allow an alphabetical secondary index to take care of the search performance.

You alternatives are to use an LSM product, which will write sequentially with no order to worry about, but have aforementioned drawbacks reading the data, or use fractal index technology which will have performance in the same ballpark of a b-tree if inserting sequentially and much better if not, with the potential for less efficient reads, and relying on an overall less mature storage engine. Now tokuDB/MX also has built in compression which may add another dimension to the offering in terms of performance and of course storage space, but that’s for another post.

Numa system performance issues – more than just swapping to consider

Most all modern systems running a database have more than one processing core (and multiple threads on each core) in order to scale the number of simultaneous running threads and ultimately, the work the system is doing. In order to scale memory performance as well, the architecture provides each physical processor or node with it’s own ‘local’ memory. In the case of systems that run a process like a mysql database that takes up more than the size of one node (i.e. more than 50% on a dual-core system), the resulting performance considerations are not necessarily straight forward. There is of course the excellent article by Jeremy Cole that I recommend reading as a general overview of Numa based systems along with understanding the most damning of these potential issues actual swapping due to memory inbalance. There are however other less obvious but potentially harmful performance issues of which you should also be aware.

non-local memory percentage of use and location

As mentioned, Numa attempts to localize memory to a node which helps a great deal with scenarios such as independent processes, or sharded/multiple instance dbs. Each process is bound to a node and in turn the node attempts to use the local memory associated with the node. It should be clarified that remote memory is simply the memory that is assigned to the other node(s) and so local/remote memory is relative to the particular node referred to. With a large process (larger than local memory of one node), it has no choice but to use memory from multiple nodes, thus there is no way to avoid remote memory access as the particular data needed can not be determined until already committing to a thread/core that may or may not have the data in local memory.

So understanding that accessing non-local memory is a necessity, it follows that the percentage of use and proximity of remote memory is a priority and can make a big difference in cpu intensive/ large memory applications such as a high transaction and dataset-in-memory db. A fundamental choice that exercises this is, ‘how many cores do I choose for my system?’ Surely more is better? not necessarily so. A two-core system running a db that takes up most of the memory means that you’ll likely access remote memory 50% of the time (save some internal optimizations), where as in a four core system, you’ll be accessing remote memory 75% of the time (and 25% local). What’s more, the distance between nodes (and the memory bus it must travel through) is also a factor which is represented via,

numactl –hardware

two node

node distances:
node   0   1
0:  10  20
1:  20  10

four node

node distances:
node   0   1   2   3
0:  10  20  30  20
1:  20  10  20  30
2:  30  20  10  20
3:  20  30  20  10

Two nodes vs. the distance between four nodes physically arranged in a square, it’s easy to see there is a significant percentage of accesses that require not one, but two hops. Summing the two factors together, the average access time can be very roughly double on a four-core system as a two-core system.

memory reclaim modes

What exactly happens when a node runs out of local memory? This investigation was spurred on by the fact that although we have NUMA enabled on most systems, We haven’t seen any swapping issues even though some nodes on some systems are filled up and potentially under pressure. In fact, swapping is indeed one possibility, however it’s not the only one, and directly dependent on your virtual memory settings. the default for systems is the general category NODE LOCAL, and can include behaviour such as discarding clean pages which will need to be read in again at a later time, writing dirty pages out, as well as simply going to a remote node with no attempt to make room for data on the local node. Details can be found under the zone_reclaim_mode option at,

this jist of which is,

0       = Zone reclaim off
1       = Zone reclaim on
2       = Zone reclaim writes dirty pages out
4       = Zone reclaim swaps pages

I’ve seen it stated the default on two-node systems is 0 which simply goes to remote nodes when local is filled, while the default on four-node systems is 1 or ‘free cached pages which are not dirty on local node before going to remote’. This is what we’ve found on our current systems, and additionally it appears the default can differ depending on the version of your OS. To check,

sysctl -a | grep reclaim

The take away is although swapping is the most obvious and published performance issue, What is in fact the default settings (at least nowadays), you’re likely not to run into a memory pressure issue at all on two-node systems, and for four-node systems, you may find the issue is more subtle, with continuous reading of pages from disk for seemingly no good reason and not swapping, as you might expect. The solution is to simply change your zone_reclaim_mode to 0.

vm.zone_reclaim_mode = 0

in /etc/sysctl.conf

and reload with sysctl -p

good additional reading on this can be found at,









Infinidb 4.0.2 review

As with most data centric companies, there comes a time where it’s necessary to be able to aggregate data to provide meaningful insight into raw data that is collected. At Fuse Powered that is essentially one of the cores services we provide, whether it be via ad-hoc queries, reports, or our dashboard.

The core database that we use is of course mysql with a default storage engine Innodb, which is geared toward OLTP and does it very well. However, when it comes to aggregating a large number of rows, and/or loading a large amount of data quickly, this default storage engine is not geared towards these sorts of operations efficiently when performing a very large amount of data. A class of storage engine called columnar storage is designed for just this sort of work, where the data is re-oriented, and the data is put into separate column based data files, where each of the columns can be read from disk sequentially and stored with high compression (resulting in disk storage savings and less data to read), as the speed in reading far outweighs the cpu overhead of decompression.

There are a number of columnar based offerings out there, however, almost all are closed source and quite pricey. Up until now we’ve utilized Infobright, a mysql client-based solution with a watered-down community version, stripping much of the functionality and performance from it’s full version. Recently a company, Calpont, (recently re-branded as it’s product name Infinidb) has come out with a open source licensing of it’s full 4.0.x version, a move that may potentially revolutionize how columnar databases are marketed.

This full version fills a number of core needs, such as compression, parallel processing, distributed storage, filesystem redundancy (HDFS, GlusterFS) and high speed data loading, and so we’ve been eager to test and implement.

The architecture is comprised of two main functional units, user modules (UM) and performance modules (PM). Performance modules work with and store data, while user modules coordinate queries across the entire system. Scaling performance modules gives you more storage and potentially better performance from parallel work being done. Scaling user modules allows for more concurrent queries to be run and redundancy for a failed um. You can start with a single server that contains all units, and grow out to a large distributed system as both type of modules can be added as your data increases. Each PM has a set number of dbroots, which can easily be reallocated/moved to other PMs on other servers as your data grows. It also uses a form of partitioning called extent mapping which will rule out sections of data in storage to speed up queries where possible. Queries are processed in parallel by the available processors and nodes that are appropriate.


the following is our general experience with it to date,

Documentation – Very good, answers quite a bit in terms of concepts, installation,  options, etc.. However, because it’s just released as open source, really needs much more use and documentation by the user community to fish out bugs, suggest fixes, etc..

Installation/configuration – straight forward, enable passwordless ssh between nodes, download/install the initial rpm/deb/binary files and then run their script which will ask a series of questions. You can adjust the main configuration file later only on the root pm, usually pm1 at /usr/local/Calpont/etc/Calpont.xml.

Schema transformation - you need to create infinidb versions of the table structure, this is going to take some time, there are all kinds of incompatibilities when converting innodb schema to infinidb.

Loading – load data infile vs. the infinidb loader. As can be guessed, the loader is much faster, anywhere from 100,000 -  1 million+ rows/s for a single module load. The source of the loader can be at the user module, which will automatically distribute the data in 10,000 rows chunks, or at the performance modules individually using pre-split files to be able to scale. You can set an error threshold over which the load will abort, with errors logged.

Compression – our data is fairly typical, largely numerical with some varchar and text/blob columns here and there. Our data is compressed at about 5X compared to uncompressed innodb.

Backups – pretty rudimentary, really just a file copy which requires it be in read-only mode to ensure data consistency. However, it being a file copy, straight forward and fast.

Queries  – performance is very impressive, especially for more complex queries, we had a class of aggregation queries that was taking 5-10 minutes on a dedicated box with 256GB ram, on a three-node public/shared cloud system 16GB each, finished in 10-15 seconds.

Logging – very well done, especially the loading, there a two files per load per table that detail the error type and the exact characters that cause the issue. Also various server logs to debug issues.

Administration – much of it is done through an independent ‘cmconsole’ which allows you to set and view various parameters and processes as well as add and reallocate modules. There are several processes per node that need to be up and running and can be a source of problems if they aren’t, viewable with the console.

Monitoring – built in monitoring of a number of key metrics such as disk/memory/cpu usage, networking issues, module issues, etc..

Cross engine compatibiity – not just the infinidb storage engine, also a full version of mysql and you can add innodb capability with the plugin. However, pretty darn old at 5.1.39. You can also mix storage engines in your queries, although you’ll need to specify it in the configuration file.



Before we get in the issues, it should be said that we have not attempted to contact infinidb reps, and any resources we used were simply the documentation and the infinidb forums (with posted responses by infinidb reps). Now, some of the issues are annoyances, other are fairly serious.

The most serious revolves around loading of data, we typical re(load) and append about 1000 tables per day and we’ve run into this behavior a few times in the last couple of weeks.

Locking – each load requires a table lock and occasionally, one of two things happen,

a) a table is locked and does not release the lock – we’ve even had a time or two where we’ve restarted the system and still has not released the lock (contrary to documentation). There is a view and clear lock utility, but for some reason it’s currently not working for us at the moment either. Our solution was to hunt for the file that contains this lock, and sure enough, you can find it on your parent pm at /usr/local/Calpont/data1/systemFiles/dbrm/tablelocks. Simply removing it and restart seems to work. This behavior seems to depend on the particular data being loaded, this happened on the same table 3 times, and we ran into most of this when initially loading our full data set.

b) table lock can not be acquired – In this case, a load is attempting to get a table lock, can’t get it and aborts the load. There were no other queries on the same table running at the same time, and in fact seemed to happen mainly when loading of other tables/dbs simultaneously.

Critical errors – we’ve ran into this a few times, where an error such as ‘image inconsistency’ initiates the system to go into ‘read-only’ mode. We only experienced it during a load which I suppose makes sense as it’s the only time we are significantly writing. When encountering this during a load it also attempts to rollback but can’t because again it’s been put into read-only mode. You might deal with the lock separately as above, and the read-only issue is suppose to clear after a system restart, but we’ve seen where it doesn’t. You then have two options. The advice at this link may help, otherwise, you may have to remove the entire infinidb installation and replace with your latest backup. We are not using a redundant file system such as HDFS or glusterFS, which may or may not help.

Not enough memory to perform select query – Infinidb uses a distributed hash join to evaluate large data sets quickly, the memory requirements of which is dictated by the second largest table in the join. There is a variable, PmMaxMemorySmallSide at a default of 64M up to a max of 4GB on the PM nodes, or all of TotalUmMemory on the UM depending on the data set size (if larger than the PM setting, uses the latter). We frequently join two tables, one at 40+ Billion rows the next at 1 Billion rows which has been an issue for certain queries, although our UM node is very modest at 16GB. Query span does make a difference (i.e. one day at a  time vs. the entire year) so it’s not just a total table size metric.

backups needing to be in read only mode - is a self-explanatory issue. hopefully your needs don’t require frequent/real-time writes/loads 24 hours a day.

no real query plan evaluation  – explain doesn’t give you much as it’s without indexes and use an alternate storage mechanism and execution algorithm, although probably irrelevant as it’s evaluated in a completely different way. Query profiling (set profiling=1) works. The question arises whether differences in the query structure matters and it would be interesting to see the performance of the same query written two different ways that take can longer one way in Innodb, and how they behave differently here (e.g. join vs. subquery class of queries) * – indeed there is a detail analyzer ‘select calgetstats()’ and ‘select calgettrace()’ directly after a query on the command line

sql fickleness – various issues such as,

- not being able to add a column anywhere other than the end of the table
- when doing mysqldump, must do –single-transaction, otherwise will err with lock issue
- All non-aggregate columns in the SELECT and ORDER BY clause must be included in the GROUP BY clause.
- ‘insert into (select..)’ errs, while ‘insert into select’ works


And so to sum up, we are using Infinidb and recommend it, however I’d suggest a thorough evaluation to your specific needs before making a commitment, luckily with an open source license, you can do just that.

Handlersocket Part III / examples / Perl

(originally posted 07/18/12)

other handlersocket entries:

part Iintroduction to handlersocket

part IIhandlersocket syntax Perl/PHP


Inserting Records

Single record

$res = $hs->execute_single(0, '+', ['mike','','2011-05-03']
, 1, 0);

 die $hs->get_error() if $res != 0;

Multiple records

let’s insert four more records as a ‘batch’

$res = $hs->execute_multi([
 [0, '+', ['102','bob','','2011-12-29'], 1, 0],
 [0, '+', ['103','john','','2011-07-18'], 1, 0],
 [0, '+', ['104','jane','','2011-06-23'], 1, 0],
 [0, '+', ['105','dave','','2011-04-12'], 1, 0]
for my $res (@$res) {
 die $hs->get_error() if $res->[0] != 0;

note: Limit and offset do not affect the insert.

Reading Records

For reading records, you’ll want to additionally do something with the resulting array reference.

Something like this will work,

for (my $row = 0; $row < $ele_returned; $row+= $count_cols) {
      my $user_name= $res->[$row + 0];
      my $user_email= $res->[$row + 1];
      my $created= $res->[$row + 2];
      print "$user_name | $user_email | $created\n";

Exact values

a) Selecting on an exact single value

$res = $hs->execute_single(0, '=', ['102'], 1, 0);

b) selecting on two or more exact values (where col1=a AND col2=b)

$res = $hs->execute_single(0, '=', ['102', ‘mike’], 1, 0);

note: select on two keys must have the multiple index to support it

c) selecting on two or more exact values (where col1=a or col2=b)

$res = $hs->execute_multi(
[0, '=', ['102'], 1, 0],
[1, '=', ['mike'], 1, 0]

note the second array needed to use a different index

handlersocket needs indexes to find results and like mysql, a multi-column index requires the leftmost prefix and index. Again, handlesocket does not do table scans, and so without it, you will not get any results at all.

d) Selecting on one or more values with (where col1 in (a, b, c….))

You can do this one of two ways,

$res = $hs->execute_multi(
 [0, '=', ['101'], 1, 0],
 [0, '=', ['102'], 1, 0],
 [0, '=', ['103'], 1, 0],
 [0, '=', ['105'], 1, 0]


$res = $hs->execute_single(0, '=',[''], 1, 0,

note with the second query, we can do more with it later on such as filtering while in the first query the arrays are independent of one another..

Limits and Offsets

These are very straight forward, we’ll start with a limit of 10,more than enough for our 5 records. (where col1=a limit x,y)

$res = $hs->execute_single(0, '>=', ['101'], 10, 0);

101 | mike |
102 | bob |
103 | john |
104 | jane |
105 | dave |

$res = $hs->execute_single(0, '>=', ['101'], 3, 0);

101 | mike |
102 | bob |
103 | john |

$res = $hs->execute_single(0, '>=', ['101'], 10, 2);

103 | john |
104 | jane |
105 | dave |

Range Searches

handlersocket is a little odd in that only one boundary of a range search can be specified in the initial operator cell

$res = $hs->execute_single(0, '>', ['102'], 10, 0);

103 | john |
104 | jane |
105 | dave |

two bounded range can indeed be done, it can be done with an additional filter function, (W for a stop, F for a filter of those records)

$res = $hs->execute_single(0, '>',['102'], 10, 0, undef, undef,
[['W','execute_single(0, '>',['102'], 10, 0, undef, undef,
[['F','!=', '0','104']]);

103 | john |
105 | dave |

the filter can be a different column as well,

$res = $hs->execute_single(0, '>',['102'], 10, 0, undef, undef,
[['F','=', '1','dave']]);

105 | dave |

as you can see with the ability to adjust operations, filters and columns, there are lots of combinations

updating records

similar to the selects, the key difference being an update ‘U’ flag and array of columns are specified.

a single key

$res = $hs->execute_single(0, '=', ['103'], 1, 0, 'U',
['2015-12-12 00:00:00']);

a range

$res = $hs->execute_single(0, '>=', ['103'], 1, 0, 'U',
['2015-12-12 00:00:00']);

two keys

$res = $hs->execute_single(0, '=', ['102', 'mike'], 1, 0, 'U',
['2015-12-12 00:00:00']);

a range and filter

$res = $hs->execute_single(0, '>',['102'], 10, 0, 'U',
['2015-12-12 00:00:00'], ['F','=', '1','dave']);

This is not working correctly.

an ‘IN’ and filter

$res = $hs->execute_single(0, '=',[''], 10, 0, 'U',
['2015-12-12 00:00:00'],[['F','!=', 'mike']],0,

it’s important the columns in the open_index call are exactly the same as in your update statement

Deleting Records


Single record

$res = $hs->execute_single(0, '=', ['103'], 1 , 0, 'D');

| user_id | user_name | user_email | created |
| 101 | mike | | 2011-05-03 00:00:00 |
| 102 | bob | | 2011-12-29 00:00:00 |
| 104 | jane | | 2015-12-12 00:00:00 |
| 105 | dave | | 2015-12-12 00:00:00 |

multiple records

$res = $hs->execute_single(0, '>=', ['103'],10 , 0, 'D');

| user_id | user_name | user_email | created |
| 101 | mike | | 2011-05-03 00:00:00 |
| 102 | bob | | 2011-12-29 00:00:00 |

$res = $hs->execute_single(0, '>=', ['103'], 2, 2 'D');

| user_id | user_name | user_email | created |
| 101 | mike | | 2011-05-03 00:00:00 |
| 104 | jane | | 2011-06-23 00:00:00 |
| 105 | dave | | 2011-04-12 00:00:00 |

You can not seem to use or even accept ‘!=’

$res = $hs->execute_single(0, '!=', ['103'], 10, 0, 'D');

however the filter can be made to do the same thing

$res = $hs->execute_single(0, '>=',['101'], 10, 0, 'D',
undef,[['F','!=', '1','bob']]);

| 102 | bob | | 2011-12-29 00:00:00 |

filters and IN

$res = $hs->execute_single(0, '=',[''], $rows_returned, $offset,
'D', undef,[['F','!=', '1','bob']],0,['101','102','103','104']);

| 102 | bob | | 2011-12-29 00:00:00 |
| 105 | dave | | 2011-04-12 00:00:00 |

bugs (tested around 02-15-12)

a recent fix was implemented for filtering and ‘IN’ in combination, however,

you CAN NOT seem to use a range, filter and ‘IN’ statement at the same time, such as,

$res = $hs->execute_single(0, '>',['101'], 10, 0, undef, undef,
[['F','!=', '1','bob']],0,['101','102','103','104']);

103 | john |
104 | jane |
105 | dave | X

I had problems with the date field as well with just filtering and IN,

$res = $hs->execute_single(0, '=',[''], 10, 0, undef, undef,
[['F','>', '2','2011-07-18 00:00:00']],0,['101','102','103','104']);

101 | mike | X
102 | bob |
103 | john | X
104 | jane | X

as with the selects, all three (range, filters and IN) do not all work together

| 101 | mike | | 2011-05-03 00:00:00 |
| 102 | bob | | 2011-12-29 00:00:00 |

(missing 105 from the IN exclusion)

putting it all together

here is a sample script in perl


### this script is far from optimized, particularly for many returned rows, however you probably shouldn’t be using handlersocket for a lot of return data anyways,

use strict;

use warnings;

use Net::HandlerSocket;

#1. establishing a connection

####### port 9999 to write, 9998 to read

#my $args = { host => ’′, port => 9998 };

my $args = { host => ’′, port => 9999 };

my $hs = new Net::HandlerSocket($args);

#2. initializing an index so that we can use in main logics.

# MySQL tables will be opened here (if not opened)

my $res = $hs->auth(‘secretwr’);

die $hs->get_error() if $res != 0;

## no spaces or it will error

my $cols = “user_name,user_email,created”;

### used to get the count of the columns above dynamically, need for below

my @count_cols = split(/,/,$cols);

my $count_cols = scalar(@count_cols);

$res = $hs->open_index(1, ‘test’, ‘user’, ‘user_name’, $cols);

die $hs->get_error() if $res != 0;

#3. main logic my $rows_returned = 3;

my $offset = 0;

my $ele_returned = $rows_returned * $count_cols;

#select signs can be – ‘=’, ‘>’,'<’,'<=’,'>=’, ‘!=’

$res = $hs->execute_single(1, ‘>=’, ['101'], $rows_returned, $offset);

die $hs->get_error() if $res->[0] != 0; shift(@$res);

########### multi-select

#$res = $hs->execute_multi([ #[1, '=', ['mike'], 1, 0], [1, '=', ['jane'], 1, 0] #]);

## this begins the exec for a multi-statement

#for my $res (@$res) {

#    die $hs->get_error() if $res->[0] != 0;

#    shift(@$res); my $count_elements = scalar(@$res);

my $count_rows = $count_elements / $count_cols;

for (my $row = 0; $row < $ele_returned; $row+= $count_cols) {

if($res->[$row]){ my $user_name= $res->[$row + 0];

my $user_email= $res->[$row + 1];

my $created= $res->[$row + 2];

print “$user_name | $user_email | $created\n”;



print”elements: $count_elements\n”;

print “rows: $count_rows\n\n”;

### this is the closing bracket for a multi-statment

# }

#4. closing the connection



Handlersocket Part II / syntax Perl/PHP

(originally published 07/18/12)

other handlersocket entries:

part Iintroduction to handlersocket

part IIIhandlersocket examples – Perl


Like the native access, you can use access all handlersocket functionality via, you can download the entire Handlersocket package at,

if you already have Handlersocket via Percona Server installed, in order to get, you may still need to download the separate installation, then simply go to the perl-Net-HandlerSocket directory within it and build the perl module per the instruction within it.

What perl-Net-HandlerSocket can do:

Like the native functionality it can,

Select / insert /update/ delete on primary and secondary keys
Filter on primary and secondary keys
‘In’ type functionality
queries with more than one parameter

The official documentation is at,

Initiate a connection


use Net::HandlerSocket;

my $args = { host => '', port => 9999 };
 my $hs = new Net::HandlerSocket($args);


$hs = new HandlerSocket($host, $port_wr);

very simple, choose your host, choose your port (normally 9999 or 9998, more below on this).

Error Handling

After executing a call, you can print alerts and stop the program with the following,


die $hs->get_error() if $res != 0;

Or for multiple rows,

for my $res (@$res) {
 die $hs->get_error() if $res->[0] != 0;


if (! Whatever function) {
echo $hs->getError(), PHP_EOL;


1) edit your my.cnf settings (see configuration settings above)

2) then in your code,


my $res = $hs->auth('your_password');
 die $hs->get_error() if $res != 0;


if (!$hs->auth('your password')) {
 die('Fault auth');

It will test both passwords and give appropriate permissions to the password that matches

Open an index

HandlerSocket completely relies on the indexes to get data, it can not get a record without one. (although it can certainly get the non-indexed data within the row you retrieve using the index). You must open at least one index for handlersocket to operate on to do a query.


$res = $hs->open_index(a, 'b', 'c', 'd', ‘e1,e2,e3…’, 'f1,f2,f3…');

describing each position in the set above,

a – the index id, this can be any number, you’ll use it to differentiate all of your open indexes in your calls. Using the same number for a different index will overwrite the former opened index.
b – the database name
c – the table name
d – the name (defined in your table) of the index you are using
e(1,2,3…) – the columns you are using in the index. If it’s a multiple index, like mysql you must include the left most columns of the index (i.e. to use col2, you can not leave out col1 and use col2, you must include both).
f(1,2,3…) – the columns for which you want data returned.


$hs->openIndex(a,’b’ , ‘c’, ‘d’, 'f1,f2,f3…')

exactly the same as above EXCEPT no defining partial indexes (no ‘e1, e2,…’)

Data transactions

To see specific examples, go to Handlersocket examples. The general form of the call is the following,


select, insert, update and delete can all be done with the general form below,

$res = $hs->execute_single(a, 'b',['c',’c2’, ‘c3’…], d, e, ‘f’, 
[g1,g2,g3,…],['h1','h2', 'h3','h4'],i,['j1','j2',’j3','j4']);

Not all are necessarily relevant and depends on the operation you are doing.

a – index id
b – the operation one of (>,=,!=, +)
c – the value for the index to query on OR the values for an insert
d – the limit of rows operated on like limit in mysql
e – the offset of the result like offset in mysql
f – if a delete or update then (‘D’, ‘U’) otherwise undef
g(1,2,3..) – if update, an array of values to use for the update
h – an array with filter values
h1 – ‘F’ (filter those values for =, !=), ‘W’ (stop once value is hit such as =)
h2 – filter operation (>,=,!=)
h3 – index id for the filter
h4 – value to filter on
i – key for IN statement
j(1,2,3..) – values for IN statement

the call itself can be execute_single or execute_multi. The latter will feed it as an array of the series above, there are a couple of examples coming up to demonstrate.
I’ve found many positions not requiring a value may still need a placeholder, where an empty ‘’ will not work for some of them. Use undef in this case.


Php is a little more flexible, you can use a similar executeSingle function for everything or use specific functions for insert, update, delete

$retval = $hs->executeSingle(a, '=', array(‘c’), d, e);
$retval = $hs->executeMulti(
 array(array(a1, '=', array(‘c1'), d1, e1),
 array(a2, '=', array('c2'), d2, e2)));
$hs->executeInsert(a, array('c1', 'c2','c3')
 $hs->executeUpdate(a, 'b', array('c1'), array('c2'), d, e)
 $hs->executeDelete(a, 'b', array('c'))

Things to keep in mind (optimization, troubleshooting and bugs)

a) execute_multi is supposed to be much faster for many statements than looping through execute_single.

b) Contrary to the documentation, I’ve found if you do not include definitions for the limit and offset, it will error. Using ‘undef’ for each place will work, but throw a ‘Use of uninitialized value in subroutine entry at’ warning.

c) DO NOT leave spaces on either side of the commas in ‘open index’ call for (e) and (f) – it will error. I assume because these are multiple value and each is not enclosed by quotes has something to do with it.

Introduction to Handlersocket

(originally published 07/08/13)

other handlersocket entries:

part IIhandlersocket syntax Perl/PHP

part IIIhandlersocket examples – Perl


There is very little practical documentation on the web regarding all of the functionality behind handlersocket, this series of posts is an attempt to shed a little more light on what you can do with this fantastic protocol, heavy on the syntax of basic and meatier examples.

Why would I want to use HandlerSocket?

HandlerSocket is essentially a protocol to bypass much of the work done by a traditional SQL call, and particular useful for simple queries where the relative percentage of work done is preparing and executing the statement itself is considerable compared to the work and time of retrieving the data. In that case, the work per query for Handlersocket is a fraction of an SQL request. Additionally, the connections are very light in comparison, and so it’s feasible to have many more concurrent connections. Considering these factors, in an extremely high volume environment, largely consisting of simple key lookups, it can help scale load many times.

For a great deal more information on HandlerSocket,

Comparison to prepared statements

Certainly a valid question, prepared statements bypass the most costly part of the process as well, the parsing, which puts the performance in the same ballpark, however there is still open/closing lock/unlock and running the query execution plan to consider, making HandlerSocket somewhat faster. It also consumes quite a bit less memory consumption (particularly if the result set is large), and can handle multiple threads per connection and can serve requests in bulk, resulting in many more simultaneous requests served. Additionally, HandlerSocket give you the flexibility to keep queries dynamically built in your code instead of defining them in a server side prepared statement.

You can find a detailed comparison on performance here,


I’m not going to cover installation here, but there are two main ways,

1) download and install separately

2) it comes with Percona server (Percona server is mysql with many custom enhancements and extras)

Some parts of the installation document will be relevant for both cases,

For it to be accessible through a language (Perl, PHP), you’ll need to download the client for that as well, links below.

Configuration settings (done in My.cnf)

The ports

9999 = read and write
9998 = read

9998 is faster for reads, about 20-30%. Of course it is also secures read only environments.

loose_handlersocket_port = 9998
 loose_handlersocket_port_wr = 9999


Very simple to use and optional, set plain text passwords for each port.


Ways to access HandlerSocket

* Natively through telnet or nc

* C++ -> libhsclient folder at

* Perl -> perl-Net-HandlerSocket folder at

3rd party






Galera set up troubleshooting

In a previous post on galera, I mentioned it’s important to look at both the joiner and donor logs to get a full picture of what problems you may be running into. Sometimes even that is not enough and you’ll need to spend time narrowing down the issue. It’s important to keep in mind that Galera is relatively young and so the documentation, error messages and online help is not comparable to mysql.

We had a situation where we were building a staging environment, duplicating our production run of Galera, yet we were running into problems not previously encountered. Two nodes, (one an arbitrator) were online and we were attempting to join a third node and have xtrabackup transfer over the data. The node would initiate xtrabackup, appear to be in operation as both nodes participated in the operation, but then quit after a few moments. Typically you might run into permissions issues, file locations, etc.. with either xtrabackup and the logs, either /varlog/mysqld.log on the donor, or the accompanying /var/lib/mysql/innobackup.backup.log will let you know what the problem is. In this case however, we were getting no obvious error,

the joiner log simply quit with,

SST failed: 32 (Broken pipe)

and the donor had a little more information,

WSREP_SST: [ERROR] innobackupex finished with error: 2.  Check /var/lib/mysql//innobackup.backup.log (20130422 11:23:08.361)

[ERROR] WSREP: Failed to read from: Process completed with error: wsrep_sst_xtrabackup --role 'donor' --address '' --auth 'xxxxx:xxxxxxxx' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --gtid 'a77b0b24-ab77-11e2-0800-f92413e82717:0'

Process completed with error: wsrep_sst_xtrabackup --role 'donor' --address '' --auth 'xxxxx:xxxxxxxx' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --gtid 'a77b0b24-ab77-11e2-0800-f92413e82717:0': 22 (Invalid argument)

which in retrospect did subtly point out the issue, but the pointer to innobackup.backup.log distracted from the real issue. Looking at that log, again a broken pipe error 32 was noted,
which doesn’t really tell you anything.

And so commencing with the testing of port/db/file permissions, switching which was a donor vs. joiner to see if the issue was two ways, comparisons to our production environment, etc.. mysql seemed to be fine, xtrabackup seemed to work correctly, etc..

The test that made is obvious was switching the SST type, to mysqldump, which then proceeded to behave the same way, but with some much more obvious error messages,

ERROR 2003 (HY000): Can't connect to MySQL server on '' (4) 130422 11:24:11 [ERROR] WSREP: Process completed with error: wsrep_sst_mysqldump --user 'root' --password 'xxxxxxxx' --host '' --port '3306' --local-port '3306' --socket '/var/lib/mysql/mysql.sock' --gtid 'a77b0b24-ab77-11e2-0800-f92413e82717:0': 1 (Operation not permitted) 130422 11:24:11 [ERROR] WSREP: Try 1/3: 'wsrep_sst_mysqldump --user 'root' --password 'xxxxxxx' --host '' --port '3306' --local-port '3306' --socket '/var/lib/mysql/mysql.sock' --gtid 'a77b0b24-ab77-11e2-0800-f92413e82717:0'' failed: 1 (Operation not permitted) hostname: Unknown host hostname: Unknown host

the ‘invalid argument’ in the first set of errors was referencing an unrecognized host. The staging servers were using internal ips, which work fine as far as Galera was concerned, except SST (xtrabackup/mysqldump) does not use the Galera node ips specified in ‘wsrep_cluster_address’  directly, a lookup is done and SST subsquently used the public ips, which were not open.

the solution is simple, you can use the variable,

in your my.cnf to explicitly specify what SST should use as the ip.




using replication wisely

Replication has a multitude of uses and a very effective tool to solve a number of problems, however, it must be used with some forethought, particularly if you are using it to scale OLTP queries, as of course the drawbacks of it’s single-threaded, asynchronous nature can cause all sort of issues.

Instead of a multi-slave replication setup, We recently applied it to a sharded implementation. There are a few fundamental choices regarding sharding, one of which is what to do with the common tables that are not sharded. You can have one copy either on one or a separate node, or replicate these tables to all shards. Factors include the network traffic, the queries (joins across tables), and the write frequency. Considering these factors, we chose to replicate to all nodes, as we wanted to minimize code changes and minimize any additional network traffic and connection overhead going to an additional node on reads (as we run a high volume of simple queries).

Then the question becomes how you replicate, as you only want the common tables to be replicated. Especially when starting your sharding with a few shards, you may be tempted to leave the common tables loose and/or in another db on one of the shards and either use ‘replicate-do-db’ or ‘replicate-do-table’ on the slave side, or ‘binlog-do-db’ on the server side (the db options would of course require these tables in a separate db). Using binlog-do-db means you only write the to the binlog or relay log for the tables in question, while replicate-do-db/table means only when the SQL is replayed on the slave do non-needed tables get ignored. In our case this difference in traffic would be 100:1, the load of a full write to the slave would be significant. However, the major drawback is you no longer are able to use the binlog on the master for point in time recovery which is a deal breaker for us. By leaving the tables in a common instance, you are forcing yourself to make this choice between less replication traffic vs. losing point-in-time backup coverage.

The alternative is to put these common tables in a separate mysql instance, (now the instance can be on the same server). The data is written to this instance, and then this instance containing ‘common tables’ db is replicated to all shards as that db, with the advantages of a separate binlog for this instance, which in our case was modest traffic while still keeping the original binlogs on the shards intact. This is also more in line with scaling the shards, going from 3 -> 20 shards will be transparent, as the common db can simply be moved to it’s own hardware when needed, with lots of room to grow on the replication end of things as well.


HA database pooling

For any application needing high availability/scalability on the database, you’ll probably look at a load balancing solution such as HAproxy. The typical description of the architecture boils down to a situation where your app servers are connecting to (usually) more than one Haproxy server on separate hardware, with a keep alive script between them for failover purposes. The active Haproxy serves loadbalanced requests to the pool of databases, either in a failover or balanced scenario.

There is an alternative method to this set up. You can simply add an Haproxy server directly on the app servers (6 app servers, 6 haproxy servers), with no keep alive script. If an app server goes down, the haproxy goes down with it, which is just fine, as each app server has it’s own dedicated Haproxy. Each one also serves a pool of databases either in a failover or balanced scenario.

There are not many situations I can say one option is clearly better than the other, however, the for life of me, I can’t see the tradeoff of why you wouldn’t do the latter setup. One less vertical hardware layer, no need for haproxy failover, no bottleneck and/or complicated haproxy loading logic at haproxy especially as more app and dbs are added. Sure there are many haproxies in play, but they all read from the same dbs and act based on exactly the same test from these dbs, which you should certainly be doing above and beyond a connection test.

my version of consistent hashing

When sharding, a common method of distributing your key randomly is to use some sort of hashing algorithm in your code, which is not time dependent, key dependent (i.e. evens, odds), etc.. You also want to keep in mind the work in reallocating data if you either add or remove shards.

The is a good amount of material on the subject,

here are a few links,

and the original paper,…

if you like technical papers.

Why am I writing about it again? I don’t love the visual explanation out there (the circle) and breaking it into whole sections, so I’m going to explain it how I understand it best.

Naive hashing

Let’s say you had shards 1, 2 and 3, and wanted to add shard 4,

with naive hashing, you do some sort of modulus of the hash to assign it to a shard.

whichshard = hash('key') % N

where N = number of shards, % is the modulus and produces a number between 1 and N, which would be the number of the shard. A typical hash function might be crc32().

reallocation of data when adding or removing a shard is a real problem with this method. To demonstrate, we’ll assign some data represented by numbers 1-> 15 through out the nodes.

 shard1 1, 4, 7, 10, 13
 shard2 2, 5, 8, 11, 14
 shard3 3, 6, 9, 12, 15

now let’s add a shard, and redistribute these same pieces of data,

 shard1 1, 5, 9, 13
 shard2 2, 6, 10, 14
 shard3 3, 7, 11, 15
 shard4 4, 8, 12

you’ll see that when you compare the data, most of them have shifted to other shards. in fact 3/4 of your data will be moved for 4 shards, 7/8 for 8 shards, etc… you’ll
have to move almost all your data each time you adjust the number of shards. This is because your algorithm using a modulus is accounting for an extra shard.

consistent hashing

So we turn to consistent hashing, and the explanation is usually accompanied by a unit circle, where it’s divided up into sections, one per shard and a range of values can be assigned per shard. It is then explained to add a shard, that you essentially divide one of these sections into two, and move only the keys you need from the shard you divided to the new one. Of course you’ll now have two smaller shards that are half the size of the rest of the shards, when of course you want to balance the load on all shards. The explanation is then instead of dividing one shard into two sections, take bits from all the shards to make your new shard. Now why push the idea that the shards are assigned full sections (i.e. 0 -> 1/5 of 15  = (1, 2, 3)) when your additional shard will need to be assigned random values (4, 9, 14)? Let’s just start with a random distribution for all shards so there is the understanding it doesn’t matter what the algorithm assigned to what node, as long as it assigns the same thing with additional shards in the mix. So let’s start with the same distribution as before (but really, any of these numbers could be assigned anywhere)

 shard1 1, 4, 7, 10, 13
 shard2 2, 5, 8, 11, 14
 shard3 3, 6, 9, 12, 15

now let’s add a shard, and redistribute manually,

 shard1 1, 4, 7, 10
 shard2 2, 5, 8, 11
 shard3 3, 6, 9, 12
 shard4 13, 14, 15

We see that we have chosen to assign node4 with a result from each on the first three shards, it’s does not matter which ones go to node4, and the result is none of the first three shards need data moved onto them, only moved off to the new shard. It also only accounts for 1/5 of the data; if we had 20 shards, we would only have to move about 1/20 of the data. Obviously you are going to have some sort of logic that maps the formula results to a node,

if (result in (2,5,8,11)){

so for our demo, how does getting a random value from 1 to 15 work in a formula? We now have,

whichshard = ceiling(hash('key')/maxvalue(hash())*N)

ceiling would be a function that rounds up to the closest result N. N normalizes the result so instead of an answer between 0 and 1 to give you a result between 1 and N, matching your assigned values.

if we are using crc32, the max value is 4294967295, so something like,

whichshard = ceiling(crc32('key')/4294967295 * N)
e.g.. -> ceiling(crc32('1234567')/4294967295 * 15) = 5

and we see that the result ’5′, we assigned to shard2, and will always be assigned to shard2 unless we move it to another shard in our grouping definition.

Lastly, you can run the above formula as a select SQL statement, but in production, do it in your code so you are not hitting the db.

Boo! a frightful hour in production, ghost of BIGINT

Well, we had quite an hour of uncertainty this morning, at 7:30am, the behaviour of one of our data servers changed radically, with no obvious reasons looking at the process list, error log, etc… We’ve done a number of things the past few days to ramp up for a version launch of one of our top games, which included new code, data structures, etc.. along with a change in hardware for the memcached instances.

processes were slowly building up, yet the load was low, and at least some queries certainly were processing. In retrospect I regret not looking at the show engine innodb status as that may have made it much more apparent what was going on, however considering the other two factors, I was first looking at other explanations to do with recent changes in code, the memcached changes, etc..

The head developer mentioned he looked at one of our big tables and it had 2.1 billion rows, however, we had already anticipated the datatype limitation of INT and went through the exercise of changing the primary key column to BIGINT, and so it seemed unlikely this was the cause. However, the number bothered me, and so I took a look as to exactly what is was and certainly it seemed the the table passing the threshold of int (2147483647 signed) exactly conincided with whatever problem we were having.

Looking at the processlist, sure enough there were many queries that were acting on the id 2147483647 (about 10% of the total queries). So what could be the problem? We use a lot of stored procedures to minimize network traffic, and the developer then remembered that in the stored procedure declarations the primary key was defined as INT, not BIGINT. All incoming queries (and new users) were getting a truncated max value of 2147483647 limited by the stored procedure, not the table. All users before 7:30am were processing normally albeit much more slowly, anyone signing up after 7:30am was hammering the same row, thus the slow query build up and of course contention due to the hotspot.

What’s scary is the dumb luck of us discovering this sooner than much later I never would have looked at the row count thinking it was a problem already solved. I’d like to think an eventual show engine innodb status would have revealed the row level locking on that id. Apologies to the user assigned id 2147483647 for their game data, I can’t imagine what the game experience must have been like for them.

Galera gotchas

We’ve recently implemented Galera clustering and have been pleased with the relatively easy install and implementation. A quick description of galera is the joining of individual mysql dbs as nodes to create a cluster that features multi-threaded synchronous replication, which allows for true high availability while still using your original db and engine (innodb). Likewise, the ability to quickly break down the cluster to individual servers if need be.  Our specific set up includes two dbs and an arbitrator to avoid ‘split-brain’ in case of a node failure. The process of adding a node simply involves connecting to the group, and the data transfer and sync is automatically done via SST (state snapshot transfer) with the typical overhead associated with the familiar backup methods, mysqldump, rsync and more recently xtrabackup. With xtrabackup you truly have a non-blocking method to add and re-add nodes. Recent improvements also include the addition of IST (incremental state transfer) which allows you to disconnect a node, do work, and reconnect and quickly catch up on the missing transactions on that node.

As mentioned, the install and implementation has been quite smooth, however here are a few things to keep in mind,

1. On installation, when running into errors, it’s important you analyze both the joiner AND donor error logs, as they of course will have differing messages. We ran into what ended up being an xtrabackup issue, which was misleading from the joiner logs, but clear as day in the donor logs.

2. On initial installation, you’ll be told to set wsrep_cluster_address=gcomm:// for your first node, as there is nothing to join to. However, DON’T keep this in your my.cnf, as on restart, you’ll end up creating a new cluster (again), not join the one you’ve made. Change it to specify the ip of one of the other nodes.

3. Similar to replication, Galera will auto-increment offset by the number of nodes, this is automatic, however, keep this in mind regarding large tables and datatype limits.

4. You may be surprised to learn that some fundamental operations will lock the entire cluster for the duration of the operation without some care. Here are two examples and by no means the only statements that can be long running and cause grief,

  • An alter table locks the entire cluster, even on a table that is not in use
  • A load data infile also locks the entire cluster even on a table that is not in use

The first is partially due to how galera handles DDL statements, total order isolation (TOI) is the default, more info can be found here,

the fact that is affects all tables and dbs is a bug / feature, more details here,

I assume the load data infile lock is due to the synchronous requirements of the cluster waiting to receive confirmation of the commit on the second node.

you have a couple of options to avoid a cluster wide lock,

a) for the alter table scenario, as detailed at the link above, you can use a DDL handling method known as rolling schema upgrade (RSU) which automatically detaches the node for the duration of the operation, then resynchronizes when finished.

b) for both the load data infile and alter table you can do a more manual version of this by simply disconnecting each node, performing the operation, and reconnecting

c) A third version is to issue a command to only apply the command locally,

SET wsrep_on=0;
do your command (alter, load data, etc...)
SET wsrep_on=1;

All methods would have to be performed on all nodes and particularly with ALTER, you’ll need to consider whether inconsistent schemas on the nodes will cause replication problems.

5. As noted in the limitations section,

only innodb is supported (myisam is starting to become experimental). when dealing with the mysql permissions db (which are myisam), use ‘grant’ and ‘create’ instead of ‘insert’ if you want the commands to replicate.

6. A caveat and further explanation to the last point – to tables other than innodb, DML statements are not replicated, but DDL statements are. This is because DDL statements seem to be passed on using a different method than DML statements. This difference also has implications that can cause some confusion; on our two nodes, we were mystified as to why the dbs would replicate DDL statements both directions node1 <–> node2 , yet would not replicate DML statements, only node1 –> node2. Figuring that the DDL was replicating both ways, we ruled out a configuration restriction issue, which was wrong, as the eventual cause was replicate-wild-do-table= specification in my.cnf for a particular db, while we were using  the ‘test’ db for the test. The setting would not allow DML replication, yet Galera allowed the replication of the DDL (create and alter) of the table in the ‘test’ db.

7. It may take some time for the node to become fully operational, for instance you might not be able to login at first, or you’ve logged in and issue a command, ‘command not found’ is returned. Just give it a little time and/or check the error logs to confirm it’s on track.

mitigating the pain of mysql restarts

Every now and then you’ll have no choice but to restart the mysql server, whether it be for editing server variables, upgrading, etc.. When the server is restarted, among the operations performed, it commits all active transactions, it flushes all dirty pages (data in memory) to disk and on restart, the buffers are wiped and re-caches data as needed. On a busy server, this set of events is not trivial, as you’ll find it can take quite a bit of time for the pages to flush, and an increased load on the server to re-populate the cache with it’s working data set. There are several things you can do to help with this process.

Be sure there are no active transactions

the server can’t shutdown if there is a running transaction. Do a ‘show processlist’ to be sure.

Flush pages before server shutdown

Have you ever shutdown the server, watching the dots add up on the command line, wondering what it’s doing and when it’s going to actually shutdown? Most of the time is taken by the flushing of dirty pages. Innodb caches a certain amount of dirty pages, loosely specified by,


the default as of 5.5.24 is 75 (%). The reason it keeps a certain percentage of dirty pages is to group together data and rearrange it in sequential order so it can be written in one in go on disk, resulting in much less I/O. You can change this setting dynamically. To cause innodb to flush as frequently as possible, you can issue the following on the command line,

set option global innodb_max_dirty_pages_pct=0

keep in mind this will cause additional load on the server, as now you’ve directed innodb to essentially randomly write the data as it’s requested in real time. You can monitor the current amount of dirty pages with the following,

mysqladmin extended -r -i 1 | grep Innodb_buffer_pool_pages_dirty

you’ll see the amount of dirty pages drop until it levels out (you’ll never hit 0 on a busy server). You are now at the minimum amount of dirty pages and you can issue your restart command.

repopulating the buffer pool

If you are worried about the extra load on the server when restarting due to all pages needing to be read from disk instead of memory, you can quickly repopulate the cache if you have percona’s flavour of mysql. No reason to repeat information here, go to,

warehousing and data analytics in a mysql world

Mysql in it’s early days had the reputation of being simple and fast but lacking critical features for many use cases. Over the years, it’s become more stable and mature with a ton of advancements in the innodb storage engine, features such as stored code, replication improvements, query planning improvements, etc.. One area mysql has not tackled is BI and analytic OLAP. I can only assume it’s not been a priority with the user base to date. Oracle and SQL server cover a different market and have these options, however, in this new era of big data, a wider array of businesses want to collect and analyze data such as performance statistics to improve their business offering. Fortunately, there are a few 3rd party vendors who have taken on this area with the spirit of open source and/or tiered pay structure in mind, as you’ll find it can be very costly otherwise.

In our search to analyze the ever increasing volume of statistical data we were collecting, we considered the following products,

  • infobright
  • infinidb
  • luciddb
  • monetdb
  • fastbit
  • metakit
  • greenplum
  • hive

most all of these solutions implement a columnar architecture to the data in one way or another, essentially rows become columns, columns become rows. This is done to take advantage of the layout of the data and allow sequential reading (vs. random reads) of each long row (which was a column) on disk, which will happen much of the time doing OLAP type queries, aggregating on a dataset much larger than memory.

Back to the products, a number we rejected for one reason or another, generally because they were either developed more for the scientific community (metakit, fastbit), geared towards another open source db such as postgresql (greenplum), or focused on distributed systems (hive) which is a bit of overkill for us at this point.

so we were left with,

  • infobright
  • infinidb
  • luciddb
  • monetdb

infobright and infinidb have a free and paid version (the free versions both have limitations), luciddb and monetdb are open source.

our priorities in order were as follows,

- high compression
- ease of use (ideally a mysql client and connection)
- full DDL, DML statement use

there is a very good benchmark comparison done by percona on these 4 dbs,

keep in mind it’s a couple of years old, on a specific set of data, so definitely worth testing yourself for your particular needs.

so why do we care about compression? The pros are two-fold, first, getting significant compression obviously saves space, one of the premiums of an OLAP system, the longer you can put off sharding (a distributed system). Typical compression on full products range from a very impressive 10:1 – 40:1, so essentially, you can have 10 TBs of data on a typical 1TB disk. Second, the compression means the disk arm for a sequential read only travels a fraction of the distance, making the operation and read faster.

Right off the bat, infinidb offers only a very modest 22% compression reduction and is their major limitation on the free version (paid you get 40:1 compression), and monetdb it’s not clear there is any meaningful compression at all, the tests matching their non-commital statement about the products compression

Infobright and LucidDb both offer in the neighborhood of 10:1 compression, very nice for a free product.

note: Now, if we were in the market for a paid solution, we take a hard look at InfiniDB along with the 40:1 compression it is a mysql build with their proprietary storage engine incorporated making it an easy transition from stock mysql.

So we are down to,

  • Infobright
  • LucidDB

The two have very different remaining pros and cons, Infobright is also a mysql build with a proprietary storage engine incorporated, however, the drawback is the limitations they’ve implemented on the free version, namely most of the DDL and DML statement functionality has been stripped away. That means you can not insert, update, delete, truncate, or alter a table. At first glance that may sound insane, but keep in mind an analytics environment in many cases needs only primarily two things, to be able to load data, and to be able to select data, and do it quickly, which it can do very well, mysql style. However if you wish to change or alter the table or data, you’ll need to fully reload the table, and these restriction may be tough for an environment needing to serve customers, however our requirements are only a daily load, appending the existing data, internal analysis, and we rarely change table structure on these particular tables, and so we are willing to deal with a dump and reload of a table occasionally if need be.

LucidDB is a completely different animal, requiring Java to run, and uses it’s own SQL variant and data loading methods (for instance, there is no datetime datatype, you’re only choice is varchar to keep the data integrity). It does have full DDL, DML statement functionality and no other obvious restrictions. Certainly with a learning curve, we could use this product, but when considering the people that would use this database, such as BAs and developers, it made sense to choose a product that is as consistent as possible with what we are currently using.

So infobright was the choice, with some work arounds to make it usable. Truncate and rename table can be emulated, and other issues had to be dealt with, for instance, you can not simply copy over an innodb schema to a brighthouse schema in many cases, it will complain about keys, foreign keys, auto increment, unsigned datatypes, etc.. We wrote a script to parse this stuff out. Not even close to being usable generally, just our specific tables, otherwise I’d include it here. Another example the utf8 charset is handled badly, you’ll need to increase your datatype definition size 4x e.g. varchar(32) -> varchar(128). There are number of odd things such as this, I’m not so sure I’d be as eager to use for outside customers, but for our internal purposes, it will fit just fine.


This is one of many perl script tools from Percona that makes life easier administrating, monitoring and testing mysql, this one solves the need to alter a table (adding a key, column, partitioning, etc…) on a production system that can’t be taken down. Many of the current Percona tools came from maatkit, developed by Baron Schwartz, this particular one from my understanding was initiated by the Facebook development team for the myriad of db servers they administrate, however, there seems to be two distinct OSC products, the one we are going to talk about is pt-online-schema-change.

It’s very impressive, from a conceptual and safe use point of view. The official description and download page is at

how it works under the hood

The description above will tell you the basics of how it works, essentially an empty altered copy of the table is created, then filled by selecting data from from the active table in manageable chunks as a precaution to overloading a busy server. Triggers on the active table are temporarily created to keep the altered copy up to date with real time changes. You may question how the trigger can update a row on the new table that may not be there yet. This is done by using ‘replace’ (an insert or delete and insert) for all insert, update and delete statements in the triggers. Inserts are of course inserted, updates are now inserts on the new table (or delete and inserts if the row is already there), and deletes do the delete portion of the replace and have nothing to insert after the delete. Now to avoid conflict with the rows that are being selected and inserted from the active table, an ‘ignore’ flag in used on those bulk inserts to leave already inserted rows from the trigger as is. once the copy is completed, an atomic rename of the two tables is done.

safety first

The are many safeguards to ensure your data doesn’t get screwed up or server overloaded, among them options are the ability to control the chunk size of the selects, setting a max amount of threads connected and running due to triggered queries and general load, a dry run vs. execute option which you have to explicitly specify, a rollback feature if the process is terminated at any point along the line, no dropping the old table, no swapping the table and so on. The link above has specific implementation details.

things to keep in mind (as of pt 2.1.2)

a couple of things that I’ve run across are,

1) when dropping foreign keys, you will need to specify the altered table foreign key name, which will not be the same as the active copy foreign key name. The system uses ‘_’ in front on the altered copy, so you’ll need to do the same when specifying those foreign keys.

2) As per mysql, partitioning does not support foreign keys, and you will not be able to both partition and drop foreign keys in the same operation. You’ll most likely need to run it twice, separately.

3) If you are running bin-logs with statement based replication and without tx_isolation=READ-COMMITTED or innodb_locks_unsafe_for_binlog, the ‘insert into select * from…’ for each chunk in the copying process will need a read lock on the table, and depending on what else it going on, such as backups, potentially can cause a lock wait timeout on the process and exiting without completing.

4) there can be a momentary stall while applying the switch of tables. see below for details.

production results

we successfully used pt-online-schema-change to indirectly alter a 1.7 billion row table handling 50,000 queries per second with no downtime, taking about 15 hours to complete. There was a momentary stall (a fraction of a second) resulting in a load spike on switching out the tables, although max_used_connections does not seem to corroborate query build up as the reason. We’ll take a closer look next time.

Where Partitioning hurts

As a mobile gaming platform provider, we collect a tremendous amount of data, our biggest tables collect upwards of 50 million rows per day, and so partitioning is a must for both our statistical collection and especially analytics, where we piece together the entire dataset over time. Partitioning allows the tables to essentially be broken up into smaller tables, arranged by a partitioning key, in this case, ‘dayofweek’ or ‘dayofyear’ column. This allows a query to be able to restrict the query to the particular partitions in question relevant to the query such as a count of the records for that day. This is a big advantage for both reads and writes that would need to scan the entire index of a table that would not fit into memory to execute the query.

For the most part, for the sorts of queries we do, it’s almost automatic for us to partition big tables, we started with a couple of the biggest, and recently partitioned everything for both stat collection and analytics in preparation for ever increasing volume. All of the queries are based on a particular day, week, month, etc.. and so I was surprised to hear of a report that a query being run seemed to be about 2X as slow once we partitioned it.

However, looking at the query, it made sense. The query involved two joined tables, both partitioned. One table was our largest (analytics events), restricted to a particular day of the week, and already partitioned prior, so no gain there. The other table was our sessions table, newly partitioned, and the query DID NOT restrict to a particular time frame (such as the same day as the analytics events). Now the query was correct, because some sessions may have been started weeks ago, (an app can sit in the tray unused and so the session is still technically valid and is the correct identifier related to that day’s analytics event).

The point being, if you’ve partitioned your tables, and you are doing queries that do not take advantage of them, the parser has to do a lot more work looking at all the partitions and will be slower than a non-partitioned table. What may not be so obvious is the same holds true if you are joining tables and even one of them is partitioned. Now we also do queries on the sessions table which do restrict to a particular time frame which indeed improved, but many of the crucial queries join to that table over it’s entire time range. For us, the solution was to remove the partitioning on the sessions table. You’ll have to of course make your best judgement as to what the best trade-off is depending on the range of the majority of your queries.