Is mysql/innodb good for high volume inserts?

As we are collecting an ever increasing amount of data and analyzing it, data storage products with alternate file structures have become popular, such as LSMs (log structured merge files) which is implemented in Cassandra and Hbase, as well as fractal indexes found in TokuDB/MX, which is an alternate file structure plugin for mysql and mongoDB. The reasoning is this – To maintain b-tree indexes in a relational database such as mysql which clusters the data by primary key, there is a constant rebalancing process to keep the order. When your data set become larger than memory, this translate to a large increase in disk I/0 which of course kills performance. This is the price you pay for an optimized search algorithm, so the story goes (and when I say story, I mean promotional material for companies using non-btree structures).

so a quick review of each of the three structures,

B-tree – nested levels of pointers specifying a fractional range of the primary key, per block. The top level is the root node, the bottom level are the leaf nodes which contain the actual data, and the levels between are internal nodes. Branching factor for a b-tree is B, the number of nodes a block on the preceding level points to. For a block size of 16kb, and say 100 byte rows, the branching factor could be around 160 and the tree depth for a billion rows ~ 4 levels.

LSM – basically log files that are always appending, never updated even for an update, thus always write sequentially. Files are sorted. To search, you must scan all the files in reverse order. This is optimized by reducing the number of/merging files (compaction), and limiting the range of the scan per file by the query parameters. Bloom filters can help point queries (not ranges). They do this by maintaining an extremely frugal/efficient data structure to allow a read query to cut down on the number of files to be searched. As you can guess, reading is somewhat of an afterthought, for the benefit of the insert performance.

Fractal – similar to a b-tree, however there is a cache for each internal node block.  All levels are not immediately written but only when needed (when they are full) and cascade down one or more levels. Doing it this way allows many writes to one block done in memory, until finally ready to write a grouped leaf node block with many changes on it to disk. b-tree groups writes as well, but not specific to the node level, the real advantage is not having to read the leaf nodes randomly for each query by delaying the leaf node write. From my understanding, for a branching factor for fractal indexes of B^1/2 = (4MB / ex. 100 byte records)^1/2 = 200, and say just 1% of the total db size as memory (2TB -> 10GB) was allocated to node buffers and pivots, from that, 100-200 writes could be grouped to write to a leaf node on disk (see below for B definition).

So the latter two files structures sound like they are a no-brainer for high volume inserting. However, there is a critical thing to keep in mind here and which is typically glossed over. You can (and usually do) insert sequentially in b-tree structures as well, requiring very little memory to do so. The performance degradation for a b-tree only happens if your working dataset is larger than available memory and you’re inserting out of order of your primary key (i.e. not chronologically, for example alphabetical). Your insert has an equal chance of going into any block of existing data and so you are likely to be reading the entire dataset in a random fashion. However, if you are inserting sequentially, such as by time or an auto-increment primary key, your application is writing to the same block and/or adjacent block for the majority of your data and will flushed together. Your need for all the data in memory goes away. Secondary indexes are indeed out of order, but a much smaller data set and handled and grouped by the insert buffer. So basically if you are inserting without a primary key, or with an auto-inc primary key, or other time oriented way, it’s a non-issue.

We’ve done up to 100,000 q/s primarily inserts into mysql/innodb on our production machines using a b-tree data structure, 32-core CPUs with multiple secondary indexes, into a 3TB table, and that’s without using nosql api’s such as handlersocket or memcached (we do use stored procedures). That matches or surpasses most benchmarks for inserts I’ve seen, regardless of the product.

the math works out to be,

  1. B – records per block size
  2. N – total number of records
  3. branching factor –  B for btree, B^1/2 for fractal

B-tree – reads and writes disk logN/LogB, writes in memory 1/B

Fractal – reads and writes disk logN/LogB * B^1/2, writes in memory 1/B

LSM – writes disk  logN/LogB * B^1/2 (including compaction) reads (LogN/LogB)^2, writes in memory 1/B

obviously B is somewhat arbitrary, (tokuDB->fractal uses a default block size of 4MB, innodb->b-tree 16KB) a larger block size offsets the smaller branching factor (B vs. B^1/2) and so both can read and write on disk within the same ballpark. With the larger block size, writes in memory, arguably fractal is better, although both are quite efficient at a tiny fraction of a disk I/0 per operation. A larger block size however hurts random reads as more data is required to be read into memory. LSM is seemingly at a clear disadvantage on reads, needing to scan all files, thus a log factor more disk I/O. This last statement for LSM reads is a little misleading as Bloom filters used by LSMs make reads doable and reduce most of this overhead as a crude form of index, although not completely and not for ranges.

So what might be a situation where you’d want to organize your data other than chronologically? Perhaps a site such as Wikipedia, which have documents, accessed with a frequency not related on the age of the document. Additionally you may want to fetch a range of documents, say all beginning with ‘Ag-‘. Then it may make sense to cluster it by document name. However, you still have a viable option with a b-tree, as it still might make more sense to write chronologically (sequentially), and allow an alphabetical secondary index to take care of the search performance.

You alternatives are to use an LSM product, which will write sequentially with no order to worry about, but have aforementioned drawbacks reading the data, or use fractal index technology which will have performance in the same ballpark of a b-tree if inserting sequentially and much better if not, with the potential for less efficient reads, and relying on an overall less mature storage engine. Now tokuDB/MX also has built in compression which may add another dimension to the offering in terms of performance and of course storage space, but that’s for another post.

Numa system performance issues – more than just swapping to consider

Most all modern systems running a database have more than one processing core (and multiple threads on each core) in order to scale the number of simultaneous running threads and ultimately, the work the system is doing. In order to scale memory performance as well, the architecture provides each physical processor or node with it’s own ‘local’ memory. In the case of systems that run a process like a mysql database that takes up more than the size of one node (i.e. more than 50% on a dual-core system), the resulting performance considerations are not necessarily straight forward. There is of course the excellent article by Jeremy Cole that I recommend reading as a general overview of Numa based systems along with understanding the most damning of these potential issues actual swapping due to memory inbalance.  http://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/ There are however other less obvious but potentially harmful performance issues of which you should also be aware.

non-local memory percentage of use and location

As mentioned, Numa attempts to localize memory to a node which helps a great deal with scenarios such as independent processes, or sharded/multiple instance dbs. Each process is bound to a node and in turn the node attempts to use the local memory associated with the node. It should be clarified that remote memory is simply the memory that is assigned to the other node(s) and so local/remote memory is relative to the particular node referred to. With a large process (larger than local memory of one node), it has no choice but to use memory from multiple nodes, thus there is no way to avoid remote memory access as the particular data needed can not be determined until already committing to a thread/core that may or may not have the data in local memory.

So understanding that accessing non-local memory is a necessity, it follows that the percentage of use and proximity of remote memory is a priority and can make a big difference in cpu intensive/ large memory applications such as a high transaction and dataset-in-memory db. A fundamental choice that exercises this is, ‘how many cores do I choose for my system?’ Surely more is better? not necessarily so. A two-core system running a db that takes up most of the memory means that you’ll likely access remote memory 50% of the time (save some internal optimizations), where as in a four core system, you’ll be accessing remote memory 75% of the time (and 25% local). What’s more, the distance between nodes (and the memory bus it must travel through) is also a factor which is represented via,

numactl –hardware

two node

node distances:
node   0   1
0:  10  20
1:  20  10

four node

node distances:
node   0   1   2   3
0:  10  20  30  20
1:  20  10  20  30
2:  30  20  10  20
3:  20  30  20  10

Two nodes vs. the distance between four nodes physically arranged in a square, it’s easy to see there is a significant percentage of accesses that require not one, but two hops. Summing the two factors together, the average access time can be very roughly double on a four-core system as a two-core system.

memory reclaim modes

What exactly happens when a node runs out of local memory? This investigation was spurred on by the fact that although we have NUMA enabled on most systems, We haven’t seen any swapping issues even though some nodes on some systems are filled up and potentially under pressure. In fact, swapping is indeed one possibility, however it’s not the only one, and directly dependent on your virtual memory settings. the default for systems is the general category NODE LOCAL, and can include behaviour such as discarding clean pages which will need to be read in again at a later time, writing dirty pages out, as well as simply going to a remote node with no attempt to make room for data on the local node. Details can be found under the zone_reclaim_mode option at,

http://lxr.free-electrons.com/source/Documentation/sysctl/vm.txt

this jist of which is,

0       = Zone reclaim off
1       = Zone reclaim on
2       = Zone reclaim writes dirty pages out
4       = Zone reclaim swaps pages

I’ve seen it stated the default on two-node systems is 0 which simply goes to remote nodes when local is filled, while the default on four-node systems is 1 or ‘free cached pages which are not dirty on local node before going to remote’. This is what we’ve found on our current systems, and additionally it appears the default can differ depending on the version of your OS. To check,

sysctl -a | grep reclaim

The take away is although swapping is the most obvious and published performance issue, What is in fact the default settings (at least nowadays), you’re likely not to run into a memory pressure issue at all on two-node systems, and for four-node systems, you may find the issue is more subtle, with continuous reading of pages from disk for seemingly no good reason and not swapping, as you might expect. The solution is to simply change your zone_reclaim_mode to 0.

vm.zone_reclaim_mode = 0

in /etc/sysctl.conf

and reload with sysctl -p

good additional reading on this can be found at,

http://frosty-postgres.blogspot.ca/2012/08/postgresql-numa-and-zone-reclaim-mode.html

http://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases

http://queue.acm.org/detail.cfm?id=2513149

 

 

 

 

 

 

 

 

Infinidb 4.0.2 review

As with most data centric companies, there comes a time where it’s necessary to be able to aggregate data to provide meaningful insight into raw data that is collected. At Fuse Powered that is essentially one of the cores services we provide, whether it be via ad-hoc queries, reports, or our dashboard.

The core database that we use is of course mysql with a default storage engine Innodb, which is geared toward OLTP and does it very well. However, when it comes to aggregating a large number of rows, and/or loading a large amount of data quickly, this default storage engine is not geared towards these sorts of operations efficiently when performing a very large amount of data. A class of storage engine called columnar storage is designed for just this sort of work, where the data is re-oriented, and the data is put into separate column based data files, where each of the columns can be read from disk sequentially and stored with high compression (resulting in disk storage savings and less data to read), as the speed in reading far outweighs the cpu overhead of decompression.

There are a number of columnar based offerings out there, however, almost all are closed source and quite pricey. Up until now we’ve utilized Infobright, a mysql client-based solution with a watered-down community version, stripping much of the functionality and performance from it’s full version. Recently a company, Calpont, (recently re-branded as it’s product name Infinidb) has come out with a open source licensing of it’s full 4.0.x version, a move that may potentially revolutionize how columnar databases are marketed.

This full version fills a number of core needs, such as compression, parallel processing, distributed storage, filesystem redundancy (HDFS, GlusterFS) and high speed data loading, and so we’ve been eager to test and implement.

The architecture is comprised of two main functional units, user modules (UM) and performance modules (PM). Performance modules work with and store data, while user modules coordinate queries across the entire system. Scaling performance modules gives you more storage and potentially better performance from parallel work being done. Scaling user modules allows for more concurrent queries to be run and redundancy for a failed um. You can start with a single server that contains all units, and grow out to a large distributed system as both type of modules can be added as your data increases. Each PM has a set number of dbroots, which can easily be reallocated/moved to other PMs on other servers as your data grows. It also uses a form of partitioning called extent mapping which will rule out sections of data in storage to speed up queries where possible. Queries are processed in parallel by the available processors and nodes that are appropriate.

Results

the following is our general experience with it to date,

Documentation – Very good, answers quite a bit in terms of concepts, installation,  options, etc.. However, because it’s just released as open source, really needs much more use and documentation by the user community to fish out bugs, suggest fixes, etc..

Installation/configuration – straight forward, enable passwordless ssh between nodes, download/install the initial rpm/deb/binary files and then run their script which will ask a series of questions. You can adjust the main configuration file later only on the root pm, usually pm1 at /usr/local/Calpont/etc/Calpont.xml.

Schema transformation – you need to create infinidb versions of the table structure, this is going to take some time, there are all kinds of incompatibilities when converting innodb schema to infinidb.

Loading – load data infile vs. the infinidb loader. As can be guessed, the loader is much faster, anywhere from 100,000 –  1 million+ rows/s for a single module load. The source of the loader can be at the user module, which will automatically distribute the data in 10,000 rows chunks, or at the performance modules individually using pre-split files to be able to scale. You can set an error threshold over which the load will abort, with errors logged.

Compression – our data is fairly typical, largely numerical with some varchar and text/blob columns here and there. Our data is compressed at about 5X compared to uncompressed innodb.

Backups – pretty rudimentary, really just a file copy which requires it be in read-only mode to ensure data consistency. However, it being a file copy, straight forward and fast.

Queries  – performance is very impressive, especially for more complex queries, we had a class of aggregation queries that was taking 5-10 minutes on a dedicated box with 256GB ram, on a three-node public/shared cloud system 16GB each, finished in 10-15 seconds.

Logging – very well done, especially the loading, there a two files per load per table that detail the error type and the exact characters that cause the issue. Also various server logs to debug issues.

Administration – much of it is done through an independent ‘cmconsole’ which allows you to set and view various parameters and processes as well as add and reallocate modules. There are several processes per node that need to be up and running and can be a source of problems if they aren’t, viewable with the console.

Monitoring – built in monitoring of a number of key metrics such as disk/memory/cpu usage, networking issues, module issues, etc..

Cross engine compatibiity – not just the infinidb storage engine, also a full version of mysql and you can add innodb capability with the plugin. However, pretty darn old at 5.1.39. You can also mix storage engines in your queries, although you’ll need to specify it in the configuration file.

 

Issues

Before we get in the issues, it should be said that we have not attempted to contact infinidb reps, and any resources we used were simply the documentation and the infinidb forums (with posted responses by infinidb reps). Now, some of the issues are annoyances, other are fairly serious.

The most serious revolves around loading of data, we typical re(load) and append about 1000 tables per day and we’ve run into this behavior a few times in the last couple of weeks.

Locking – each load requires a table lock and occasionally, one of two things happen,

a) a table is locked and does not release the lock – we’ve even had a time or two where we’ve restarted the system and still has not released the lock (contrary to documentation). There is a view and clear lock utility, but for some reason it’s currently not working for us at the moment either. Our solution was to hunt for the file that contains this lock, and sure enough, you can find it on your parent pm at /usr/local/Calpont/data1/systemFiles/dbrm/tablelocks. Simply removing it and restart seems to work. This behavior seems to depend on the particular data being loaded, this happened on the same table 3 times, and we ran into most of this when initially loading our full data set.

b) table lock can not be acquired – In this case, a load is attempting to get a table lock, can’t get it and aborts the load. There were no other queries on the same table running at the same time, and in fact seemed to happen mainly when loading of other tables/dbs simultaneously.

Critical errors – we’ve ran into this a few times, where an error such as ‘image inconsistency’ initiates the system to go into ‘read-only’ mode. We only experienced it during a load which I suppose makes sense as it’s the only time we are significantly writing. When encountering this during a load it also attempts to rollback but can’t because again it’s been put into read-only mode. You might deal with the lock separately as above, and the read-only issue is suppose to clear after a system restart, but we’ve seen where it doesn’t. You then have two options. The advice at this link may help, http://infinidb.co/community/infinidb-not-starting otherwise, you may have to remove the entire infinidb installation and replace with your latest backup. We are not using a redundant file system such as HDFS or glusterFS, which may or may not help.

Not enough memory to perform select query – Infinidb uses a distributed hash join to evaluate large data sets quickly, the memory requirements of which is dictated by the second largest table in the join. There is a variable, PmMaxMemorySmallSide at a default of 64M up to a max of 4GB on the PM nodes, or all of TotalUmMemory on the UM depending on the data set size (if larger than the PM setting, uses the latter). We frequently join two tables, one at 40+ Billion rows the next at 1 Billion rows which has been an issue for certain queries, although our UM node is very modest at 16GB. Query span does make a difference (i.e. one day at a  time vs. the entire year) so it’s not just a total table size metric.

backups needing to be in read only mode – is a self-explanatory issue. hopefully your needs don’t require frequent/real-time writes/loads 24 hours a day.

no real query plan evaluation  – explain doesn’t give you much as it’s without indexes and use an alternate storage mechanism and execution algorithm, although probably irrelevant as it’s evaluated in a completely different way. Query profiling (set profiling=1) works. The question arises whether differences in the query structure matters and it would be interesting to see the performance of the same query written two different ways that take can longer one way in Innodb, and how they behave differently here (e.g. join vs. subquery class of queries) * – indeed there is a detail analyzer ‘select calgetstats()’ and ‘select calgettrace()’ directly after a query on the command line

sql fickleness – various issues such as,

– not being able to add a column anywhere other than the end of the table
– when doing mysqldump, must do –single-transaction, otherwise will err with lock issue
– All non-aggregate columns in the SELECT and ORDER BY clause must be included in the GROUP BY clause.
– ‘insert into (select..)’ errs, while ‘insert into select’ works

 

And so to sum up, we are using Infinidb and recommend it, however I’d suggest a thorough evaluation to your specific needs before making a commitment, luckily with an open source license, you can do just that.

Handlersocket Part III / examples / Perl

(originally posted 07/18/12)

other handlersocket entries:

part Iintroduction to handlersocket

part IIhandlersocket syntax Perl/PHP

—————————————

Inserting Records

Single record

$res = $hs->execute_single(0, '+', ['mike','mike@101.com','2011-05-03']
, 1, 0);

 die $hs->get_error() if $res != 0;

Multiple records

let’s insert four more records as a ‘batch’

$res = $hs->execute_multi([
 [0, '+', ['102','bob','bob@102.com','2011-12-29'], 1, 0],
 [0, '+', ['103','john','john@103.com','2011-07-18'], 1, 0],
 [0, '+', ['104','jane','jane@104.com','2011-06-23'], 1, 0],
 [0, '+', ['105','dave','dave@105.com','2011-04-12'], 1, 0]
 ]);
for my $res (@$res) {
 die $hs->get_error() if $res->[0] != 0;
 shift(@$res);

note: Limit and offset do not affect the insert.

Reading Records

For reading records, you’ll want to additionally do something with the resulting array reference.

Something like this will work,

for (my $row = 0; $row < $ele_returned; $row+= $count_cols) {
   if($res->[$row]){
      my $user_name= $res->[$row + 0];
      my $user_email= $res->[$row + 1];
      my $created= $res->[$row + 2];
      print "$user_name | $user_email | $created\n";
   }
}

Exact values

a) Selecting on an exact single value

$res = $hs->execute_single(0, '=', ['102'], 1, 0);

b) selecting on two or more exact values (where col1=a AND col2=b)

$res = $hs->execute_single(0, '=', ['102', ‘mike’], 1, 0);

note: select on two keys must have the multiple index to support it

c) selecting on two or more exact values (where col1=a or col2=b)

$res = $hs->execute_multi(
[0, '=', ['102'], 1, 0],
[1, '=', ['mike'], 1, 0]
);

note the second array needed to use a different index
notes:

handlersocket needs indexes to find results and like mysql, a multi-column index requires the leftmost prefix and index. Again, handlesocket does not do table scans, and so without it, you will not get any results at all.

d) Selecting on one or more values with (where col1 in (a, b, c….))

You can do this one of two ways,

$res = $hs->execute_multi(
 [0, '=', ['101'], 1, 0],
 [0, '=', ['102'], 1, 0],
 [0, '=', ['103'], 1, 0],
 [0, '=', ['105'], 1, 0]
);

or

$res = $hs->execute_single(0, '=',[''], 1, 0,
undef,undef,undef,0,['101','102','103','105']);

note with the second query, we can do more with it later on such as filtering while in the first query the arrays are independent of one another..

Limits and Offsets

These are very straight forward, we’ll start with a limit of 10,more than enough for our 5 records. (where col1=a limit x,y)

$res = $hs->execute_single(0, '>=', ['101'], 10, 0);

101 | mike | mike@101.com
102 | bob | bob@102.com
103 | john | john@103.com
104 | jane | jane@104.com
105 | dave | dave@105.com

$res = $hs->execute_single(0, '>=', ['101'], 3, 0);

101 | mike | mike@101.com
102 | bob | bob@102.com
103 | john | john@103.com

$res = $hs->execute_single(0, '>=', ['101'], 10, 2);

103 | john | john@103.com
104 | jane | jane@104.com
105 | dave | dave@105.com

Range Searches

handlersocket is a little odd in that only one boundary of a range search can be specified in the initial operator cell

$res = $hs->execute_single(0, '>', ['102'], 10, 0);

103 | john | john@103.com
104 | jane | jane@104.com
105 | dave | dave@105.com

two bounded range can indeed be done, it can be done with an additional filter function, (W for a stop, F for a filter of those records)

$res = $hs->execute_single(0, '>',['102'], 10, 0, undef, undef,
[['W','execute_single(0, '>',['102'], 10, 0, undef, undef,
[['F','!=', '0','104']]);

103 | john | john@103.com
105 | dave | dave@105.com

the filter can be a different column as well,

$res = $hs->execute_single(0, '>',['102'], 10, 0, undef, undef,
[['F','=', '1','dave']]);

105 | dave | dave@105.com

as you can see with the ability to adjust operations, filters and columns, there are lots of combinations

updating records

similar to the selects, the key difference being an update ‘U’ flag and array of columns are specified.

a single key

$res = $hs->execute_single(0, '=', ['103'], 1, 0, 'U',
['2015-12-12 00:00:00']);

a range

$res = $hs->execute_single(0, '>=', ['103'], 1, 0, 'U',
['2015-12-12 00:00:00']);

two keys

$res = $hs->execute_single(0, '=', ['102', 'mike'], 1, 0, 'U',
['2015-12-12 00:00:00']);

a range and filter

$res = $hs->execute_single(0, '>',['102'], 10, 0, 'U',
['2015-12-12 00:00:00'], ['F','=', '1','dave']);

This is not working correctly.

an ‘IN’ and filter

$res = $hs->execute_single(0, '=',[''], 10, 0, 'U',
['2015-12-12 00:00:00'],[['F','!=', 'mike']],0,
['101','102','103','104']);

it’s important the columns in the open_index call are exactly the same as in your update statement

Deleting Records

 

Single record

$res = $hs->execute_single(0, '=', ['103'], 1 , 0, 'D');

+———+———–+————–+———————+
| user_id | user_name | user_email | created |
+———+———–+————–+———————+
| 101 | mike | mike@101.com | 2011-05-03 00:00:00 |
| 102 | bob | bob@102.com | 2011-12-29 00:00:00 |
| 104 | jane | jane@104.com | 2015-12-12 00:00:00 |
| 105 | dave | dave@105.com | 2015-12-12 00:00:00 |
+———+———–+————–+———————+

multiple records

$res = $hs->execute_single(0, '>=', ['103'],10 , 0, 'D');

+———+———–+————–+———————+
| user_id | user_name | user_email | created |
+———+———–+————–+———————+
| 101 | mike | mike@101.com | 2011-05-03 00:00:00 |
| 102 | bob | bob@102.com | 2011-12-29 00:00:00 |
+———+———–+————–+———————+

$res = $hs->execute_single(0, '>=', ['103'], 2, 2 'D');

+———+———–+————–+———————+
| user_id | user_name | user_email | created |
+———+———–+————–+———————+
| 101 | mike | mike@101.com | 2011-05-03 00:00:00 |
| 104 | jane | jane@104.com | 2011-06-23 00:00:00 |
| 105 | dave | dave@105.com | 2011-04-12 00:00:00 |

You can not seem to use or even accept ‘!=’

$res = $hs->execute_single(0, '!=', ['103'], 10, 0, 'D');

however the filter can be made to do the same thing

$res = $hs->execute_single(0, '>=',['101'], 10, 0, 'D',
undef,[['F','!=', '1','bob']]);

| 102 | bob | bob@102.com | 2011-12-29 00:00:00 |

filters and IN

$res = $hs->execute_single(0, '=',[''], $rows_returned, $offset,
'D', undef,[['F','!=', '1','bob']],0,['101','102','103','104']);

| 102 | bob | bob@102.com | 2011-12-29 00:00:00 |
| 105 | dave | dave@105.com | 2011-04-12 00:00:00 |

bugs (tested around 02-15-12)

a recent fix was implemented for filtering and ‘IN’ in combination, however,

you CAN NOT seem to use a range, filter and ‘IN’ statement at the same time, such as,

$res = $hs->execute_single(0, '>',['101'], 10, 0, undef, undef,
[['F','!=', '1','bob']],0,['101','102','103','104']);

103 | john | john@103.com
104 | jane | jane@104.com
105 | dave | dave@105.com X

I had problems with the date field as well with just filtering and IN,

$res = $hs->execute_single(0, '=',[''], 10, 0, undef, undef,
[['F','>', '2','2011-07-18 00:00:00']],0,['101','102','103','104']);

101 | mike | mike@101.com X
102 | bob | bob@102.com
103 | john | john@103.com X
104 | jane | jane@104.com X

as with the selects, all three (range, filters and IN) do not all work together

| 101 | mike | mike@101.com | 2011-05-03 00:00:00 |
| 102 | bob | bob@102.com | 2011-12-29 00:00:00 |

(missing 105 from the IN exclusion)

putting it all together

here is a sample script in perl

#!/usr/bin/perl

### this script is far from optimized, particularly for many returned rows, however you probably shouldn’t be using handlersocket for a lot of return data anyways,

use strict;

use warnings;

use Net::HandlerSocket;

#1. establishing a connection

####### port 9999 to write, 9998 to read

#my $args = { host => ‘127.0.0.1’, port => 9998 };

my $args = { host => ‘127.0.0.1’, port => 9999 };

my $hs = new Net::HandlerSocket($args);

#2. initializing an index so that we can use in main logics.

# MySQL tables will be opened here (if not opened)

my $res = $hs->auth(‘secretwr’);

die $hs->get_error() if $res != 0;

## no spaces or it will error

my $cols = “user_name,user_email,created”;

### used to get the count of the columns above dynamically, need for below

my @count_cols = split(/,/,$cols);

my $count_cols = scalar(@count_cols);

$res = $hs->open_index(1, ‘test’, ‘user’, ‘user_name’, $cols);

die $hs->get_error() if $res != 0;

#3. main logic my $rows_returned = 3;

my $offset = 0;

my $ele_returned = $rows_returned * $count_cols;

#select signs can be – ‘=’, ‘>’,'<‘,'<=’,’>=’, ‘!=’

$res = $hs->execute_single(1, ‘>=’, [‘101’], $rows_returned, $offset);

die $hs->get_error() if $res->[0] != 0; shift(@$res);

########### multi-select

#$res = $hs->execute_multi([ #[1, ‘=’, [‘mike’], 1, 0], [1, ‘=’, [‘jane’], 1, 0] #]);

## this begins the exec for a multi-statement

#for my $res (@$res) {

#    die $hs->get_error() if $res->[0] != 0;

#    shift(@$res); my $count_elements = scalar(@$res);

my $count_rows = $count_elements / $count_cols;

for (my $row = 0; $row < $ele_returned; $row+= $count_cols) {

if($res->[$row]){ my $user_name= $res->[$row + 0];

my $user_email= $res->[$row + 1];

my $created= $res->[$row + 2];

print “$user_name | $user_email | $created\n”;

}

}

print”elements: $count_elements\n”;

print “rows: $count_rows\n\n”;

### this is the closing bracket for a multi-statment

# }

#4. closing the connection

$hs->close();

 

Handlersocket Part II / syntax Perl/PHP

(originally published 07/18/12)

other handlersocket entries:

part Iintroduction to handlersocket

part IIIhandlersocket examples – Perl

————————-

Like the native access, you can use access all handlersocket functionality via HandlerSocket.pm, you can download the entire Handlersocket package at,
https://github.com/ahiguti/HandlerSocket-Plugin-for-MySQL

if you already have Handlersocket via Percona Server installed, in order to get HandlerSocket.pm, you may still need to download the separate installation, then simply go to the perl-Net-HandlerSocket directory within it and build the perl module per the instruction within it.

What perl-Net-HandlerSocket can do:

Like the native functionality it can,

Select / insert /update/ delete on primary and secondary keys
Filter on primary and secondary keys
‘In’ type functionality
queries with more than one parameter

The official documentation is at,

https://github.com/ahiguti/HandlerSocket-Plugin-for-MySQL/blob/master/docs-en/perl-client.en.txt

Initiate a connection

perl

use Net::HandlerSocket;

my $args = { host => '127.0.0.1', port => 9999 };
 my $hs = new Net::HandlerSocket($args);

php

$hs = new HandlerSocket($host, $port_wr);

very simple, choose your host, choose your port (normally 9999 or 9998, more below on this).

Error Handling

After executing a call, you can print alerts and stop the program with the following,

perl

die $hs->get_error() if $res != 0;

Or for multiple rows,

for my $res (@$res) {
 die $hs->get_error() if $res->[0] != 0;
 shift(@$res);

php

if (! Whatever function) {
echo $hs->getError(), PHP_EOL;
 die(); 
}

Authorization

1) edit your my.cnf settings (see configuration settings above)

2) then in your code,

perl

my $res = $hs->auth('your_password');
 die $hs->get_error() if $res != 0;

php

if (!$hs->auth('your password')) {
 die('Fault auth');
 }

It will test both passwords and give appropriate permissions to the password that matches

Open an index

HandlerSocket completely relies on the indexes to get data, it can not get a record without one. (although it can certainly get the non-indexed data within the row you retrieve using the index). You must open at least one index for handlersocket to operate on to do a query.

perl

$res = $hs->open_index(a, 'b', 'c', 'd', ‘e1,e2,e3…’, 'f1,f2,f3…');

describing each position in the set above,

a – the index id, this can be any number, you’ll use it to differentiate all of your open indexes in your calls. Using the same number for a different index will overwrite the former opened index.
b – the database name
c – the table name
d – the name (defined in your table) of the index you are using
e(1,2,3…) – the columns you are using in the index. If it’s a multiple index, like mysql you must include the left most columns of the index (i.e. to use col2, you can not leave out col1 and use col2, you must include both).
f(1,2,3…) – the columns for which you want data returned.

php

$hs->openIndex(a,’b’ , ‘c’, ‘d’, 'f1,f2,f3…')

exactly the same as above EXCEPT no defining partial indexes (no ‘e1, e2,…’)

Data transactions

To see specific examples, go to Handlersocket examples. The general form of the call is the following,

perl

select, insert, update and delete can all be done with the general form below,

$res = $hs->execute_single(a, 'b',['c',’c2’, ‘c3’…], d, e, ‘f’, 
[g1,g2,g3,…],['h1','h2', 'h3','h4'],i,['j1','j2',’j3','j4']);

Not all are necessarily relevant and depends on the operation you are doing.

a – index id
b – the operation one of (>,=,!=, +)
c – the value for the index to query on OR the values for an insert
d – the limit of rows operated on like limit in mysql
e – the offset of the result like offset in mysql
f – if a delete or update then (‘D’, ‘U’) otherwise undef
g(1,2,3..) – if update, an array of values to use for the update
h – an array with filter values
h1 – ‘F’ (filter those values for =, !=), ‘W’ (stop once value is hit such as =)
h2 – filter operation (>,=,!=)
h3 – index id for the filter
h4 – value to filter on
i – key for IN statement
j(1,2,3..) – values for IN statement

the call itself can be execute_single or execute_multi. The latter will feed it as an array of the series above, there are a couple of examples coming up to demonstrate.
I’ve found many positions not requiring a value may still need a placeholder, where an empty ‘’ will not work for some of them. Use undef in this case.

php

Php is a little more flexible, you can use a similar executeSingle function for everything or use specific functions for insert, update, delete

$retval = $hs->executeSingle(a, '=', array(‘c’), d, e);
$retval = $hs->executeMulti(
 array(array(a1, '=', array(‘c1'), d1, e1),
 array(a2, '=', array('c2'), d2, e2)));
$hs->executeInsert(a, array('c1', 'c2','c3')
 $hs->executeUpdate(a, 'b', array('c1'), array('c2'), d, e)
 $hs->executeDelete(a, 'b', array('c'))

Things to keep in mind (optimization, troubleshooting and bugs)

a) execute_multi is supposed to be much faster for many statements than looping through execute_single.

b) Contrary to the documentation, I’ve found if you do not include definitions for the limit and offset, it will error. Using ‘undef’ for each place will work, but throw a ‘Use of uninitialized value in subroutine entry at handlersockettest.pl’ warning.

c) DO NOT leave spaces on either side of the commas in ‘open index’ call for (e) and (f) – it will error. I assume because these are multiple value and each is not enclosed by quotes has something to do with it.

Introduction to Handlersocket

(originally published 07/08/13)

other handlersocket entries:

part IIhandlersocket syntax Perl/PHP

part IIIhandlersocket examples – Perl

——————————–

There is very little practical documentation on the web regarding all of the functionality behind handlersocket, this series of posts is an attempt to shed a little more light on what you can do with this fantastic protocol, heavy on the syntax of basic and meatier examples.

Why would I want to use HandlerSocket?

HandlerSocket is essentially a protocol to bypass much of the work done by a traditional SQL call, and particular useful for simple queries where the relative percentage of work done is preparing and executing the statement itself is considerable compared to the work and time of retrieving the data. In that case, the work per query for Handlersocket is a fraction of an SQL request. Additionally, the connections are very light in comparison, and so it’s feasible to have many more concurrent connections. Considering these factors, in an extremely high volume environment, largely consisting of simple key lookups, it can help scale load many times.

For a great deal more information on HandlerSocket,

http://yoshinorimatsunobu.blogspot.com/2010/10/using-mysql-as-nosql-story-for.html

Comparison to prepared statements

Certainly a valid question, prepared statements bypass the most costly part of the process as well, the parsing, which puts the performance in the same ballpark, however there is still open/closing lock/unlock and running the query execution plan to consider, making HandlerSocket somewhat faster. It also consumes quite a bit less memory consumption (particularly if the result set is large), and can handle multiple threads per connection and can serve requests in bulk, resulting in many more simultaneous requests served. Additionally, HandlerSocket give you the flexibility to keep queries dynamically built in your code instead of defining them in a server side prepared statement.

You can find a detailed comparison on performance here,

http://www.mysqlperformanceblog.com/2011/03/16/where-does-handlersocket-really-save-you-time/


Installation

I’m not going to cover installation here, but there are two main ways,

1) download and install separately https://github.com/DeNADev/HandlerSocket-Plugin-for-MySQL

2) it comes with Percona server (Percona server is mysql with many custom enhancements and extras) http://www.percona.com/software/percona-server/

Some parts of the installation document will be relevant for both cases,

https://github.com/ahiguti/HandlerSocket-Plugin-for-MySQL/blob/master/docs-en/installation.en.txt

For it to be accessible through a language (Perl, PHP), you’ll need to download the client for that as well, links below.

Configuration settings (done in My.cnf)

The ports

9999 = read and write
9998 = read

9998 is faster for reads, about 20-30%. Of course it is also secures read only environments.

loose_handlersocket_port = 9998
 loose_handlersocket_port_wr = 9999

Authorization

Very simple to use and optional, set plain text passwords for each port.

handlersocket_plain_secret=<secret_for_reads>
 handlersocket_plain_secret_wr=<secret_for_readwrites>

Ways to access HandlerSocket

* Natively through telnet or nc

* C++ -> libhsclient folder at https://github.com/ahiguti/HandlerSocket-Plugin-for-MySQL

* Perl -> perl-Net-HandlerSocket folder at https://github.com/ahiguti/HandlerSocket-Plugin-for-MySQL

3rd party

PHP

Java

Python

Ruby

 

Galera set up troubleshooting

In a previous post on galera, I mentioned it’s important to look at both the joiner and donor logs to get a full picture of what problems you may be running into. Sometimes even that is not enough and you’ll need to spend time narrowing down the issue. It’s important to keep in mind that Galera is relatively young and so the documentation, error messages and online help is not comparable to mysql.

We had a situation where we were building a staging environment, duplicating our production run of Galera, yet we were running into problems not previously encountered. Two nodes, (one an arbitrator) were online and we were attempting to join a third node and have xtrabackup transfer over the data. The node would initiate xtrabackup, appear to be in operation as both nodes participated in the operation, but then quit after a few moments. Typically you might run into permissions issues, file locations, etc.. with either xtrabackup and the logs, either /varlog/mysqld.log on the donor, or the accompanying /var/lib/mysql/innobackup.backup.log will let you know what the problem is. In this case however, we were getting no obvious error,

the joiner log simply quit with,

SST failed: 32 (Broken pipe)

and the donor had a little more information,

WSREP_SST: [ERROR] innobackupex finished with error: 2.  Check /var/lib/mysql//innobackup.backup.log (20130422 11:23:08.361)

[ERROR] WSREP: Failed to read from: Process completed with error: wsrep_sst_xtrabackup --role 'donor' --address 'xx.xx.xxx.xxx:4444/xtrabackup_sst' --auth 'xxxxx:xxxxxxxx' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --gtid 'a77b0b24-ab77-11e2-0800-f92413e82717:0'

Process completed with error: wsrep_sst_xtrabackup --role 'donor' --address 'xx.xx.xxx.xxx:4444/xtrabackup_sst' --auth 'xxxxx:xxxxxxxx' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --gtid 'a77b0b24-ab77-11e2-0800-f92413e82717:0': 22 (Invalid argument)

which in retrospect did subtly point out the issue, but the pointer to innobackup.backup.log distracted from the real issue. Looking at that log, again a broken pipe error 32 was noted,
which doesn’t really tell you anything.

And so commencing with the testing of port/db/file permissions, switching which was a donor vs. joiner to see if the issue was two ways, comparisons to our production environment, etc.. mysql seemed to be fine, xtrabackup seemed to work correctly, etc..

The test that made is obvious was switching the SST type, to mysqldump, which then proceeded to behave the same way, but with some much more obvious error messages,

ERROR 2003 (HY000): Can't connect to MySQL server on 'xx.xx.xxx.xxx' (4) 130422 11:24:11 [ERROR] WSREP: Process completed with error: wsrep_sst_mysqldump --user 'root' --password 'xxxxxxxx' --host 'xx.xx.xxx.xxx' --port '3306' --local-port '3306' --socket '/var/lib/mysql/mysql.sock' --gtid 'a77b0b24-ab77-11e2-0800-f92413e82717:0': 1 (Operation not permitted) 130422 11:24:11 [ERROR] WSREP: Try 1/3: 'wsrep_sst_mysqldump --user 'root' --password 'xxxxxxx' --host 'xx.xx.xxx.xxx' --port '3306' --local-port '3306' --socket '/var/lib/mysql/mysql.sock' --gtid 'a77b0b24-ab77-11e2-0800-f92413e82717:0'' failed: 1 (Operation not permitted) hostname: Unknown host hostname: Unknown host

the ‘invalid argument’ in the first set of errors was referencing an unrecognized host. The staging servers were using internal ips, which work fine as far as Galera was concerned, except SST (xtrabackup/mysqldump) does not use the Galera node ips specified in ‘wsrep_cluster_address’  directly, a lookup is done and SST subsquently used the public ips, which were not open.

the solution is simple, you can use the variable,

wsrep_sst_receive_address=xx.xx.xxx.xxx

in your my.cnf to explicitly specify what SST should use as the ip.

 

 

 

using replication wisely

Replication has a multitude of uses and a very effective tool to solve a number of problems, however, it must be used with some forethought, particularly if you are using it to scale OLTP queries, as of course the drawbacks of it’s single-threaded, asynchronous nature can cause all sort of issues.

Instead of a multi-slave replication setup, We recently applied it to a sharded implementation. There are a few fundamental choices regarding sharding, one of which is what to do with the common tables that are not sharded. You can have one copy either on one or a separate node, or replicate these tables to all shards. Factors include the network traffic, the queries (joins across tables), and the write frequency. Considering these factors, we chose to replicate to all nodes, as we wanted to minimize code changes and minimize any additional network traffic and connection overhead going to an additional node on reads (as we run a high volume of simple queries).

Then the question becomes how you replicate, as you only want the common tables to be replicated. Especially when starting your sharding with a few shards, you may be tempted to leave the common tables loose and/or in another db on one of the shards and either use ‘replicate-do-db’ or ‘replicate-do-table’ on the slave side, or ‘binlog-do-db’ on the server side (the db options would of course require these tables in a separate db). Using binlog-do-db means you only write the to the binlog or relay log for the tables in question, while replicate-do-db/table means only when the SQL is replayed on the slave do non-needed tables get ignored. In our case this difference in traffic would be 100:1, the load of a full write to the slave would be significant. However, the major drawback is you no longer are able to use the binlog on the master for point in time recovery which is a deal breaker for us. By leaving the tables in a common instance, you are forcing yourself to make this choice between less replication traffic vs. losing point-in-time backup coverage.

The alternative is to put these common tables in a separate mysql instance, (now the instance can be on the same server). The data is written to this instance, and then this instance containing ‘common tables’ db is replicated to all shards as that db, with the advantages of a separate binlog for this instance, which in our case was modest traffic while still keeping the original binlogs on the shards intact. This is also more in line with scaling the shards, going from 3 -> 20 shards will be transparent, as the common db can simply be moved to it’s own hardware when needed, with lots of room to grow on the replication end of things as well.

 

HA database pooling

For any application needing high availability/scalability on the database, you’ll probably look at a load balancing solution such as HAproxy. The typical description of the architecture boils down to a situation where your app servers are connecting to (usually) more than one Haproxy server on separate hardware, with a keep alive script between them for failover purposes. The active Haproxy serves loadbalanced requests to the pool of databases, either in a failover or balanced scenario.

There is an alternative method to this set up. You can simply add an Haproxy server directly on the app servers (6 app servers, 6 haproxy servers), with no keep alive script. If an app server goes down, the haproxy goes down with it, which is just fine, as each app server has it’s own dedicated Haproxy. Each one also serves a pool of databases either in a failover or balanced scenario.

There are not many situations I can say one option is clearly better than the other, however, the for life of me, I can’t see the tradeoff of why you wouldn’t do the latter setup. One less vertical hardware layer, no need for haproxy failover, no bottleneck and/or complicated haproxy loading logic at haproxy especially as more app and dbs are added. Sure there are many haproxies in play, but they all read from the same dbs and act based on exactly the same test from these dbs, which you should certainly be doing above and beyond a connection test.

my version of consistent hashing

When sharding, a common method of distributing your key randomly is to use some sort of hashing algorithm in your code, which is not time dependent, key dependent (i.e. evens, odds), etc.. You also want to keep in mind the work in reallocating data if you either add or remove shards.

The is a good amount of material on the subject,

here are a few links,

http://michaelnielsen.org/blog/consistent-hashing/
http://www.paperplanes.de/2011/12/9/the-magic-of-consistent-hashing.html

and the original paper,

http://www.akamai.com/dl/technical_publications/ConsistenHashing…

if you like technical papers.

Why am I writing about it again? I don’t love the visual explanation out there (the circle) and breaking it into whole sections, so I’m going to explain it how I understand it best.

Naive hashing

Let’s say you had shards 1, 2 and 3, and wanted to add shard 4,

with naive hashing, you do some sort of modulus of the hash to assign it to a shard.

whichshard = hash('key') % N

where N = number of shards, % is the modulus and produces a number between 1 and N, which would be the number of the shard. A typical hash function might be crc32().

reallocation of data when adding or removing a shard is a real problem with this method. To demonstrate, we’ll assign some data represented by numbers 1-> 15 through out the nodes.

 shard1 1, 4, 7, 10, 13
 shard2 2, 5, 8, 11, 14
 shard3 3, 6, 9, 12, 15

now let’s add a shard, and redistribute these same pieces of data,

 shard1 1, 5, 9, 13
 shard2 2, 6, 10, 14
 shard3 3, 7, 11, 15
 shard4 4, 8, 12

you’ll see that when you compare the data, most of them have shifted to other shards. in fact 3/4 of your data will be moved for 4 shards, 7/8 for 8 shards, etc… you’ll
have to move almost all your data each time you adjust the number of shards. This is because your algorithm using a modulus is accounting for an extra shard.

consistent hashing

So we turn to consistent hashing, and the explanation is usually accompanied by a unit circle, where it’s divided up into sections, one per shard and a range of values can be assigned per shard. It is then explained to add a shard, that you essentially divide one of these sections into two, and move only the keys you need from the shard you divided to the new one. Of course you’ll now have two smaller shards that are half the size of the rest of the shards, when of course you want to balance the load on all shards. The explanation is then instead of dividing one shard into two sections, take bits from all the shards to make your new shard. Now why push the idea that the shards are assigned full sections (i.e. 0 -> 1/5 of 15  = (1, 2, 3)) when your additional shard will need to be assigned random values (4, 9, 14)? Let’s just start with a random distribution for all shards so there is the understanding it doesn’t matter what the algorithm assigned to what node, as long as it assigns the same thing with additional shards in the mix. So let’s start with the same distribution as before (but really, any of these numbers could be assigned anywhere)

 shard1 1, 4, 7, 10, 13
 shard2 2, 5, 8, 11, 14
 shard3 3, 6, 9, 12, 15

now let’s add a shard, and redistribute manually,

 shard1 1, 4, 7, 10
 shard2 2, 5, 8, 11
 shard3 3, 6, 9, 12
 shard4 13, 14, 15

We see that we have chosen to assign node4 with a result from each on the first three shards, it’s does not matter which ones go to node4, and the result is none of the first three shards need data moved onto them, only moved off to the new shard. It also only accounts for 1/5 of the data; if we had 20 shards, we would only have to move about 1/20 of the data. Obviously you are going to have some sort of logic that maps the formula results to a node,

if (result in (2,5,8,11)){
   shard=2
}
etc...

so for our demo, how does getting a random value from 1 to 15 work in a formula? We now have,

whichshard = ceiling(hash('key')/maxvalue(hash())*N)

ceiling would be a function that rounds up to the closest result N. N normalizes the result so instead of an answer between 0 and 1 to give you a result between 1 and N, matching your assigned values.

if we are using crc32, the max value is 4294967295, so something like,

whichshard = ceiling(crc32('key')/4294967295 * N)
e.g.. -> ceiling(crc32('1234567')/4294967295 * 15) = 5

and we see that the result ‘5’, we assigned to shard2, and will always be assigned to shard2 unless we move it to another shard in our grouping definition.

Lastly, you can run the above formula as a select SQL statement, but in production, do it in your code so you are not hitting the db.