Category Archives: Uncategorized

Is mysql/innodb good for high volume inserts?

As we are collecting an ever increasing amount of data and analyzing it, data storage products with alternate file structures have become popular, such as LSMs (log structured merge files) which is implemented in Cassandra and Hbase, as well as fractal indexes found in TokuDB/MX, which is an alternate file structure plugin for mysql and mongoDB. The reasoning is this – To maintain b-tree indexes in a relational database such as mysql which clusters the data by primary key, there is a constant rebalancing process to keep the order. When your data set become larger than memory, this translate to a large increase in disk I/0 which of course kills performance. This is the price you pay for an optimized search algorithm, so the story goes (and when I say story, I mean promotional material for companies using non-btree structures).

so a quick review of each of the three structures,

B-tree – nested levels of pointers specifying a fractional range of the primary key, per block. The top level is the root node, the bottom level are the leaf nodes which contain the actual data, and the levels between are internal nodes. Branching factor for a b-tree is B, the number of nodes a block on the preceding level points to. For a block size of 16kb, and say 100 byte rows, the branching factor could be around 160 and the tree depth for a billion rows ~ 4 levels.

LSM – basically log files that are always appending, never updated even for an update, thus always write sequentially. Files are sorted. To search, you must scan all the files in reverse order. This is optimized by reducing the number of/merging files (compaction), and limiting the range of the scan per file by the query parameters. Bloom filters can help point queries (not ranges). They do this by maintaining an extremely frugal/efficient data structure to allow a read query to cut down on the number of files to be searched. As you can guess, reading is somewhat of an afterthought, for the benefit of the insert performance.

Fractal – similar to a b-tree, however there is a cache for each internal node block.  All levels are not immediately written but only when needed (when they are full) and cascade down one or more levels. Doing it this way allows many writes to one block done in memory, until finally ready to write a grouped leaf node block with many changes on it to disk. b-tree groups writes as well, but not specific to the node level, the real advantage is not having to read the leaf nodes randomly for each query by delaying the leaf node write. From my understanding, for a branching factor for fractal indexes of B^1/2 = (4MB / ex. 100 byte records)^1/2 = 200, and say just 1% of the total db size as memory (2TB -> 10GB) was allocated to node buffers and pivots, from that, 100-200 writes could be grouped to write to a leaf node on disk (see below for B definition).

So the latter two files structures sound like they are a no-brainer for high volume inserting. However, there is a critical thing to keep in mind here and which is typically glossed over. You can (and usually do) insert sequentially in b-tree structures as well, requiring very little memory to do so. The performance degradation for a b-tree only happens if your working dataset is larger than available memory and you’re inserting out of order of your primary key (i.e. not chronologically, for example alphabetical). Your insert has an equal chance of going into any block of existing data and so you are likely to be reading the entire dataset in a random fashion. However, if you are inserting sequentially, such as by time or an auto-increment primary key, your application is writing to the same block and/or adjacent block for the majority of your data and will flushed together. Your need for all the data in memory goes away. Secondary indexes are indeed out of order, but a much smaller data set and handled and grouped by the insert buffer. So basically if you are inserting without a primary key, or with an auto-inc primary key, or other time oriented way, it’s a non-issue.

We’ve done up to 100,000 q/s primarily inserts into mysql/innodb on our production machines using a b-tree data structure, 32-core CPUs with multiple secondary indexes, into a 3TB table, and that’s without using nosql api’s such as handlersocket or memcached (we do use stored procedures). That matches or surpasses most benchmarks for inserts I’ve seen, regardless of the product.

the math works out to be,

  1. B – records per block size
  2. N – total number of records
  3. branching factor -  B for btree, B^1/2 for fractal

B-tree – reads and writes disk logN/LogB, writes in memory 1/B

Fractal – reads and writes disk logN/LogB * B^1/2, writes in memory 1/B

LSM – writes disk  logN/LogB * B^1/2 (including compaction) reads (LogN/LogB)^2, writes in memory 1/B

obviously B is somewhat arbitrary, (tokuDB->fractal uses a default block size of 4MB, innodb->b-tree 16KB) a larger block size offsets the smaller branching factor (B vs. B^1/2) and so both can read and write on disk within the same ballpark. With the larger block size, writes in memory, arguably fractal is better, although both are quite efficient at a tiny fraction of a disk I/0 per operation. A larger block size however hurts random reads as more data is required to be read into memory. LSM is seemingly at a clear disadvantage on reads, needing to scan all files, thus a log factor more disk I/O. This last statement for LSM reads is a little misleading as Bloom filters used by LSMs make reads doable and reduce most of this overhead as a crude form of index, although not completely and not for ranges.

So what might be a situation where you’d want to organize your data other than chronologically? Perhaps a site such as Wikipedia, which have documents, accessed with a frequency not related on the age of the document. Additionally you may want to fetch a range of documents, say all beginning with ‘Ag-’. Then it may make sense to cluster it by document name. However, you still have a viable option with a b-tree, as it still might make more sense to write chronologically (sequentially), and allow an alphabetical secondary index to take care of the search performance.

You alternatives are to use an LSM product, which will write sequentially with no order to worry about, but have aforementioned drawbacks reading the data, or use fractal index technology which will have performance in the same ballpark of a b-tree if inserting sequentially and much better if not, with the potential for less efficient reads, and relying on an overall less mature storage engine. Now tokuDB/MX also has built in compression which may add another dimension to the offering in terms of performance and of course storage space, but that’s for another post.

Boo! a frightful hour in production, ghost of BIGINT

Well, we had quite an hour of uncertainty this morning, at 7:30am, the behaviour of one of our data servers changed radically, with no obvious reasons looking at the process list, error log, etc… We’ve done a number of things the past few days to ramp up for a version launch of one of our top games, which included new code, data structures, etc.. along with a change in hardware for the memcached instances.

processes were slowly building up, yet the load was low, and at least some queries certainly were processing. In retrospect I regret not looking at the show engine innodb status as that may have made it much more apparent what was going on, however considering the other two factors, I was first looking at other explanations to do with recent changes in code, the memcached changes, etc..

The head developer mentioned he looked at one of our big tables and it had 2.1 billion rows, however, we had already anticipated the datatype limitation of INT and went through the exercise of changing the primary key column to BIGINT, and so it seemed unlikely this was the cause. However, the number bothered me, and so I took a look as to exactly what is was and certainly it seemed the the table passing the threshold of int (2147483647 signed) exactly conincided with whatever problem we were having.

Looking at the processlist, sure enough there were many queries that were acting on the id 2147483647 (about 10% of the total queries). So what could be the problem? We use a lot of stored procedures to minimize network traffic, and the developer then remembered that in the stored procedure declarations the primary key was defined as INT, not BIGINT. All incoming queries (and new users) were getting a truncated max value of 2147483647 limited by the stored procedure, not the table. All users before 7:30am were processing normally albeit much more slowly, anyone signing up after 7:30am was hammering the same row, thus the slow query build up and of course contention due to the hotspot.

What’s scary is the dumb luck of us discovering this sooner than much later I never would have looked at the row count thinking it was a problem already solved. I’d like to think an eventual show engine innodb status would have revealed the row level locking on that id. Apologies to the user assigned id 2147483647 for their game data, I can’t imagine what the game experience must have been like for them.