As we are collecting an ever increasing amount of data and analyzing it, data storage products with alternate file structures have become popular, such as LSMs (log structured merge files) which is implemented in Cassandra and Hbase, as well as fractal indexes found in TokuDB/MX, which is an alternate file structure plugin for mysql and mongoDB. The reasoning is this – To maintain b-tree indexes in a relational database such as mysql which clusters the data by primary key, there is a constant rebalancing process to keep the order. When your data set become larger than memory, this translate to a large increase in disk I/0 which of course kills performance. This is the price you pay for an optimized search algorithm, so the story goes (and when I say story, I mean promotional material for companies using non-btree structures).
so a quick review of each of the three structures,
B-tree – nested levels of pointers specifying a fractional range of the primary key, per block. The top level is the root node, the bottom level are the leaf nodes which contain the actual data, and the levels between are internal nodes. Branching factor for a b-tree is B, the number of nodes a block on the preceding level points to. For a block size of 16kb, and say 100 byte rows, the branching factor could be around 160 and the tree depth for a billion rows ~ 4 levels.
LSM – basically log files that are always appending, never updated even for an update, thus always write sequentially. Files are sorted. To search, you must scan all the files in reverse order. This is optimized by reducing the number of/merging files (compaction), and limiting the range of the scan per file by the query parameters. Bloom filters can help point queries (not ranges). They do this by maintaining an extremely frugal/efficient data structure to allow a read query to cut down on the number of files to be searched. As you can guess, reading is somewhat of an afterthought, for the benefit of the insert performance.
Fractal – similar to a b-tree, however there is a cache for each internal node block. All levels are not immediately written but only when needed (when they are full) and cascade down one or more levels. Doing it this way allows many writes to one block done in memory, until finally ready to write a grouped leaf node block with many changes on it to disk. b-tree groups writes as well, but not specific to the node level, the real advantage is not having to read the leaf nodes randomly for each query by delaying the leaf node write. From my understanding, for a branching factor for fractal indexes of B^1/2 = (4MB / ex. 100 byte records)^1/2 = 200, and say just 1% of the total db size as memory (2TB -> 10GB) was allocated to node buffers and pivots, from that, 100-200 writes could be grouped to write to a leaf node on disk (see below for B definition).
So the latter two files structures sound like they are a no-brainer for high volume inserting. However, there is a critical thing to keep in mind here and which is typically glossed over. You can (and usually do) insert sequentially in b-tree structures as well, requiring very little memory to do so. The performance degradation for a b-tree only happens if your working dataset is larger than available memory and you’re inserting out of order of your primary key (i.e. not chronologically, for example alphabetical). Your insert has an equal chance of going into any block of existing data and so you are likely to be reading the entire dataset in a random fashion. However, if you are inserting sequentially, such as by time or an auto-increment primary key, your application is writing to the same block and/or adjacent block for the majority of your data and will flushed together. Your need for all the data in memory goes away. Secondary indexes are indeed out of order, but a much smaller data set and handled and grouped by the insert buffer. So basically if you are inserting without a primary key, or with an auto-inc primary key, or other time oriented way, it’s a non-issue.
We’ve done up to 100,000 q/s primarily inserts into mysql/innodb on our production machines using a b-tree data structure, 32-core CPUs with multiple secondary indexes, into a 3TB table, and that’s without using nosql api’s such as handlersocket or memcached (we do use stored procedures). That matches or surpasses most benchmarks for inserts I’ve seen, regardless of the product.
the math works out to be,
- B – records per block size
- N – total number of records
- branching factor - B for btree, B^1/2 for fractal
B-tree – reads and writes disk logN/LogB, writes in memory 1/B
Fractal – reads and writes disk logN/LogB * B^1/2, writes in memory 1/B
LSM – writes disk logN/LogB * B^1/2 (including compaction) reads (LogN/LogB)^2, writes in memory 1/B
obviously B is somewhat arbitrary, (tokuDB->fractal uses a default block size of 4MB, innodb->b-tree 16KB) a larger block size offsets the smaller branching factor (B vs. B^1/2) and so both can read and write on disk within the same ballpark. With the larger block size, writes in memory, arguably fractal is better, although both are quite efficient at a tiny fraction of a disk I/0 per operation. A larger block size however hurts random reads as more data is required to be read into memory. LSM is seemingly at a clear disadvantage on reads, needing to scan all files, thus a log factor more disk I/O. This last statement for LSM reads is a little misleading as Bloom filters used by LSMs make reads doable and reduce most of this overhead as a crude form of index, although not completely and not for ranges.
So what might be a situation where you’d want to organize your data other than chronologically? Perhaps a site such as Wikipedia, which have documents, accessed with a frequency not related on the age of the document. Additionally you may want to fetch a range of documents, say all beginning with ‘Ag-’. Then it may make sense to cluster it by document name. However, you still have a viable option with a b-tree, as it still might make more sense to write chronologically (sequentially), and allow an alphabetical secondary index to take care of the search performance.
You alternatives are to use an LSM product, which will write sequentially with no order to worry about, but have aforementioned drawbacks reading the data, or use fractal index technology which will have performance in the same ballpark of a b-tree if inserting sequentially and much better if not, with the potential for less efficient reads, and relying on an overall less mature storage engine. Now tokuDB/MX also has built in compression which may add another dimension to the offering in terms of performance and of course storage space, but that’s for another post.