Is mysql/innodb good for high volume inserts?

As we are collecting an ever increasing amount of data and analyzing it, data storage products with alternate file structures have become popular, such as LSMs (log structured merge files) which is implemented in Cassandra and Hbase, as well as fractal indexes found in TokuDB/MX, which is an alternate file structure plugin for mysql and mongoDB. The reasoning is this – To maintain b-tree indexes in a relational database such as mysql which clusters the data by primary key, there is a constant rebalancing process to keep the order. When your data set become larger than memory, this translate to a large increase in disk I/0 which of course kills performance. This is the price you pay for an optimized search algorithm, so the story goes (and when I say story, I mean promotional material for companies using non-btree structures).

so a quick review of each of the three structures,

B-tree – nested levels of pointers specifying a fractional range of the primary key, per block. The top level is the root node, the bottom level are the leaf nodes which contain the actual data, and the levels between are internal nodes. Branching factor for a b-tree is B, the number of nodes a block on the preceding level points to. For a block size of 16kb, and say 100 byte rows, the branching factor could be around 160 and the tree depth for a billion rows ~ 4 levels.

LSM – basically log files that are always appending, never updated even for an update, thus always write sequentially. Files are sorted. To search, you must scan all the files in reverse order. This is optimized by reducing the number of/merging files (compaction), and limiting the range of the scan per file by the query parameters. Bloom filters can help point queries (not ranges). They do this by maintaining an extremely frugal/efficient data structure to allow a read query to cut down on the number of files to be searched. As you can guess, reading is somewhat of an afterthought, for the benefit of the insert performance.

Fractal – similar to a b-tree, however there is a cache for each internal node block.  All levels are not immediately written but only when needed (when they are full) and cascade down one or more levels. Doing it this way allows many writes to one block done in memory, until finally ready to write a grouped leaf node block with many changes on it to disk. b-tree groups writes as well, but not specific to the node level, the real advantage is not having to read the leaf nodes randomly for each query by delaying the leaf node write. From my understanding, for a branching factor for fractal indexes of B^1/2 = (4MB / ex. 100 byte records)^1/2 = 200, and say just 1% of the total db size as memory (2TB -> 10GB) was allocated to node buffers and pivots, from that, 100-200 writes could be grouped to write to a leaf node on disk (see below for B definition).

So the latter two files structures sound like they are a no-brainer for high volume inserting. However, there is a critical thing to keep in mind here and which is typically glossed over. You can (and usually do) insert sequentially in b-tree structures as well, requiring very little memory to do so. The performance degradation for a b-tree only happens if your working dataset is larger than available memory and you’re inserting out of order of your primary key (i.e. not chronologically, for example alphabetical). Your insert has an equal chance of going into any block of existing data and so you are likely to be reading the entire dataset in a random fashion. However, if you are inserting sequentially, such as by time or an auto-increment primary key, your application is writing to the same block and/or adjacent block for the majority of your data and will flushed together. Your need for all the data in memory goes away. Secondary indexes are indeed out of order, but a much smaller data set and handled and grouped by the insert buffer. So basically if you are inserting without a primary key, or with an auto-inc primary key, or other time oriented way, it’s a non-issue.

We’ve done up to 100,000 q/s primarily inserts into mysql/innodb on our production machines using a b-tree data structure, 32-core CPUs with multiple secondary indexes, into a 3TB table, and that’s without using nosql api’s such as handlersocket or memcached (we do use stored procedures). That matches or surpasses most benchmarks for inserts I’ve seen, regardless of the product.

the math works out to be,

  1. B – records per block size
  2. N – total number of records
  3. branching factor -  B for btree, B^1/2 for fractal

B-tree – reads and writes disk logN/LogB, writes in memory 1/B

Fractal – reads and writes disk logN/LogB * B^1/2, writes in memory 1/B

LSM – writes disk  logN/LogB * B^1/2 (including compaction) reads (LogN/LogB)^2, writes in memory 1/B

obviously B is somewhat arbitrary, (tokuDB->fractal uses a default block size of 4MB, innodb->b-tree 16KB) a larger block size offsets the smaller branching factor (B vs. B^1/2) and so both can read and write on disk within the same ballpark. With the larger block size, writes in memory, arguably fractal is better, although both are quite efficient at a tiny fraction of a disk I/0 per operation. A larger block size however hurts random reads as more data is required to be read into memory. LSM is seemingly at a clear disadvantage on reads, needing to scan all files, thus a log factor more disk I/O. This last statement for LSM reads is a little misleading as Bloom filters used by LSMs make reads doable and reduce most of this overhead as a crude form of index, although not completely and not for ranges.

So what might be a situation where you’d want to organize your data other than chronologically? Perhaps a site such as Wikipedia, which have documents, accessed with a frequency not related on the age of the document. Additionally you may want to fetch a range of documents, say all beginning with ‘Ag-’. Then it may make sense to cluster it by document name. However, you still have a viable option with a b-tree, as it still might make more sense to write chronologically (sequentially), and allow an alphabetical secondary index to take care of the search performance.

You alternatives are to use an LSM product, which will write sequentially with no order to worry about, but have aforementioned drawbacks reading the data, or use fractal index technology which will have performance in the same ballpark of a b-tree if inserting sequentially and much better if not, with the potential for less efficient reads, and relying on an overall less mature storage engine. Now tokuDB/MX also has built in compression which may add another dimension to the offering in terms of performance and of course storage space, but that’s for another post.

10 thoughts on “Is mysql/innodb good for high volume inserts?

  1. Pingback: Is mysql/innodb good for high volume inserts? | MySQL

    1. admin Post author

      I appreciate it Mark, as you inferred, more just general comments and observations, and definitely worth a more in depth analysis on a variety of workloads with hard data to contrast the differences, I’ll have to find the time :) We do quite a bit of data ingestion and analysis, probably a little more than a web facing company would, where Cassandra/Hbase/Hadoop ecosystem could be a natural candidate, but we also wanted to keep the benefits of a traditional rdbms (relational model, b-tree indexes, features such as stored procedures, etc..). For what we do, we’ve found a mysql centric architecture to be a flexible, highly performant and hassle free solution with some thought put into the implementation.

  2. Jaime Crespo

    Oh, I can also get 600 000/rows inserted per second on InnoDB durably on my pc consumer-grade ssd on an ideal workload, but the truth is that at Wikipedia we get edits a human-speed (not very fast). Writes are not an issue.

    Complex reads and analytics are an issue, those are only solved partially on MySQL by denormalizing its tables asynchronously or not using MySQL at all (both disk amplifications). And sometimes, performance is secondary in real life if the load is not too high: if you are limited by space (SSD & compression), write and read caching on top, reliability, human resources’ skills, etc.

    1. admin Post author

      The Wikipedia example was really just an abstract one, more about highlighting a choice for when you might choose a non-chronological primary key to make reads easier (and alternately a chronological key and a secondary index if writes are in fact an issue). As you said, I have no doubt the reality of the bulk of Wikipedia’s workload is varying types of reads with modest writes. For analytics or other non real-time complex/aggregated reads, we import to a columnar db, mysql is a tough fit for that sort of thing.

      now, regarding our throughput, we’re talking insert statements executed, not rows, and certainly you can do impressive numbers in ideal/test conditions.. the general point was that for a b-tree data structure with multiple indexes, a multi-TB table on 5.5 in a real production environment, you can certainly still do what would be considered high volume inserts if writing sequentially.

  3. Moshe Shadmon

    We (ScaleDB and MariaDB) have been working on these same challenges – can MySQL manage high-velocity data and in particular time series data?
    Here are our thoughts on the current alternatives and the solution we have –
    B-tree – if the goal is high ingestion rate – the Btree creates too much contention between threads to allow the high velocity.
    LSM – this needs to be part of the solution as it allows to ingest high velocity data to disk – in order to effectively support billions of new records per day (and sometimes billions per hour) a non-sequential write process would trigger random IOs which are too slow for high velocity data (and an in memory solution is not practical as terabytes of RAM are too expensive).
    To manage high velocity and efficiently query big data, the solution needs to extend beyond a single machine.
    So we extended MariaDB (Monty supported the MariaDB side of the solution) to provide a log (and disk based) data structure for time series data. With this setup, a ScaleDB cluster (4 MariaDB nodes and 4 storage nodes) ingests millions of inserts per second.
    We replaced the Btree indexes with Hash based indexes. The Hash indexes support point lookups over the logged data (i.e. – select all rows for a particular customer). These indexes have minimal impact on the insert time and provide efficient lookup mechanism (each Hash entry provides the row ids that satisfy the key value in a LIFO order).
    The log structure organizes the data by time. The MariaDB query mechanism was extended to support a pushdown process (which is to execute the queries on the ScaleDB storage nodes and next to the data) similar to map reduction. This approach supports BI queries where hundreds of millions of rows are evaluated within 1-2 seconds (and the log structure efficiently provides the starting point and end point for the scan).

    1. admin Post author

      Well, I can’t argue that getting rid of the b-tree indexes wouldn’t speed things up (if not writing sequentially). To that end, using only hash indexes in innodb might be a consideration if you can sacrifice the range searching.

      Now, time-series data is, in general, sequential and as mentioned, writes to a b-tree should easily stay in memory, but for data that isn’t, and you can deal with the searching drawbacks, an LSM product can work.

      I will say that Scaledb in general looks like it has some good clustering features for teams not looking to grow their own sharded solution, although I can’t speak to it’s performance/reliability in a production environment.

      1. Moshe Shadmon

        I will say that it will be very difficult to make a sharded solution efficient: If the sharding key is by time – at any given point of time – the data stream is pushed to a single machine and therefore ingestion rate is limited to what a single machine can deliver. In addition queries using time predicates will not find even distribution of the data therefore would (also) be limited to what a single machine can deliver. If the sharding key is not by time, then the system is complex. Queries would need to join data from multiple machines. It would require assumptions on data distribution and the type of queries that needs to be supported. These issue do not exist on a ScaleDB cluster as for any given time, the data is evenly distributed on all the storage nodes and all these nodes are leveraged during the insert and query processes.

        1. admin Post author

          Moshe, I know you have a product to promote but you’re baiting me here :) This post was to show you don’t have to automatically surrender to an alternate storage engine, data structure and or product to perform high volume inserts if you understand what’s going on under the hood. Innodb doing sequential inserts (which is all in memory) into a b-tree performs in the ballpark of any alternate data structure as both the math and our production environment shows. Same is true for sharding. We do it, it wasn’t that difficult and performs rock solid at a highly performant level as they are essentially individual mysql instances. We use a consistent hashing algorithm which spreads records evenly across nodes by primary key, yet is still sequential per node and therefore the dataset is always in memory. Most queries are per user (the primary key) and for queries that require data across shards, we simply query in parallel. Clusters are a double-edged sword, where they can give you many enticing automated features, yet can also fail/perform badly in unexpected ways. Again, I don’t speak to your product specifically, but we wouldn’t be comfortable using software that we didn’t write ourselves and/or didn’t understand completely for this sort of thing.

          1. Moshe Shadmon

            I do promote a product but more than that I’m intrigued by the conversation and your comments. Here are my thoughts on your latest reply:
            We tested a MySQL/Innodb setup with 2 keys. One key was time based and the second key was on a customer ID. Once the data did not fit RAM, performance dropped from 100,000 rows per second to 8,000 per second. Now, placing high velocity data in RAM has significant cost implications – for example, if you stream 1M inserts per second (of 100 bytes each row) – this means 360GB of data every hour. Keeping a month of data (not to mention a year) seems not practical. I wasn’t sure if your comment “sequential per node and therefore the dataset is always in memory” is that only the latest is in memory, which means that queries (over older data) would do the random seeks which I think we agree are slow for this type of application.
            My comments also assumed that the setup needs to equally satisfy both lookups by customer IDs and analytics for time intervals. Your latest comment is that the analytics queries are not frequently used.
            I think that it emphasizes the point which is partitioning works well if the query do not conflict with the partition criteria and that you made a choice to prefer the help desk functionality over the analytics.
            With the setup I suggested you do not need to do the compromise and I believe that I will be able to convince you that you can get better performance for a much lower TCO.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>