warehousing and data analytics in a mysql world

Mysql in it’s early days had the reputation of being simple and fast but lacking critical features for many use cases. Over the years, it’s become more stable and mature with a ton of advancements in the innodb storage engine, features such as stored code, replication improvements, query planning improvements, etc.. One area mysql has not tackled is BI and analytic OLAP. I can only assume it’s not been a priority with the user base to date. Oracle and SQL server cover a different market and have these options, however, in this new era of big data, a wider array of businesses want to collect and analyze data such as performance statistics to improve their business offering. Fortunately, there are a few 3rd party vendors who have taken on this area with the spirit of open source and/or tiered pay structure in mind, as you’ll find it can be very costly otherwise.

In our search to analyze the ever increasing volume of statistical data we were collecting, we considered the following products,

  • infobright
  • infinidb
  • luciddb
  • monetdb
  • fastbit
  • metakit
  • greenplum
  • hbase
  • cassandra

most all of these solutions implement a columnar architecture to the data in one way or another, essentially rows become columns, columns become rows. This is done to take advantage of the layout of the data and allow sequential reading (vs. random reads) of each long row (which was a column) on disk, which will happen much of the time doing OLAP type queries, aggregating on a dataset much larger than memory.

Back to the products, a number we rejected for one reason or another, generally because they were either developed more for the scientific community (metakit, fastbit), geared towards another open source db such as postgresql (greenplum), or focused on distributed systems (hbase, cassandra) which is a bit of overkill for us at this point.

so we were left with,

  • infobright
  • infinidb
  • luciddb
  • monetdb

infobright and infinidb have a free and paid version (the free versions both have limitations), luciddb and monetdb are open source.

our priorities in order were as follows,

- high compression
- ease of use (ideally a mysql client and connection)
- full DDL, DML statement use

there is a very good benchmark comparison done by percona on these 4 dbs,

Infobright and LucidDb both offer in the neighborhood of 10:1 compression, very nice for a free product.

note: Now, if we were in the market for a paid solution, we take a hard look at InfiniDB along with the 40:1 compression it is a mysql build with their proprietary storage engine incorporated making it an easy transition from stock mysql.

So we are down to,

  • Infobright
  • LucidDB

The two have very different remaining pros and cons, Infobright is also a mysql build with a proprietary storage engine incorporated, however, the drawback is the limitations they’ve implemented on the free version, namely most of the DDL and DML statement functionality has been stripped away. That means you can not insert, update, delete, truncate, or alter a table. At first glance that may sound insane, but keep in mind an analytics environment in many cases needs only primarily two things, to be able to load data, and to be able to select data, and do it quickly, which it can do very well, mysql style. However if you wish to change or alter the table or data, you’ll need to fully reload the table, and these restriction may be tough for an environment needing to serve customers, however our requirements are only a daily load, appending the existing data, internal analysis, and we rarely change table structure on these particular tables, and so we are willing to deal with a dump and reload of a table occasionally if need be.

LucidDB is a completely different animal, requiring Java to run, and uses it’s own SQL variant and data loading methods (for instance, there is no datetime datatype, you’re only choice is varchar to keep the data integrity). It does have full DDL, DML statement functionality and no other obvious restrictions. Certainly with a learning curve, we could use this product, but when considering the people that would use this database, such as BAs and developers, it made sense to choose a product that is as consistent as possible with what we are currently using.

So infobright was the choice, with some work arounds to make it usable. Truncate and rename table can be emulated, and other issues had to be dealt with, for instance, you can not simply copy over an innodb schema to a brighthouse schema in many cases, it will complain about keys, foreign keys, auto increment, unsigned datatypes, etc.. We wrote a script to parse this stuff out. Not even close to being usable generally, just our specific tables, otherwise I’d include it here. Another example the utf8 charset is handled badly, you’ll need to increase your datatype definition size 4x e.g. varchar(32) -> varchar(128). There are number of odd things such as this, I’m not so sure I’d be as eager to use for outside customers, but for our internal purposes, it will fit just fine.

One thought on “warehousing and data analytics in a mysql world

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>