As with most data centric companies, there comes a time where it’s necessary to be able to aggregate data to provide meaningful insight into raw data that is collected. At Fuse Powered that is essentially one of the cores services we provide, whether it be via ad-hoc queries, reports, or our dashboard.
The core database that we use is of course mysql with a default storage engine Innodb, which is geared toward OLTP and does it very well. However, when it comes to aggregating a large number of rows, and/or loading a large amount of data quickly, this default storage engine is not geared towards these sorts of operations efficiently when performing a very large amount of data. A class of storage engine called columnar storage is designed for just this sort of work, where the data is re-oriented, and the data is put into separate column based data files, where each of the columns can be read from disk sequentially and stored with high compression (resulting in disk storage savings and less data to read), as the speed in reading far outweighs the cpu overhead of decompression.
There are a number of columnar based offerings out there, however, almost all are closed source and quite pricey. Up until now we’ve utilized Infobright, a mysql client-based solution with a watered-down community version, stripping much of the functionality and performance from it’s full version. Recently a company, Calpont, (recently re-branded as it’s product name Infinidb) has come out with a open source licensing of it’s full 4.0.x version, a move that may potentially revolutionize how columnar databases are marketed.
This full version fills a number of core needs, such as compression, parallel processing, distributed storage, filesystem redundancy (HDFS, GlusterFS) and high speed data loading, and so we’ve been eager to test and implement.
The architecture is comprised of two main functional units, user modules (UM) and performance modules (PM). Performance modules work with and store data, while user modules coordinate queries across the entire system. Scaling performance modules gives you more storage and potentially better performance from parallel work being done. Scaling user modules allows for more concurrent queries to be run and redundancy for a failed um. You can start with a single server that contains all units, and grow out to a large distributed system as both type of modules can be added as your data increases. Each PM has a set number of dbroots, which can easily be reallocated/moved to other PMs on other servers as your data grows. It also uses a form of partitioning called extent mapping which will rule out sections of data in storage to speed up queries where possible. Queries are processed in parallel by the available processors and nodes that are appropriate.
the following is our general experience with it to date,
Documentation – Very good, answers quite a bit in terms of concepts, installation, options, etc.. However, because it’s just released as open source, really needs much more use and documentation by the user community to fish out bugs, suggest fixes, etc..
Installation/configuration – straight forward, enable passwordless ssh between nodes, download/install the initial rpm/deb/binary files and then run their script which will ask a series of questions. You can adjust the main configuration file later only on the root pm, usually pm1 at /usr/local/Calpont/etc/Calpont.xml.
Schema transformation - you need to create infinidb versions of the table structure, this is going to take some time, there are all kinds of incompatibilities when converting innodb schema to infinidb.
Loading – load data infile vs. the infinidb loader. As can be guessed, the loader is much faster, anywhere from 100,000 - 1 million+ rows/s for a single module load. The source of the loader can be at the user module, which will automatically distribute the data in 10,000 rows chunks, or at the performance modules individually using pre-split files to be able to scale. You can set an error threshold over which the load will abort, with errors logged.
Compression – our data is fairly typical, largely numerical with some varchar and text/blob columns here and there. Our data is compressed at about 5X compared to uncompressed innodb.
Backups – pretty rudimentary, really just a file copy which requires it be in read-only mode to ensure data consistency. However, it being a file copy, straight forward and fast.
Queries – performance is very impressive, especially for more complex queries, we had a class of aggregation queries that was taking 5-10 minutes on a dedicated box with 256GB ram, on a three-node public/shared cloud system 16GB each, finished in 10-15 seconds.
Logging – very well done, especially the loading, there a two files per load per table that detail the error type and the exact characters that cause the issue. Also various server logs to debug issues.
Administration – much of it is done through an independent ‘cmconsole’ which allows you to set and view various parameters and processes as well as add and reallocate modules. There are several processes per node that need to be up and running and can be a source of problems if they aren’t, viewable with the console.
Monitoring – built in monitoring of a number of key metrics such as disk/memory/cpu usage, networking issues, module issues, etc..
Cross engine compatibiity – not just the infinidb storage engine, also a full version of mysql and you can add innodb capability with the plugin. However, pretty darn old at 5.1.39. You can also mix storage engines in your queries, although you’ll need to specify it in the configuration file.
Before we get in the issues, it should be said that we have not attempted to contact infinidb reps, and any resources we used were simply the documentation and the infinidb forums (with posted responses by infinidb reps). Now, some of the issues are annoyances, other are fairly serious.
The most serious revolves around loading of data, we typical re(load) and append about 1000 tables per day and we’ve run into this behavior a few times in the last couple of weeks.
Locking – each load requires a table lock and occasionally, one of two things happen,
a) a table is locked and does not release the lock – we’ve even had a time or two where we’ve restarted the system and still has not released the lock (contrary to documentation). There is a view and clear lock utility, but for some reason it’s currently not working for us at the moment either. Our solution was to hunt for the file that contains this lock, and sure enough, you can find it on your parent pm at /usr/local/Calpont/data1/systemFiles/dbrm/tablelocks. Simply removing it and restart seems to work. This behavior seems to depend on the particular data being loaded, this happened on the same table 3 times, and we ran into most of this when initially loading our full data set.
b) table lock can not be acquired – In this case, a load is attempting to get a table lock, can’t get it and aborts the load. There were no other queries on the same table running at the same time, and in fact seemed to happen mainly when loading of other tables/dbs simultaneously.
Critical errors – we’ve ran into this a few times, where an error such as ‘image inconsistency’ initiates the system to go into ‘read-only’ mode. We only experienced it during a load which I suppose makes sense as it’s the only time we are significantly writing. When encountering this during a load it also attempts to rollback but can’t because again it’s been put into read-only mode. You might deal with the lock separately as above, and the read-only issue is suppose to clear after a system restart, but we’ve seen where it doesn’t. You then have two options. The advice at this link may help, http://infinidb.co/community/infinidb-not-starting otherwise, you may have to remove the entire infinidb installation and replace with your latest backup. We are not using a redundant file system such as HDFS or glusterFS, which may or may not help.
Not enough memory to perform select query – Infinidb uses a distributed hash join to evaluate large data sets quickly, the memory requirements of which is dictated by the second largest table in the join. There is a variable, PmMaxMemorySmallSide at a default of 64M up to a max of 4GB on the PM nodes, or all of TotalUmMemory on the UM depending on the data set size (if larger than the PM setting, uses the latter). We frequently join two tables, one at 40+ Billion rows the next at 1 Billion rows which has been an issue for certain queries, although our UM node is very modest at 16GB. Query span does make a difference (i.e. one day at a time vs. the entire year) so it’s not just a total table size metric.
backups needing to be in read only mode - is a self-explanatory issue. hopefully your needs don’t require frequent/real-time writes/loads 24 hours a day.
no real query plan evaluation – explain doesn’t give you much as it’s without indexes and use an alternate storage mechanism and execution algorithm, although probably irrelevant as it’s evaluated in a completely different way. Query profiling (set profiling=1) works. The question arises whether differences in the query structure matters and it would be interesting to see the performance of the same query written two different ways that take can longer one way in Innodb, and how they behave differently here (e.g. join vs. subquery class of queries) * – indeed there is a detail analyzer ‘select calgetstats()’ and ‘select calgettrace()’ directly after a query on the command line
sql fickleness – various issues such as,
- not being able to add a column anywhere other than the end of the table
- when doing mysqldump, must do –single-transaction, otherwise will err with lock issue
- All non-aggregate columns in the SELECT and ORDER BY clause must be included in the GROUP BY clause.
- ‘insert into (select..)’ errs, while ‘insert into select’ works
And so to sum up, we are using Infinidb and recommend it, however I’d suggest a thorough evaluation to your specific needs before making a commitment, luckily with an open source license, you can do just that.