Monthly Archives: July 2012

Where Partitioning hurts

As a mobile gaming platform provider, we collect a tremendous amount of data, our biggest tables collect upwards of 50 million rows per day, and so partitioning is a must for both our statistical collection and especially analytics, where we piece together the entire dataset over time. Partitioning allows the tables to essentially be broken up into smaller tables, arranged by a partitioning key, in this case, ‘dayofweek’ or ‘dayofyear’ column. This allows a query to be able to restrict the query to the particular partitions in question relevant to the query such as a count of the records for that day. This is a big advantage for both reads and writes that would need to scan the entire index of a table that would not fit into memory to execute the query.

For the most part, for the sorts of queries we do, it’s almost automatic for us to partition big tables, we started with a couple of the biggest, and recently partitioned everything for both stat collection and analytics in preparation for ever increasing volume. All of the queries are based on a particular day, week, month, etc.. and so I was surprised to hear of a report that a query being run seemed to be about 2X as slow once we partitioned it.

However, looking at the query, it made sense. The query involved two joined tables, both partitioned. One table was our largest (analytics events), restricted to a particular day of the week, and already partitioned prior, so no gain there. The other table was our sessions table, newly partitioned, and the query DID NOT restrict to a particular time frame (such as the same day as the analytics events). Now the query was correct, because some sessions may have been started weeks ago, (an app can sit in the tray unused and so the session is still technically valid and is the correct identifier related to that day’s analytics event).

The point being, if you’ve partitioned your tables, and you are doing queries that do not take advantage of them, the parser has to do a lot more work looking at all the partitions and will be slower than a non-partitioned table. What may not be so obvious is the same holds true if you are joining tables and even one of them is partitioned. Now we also do queries on the sessions table which do restrict to a particular time frame which indeed improved, but many of the crucial queries join to that table over it’s entire time range. For us, the solution was to remove the partitioning on the sessions table. You’ll have to of course make your best judgement as to what the best trade-off is depending on the range of the majority of your queries.