Category Archives: hardware architecture

Numa system performance issues – more than just swapping to consider

Most all modern systems running a database have more than one processing core (and multiple threads on each core) in order to scale the number of simultaneous running threads and ultimately, the work the system is doing. In order to scale memory performance as well, the architecture provides each physical processor or node with it’s own ‘local’ memory. In the case of systems that run a process like a mysql database that takes up more than the size of one node (i.e. more than 50% on a dual-core system), the resulting performance considerations are not necessarily straight forward. There is of course the excellent article by Jeremy Cole that I recommend reading as a general overview of Numa based systems along with understanding the most damning of these potential issues actual swapping due to memory inbalance.  http://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/ There are however other less obvious but potentially harmful performance issues of which you should also be aware.

non-local memory percentage of use and location

As mentioned, Numa attempts to localize memory to a node which helps a great deal with scenarios such as independent processes, or sharded/multiple instance dbs. Each process is bound to a node and in turn the node attempts to use the local memory associated with the node. It should be clarified that remote memory is simply the memory that is assigned to the other node(s) and so local/remote memory is relative to the particular node referred to. With a large process (larger than local memory of one node), it has no choice but to use memory from multiple nodes, thus there is no way to avoid remote memory access as the particular data needed can not be determined until already committing to a thread/core that may or may not have the data in local memory.

So understanding that accessing non-local memory is a necessity, it follows that the percentage of use and proximity of remote memory is a priority and can make a big difference in cpu intensive/ large memory applications such as a high transaction and dataset-in-memory db. A fundamental choice that exercises this is, ‘how many cores do I choose for my system?’ Surely more is better? not necessarily so. A two-core system running a db that takes up most of the memory means that you’ll likely access remote memory 50% of the time (save some internal optimizations), where as in a four core system, you’ll be accessing remote memory 75% of the time (and 25% local). What’s more, the distance between nodes (and the memory bus it must travel through) is also a factor which is represented via,

numactl –hardware

two node

node distances:
node   0   1
0:  10  20
1:  20  10

four node

node distances:
node   0   1   2   3
0:  10  20  30  20
1:  20  10  20  30
2:  30  20  10  20
3:  20  30  20  10

Two nodes vs. the distance between four nodes physically arranged in a square, it’s easy to see there is a significant percentage of accesses that require not one, but two hops. Summing the two factors together, the average access time can be very roughly double on a four-core system as a two-core system.

memory reclaim modes

What exactly happens when a node runs out of local memory? This investigation was spurred on by the fact that although we have NUMA enabled on most systems, We haven’t seen any swapping issues even though some nodes on some systems are filled up and potentially under pressure. In fact, swapping is indeed one possibility, however it’s not the only one, and directly dependent on your virtual memory settings. the default for systems is the general category NODE LOCAL, and can include behaviour such as discarding clean pages which will need to be read in again at a later time, writing dirty pages out, as well as simply going to a remote node with no attempt to make room for data on the local node. Details can be found under the zone_reclaim_mode option at,

http://lxr.free-electrons.com/source/Documentation/sysctl/vm.txt

this jist of which is,

0       = Zone reclaim off
1       = Zone reclaim on
2       = Zone reclaim writes dirty pages out
4       = Zone reclaim swaps pages

I’ve seen it stated the default on two-node systems is 0 which simply goes to remote nodes when local is filled, while the default on four-node systems is 1 or ‘free cached pages which are not dirty on local node before going to remote’. This is what we’ve found on our current systems, and additionally it appears the default can differ depending on the version of your OS. To check,

sysctl -a | grep reclaim

The take away is although swapping is the most obvious and published performance issue, What is in fact the default settings (at least nowadays), you’re likely not to run into a memory pressure issue at all on two-node systems, and for four-node systems, you may find the issue is more subtle, with continuous reading of pages from disk for seemingly no good reason and not swapping, as you might expect. The solution is to simply change your zone_reclaim_mode to 0.

vm.zone_reclaim_mode = 0

in /etc/sysctl.conf

and reload with sysctl -p

good additional reading on this can be found at,

http://frosty-postgres.blogspot.ca/2012/08/postgresql-numa-and-zone-reclaim-mode.html

http://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases

http://queue.acm.org/detail.cfm?id=2513149