MongoDB When To Shard - Brian Carpio

I had an interesting conversation with my development team and my DBA team specifically regarding how to identify when to shard. Currently we are using Zabbix to monitor multiple mongodb servers using the mikoomi Zabbix template and php script. If you aren't leveraging Zabbix for monitoring your infrastructure then you should check it out, its got a lot of great features. Anyway, the conversation started with someone suggesting that the memory foot print might be a very important statistic leverage to determine if its time to shard. While I don't disagree that monitoring memory can be a key indicator I certainly think there are better.

Background Flushing

Using mikoomi and Zabbix you get a graph which shows how your data is flushing to disk four important metrics are displayed:

•Background Flush Average Time
•Total Background Flush Time In Last Minute
•Number Of Background Flushes In The Last Minute
•Last Background Flush Time

The reason background flush information is so important is that it allows you to determine if the amount of data being updated or added is overwhelming the disk subsystem. As you know by default mongodb flushes data to disk every minute so we need to make sure that the amount of data being flushed can flush in 60 seconds. I'd say there should be concern when background flush average time has reached 30 seconds, either fast disks need to be added to the mongodb cluster or a new shard should be introduced.

Global Locking

Another key indicator is Global locking, at one company we ran a very high performance application and tried to keep locking to under 30% once we started to surpass 30% we knew the mongodb servers would start to slow down the performance of our application layer. Locking as you know occurs during write load (and has been improved in 2.x), so this is a good indicator that the write load is too high for just a single replica set and needs to be spread across multiple replica sets in a sharded config. I'm not saying 30% is the right number for you it could be 10% it could be 50% but it's certainly a very important key indicator. Again with mikoomi and Zabbix this information is very accessible with the following items:

•Current Reader Queue Length
•Total Lock Time (microseconds) Last Minute
•Current Writer Queue Length
•Current Total Queue Length

Page Faults

I'd also say that "Memory: Page Faults/Minute" is a great key indicator this really helps assess how often mongodb is going to disk for data that is being requested. In mongodb its import to have indexes and most of the working data set in memory as this number grows it's certainly a key indicator that the amount of memory allocated to the server isn't sufficient for the working data set and either more memory needs to be added to the system or a new shard should be added. Mikoomi and Zabbix offer a item called "Memory: Page Faults/Minute" that tracks this very thing.

While this certainly isn't an exhaustive list of things which should be monitored for sharding they are defiantly some of the ones in my experience show that a new shard should be added, in some cases its simply sufficient to add memory or ssd drives to your solution to keep free of having to implement sharding.