As I described in this post, Practical vSAN: Measuring Congestion, the vsan uses congestion as one of the primary metrics in determining whether or not to add latency to incoming writes. As congestion increases, the vsan’s overall performance will decline until eventually it will stop accepting incoming writes all together and machines on the vsan will effectively hang, similar to an APD condition using the core storage (NFS/iSCSI) subsystem in ESXi.
One incredibly useful technique to “buy yourself time” is to increase the buffer sizes. The upper limit according to PSS is 128GB. Remember, this will only buy you time if you don’t resolve the underlying problem. The “LowLimitGB” is the threshold at which latency will start to be added. The “HighLimitGB” is the threshold at which incoming writes will be halted. 64/128 appear to be the limits for these values. You will need to execute these commands on every host in the cluster that is experiencing congestion. We suggest changing them on all hosts to be identical. Also, don’t set limits larger than the size of your cache devices.
esxcfg-advcfg -s 64 /LSOM/lsomLogCongestionLowLimitGB
esxcfg-advcfg -s 128 /LSOM/lsomLogCongestionHighLimitGB
In one recent case, we were able to use these commands to buy ourselves a few hours while we diagnosed an underlying hardware issue, and simultaneously finish up the working day without any further performance complaints. We ended up leaving the values at these levels rather than reverting them to default as recommended by PSS, as I don’t really see a downside if you’re proactive about monitoring for congestion generally. In our next post, Practical vSAN: Part 3 – Adjusting the number of background resync processes, if your storage woes are being exacerbated by a background resync we’ll show you how to change the number of simultaneous resync processes.