Monthly Archives: October 2016

Practical vSAN: Increasing Congestion Thresholds

By | IT, Storage, SysAd, VMware | No Comments

As I described in this post, Practical vSAN: Measuring Congestion, the vsan uses congestion as one of the primary metrics in determining whether or not to add latency to incoming writes. As congestion increases, the vsan’s overall performance will decline until eventually it will stop accepting incoming writes all together and machines on the vsan will effectively hang, similar to an APD condition using the core storage (NFS/iSCSI) subsystem in ESXi.

One incredibly useful technique to “buy yourself time” is to increase the buffer sizes. The upper limit according to PSS is 128GB. Remember, this will only buy you time if you don’t resolve the underlying problem. The “LowLimitGB” is the threshold at which latency will start to be added. The “HighLimitGB” is the threshold at which incoming writes will be halted. 64/128 appear to be the limits for these values. You will need to execute these commands on every host in the cluster that is experiencing congestion. We suggest changing them on all hosts to be identical. Also, don’t set limits larger than the size of your cache devices.

esxcfg-advcfg -s 64 /LSOM/lsomLogCongestionLowLimitGB
esxcfg-advcfg -s 128 /LSOM/lsomLogCongestionHighLimitGB

In one recent case, we were able to use these commands to buy ourselves a few hours while we diagnosed an underlying hardware issue, and simultaneously finish up the working day without any further performance complaints. We ended up leaving the values at these levels rather than reverting them to default as recommended by PSS, as I don’t really see a downside if you’re proactive about monitoring for congestion generally. In our next post, Practical vSAN: Part 3 – Adjusting the number of background resync processes, if your storage woes are being exacerbated by a background resync we’ll show you how to change the number of simultaneous resync processes.

Practical vSAN: Measuring Congestion

By | IT, Storage, SysAd, VMware | No Comments

VMware’s vSAN, while an amazing product in many respects, leaves something to be desired when it comes to troubleshooting issues while in production. Many of the knobs and dials are “under the hood” and really not exposed in a way that is obvious in a crisis. In this series of posts we’ll document some of the troubleshooting techniques and tools we’ve gathered over the last several months. As always, use these tools at your own risk; we highly advise engaging VMware PSS if you can.

Part 1: Measuring Congestion

In vSAN, as writes are committed to the cache tier other writes are concurrently destaged to the capacity tier. In essence, the cache tier is a buffer for incoming writes to the capacity tier. If there is an issue with the underlying capacity tier, this buffer can start to fill. In vSANese, this is known as “log congestion.” In my opinion congestion is one of the primary metrics of health of a vSAN. If during a resync or other intensive IO operation you start to experience persistent log congestion, that’s a very good sign that some underlying component has a fault or there is an issue with drivers/firmware. I should also note that log congestion does not necessarily always indicate a problem with the capacity tier, logs are also used when persisting data to the caching tier.

As an aside, the health check plugin reports on these logs on what appears to be a scale of 0 to 255, with 255 being 100% full (though support is unclear on this exactly).

As the log buffer fills up, the vSAN starts to “add latency” to the incoming write requests as a way to throttle them. When a resync is triggered, if the underlying storage cant keep up for whatever reason, these buffers WILL fill. Once they hit a certain threshold (16GB by default) latency is added. More and more latency is added until a second threshold is reached (24GB by default) in which incoming writes are completely halted until enough data has been destaged to resume operation. At this point your entire cluster may enter a kind of bizarro-world APD type state, where individual hosts start dropping out of vCenter, virts will hang and pause. Note, the only indication you will have from vCenter or the vSAN itself at this point is that log congestion is too high.

You can check the current size of these log buffers by running this command from each host in your vsan:
esxcli vsan storage list |grep -A1 "SSD: true" |grep UUID|awk '{print $3}'|while read i;do sumTotal=$(vsish -e get /vmkModules/lsom/disks/$i/info |grep "Log space consumed"|awk -F \: '{print $2}'|sed 'N;s/\n/ /'|awk '{print $1 + $2}');gibTotal=$(echo $sumTotal|awk '{print $1 / 1073741824}');echo "SSD \"$i\" total log space consumed: $gibTotal GiB";done

Another useful tip is to save this into shell script and use the unix command “watch” to observe these values over time (e.g. watch ./congestion.sh).

Again, any value below 16GB should not cause the vSAN to introduce latency. Any value between 16GB and 24GB is bad news. If you hit 24GB, you’re really having a bad time. Also watch these values over time, if they are generally moving upwards, that’s bad. If they are generally decreasing you can start breathing again.
These values can be increased, which can buy you some breathing room in a crisis. You can read about that in our next post: Practical vSAN Part 2: Increasing Congestion Thresholds