Practical vSAN: Measuring Congestion

By October 10, 2016 IT, Storage, SysAd, VMware No Comments

VMware’s vSAN, while an amazing product in many respects, leaves something to be desired when it comes to troubleshooting issues while in production. Many of the knobs and dials are “under the hood” and really not exposed in a way that is obvious in a crisis. In this series of posts we’ll document some of the troubleshooting techniques and tools we’ve gathered over the last several months. As always, use these tools at your own risk; we highly advise engaging VMware PSS if you can.

Part 1: Measuring Congestion

In vSAN, as writes are committed to the cache tier other writes are concurrently destaged to the capacity tier. In essence, the cache tier is a buffer for incoming writes to the capacity tier. If there is an issue with the underlying capacity tier, this buffer can start to fill. In vSANese, this is known as “log congestion.” In my opinion congestion is one of the primary metrics of health of a vSAN. If during a resync or other intensive IO operation you start to experience persistent log congestion, that’s a very good sign that some underlying component has a fault or there is an issue with drivers/firmware. I should also note that log congestion does not necessarily always indicate a problem with the capacity tier, logs are also used when persisting data to the caching tier.

As an aside, the health check plugin reports on these logs on what appears to be a scale of 0 to 255, with 255 being 100% full (though support is unclear on this exactly).

As the log buffer fills up, the vSAN starts to “add latency” to the incoming write requests as a way to throttle them. When a resync is triggered, if the underlying storage cant keep up for whatever reason, these buffers WILL fill. Once they hit a certain threshold (16GB by default) latency is added. More and more latency is added until a second threshold is reached (24GB by default) in which incoming writes are completely halted until enough data has been destaged to resume operation. At this point your entire cluster may enter a kind of bizarro-world APD type state, where individual hosts start dropping out of vCenter, virts will hang and pause. Note, the only indication you will have from vCenter or the vSAN itself at this point is that log congestion is too high.

You can check the current size of these log buffers by running this command from each host in your vsan:
esxcli vsan storage list |grep -A1 "SSD: true" |grep UUID|awk '{print $3}'|while read i;do sumTotal=$(vsish -e get /vmkModules/lsom/disks/$i/info |grep "Log space consumed"|awk -F \: '{print $2}'|sed 'N;s/\n/ /'|awk '{print $1 + $2}');gibTotal=$(echo $sumTotal|awk '{print $1 / 1073741824}');echo "SSD \"$i\" total log space consumed: $gibTotal GiB";done

Another useful tip is to save this into shell script and use the unix command “watch” to observe these values over time (e.g. watch ./congestion.sh).

Again, any value below 16GB should not cause the vSAN to introduce latency. Any value between 16GB and 24GB is bad news. If you hit 24GB, you’re really having a bad time. Also watch these values over time, if they are generally moving upwards, that’s bad. If they are generally decreasing you can start breathing again.
These values can be increased, which can buy you some breathing room in a crisis. You can read about that in our next post: Practical vSAN Part 2: Increasing Congestion Thresholds