Category Archives: SysAd

Overview of Modern Storage for the Enterprise

By | CTO, IT, Storage, SysAd | No Comments

Hyper-converged, scale-out, black-box, roll your own; on-premises enterprise data storage is a bit more than a hobby of ours here at Symbio. Until recently, most medium scale enterprises bought their storage from one of a few vendors; HP, Dell/Compellent, Nimble, Netapp, EMC, etc. These products arrive as a more-or-less plug and play appliances, and are usually fully supported by their respective vendors. They are also expensive. Symbio got it’s start using a home-brewed Linux based storage appliance we built ourselves because we couldn’t afford anything else.

This was great until I found myself debugging production storage issues during (literally) the birth of my first child. After that experience and at the urging of my wife, we committed to a Compellent investment. The cost was extreme for a company of Symbio’s size (the initial purchase price was nearly 20% of our annual revenue, annual support renewals were almost 10% of revenue the first couple years.) The Compellent served us well, and it’s stability allowed us to reliably quadruple the size of our customer base. However, after the first couple years, we outpaced it’s performance abilities.

As technology decision makers, we’re used to basing a storage platform choice on metrics like cost-per-GB, IOPS, features and support reputation. However, we wanted to take a deeper look at our options this time as it’s clear the very ways we think about storing data are changing. This presents fantastic opportunities for cost-control and adding new capabilities, but also present new risks for businesses to contend with. This series of articles explores a few of the emerging trends in on-premises enterprise storage, ideal applications for each technology, and details our specific experiences of each of the approaches. Symbio is currently running all of these systems in a production capacity.

Hyper-converged – VMware vSAN: Solutions like vSAN or Nutanix place the storage directly inside your processing hosts, then use software on the “back end” to provide for performance and data redundancy. These solutions can offer extreme performance at a moderate cost, but dramatically change failure models and require very careful and experienced planning to implement reliably. Symbio uses vSAN as our primary storage for high performance needs; specifically database and virtual desktops.

Traditional “Black Box” SAN – Nimble Storage: The “usual” enterprise approach with an appliance provided for and supported by a vendor. This approach offers moderate performance, generally very high reliability, and is generally compatible with existing thinking regarding failure modes (storage and compute can be thought of as isolated components of an overall system). Cost is often high compared to alternative approaches, but the “one-ass-to-kick” nature of the support can be of tremendous value to shops that lack deep IT talent. Symbio uses Nimble for our “general purpose” workloads; things that don’t demand extreme performance or capacity but where we derive value from some of the “nice to have” features that aren’t available on our other solutions.

Open Source, Scale Out – RedHat Ceph: Ceph is rapidly emerging as a favorite low-cost, high capacity solution for shops with strong technical capability. Ceph uses a mathematical model to decide where to place data on the underlying disks, and clients talk directly to the disks to request data. This means your controller is no longer a bottleneck or failure point as with a traditional SAN. Ceph can scale to Petabytes simply and without the enormous cost a traditional SAN would require. Ceph is open source, community supported (though enterprise support is available), and will run on commodity hardware. Symbio re-purposed all our old Compellent hardware into Ceph clusters (which, yes, we will write a blog post about) and is used as a low-performance high-capacity storage for backups and our SymbioVault off-site backup product. Ceph is presently limited in some very important ways, however.

We’ll explore each of these technologies in depth in the coming series of articles.

Practical vSAN: Increasing Congestion Thresholds

By | IT, Storage, SysAd, VMware | No Comments

As I described in this post, Practical vSAN: Measuring Congestion, the vsan uses congestion as one of the primary metrics in determining whether or not to add latency to incoming writes. As congestion increases, the vsan’s overall performance will decline until eventually it will stop accepting incoming writes all together and machines on the vsan will effectively hang, similar to an APD condition using the core storage (NFS/iSCSI) subsystem in ESXi.

One incredibly useful technique to “buy yourself time” is to increase the buffer sizes. The upper limit according to PSS is 128GB. Remember, this will only buy you time if you don’t resolve the underlying problem. The “LowLimitGB” is the threshold at which latency will start to be added. The “HighLimitGB” is the threshold at which incoming writes will be halted. 64/128 appear to be the limits for these values. You will need to execute these commands on every host in the cluster that is experiencing congestion. We suggest changing them on all hosts to be identical. Also, don’t set limits larger than the size of your cache devices.

esxcfg-advcfg -s 64 /LSOM/lsomLogCongestionLowLimitGB
esxcfg-advcfg -s 128 /LSOM/lsomLogCongestionHighLimitGB

In one recent case, we were able to use these commands to buy ourselves a few hours while we diagnosed an underlying hardware issue, and simultaneously finish up the working day without any further performance complaints. We ended up leaving the values at these levels rather than reverting them to default as recommended by PSS, as I don’t really see a downside if you’re proactive about monitoring for congestion generally. In our next post, Practical vSAN: Part 3 – Adjusting the number of background resync processes, if your storage woes are being exacerbated by a background resync we’ll show you how to change the number of simultaneous resync processes.

Practical vSAN: Measuring Congestion

By | IT, Storage, SysAd, VMware | No Comments

VMware’s vSAN, while an amazing product in many respects, leaves something to be desired when it comes to troubleshooting issues while in production. Many of the knobs and dials are “under the hood” and really not exposed in a way that is obvious in a crisis. In this series of posts we’ll document some of the troubleshooting techniques and tools we’ve gathered over the last several months. As always, use these tools at your own risk; we highly advise engaging VMware PSS if you can.

Part 1: Measuring Congestion

In vSAN, as writes are committed to the cache tier other writes are concurrently destaged to the capacity tier. In essence, the cache tier is a buffer for incoming writes to the capacity tier. If there is an issue with the underlying capacity tier, this buffer can start to fill. In vSANese, this is known as “log congestion.” In my opinion congestion is one of the primary metrics of health of a vSAN. If during a resync or other intensive IO operation you start to experience persistent log congestion, that’s a very good sign that some underlying component has a fault or there is an issue with drivers/firmware. I should also note that log congestion does not necessarily always indicate a problem with the capacity tier, logs are also used when persisting data to the caching tier.

As an aside, the health check plugin reports on these logs on what appears to be a scale of 0 to 255, with 255 being 100% full (though support is unclear on this exactly).

As the log buffer fills up, the vSAN starts to “add latency” to the incoming write requests as a way to throttle them. When a resync is triggered, if the underlying storage cant keep up for whatever reason, these buffers WILL fill. Once they hit a certain threshold (16GB by default) latency is added. More and more latency is added until a second threshold is reached (24GB by default) in which incoming writes are completely halted until enough data has been destaged to resume operation. At this point your entire cluster may enter a kind of bizarro-world APD type state, where individual hosts start dropping out of vCenter, virts will hang and pause. Note, the only indication you will have from vCenter or the vSAN itself at this point is that log congestion is too high.

You can check the current size of these log buffers by running this command from each host in your vsan:
esxcli vsan storage list |grep -A1 "SSD: true" |grep UUID|awk '{print $3}'|while read i;do sumTotal=$(vsish -e get /vmkModules/lsom/disks/$i/info |grep "Log space consumed"|awk -F \: '{print $2}'|sed 'N;s/\n/ /'|awk '{print $1 + $2}');gibTotal=$(echo $sumTotal|awk '{print $1 / 1073741824}');echo "SSD \"$i\" total log space consumed: $gibTotal GiB";done

Another useful tip is to save this into shell script and use the unix command “watch” to observe these values over time (e.g. watch ./congestion.sh).

Again, any value below 16GB should not cause the vSAN to introduce latency. Any value between 16GB and 24GB is bad news. If you hit 24GB, you’re really having a bad time. Also watch these values over time, if they are generally moving upwards, that’s bad. If they are generally decreasing you can start breathing again.
These values can be increased, which can buy you some breathing room in a crisis. You can read about that in our next post: Practical vSAN Part 2: Increasing Congestion Thresholds

Spear Phishing: What you need to know

By | IT, Security, SysAd | No Comments

As your IT advisor it is very important for us to remind you to never ever click on attachments that you are not expecting, and if you are responding to a questionable email, please check the address for accuracy.

Here’s why:
Over the last few years, we have made excellent strides towards improving operating systems security. We also have seen a decline in traditional computer viruses. However, there is still a lot of money to be made in the business of compromising your computer (or Virtual Desktop). As a result, there are a lot of people diligently trying to trick you into installing malicious software. We have all seen infected websites, usually pop-ups, which try to trick you into thinking you have a problem that can only be fixed by installing some piece of software. If this has ever happened to you, hopefully you know to exit the Web Browser (Alt-f4 on the keyboard rather than the X in the upper right) and that you should never install ‘security’ software from a random website.

Just as you should never trust website pop-ups, you should also be very careful about trusting your email. Our industry has spent many years developing very complicated software in an attempt to automatically remove things like spam, malicious software, and questionable web links from your incoming mail before you ever see it. This security software works remarkably well given how hard people are working to get around it. The fact is: email was never designed to be secure.

As security systems have improved over the years, smart attackers have shifted their techniques from attacking our filters and trying to get past them, to more direct, personalized emails and contacts. Instead of poorly written emails that look like gibberish, we’re seeing well written emails that reference people by name and occasionally mention specific details about your company which can be gleaned from your website.

These social engineering techniques have been a part of the systems security landscape for a long time, but we’re now seeing enough of it that computer security folks decided to give it a name: Spear-Phishing.

vSphere Appliance 6 update 1b issues – Regenerate your certificates!

By | SysAd, Uncategorized, VMware | No Comments

Recently we decided to upgrade one of our clusters to the latest vSphere/ESXi 6.0 U1b.  While it’s early in the release cycle to apply these fixes to a production system, we’ve been having some issues with this cluster we were hoping the upgrade would resolve.  This cluster has 4 hosts running VSAN for storage, and is primarily used for VDI.  We use App Volumes pretty extensively.  This cluster uses the certs that were generated when the appliance was deployed.

Last Friday night, we mounted the ISO and ran the update.  Following the update things seemed fine for a while other than an apparently cosmetic “This node cannot communicate with all the other nodes in the VSAN cluster” in the VI native client which is documented in this VSAN Forum discussion.  However, after approximately 2 hours online, the vpxd service would end up unresponsive and VI and Web Client tools would hang.  It would intermittently come back, but you could only click on one or two things in the VI Client before it would become unresponsive again.  VCS servers would be unable to execute power operations.  If we reboot the VCSA, it would be responsive for a few hours until hanging again.  None of this behavior presented itself before the upgrade to U1b.

We opened up an SRR with VMware support, but after hours looking at logs the best they could suggest was rebuilding the applicance from scratch and re-importing our database from the broken appliance.  Ultimately, we started looking at SSO as the probable cause of our issues, we noted our SSO logs appeared much larger than they should be.  At the moment the appliance became unresponsive the SSO service started failing authentication requests.  Given that VMware’s solution was to rebuild the appliance, we decided we had little to loose by attempting to regenerate all of the certificates using the certificate-manager utility.  After giving that a go, the problem has resolved.  Our best guess is that one of the solution certificates was, to use the technical term “borked,” and that U1b either has some new throttling in place or handles broken solution certificates differently than 6.0a.

We’ll continue troubleshooting with VMware to attempt to determine the underlying cause and will update this post if we learn more.  It seems as though the common wisom was true, at least this time: “If thouest have performance issues with vSphere, SSO and certificates are thy cause.”

VMWare View Horizon: Things I wish I knew earlier

By | SysAd | No Comments

In working with VMWare Horizon View over the years and through many different versions we’ve often stumbled across problems or bugs that were worth buckling down and trying to resolve because the rest of the release was otherwise stable.  4.6 was a great release.  5.1 was the next version that we stayed on and now it looks like 6.1 will be around for a while.  Finding solutions for problems like USB ports not redirecting or sources busy errors if the console has been left logged into have been very version specific.  However, through all of the versions of View that we’ve used there are a few things that I wish I would have known sooner that apply to any of them.

Linked Clones can’t be moved between datastores without refreshing or expanding to full virtual machine

Moving a linked clone pool the View Admin console will automatically cause each Virtual Desktop to refresh and there isn’t any other way to move them.  If you absolutely need to move a Linked Clone Virtual Desktop’s storage targeted for some reason, you can blow it up into a full stand-alone Virtual Desktop and add it to a separate pool.

PCOIP has the potential to use tons of bandwidth – lock down per client bandwidth limits

I’ve witnessed a full screen video across pcoip completely saturate a 50 mb circuit (and the video was still a little choppy).  No matter how small a remote office is or how much bandwidth is available, we always implement traffic shaping rules to prevent a single user becoming an internet hog.  I usually start at 40% maximum bandwidth utilization per device for the entire circuit for offices of 15 users or less.  That way, it takes 3 devices to overload the internet connection.  This change has helped reduce the number of “slowness complaints” for users of our Virtual Desktops dramatically.

If a user resets their Virtual Desktop while it’s starting up it can send a Windows 7 machine into startup repair

We were chasing this issue down for quite a while.  We would come across a Virtual Desktop where the user was reporting that they could not log in and it was stuck in a startup repair loop.  What was happening was a user would restart their Virtual Desktop for whatever reason and then try to log back on right away.  Getting an error message along the lines of “no sources available” they would click okay and then be taken to a screen on their Teradici thin client with a button “Reset VM”.  The user would click the button, sending the command to reset their dedicated Virtual Desktop through the View Connection Server to the Vsphere server and the virtual machine would restart mid boot.  The next time it came up it would go into startup repair.  This issue was easily resolved once we figured out what was happeneing with the commands listed below.

From an elevated Command Prompt (cmd – right click – run as administrator)

bcdedit /set {default} bootstatuspolicy ignoreallfailures

bcdedit /enum

 

 

 

Why Devops?

By | SysAd | No Comments

Running a managed IT service like ours means we generate a tremendous amount of logging and diagnostic data.  Most recently we’ve been using a fantastic little service called Papertrail to collect and present those logs in a searchable way.  As our service has grown our needs have grown as well, and now we’re looking for new ways of making those logs useful to our ops team using more sophisticated analytics and heuristics.  Enter the ELK stack (ElasticSearch, Logstash, Kibana.)  ELK has gained tremendous popularity in recent years for a variety of reasons, however, we’ll save that for another post.  This post is about Devops, and why it makes sense for even small organizations or individuals to learn.  “Devops” is a huge and hotly contested term, but loosely speaking it is the idea that you can treat infrastructure operations “ops” or “systems administration” as a developer treats code, using many of the same tools and patterns.  We’ll leave out the sociological and team management aspects for this post, and focus on it’s benefits to the traditional systems administrator.

I started deploying our new ELK stack in my spare time as a little side-project, and made the decision early on to use Docker (www.docker.io) to containerize the applications.  Presently Docker, if you’re not familiar with it, is the poster-child for the entire Devops movement.  As I was working on this I ran into a snag where I didn’t fully understand how ports within the containers are mapped through to the host operating system.  This caused several hours of grumbles, until finally Matt Horton, another member of our ops team came over to ask me what was wrong.  As I explained, I mentioned that I was using Docker for the deployment.  He said “Oh, why didn’t you just stand up the services on a normal server?  Why are you using Docker for this?”

Which is a fantastic question.  At the time the best I could manage was “well, not using Docker wouldn’t be very Devopsy(tm), would it?”

But it got me thinking.   I’ve got the same feeling about the Devops methodology that I had about server virtualization back in 2004, and I expect the impact of of Devops (regardless of the specific tool sets used) will have an even more profound effect on the way IT is delivered in the coming decades.  But that feeling isn’t a good enough answer to the question:  Why was I using Docker for something that could just as easily have been installed directly into the OS?  Why do I think that taking a Devops approach to solving this problem when it would have been faster and simpler to just install and configure the software by hand?

  1. Because I knew I would screw it up the first time I tried to get things deployed.  Having never deployed an ELK stack before, I knew that inevitably after I got it setup the first time I’d want to do it again, better.  Then again, even better.  The reality of IT operations is that most of our work is iterative, that is, we don’t necessarily know the right way to do something until we’ve done it several times different ways.  Using Docker and a Devops methodology means that my environment is built from scratch every time I run it, so all I need to do to tweak the way I do something is change the file that specifies the way the environment is to be built.  This gives you an incredibly powerful tool to rapidly optimize when doing something you haven’t done before.
  2. It was incredibly easy to get started.  One of the beautiful aspects of the Docker ecosystem in particular is the extensive library of pre-built Docker images you can work from.  Getting started with the elk stack was as easy as cloning a git repository from someone who had done the work already (if you are interested, this is what we used for a starting point for this project).  And per my next point, it was very easy for me to understand each step of what building a functional ELK environment was simply by examining the Dockerfile.  Rather than having to read a bunch of individual how-to documents and synthesize them into something that would ultimately be largely irreproducible.
  3. Your projects become self-documenting.  One of the most important things I learned early in my career was the importance of good documentation.  Checklists, notes, all have tremendous value when supporting complex systems.  Devops does away with the necessity for much of the tactical-level documentation ops is responsible for, as the tools generally use lists of human and machine readable insturctions to define what should be done.

Frankly, it’s more fun.  Most (hopefully) of us do this job because we love learning new things, designing systems and watching them work, and figuring out how to use those systems to solve problems for people.  While doing IT the old fashioned way was also fun, being able to iterate on your designs so quickly really takes it to another level.