Virtualization Basics: Troubleshooting

on August 24, 2015

No more shooting from the hip

This week, I’ll take you through troubleshooting, a skill that allows admins to drill down into an incident to gain understanding about its underlying cause and any related effects. Last week, I went over remediation, describing it as the skill that allows virtualization admins to keep their jobs. If that’s the case, then troubleshooting allows you to build a bridge to your next career adventure. This is because troubleshooting requires you to essentially reverse engineer an incident to fully understand the what, how, why, when, and where of it.

Find the root cause of any issue

It’s a double-edged sword: the opportunities for gaining experience in troubleshooting only come through encountering and overcoming failures and incidents. Therefore, a virtualization admin must embrace every incident and failure as an opportunity to grow and show the world what he or she knows. In addition, troubleshooting is not only about finding out what happened and devising a solution to keep it from happening in the future. It’s also about uncovering the most optimal solution among the N-solutions for any given problem. This means looking at your priorities first, which could include cost, performance, or design considerations. Preventing issues makes your life easier; creating an optimal solution is what makes your boss – and your boss’ boss – happy.

Shoot the trouble out of your IT stack

As a virtualization admin, your focus should be on solving the right problem instead of chasing false positives. Time is money. Being able to quickly uncover the root of a virtual environment problem translates into being able to remediate it that much faster, and having that remediation last longer. This will provide you with a lot of experience, but it can also keep you in the data center long after business hours, and over weekends and holidays.

The basic troubleshooting flow to use in any situation includes the following steps:

Define the problem.
Gather and analyze relevant information.
Construct a hypothesis on the probable cause for the failure or incident.
Devise a plan to resolve the problem based on that hypothesis.
Implement the plan.
Observe the results of the implementation.
Repeat steps 2-6.
Document the solution.

The problem with this approach is that it assumes a virtualization admin has unlimited time and resources to troubleshoot the root cause of every problem that arises. Unfortunately, time is a limited commodity for any IT pro. Fortunately, one way to shorten the standard troubleshooting procedural time is to leverage the skills of discovery, alerting, and remediation. After all, they form the foundation of the troubleshooting skill. Steps 1, 2, and 3 in the troubleshooting workflow are covered by discovery and alerting. Steps 4-5 are remediation, while steps 6-7 are exclusive to troubleshooting. Pairing these three skills with a proper IT tool can more quickly surface the single point of truth and provide the necessary insights to root-cause any issue in an efficient and effective manner across multiple stacks.

Follow these eight steps and you can troubleshoot anything. But, of course, there’s a caveat. As my first IT mentor always reminded me, the devil is in the details. In other words, this framework gives your approach consistency and rigor, regardless of the issue at hand. But the questions you ask, the performance data you analyze, and the logged events you focus on will vary, depending on the type of issue you encounter.

Essentially, your troubleshooting efforts will be customized for your multi-variable data center environment. To simplify things, you should try and reduce your troubleshooting analysis to a binary decision: ask a question and the answer is either a 1/Yes or a 0/No.

Binary decision-making

One feature that VMware® vSphere^TM admins use extensively to enable VM mobility is vMotion^TM. It allows admins to move a VM from one host server to another in the cluster. It’s quite useful for performing rolling maintenance on live servers in a cluster as well as re-distributing workloads across the virtual cluster.

Example issue: vMotion was working for the eight VMs in Cluster A, a 4-node cluster. After adding a fifth node to Cluster A, the virtualization admin discovered that vMotion failed to link the newly added host. The vCenter Server reports an error message that states that vMotion migration failed because the destination host did not receive data from the source host.

Let’s use the eight-step process combined with binary decision-making and consider the following facts: vMotion worked previously with the four existing host servers. It did not work with the new node. All five host servers are identical in terms of hardware configuration and software version number and licensing.

Step 1: Define the problem: vMotion is not working on the new host server.

Step 2: Gather and analyze: vMotion is not working only on the newly added fifth host server. Using error messages from vCenter logs as a guide.

Step 3: The hypothesis: vMotion was incorrectly configured on the new server since the other four are still working.

Steps 4 and 5: Apply binary decision tree and test hypothesis in Step 3.

Is vMotion enabled?
1. If yes, go to 2.
2. If no, enable it and go to 2.
VMkernel setup: Is a standard vSwitch used?
1. If yes, go to 3.
2. If no, go to 5.
Is vmkping working correctly via the VMkernel network selected for vMotion?
1. If yes, go to Step 7.
2. If no, go to 4.
Are the network and port group labels the same for host 5 as the other servers?
1. If yes, go to Step 7.
2. If no, fix the labels, go to Step 6.

Step 6: Observe the results: Does vMotion work?

If yes, go to Step 8.
If no, go to Step 7.

Step 7: Go back to Step 2 and devise a new hypothesis for Step 3. For instance, another thing that causes vMotion to fail includes VMs with locally mounted CDs/ISOs. Remember to disconnect them prior to vMotion.

Step 8: Document resolution.

And that’s troubleshooting in a nutshell.

The bridge to utility

Acquiring and using troubleshooting skills allows virtualization admins to cross the career bridge. Trial by fire and overcoming IT incidents sharpens the IT pro’s skills and gives him many qualities that are valued by organizations in- and outside of IT. These qualities include decision-making under duress, root-cause analysis, and a firm understanding of the entirety of the stack. If you choose to leave IT, and you are a skilled, analytical problem-solver, you could consider technical marketing, pre-sales engineering, and people or product management as real career possibilities. If you choose to advance up the IT ladder, you can explore engineering design, architecture, and strategy that encompasses security, optimization, automation, and reporting. Master the four skills of discovery, alerting, remediation, and troubleshooting and you’ll be ready to soar in virtualization!

_____________________________________________________________________________________

You can also download my latest eBook that walks you through in detail each of these 4 essential skills that any virtualization admin will need to master.