Virtualization Basics: Remediation

on August 14, 2015

Fix It Now: It’s how you keep your job

Last week, I covered alerting as a skill that allows virtualization admins to quickly surface information about potential or likely incidents. In theory, what is surfaced is the single point of truth. But an ugly fact of reality is that no matter how good your alerting may be, things can still break. This week, I’ll take you through remediation, a truly vital skill for any IT admin, because, as my fellow Head Geek Thomas LaRock so eloquently stated, “As a systems administrator, you get paid for performance, but you keep your job with recovery.”

Remediation means fixing the problem

Your job is on the line when **IT happens. The lone objective of remediation is to get back to a working state as fast as possible, no matter what caused the problem. Getting back to a working state could mean fixing things so that they deliver acceptable application QoS, or it could mean bringing VMs and applications back online. Whether it’s a service, an application, a VM, or a system, for a virtualization admin, this is a race against time. Every second an application or system is down, or even just slow, means that less work is being done. Employees are less productive, customers are less satisfied, less sales are being recorded, and less revenue is being generated. Because all of these things affect the bottom line, admins should view job security in terms of how fast they can fix these complex problems. The quality of your fix determines your reward.

**IT happens – Stand and deliver

All heck is breaking loose in the virtual data center. Tickets are piling up in your mailbox, and most of them are completely indecipherable. Other IT teams responsible for the stack, including those tasked with managing servers, storage, networking, and applications, have gone into blame-game mode and have placed the “X” squarely on the virtualization team. They claim, “Well, it was never slow when it was running on physical.” And worst of all, management, adjacent teams, and business owners are breathing down your neck, second-guessing everything that you’re doing. (The world is watching and clock is ticking. Do you cut the blue wire or the red wire?!?)

The only thing that matters at this moment is making the world right. It doesn’t matter what you do or how you do it, you just have to get things working again. If you can’t, you just might be looking for a new job. This is the kind of pressure a virtualization admin in full remediation mode experiences. The very best admins will fix issues using the simplest and most straightforward means, making sure they add little to no increased overhead to their virtualized environments.

Three magic words for when you’re feeling (IT) queasy

Take a deep breath and repeat these three magic words: Stop. Drop. Roll. Yes, these are the same steps to take if you’re on fire. They work for IT virtualization fires as well.

Stop
1. Assess the situation.
2. Focus on the steps that will lead to resolution.
Drop
1. Drop all distractions, such as unnecessary and unconnected services and processes.
2. Remove all unnecessary pseudo-IT chefs from the virtual kitchen. This means anyone not directly responsible for or connected to the stack you are trying to restore.
Roll
1. Roll out your recovery plan to get your systems, apps, and VMs back in working order.
2. Monitor key performance indicators to make sure the systems and apps are stable following the fix.

What makes remediation gold?

Remediation is really about rocking that roll phase. Remember: the goal of remediation is to restore your system to a good working state. Leveraging highly available architectures, good backup plans, and disaster recovery techniques are fundamental keys to remediation success.

High availability (HA) implementations allow virtualization admins to absorb some level of degradation or failure while buying time to work on incidents and issues. An example is VMware® HA. When a host server in the cluster fails, the VMs on it will automatically restart on another host server in the cluster. It can even detect guest OS failure and restart the VM on the host server. Leveraging your discovery and alerting skills alongside your HA knowledge of technologies and applications will help you realize your full potential in remediation.
Backup plans are the next line of defense for virtualization admins as they work to minimize the mean time to resolution. The only guarantee in IT is that something will change, and if it changes for the worse, you have to be prepared to deal with it. The key to backup plans is being able to identify the steps in a complete install/deployment that are most time-intensive or most painful to deal with in the event of a failure or disaster. Once you identify these, you can create checkpoints that allow you to start at a point further along in the deployment cycle. Test your backup plan to make sure that it works as designed.
Disaster recovery (DR) brings the systems, applications, or VMs back into working order, using your backup plan as the basis. The better the DR plan and process, the better the recovery rate.

Virtual admins who remediate efficiently and effectively get to keep their jobs. Remediation is the act of going from failure to recovery. Leveraging the skills of discovery and alerting makes a virtualization admin’s job easier when resolving any incident. Once you’ve restored the network to an acceptable state, it’s time to transition into troubleshooting mode, which will lead you to the root cause of the incident.

Next week, I’ll discuss the troubleshooting skill.

______________________________________________________________________________________

You can also download my latest eBook that walks you through in detail each of the 4 essential skills any virtualization admin will need to master.