The importance of testing and redundancy

When most of us think of redundancy we think of redundant SAS arrays, multiple power supplies, WAN links or even diverse power feeds. As our job is primary technical in nature we typically think of the technical things and address those first whilst some of the other areas may get forgotten 

What about those other forms of redundancy and why are they important? 
 
What other forms you ask? Well let me take you on a recent site visit and address some things we could do to improve both the customer experience and lessen your out of hours calls
 

The Failure

The customer called our firm advising the power had gone out for half the office. This customer is a call centre, so they have a reasonably sized UPS that runs the server room as well as the call centre section of the office. Some troubleshooting showed the UPS had failed and had taken the servers/call centre with it.
 eb76639f77774a55956e4c605d17531a     
“Everybody stand up, collect your PC and walk towards the light”
 
Now here’s where the non-technical issues pop up, making something simple to fix, a real big problem!
There were no IT staff onsite and no senior management. All we had at our disposal were some call centre agents and a team leader, thus no-one was able to open the server room and troubleshoot over the phone. The staff were sent home early. I live about 50km away from the site and started heading in when the issue was escalated to me.
 

The small problem, that just got very big.

After visiting our head office, collecting this customers keys from the keysafe and arriving at 10:30PM in the nice cold and wet Melbourne weather. The smallest of oversight stopped resolution of the problem. This office is secured with a swipe card and a backup key, either can typically open the doors. I swipe the card, the reader acknowledges it’s a good card, but no satisfying click of the lock. The power supply for the locks was on the UPS no power! Argh! No card access, no building access. Okay time to use the key!
f3d90124e59becec943e543e32779797

It’s about this point I threw tantrum not too dissimilar from my four year old.

 
I became very unprofessional upon discovering the provided key didn’t open the outer door and now I’ve effectively driven out in the middle if the night to accomplish nothing. Our two escalation points, the CEO and the CFO were out of the country and we had no idea who was opening the office in the morning.

Luckily another key was located (after 100km round trip) and the site was restarted but it highlighted a few deficiencies.
 
The issues and resolutions
So now we have a site that has been offline for a few hours, staff were sent home early and potential revenue was lost. This begs the age old question “Why did this issue occur and how can we prevent it in future?”

These are only the examples raised in this outage, but it shows the issues that some of us may never think about when looking after an environment

1.) There was no way to let staff into the server room in an emergency. 

This prevents access to the server room by remote hands. We could have avoided a site visit all together if there was a key located onsite. Imagine if the server room was on fire, all our nice expensive equipment would be burning with a nice shiny fire extinguisher sitting outside.Now the solution to this, I can see the comments already. “Hide a key somewhere and tell them when needed.”
This is all fine and dandy, until there is a failure and someone learns where the key is. There are reasons the server room is locked in the first place, Remember? Consider if one staff member knows where the key is, all 200 know. Kinda defeats the purpose of a key doesn’t it? Worse still is that management re-assign the key and it’s not there when you need it.
 
We solved the issue by putting a small Keysafe, we generate a random code every time a user requires access to the server room (not often). So if there is a failure we instruct the users to open the safe (with the current code) and then when IT staff are on site the next day the code is changed  and documentation updated.
 

b7f3236684f92cdb61f3fd20979bec70 
HA HA HA! All your keys are belong to us.

2.) Not all keys are the same

When we obtained our keys from the client we tested all our RFID cards but not all the keys on the main door. One set was a master and the others were not. Not knowing this the junior engineer whom is typically onsite during business hours had the master set and the general set was in our key safe.
Even if you do all have the same key, there can be minute differences between them. I would recommend you test all of them, make sure you can open everything manually if you have to.
In this case the solution was simple. The junior only needed to be able to open the server room as the site is typically unlocked when he arrives. So we swapped keys.


f7b8de14fc5f5b1b66688cc02d5c043a 
We’ve been going out long enough, here are my keys!

3.) We didn’t know what else was affected by the loss of power

The alarm locks don’t run off the built-in alarm battery, hence the RFID cards were effectively useless. Sure the alarm and security system have their own battery backup but not the power for the door solenoids. The main gate also requires power to open. Luckily the UPS didn’t run this, but  another thing to consider.This wasn’t a show stopping issue here as we had a key, it was however a 100km round trip so added considerably to restore time.
I would recommend if at all possible, you test what happens if your UPS is off. What happens when a UPS battery blows up and becomes a contestant on the biggest loser?
50fdcbd3bbceab4251774b7a3fa85df4 
Leeloo:”Big Badaboom!” Corbin: “Big, Big Badaboom”
 
4.) The redundant power supplies on the servers were connected to the same UPS
This can trip up a number of new comers to the IT field. you think “Great everything is connected to the UPS, should be fine”
I can tell you the UPS is really just there to stop a power spike or a brown out etc taking the servers offline, It’s not designed to be bullet proof or to magically keep thinks online for hours. Especially anything small enough you can plug it into a power socket. Some sites we administer run multiple UPS’s distributing load between them to try and work around this. If you do go down this route make sure you configure your server’s power supply settings in the BIOS accordingly.
Another pain in your rear to think about in this case and this applies to smaller shops more specifically is that switches/routers/firewalls typically only have 1 power connection. So unless you have a bottomless pit of income (or you still live in the dotcom bubble) you can’t dual home everything. Go ask your boss for 2 routers smart enough to support VRRP/HSRP, 2 internet connections, 2 core switches with stacking, etc etc and see how far you get. If you have less than 50 staff you typically won’t get far.
Lastly, the UPS is a beast on its own and can be the cause of many failures. I’ll create another post specifically for this soon.f426846ed2c26fa71f4d7632861907f1 
Where the hell does the second power plug go?
 
 
Some technical issues arose from this site visit as well. We will dive into those deeper another time, but for now.
 
1.) It turns out that someone forgot to add the VMware guests to the autostart list, thus these had to be restarted manually. A word of warning here, some companies have their primary domain controller as a guest. Vcenter single sign on wont work if that guest is offline, so make sure your ESXi root passwords are documented!
 
2.) The machine used for remote management as also a guest of VMware. Thus even if someone got access to the server room and reset the UPS we would be spending valuable time reconfiguring the users desktop for remote access.
 

In Summary

As IT pros become responsible for more and more systems these days it’s more and more important for you to consider what happens when certain hardware fails. You should be able to walk into the server room, point at any piece of equipment and answer the question “what happens if that let’s out the magic blue smoke?” with confidence because you have tested what happens when it fails.

In regards to testing this is another area overlooked by admins. Especially with backups (that’s another post) but you should use your maintenance windows to actually test your theories are correct.
Ask yourself everything, sit down and have a think about all the possibilities, here are some examples.

  • If all the servers go down does the alarm still work?
  • Do your monitoring systems still send off notifications if your core router goes down? What about your mail server?
  • What if you’re stuck in the middle of no-where and something goes offline. Does anyone else have access / the smarts to get things back online?And a big one I’ll quickly touch on here, but something else we will flesh out in another post.
  • If you are a smaller shop and you’re the only IT resource. What happens to the business if you’re reading your phone and get runover by a bus?

 

 
Screen grab here from Get smart, agent getting hit by a bus.
That’s not how you catch a bus

 

We can go down this path for hours and I think we will delve deeper into planning redundancy at a later stage. But the main thing to take away from this is that non technical things can pose very big problems that make a small technical issue a massive issue for you and your company.
 
Feel free to use this example to make a business case if you’re an external consultancy or do department to department billing and get testing, if not definitely add this to your list of things to do. Remember to plan a nice big window so if something doesn’t go as expected so you can resolve, document and implement solutions to the issues found.
 
Have you had the simplest of oversights cause you grief? Post your experiences below.
-JA

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.