Set Notifications For When Things get Hairy

Let me draw a scenario for you. Imagine you’ve got a nice big server with about 48 drives and one of your drives goes out, except you’re in Nassau with your feet in the sand. You’ve got your phone, but that’s not doing you any good because that fancy monitoring system you set up only monitors network connectivity. You’ve got RAID in place, but now your parity is straining to keep up. How did it come to this? What could have been done differently to prevent this scenario?

I was in a similar situation at my Job, minus the feet in the sand. At that time we had outsourced our IT to a company that had neglected to setup any kind of monitoring other than some network monitoring.

Our outsourced IT’s thought process on redundancy.

The first person to notice the problem was me. As I was walking by the server room I noticed some amber lights, those little things can give a grown man a heart attack. Sure we had spares, but had another drive failed we would have had to do a full restore from backups which at that point would have stopped all work in the office for at least a day.

Perhaps the most frustrating part of this scenario was how easily it could have been avoided. VMWare’s VSphere server makes it trivially easy to react any way you want to any event. It provides a plethora of events to choose from. Once you’ve picked a “trigger” you can then pick one or multiple actions which include everything from a command, to email, to shutting down the server gracefully.

Now this is for smaller environments, as you scale up your monitoring and notifications should be centralized on a Sys Log Server. But for a 3 host, 15 VM environment that’s a bit overkill.

This may seem like sys admin 101, but it’s overlooked by so many administrators it’s scary. And don’t make the mistake of thinking just because you’ve set up some parity with RAID that you’re covered for all scenarios, RAID should be your last line of defense not your first. So please set up some decent monitoring, and save everyone time and headache.

And as always, check to make sure you have good backups. Then check again.

Leave a Reply

Your email address will not be published. Required fields are marked *