Server Issues

Some of your may have noticed that this site was down for about a day. Oh who am I trying to kid. 3 people noticed if that and most of you are probably reading this going, “oh, you had issues. I guess so, you’re writing about them.”

Since we’re all here on the same page, let’s continue shall with with the postmortem.

What Happened?

On Sunday morning, a little after 9:30 AM I noticed that I was no longer getting email from my server. Which was kind of a bummer as I had just gotten a cert setup for IMAP+SSL. I started poking around doing the usual pings and traceroutes. It was clear that I wasn’t having a connectivity issue to the internet, this was clearly an issue with my server. Web was down, smtp, ssh, everything. Ugh, not good. This server had issues about 6 months earlier where half of its ram died. I should have seen this as a sign to move everything off right then and there. But I didn’t.

Luckily for me, this server is hosted by my old company Solinus who make the kick ass MailFoundry Anti-Spam appliances. So I had a good buddy take a look at this box and he noticed a few red flags right from the beginning. First and foremost, it was off. Yikes! After he booted it up, it took a LONG time just to get to a bios screen that would then half way load a lilo screen before freaking out again and going back down. And this happened only once. After that, it simply wouldn’t post. My friends, we have a system board failure.

Now, this is a sandbox server for me. Its used for backups and hosting a few low volume blogs for myself and a few friends. Its also pretty old when it comes to servers and their lifetime. I bought this off of ebay back in fall of 2006. It was several years old when I got it. When the system board finally gave up its ghost, it was at least 6 years old if not 7 or 8.

Recovery

So what do you do when you first lose a server and know its a lost cause. Well you go to your backups of coarse. Have you tested your backups lately? Most people forget this all too critical step. Sure you know that your backups are running, but do you know if you could really recover from them?

Luckily, I knew I had a good backup as I had to recover a file from it last week that I accidentally removed. Whew!

Recovering the files wasn’t too bad. The only down side is that the files are backed up a server I have sitting here at home. I have a cable modem connection with great download speeds. Guess what it doesn’t have…great UPLOAD speeds. So moving around the couple GB of data has been a bit of a pain but once on the server, I was able to parse out the data pretty quickly.

Lessons learned

  • Always test your backups I really can’t stress this one enough. Just because your backup ran, doesn’t mean that you have a good backup. Try putting that data on another server and build up a clone of your current box.

  • Documentation As with our previous example, can you restore the data to another server. Can you configure anything that you missed without looking at the live server? If its a one of a kind server, you may not have all those permissions documented properly. Its key to have good documentation how things were installed and configured not just the data that you were hosting.

  • Recovery Plan Do you know where you are going to restore/rebuild that server. Do you have spare hardware? a VM maybe? It helps to have those plans in place so you know exactly where you can shift loads and get things going again. Sure this was a dev box for me, but I still like to have an idea of where things are going to go so my downtime is minimal.

  • Did I mention backups? yup, its that important so it goes on the list twice 😉

Final Thoughts

Overall, my server disaster wasn’t as bad as it could have been. I lost a few hours recovering data. But in the grand scheme of things, I lost very little of the imporatnt data. The one thing I didn’t have backed up was my home directory. There wasn’t too much that was really lost there. A little bit of mail and a few random scripts. But nothing that I would consider the end of the world.