I’ve been meaning to write out this for the past week and just haven’t had the time.
Last weekend, I was thrilled to get a call at midnight from someone on my team telling me that one of our primary authoritative DNS servers was having issues. We have 3 so this typically isn’t a huge deal but seeing as this is the one that the others sync off of, it tends to get a higher priority.
Unfortunately, the thing was in bad shape. I could ping it but not connect with SSH and it wasn’t responding to DNS queries. Connecting to the KVM showed signs that things were not right with this server. I couldn’t log in mainly as the system appeared to have run out of processes. Another window showed that /var was missing.
The only thing I could do was reboot the thing. This is going to make for a fun night. For those not in the computer industry, a server in this state is kind of like walking into the ER. You know its bad and you’re going to waste most of your time waiting around for *something* to happen. And since my alternatives for the evening were going to sleep and dreaming about warm beaches or winning the stanley cup, I figured lets royally screw our night and see how far down the rabbit hole this thing goes.
Let’s put it this way, at 3AM I opened a ticket with IBM as I had exhausted all my tricks and was getting some pretty nice error messages. At 4AM when the IBM tech had exhausted his ideas, we were to the point of replacing the system board. Now, during the hour I was on the phone with the IBM tech, there were several 5-10 minute spans where I was back in the data center while he waited. There was no calling back, no fumbling through phone trees to get back to him, he just waited. It was nice to know that I didn’t have to screw around getting back in touch with this tech.
We pay for the uber fast replacement support. So IBM had 4 hours to get me a new system board. I had told the tech at 4AM that I was fine if this was at 7 so I could get some sleep before they showed up. He said that they would give me a call when they were headed out. Ok, no problem, I can get a power nap in and head back out.
Unfortunately for me, but fortunately for the company, I got a very fast response. By the time I fired off some emails, checked with the night time NOC guy and drove home, it was 5 AM when my head finally hit the pillow. I got a call from the tech that was going to work on the server at 5:15AM. Another call at 5:25AM from the guy delivering the part. Another call from our NOC guy at 5:45AM stating that the part was there and then coordinating a call to the first IBM tech and then someone else on my team (the same poor bastard that called me) to meet the guy out there. The system board was pretty much replaced by 8 AM and I was a bit out of it to resolve the issue until I woke back up at 11. But damn IBM was fast. Yes, we pay for the high end support but we also live in Des Moines Iowa. Its not like we are in Chicago or San Francisco. I figured that there would be a bit more turn around time. But IBM proved to be professional and very speedy in getting the system board replaced and our server back up and running.