A brief outage

Well, it had to happen at some point. My first outage since moving my website to <a hre"https://www.linode.com/?r=31b44f454740d06a1014351fded0ede51d25fa33" :target="_blank">linode happened last week. Here is a little bit of what happened and how I recovered.

What Happened?

Usually when I check into a server while doing whatever is on the todo list for the day, I will also check if anything needs to be updated. This time, there were a lot of updates so it looks it must be a new version of CentOS. Checking their site, that was exactly the case. CentOS8.2105 had been released.

So I kicked off the upgrade and usually this is kind of a no-op for me. Let it run, maybe it needs a reboot and away you go. I’ve gotten to the point where I will often times automate this on a system and not pay attention to it at all.

While these were scrolling off on the side as a partially paid attention to a meeting, I noticed the following error messages starting to pop up.

Running transaction
  Preparing        :                                                        1/1
  Running scriptlet: glibc-langpack-en-2.28-151.el8.x86_64                  1/1
error: Couldn't fork %triggerun(systemd-239-41.el8_3.2.x86_64): Cannot allocate memory

Error in <unknown> scriptlet in rpm package glibc-langpack-en
  Upgrading        : glibc-langpack-en-2.28-151.el8.x86_64                1/497
  Upgrading        : glibc-common-2.28-151.el8.x86_64                     2/497
  Running scriptlet: glibc-2.28-151.el8.x86_64                            3/497
  Upgrading        : glibc-2.28-151.el8.x86_64                            3/497
  Running scriptlet: glibc-2.28-151.el8.x86_64                            3/497
error: lua script failed: [string "%post(glibc-2.28-151.el8.x86_64)"]:7: attempt to compare number with nil

Error in POSTIN scriptlet in rpm package glibc
error: Couldn't fork %triggerin(cronie-1.5.2-4.el8.x86_64): Cannot allocate memory

These error continued throughout the script even though there was plenty of memory on the system.

Oh Shit, now what?

Well…the system wasn’t giving me much at this point. Or at least, not a lot I could deal with without running into some issue. I suspected we had maxed out some sort of file pointer or something along those lines which was showing up in a weird way. Seeing as this is a VM, it probably needs a good kick in the teeth and a reboot will fix this right?

Yes, and mostly no.

Since this is running on <a hre"https://www.linode.com/?r=31b44f454740d06a1014351fded0ede51d25fa33" :target="_blank">linode, I first cloned the machine hoping I could boot that one first and see what sort of trouble I was in. Worst case scenario, I was hoping I could get some of the configs copy and pasted off since all the data for the most part is backed up in mysql databases and github repos.

So the goal was to create the clone, boot it and see it. I got the clone completed but was doing this during the day so I was between meetings. I was on the machine in question when I had a brain fart and typed a reboot && exit and let it fly.

Shit! Not what I wanted to do.

Can’t be that bad right.

Just need to wait a bit.

Maybe a bit more.

Still not coming up.

What does the linode console session tell me…its Up…but with a weird name. And no networking. Oh boy…this is going to be shitty.

Basically the machine did in fact boot. But the modules for the new kernel was not working. Even with trying to change grub to an older kernel, it was still not loading the kernel modules. All 3 kernels that were available to me did not work. No matter what sort of grub magic / boot order changes / networking command magic, I could not get the public IP to come back up on the VM.

The Recovery

I was pretty much writing off the server at this point and figured I’d need to recreate it from scratch. Most of the data is kept in github so it shouldn’t be that hard to recreate the web configs, create users, regenerate ssh keys, have puppet do its magic to check things out etc etc. But, it still wasn’t fully automated and would be a pain in the ass. What else could I do.

Well, wouldn’t you know it, Linode has their own rescue image. Let’s see what that can do.

Sure enough, it can boot your VM WITH networking support. Hell yeah. Maybe I can either transfer the data off that I need making the recovery faster or… maybe I can give the yum update another run and see if all the module dependencies get cleared up.

So, the order of operations looked like this.

  1. Boot the VM into rescue mode.
  2. mount /dev/sda /mnt
  3. chroot /mnt
  4. yum clean all; yum update -y
  5. pray everything installs correctly
  6. exit out of chroot
  7. reboot
  8. pray some more
  9. Victory!!!

Conclusion

I’ll admit it, I got lucky this time. All was recovered and the few sites that I have on here being down are not the end of the world. We’re not doing anything mission critical on this blog or other projects going on here.

However, we can always get better at what we are doing. Since the outage, I’ve enabled backups through linode. Its an extra $2/mo and quite honestly, money well worth it until I can get a fully recoverable system with little to no downtime.

Beyond simple backups, I’m working on getting my puppet configuration built out to handle all the aspects of setting up a new web system. All the user creation, configuration creation, SSL certs, etc that I did manually because I’m only going to do it once…needs to be part of the puppet configuration. I can then worry about the little bit of mysqldata that needs to be synced off to another system as well. In the end, it should be a much more robust configuration that I have that will eventually save me $24 a year ;)