Screwing up with class

December 13th, 2010 | by | sysadmin, tips & tricks

Dec
13

Those that know me know I’m a big fan of 37Signals. Heck, I even applied for a job there once upon a time along with several hundred other people.

Over the last week or so they have been having some major issues with their campfire product. And in true 37signals style, they have an explanation and apology all wrapped up in one great blog post.

What’s so great about it?
Well, first off, its honest. There’s no marketing bullshit. No fake apology. Its sincere and straight forward. They had several issues, they laid them out, and took full responsibility for stuff breaking.

You could feel the pain. They rely on this product and were pissed it was down. More to the point, they were pissed that they were letting their customers down. Many of which, like me, are fans. Its easier to win back fans. But there are others that are on the fringe that they could have lost and they come right out and state that they need to earn their trust back.

The structure of the apology. Not many people realize that there is an art to writing this sort of message. The Pointy Haired Bosses (PHBs) will read a few sentences to maybe a couple of paragraphs. The first section is for them. The second paragraph is for those that have to answer to the PHBs. Middle managers that made the recommendation that are going to get the heat about it. They can state, “Yes, there was a problem and we’re getting a credit” The 3rd section is for the admins out there. The guys that want to hear what happened and learn from others issues. We’re guys that may run into this ourselves and knowledge is power here.

Conclusion
For those reading this thinking that it doesn’t apply to them because they’re not a tech company I say to you, wake up. Just take point number 1 as an example. No matter what profession you are in, you’re going to screw up at some point in your life. And when it comes time to apologize, be honest. Don’t sugar coat it or bury it in a non apology. Take responsibility, be sincere and admit you fucked up. Your customers will respect that sort of apology.

Comments Closed

Server Issues

August 23rd, 2010 | by | site news, sysadmin

Aug
23

Some of your may have noticed that this site was down for about a day. Oh who am I trying to kid. 3 people noticed if that and most of you are probably reading this going, “oh, you had issues. I guess so, you’re writing about them.”

Since we’re all here on the same page, let’s continue shall with with the postmortem.

What Happened?

On Sunday morning, a little after 9:30 AM I noticed that I was no longer getting email from my server. Which was kind of a bummer as I had just gotten a cert setup for IMAP+SSL. I started poking around doing the usual pings and traceroutes. It was clear that I wasn’t having a connectivity issue to the internet, this was clearly an issue with my server. Web was down, smtp, ssh, everything. Ugh, not good. This server had issues about 6 months earlier where half of its ram died. I should have seen this as a sign to move everything off right then and there. But I didn’t.

Luckily for me, this server is hosted by my old company Solinus who make the kick ass MailFoundry Anti-Spam appliances. So I had a good buddy take a look at this box and he noticed a few red flags right from the beginning. First and foremost, it was off. Yikes! After he booted it up, it took a LONG time just to get to a bios screen that would then half way load a lilo screen before freaking out again and going back down. And this happened only once. After that, it simply wouldn’t post. My friends, we have a system board failure.

Now, this is a sandbox server for me. Its used for backups and hosting a few low volume blogs for myself and a few friends. Its also pretty old when it comes to servers and their lifetime. I bought this off of ebay back in fall of 2006. It was several years old when I got it. When the system board finally gave up its ghost, it was at least 6 years old if not 7 or 8.

Recovery

So what do you do when you first lose a server and know its a lost cause. Well you go to your backups of coarse. Have you tested your backups lately? Most people forget this all too critical step. Sure you know that your backups are running, but do you know if you could really recover from them?

Luckily, I knew I had a good backup as I had to recover a file from it last week that I accidentally removed. Whew!

Recovering the files wasn’t too bad. The only down side is that the files are backed up a server I have sitting here at home. I have a cable modem connection with great download speeds. Guess what it doesn’t have…great UPLOAD speeds. So moving around the couple GB of data has been a bit of a pain but once on the server, I was able to parse out the data pretty quickly.

Lessons learned

  • Always test your backups I really can’t stress this one enough. Just because your backup ran, doesn’t mean that you have a good backup. Try putting that data on another server and build up a clone of your current box.
  • Documentation As with our previous example, can you restore the data to another server. Can you configure anything that you missed without looking at the live server? If its a one of a kind server, you may not have all those permissions documented properly. Its key to have good documentation how things were installed and configured not just the data that you were hosting.
  • Recovery Plan Do you know where you are going to restore/rebuild that server. Do you have spare hardware? a VM maybe? It helps to have those plans in place so you know exactly where you can shift loads and get things going again. Sure this was a dev box for me, but I still like to have an idea of where things are going to go so my downtime is minimal.
  • Did I mention backups? yup, its that important so it goes on the list twice ;)

Final Thoughts

Overall, my server disaster wasn’t as bad as it could have been. I lost a few hours recovering data. But in the grand scheme of things, I lost very little of the imporatnt data. The one thing I didn’t have backed up was my home directory. There wasn’t too much that was really lost there. A little bit of mail and a few random scripts. But nothing that I would consider the end of the world.

Comments Closed

SSH Timeouts

June 4th, 2010 | by | sysadmin, tips & tricks

Jun
04

Do you work in an environment where you bounce through a bunch of firewalls? Do you hang out on idle ssh connections that often times get dropped after a certain amount of idle time? I do and it has always annoyed me. To the point that once I connect to a box that I will be coming back to, I will run top and move on. Well, not anymore. You can set your SSH client to automatically send a bit of data over your connection every X seconds. Here is how it is done for Mac and Linux boxes.

In your home directory, edit your .ssh/config file. If you don’t have one, that’s not a problem, simply create a new one. Then enter in the following line:

ServerAliveInterval 60

And you’re done! Now wasn’t that easy?

Happy terminal camping partner!

Comments Closed

Monit Tricks

February 4th, 2010 | by | sysadmin, tips & tricks

Feb
04

Recently I had a chance to do a little monit foo with a co-worker for a rather interesting project that we will hopefully be sending off into the intertubes.

For one part of this project, I got the chance to get my hands dirty with my old friend monit. Monit, for those that don’t know, is a UNIX system administrators dream.

Here’s a brief run down of what monit can do from the web site:

Monit can start a process if it does not run, restart a process if it does not respond and stop a process if it uses too much resources. You can use Monit to monitor files, directories and filesystems for changes, such as timestamp changes, checksum changes or size changes. You can also monitor remote hosts; Monit can ping a remote host and can check TCP/IP port connections and server protocols. Monit is controlled via an easy to use control file based on a free-format, token-oriented syntax. Monit logs to syslog or to its own log file and notifies you about error conditions and recovery status via customizable alert.

So…with that little bit of unnecessary advertising going on. What was I trying to do? It was pretty simple really. Monitor a process, if it is not running, restart it. However, there was a twist that I hadn’t done before. It needed to restart as a particular user. My past experience had always been monitoring applications such as a ssh server or smtp server. I hadn’t gone down the path of monitoring an application that a user could start. But if you are doing anything like a kiosk, this type of functionality might come in handy for you.

The solution is ridiculously simple. All you need to do is add an “as” line to the start portion of your script. Here’s an example I found online:

start program = "/etc/init.d/tomcat start"
              as uid nobody and gid nobody
        stop program  = "/etc/init.d/tomcat stop"
              # You can also use id numbers instead and write:
              as uid 99 and with gid 99

I’m sure I’m not the only one that has run into this so I figured I would help spread the word on a very obvious and probably overlooked monit feature.

Comments Closed

20%

January 2nd, 2010 | by | security, sysadmin

Jan
02

We launched our Hosted Exchange 2007 Product just over a year ago. And for the most part, things have gone great.

One of our early decisions was to balance the security of the system while making the system as user friendly as possible. Originally, we had a pretty strict password policy. We soon found that many of our customers were not too happy with this policy and thought it was too much. Were we out of control security freaks? Shouldn’t the customer appreciate the steps that we are taking to not only secure our servers, but their information!

Looking around at other vendors, we quickly found that we may be a bit too harsh. Take Gmail for example. Sure its not exchange. But then again it has over 100 million users. If they had massive issues with security and hacking, they clearly have it under control behind the scenes so things do not get out of hand.

And have you ever been prompted to change your password on gmail? I haven’t.

So we compromised. We altered the time between when the system forces you to change your password. We altered the number of passwords that you could recycle. And we also added a somewhat buried feature in our customer portal. That feature, ‘allow passwords to never expire’

Holy crap! Let’s just blow a huge freaking hole in the security system shall we.

This was a feature that we were not all that happy about, but with the other measures in place we figured we would avoid passwords such as abc123. It makes the end user happy, we have some level of security though not as high and tight as we would like. But its better than having things wide open.

Now here is the shocking part of this. 20% of our users have this feature enabled. 20-feaking-percent! I was really hoping for this number to be in the 5-10% range.
But no, 1 in 5 of our users will never change their password again.

Or will they?

I’m currently developing a nag script that will send out a reminder to the end users ever couple of months. Not enough to completely annoy the heck out of them. But hopefully enough to get a good portion of that 20% to change their passwords on a semi-regular basis.

So what do you do for your password policy? Leave your tips and tricks in the comments section. We’d like to hear what you think is an acceptable policy to stay secure!

Comments Closed