Startups Stack Exchange Archive

How can I reduce the impact of downtime?

As I sit here waiting on downtime to be resolved on one of my most-used services, I started wondering: what kinds of things can a business do to ease the customer service hit of unplanned downtime on a SaaS product?

I’m not too worried about the legal aspects of it, since presumably an SLA is upheld. I’m focusing more on the customer service, and keeping people as happy as possible during the period.

Bonus points for addressing planned downtime, but that’s not really what I’m going for with this.

Answer 1756

I’ve thought of a couple ideas, but I’m not sure if I love them or not.

Updated Service Status Page

The classic solution is to just have a page at /status on which you list out any service interruptions and any updates that come up.

###Pros

###Cons

Twitter Status Updates

This is a more modern twist on the previous solution, and happens to be what Stack Exchange does. Essentially, this would entail just tweeting out the same updates as you’d post to your status page, and perhaps link back to that page.

###Pros

###Cons

#Email on down and up

A lot of B2B providers (particularly PaaS) email users when service gets interrupted, then again when it comes back up. This is again typically used in conjunction with a status page of some kind.

###Pros

###Cons

Phone Reminders

This is something that I was surprised to learn during the downtime that prompted me to post this, that Microsoft does for Xbox Live. They use a status page, but on it, they allow people to enter their phone numbers to be alerted via a call or SMS when service is restored.

###Pros

###Cons

Answer 1757

Know that you’re down before your users do

In my experience, the single-most important part of keeping end-users happy is actually knowing you’re down before your end-users do.

It may sound silly and ludicrous. But it’s in fact extremely common, not to mention infuriating, for a service provider to become aware that its site or service is down because an end-user told them it was.

Communicate that you know you’re down

Once proper monitoring is in place, how you communicate that you know is basically cosmetic. It can be a status page somewhere (on a different server, of course), it can be updates on twitter, it can be notifications by email or SMS, etc. Matthew has posted a list already. Which medium and the specifics aren’t so important.

What really really does matter is communicating to end-users that you already know you’re down. You don’t want them to inundate you with soul-draining support requests and angry emails and what have you, and you don’t want to waste time and resources answering them instead of fixing things.

Bonus: notify users when you’re back up

In my mind this is mostly a bonus. It can be by email, by SMS, whatever. Make it opt-in (e.g. “Notify me by email when you’re back up!”) or if doing this on Twitter doesn’t seem like it’s enough.

Bonus: explain why you were down

This is also a bonus in my mind. Users care less about why the service was down than about how fast it was back up. If your audience does care about the former, it will usually be satisfied with high level explanations (e.g. “An excavator slit the telco links while digging a trench.”)

If you do this, prefer doing it in a private, clients-only area for the same reason as you wouldn’t want prospects to visit a support forum: it can give the wrong impression to potential clients.

Unless, of course, it was a planned downtime…

Announce your planned downtimes

For planned downtimes, it’s not a bonus. You want to:

  1. Explain that you need to be down in advance in a reasonably public place (e.g. the company’s blog and Twitter)
  2. Send a reminder to end-users an hour before the downtime if appropriate (i.e. allow end-users to save their work)
  3. Highlight when the downtime is over in the same reasonably public place.

Answer 1760

Hmm, personally I would:

  1. Get a status page. Your Startup is probably young, and you NEED to be transparent about outages to keep your users trusting of you. Check out statuspage.io (building your own might not be the best idea -- what if everything crashes?).

    a. One answer mentioned placing the status page under "/status" of your domain. This is OK, but convention is to place it on a subdomain. (status.github.com , status.trello.com , and many more). IMHO status.yourdomain.com is preferable to yourdomain.com/status simply because the implementation is SO MUCH SIMPLER. The former is simply a DNS record pointing to the server for your status page, while the latter routes to the server that is probably down (you could get around it by using load balancers etc, but that's not fun).

  2. Find affected users. If it's a few isolated cases, don't go writing a blog post about your down time. Always post to the status page, and make it clear to users that the status page exists (maybe a "Status" link in the page footer?).

    a. Write a notification to those affected. Email is probably the best, but maybe a message on the next successful login would work, too. Depending on what your service is, provide some consolation if you feel it necessary. (For example, if you provide a service that people NEED to get work done, give them a free month of your service).

    b. If you DO explain the outage, make sure that it is accurate and honest. Trying to fabricate a story "lol, a cat ate one of our HDDs in the secure datacenter and etc..." makes your company sound like it has no idea what it's doing, while a response like "One of our 10G peering links failed due to a PDU fault causing power to be lost to the router" would go WHOOSH over the user's head, most likely. Find a balance between accuracy and jest, while still taking full blame for it. Make your users know that you REALLY DO care about them.

  3. Find out WHY. Outages = very bad (same with downtime) As a SaaS provider, you really need to have your backend together. Personally I don't use products if I don't think the company can handle my data.

    a. Make customers trust you by making downtime nonexistent. I containerize my apps with Docker atop CoreOS which makes rolling updates really simple (nobody ever loses service, but everybody is upgraded). When was the last time you saw Facebook, Twitter or Youtube go down for "planned maintenance"?

    b. In addition to having rolling updates, you should have redundancy. You should have multiple geographically diverse (read: not located in the same building) servers, so if one blows up (or a whole rack spontaneously combusts), your users won't notice a thing. Efficient use of Anycast DNS will let you serve requests from multiple DCs. If one goes down, the user gets routed to the next available one (data replication across them can be expensive though).

Good luck (and great question!)

Answer 1779

Funny/Cute Status Page

A standard “hey we are down” page with a funny or cute element added on top.
Here are some examples I liked:
http://www.smashingmagazine.com/2009/06/12/effective-maintenance-pages-examples-and-best-practices/
http://waldowsocial.com/best-unsubscribe-ever/

###Pros

###Cons


All content is licensed under CC BY-SA 3.0.