How do you handle Sev 1 critical outages?
Sev 1 outages are stressful. When production is down and customers are affected and everyone is looking at your team for an answer, first and foremost stay calm… breath. This is easier said than done when you are affecting millions of dollars per hour in transactions (which is a thing in large scale payment systems, not fun), but regardless of the impact of the outage, if you loose your cool, the solution can be sitting right in front of your face and you won’t see it. Shit happens, there will always be bugs, sites will always go down at some point, accept it, find the solution and focus on not letting the same shit happen twice.
Don’t focus on blame
Establishing who caused the issue is not important? Knowing who may have been involved in the changes that led up to an issue and made changes after the issue is important to understand. Even if there may have only been one person involved, you can’t assume that they are the cause and it does no good to blame anyone when production is down. Focus on the solution.
Create time line of events
You should document all relevant changes that led up to the outage and all changes that occurred after the outage. This not only helps to discover possible causes it provides documentation that can be used during root cause analysis and investigations during similar outages.
The time line can be kept on an internal team wiki so the team has visibility and can add to it as necessary. During an outage, someone should be assigned to record all of the facts in the time line. Without facts your poking at the problem in the dark.
Never theorize before you have data. Invariably, you end up twisting facts to suit theories, instead of theories to suit facts.
Sherlock Holmes to Watson (Movie – Sherlock Holmes 2009)
I can’t tell you how many times that checking the logs first would have saved a lot of time as opposed to just poking around looking for file changes, config changes… everything else, but simply looking at the logs. The first step in investigating the issue should be looking at all of the relevant logs: event logs, custom application logs…
Communication is key
Keep a bridge line open with team
Keeping a bridge line open, even if there is nothing to discuss, keeps a real-time line of communication open and ready when someone has questions, ideas, and possible solutions.
Send regular status updates to team and stakeholders
Sending a message to announce the issue and what is known right now is good form. It lets everyone know that you are on top of the issue and working hard to solve it. If you haven’t found a resolution in a certain amount of time, sending another update explaining what has been done and any new findings lets everyone know that although you haven’t found the issue, you are still working hard on it. It may be a good idea to even post the status updates to a blog or Twitter, syndicate the updates to as many channels as you can, especially if you have a large application with many users.
Staying proactive with communication is much better than constantly having to field random calls and emails looking for information you should be readily sharing. Keep communications open and don’t try to hide, spin, or lie about the mistake.
No one makes any changes without discussing the change
While everyone is trying to solve the issue, no one should be making changes in production, even if the fix is blatantly obvious. A Sev 1 is serious and everything changed to fix it should be discussed with the team first so it can be documented and controls put in place to prevent it in the future.
If the team agrees on the change then the change should be documented on the timeline and a notification should be sent when the change is starting and when the change is finished. The change discussion and notifications can be simply talking it out over the bridge line or an IM or email. The point is don’t allow the change to get worse or be repeated by making undocumented changes that the team can’t learn from.
These are just some tips that I have learned over the years. I have seen many more sound practices, but the gist is:
- Stay calm
- Document changes to production
- Work as a team
- Learn from failure