I read a book called the Phoenix Project. A surprisingly good book about a company establishing a DevOps culture. One of the terms in the book that I had no experience with was Sev1 incident. I have since heard it repeated and have come to find out that it is part of a common grading of incident severity. Well, I decided to finally research it about a year after I read the book and put more thought into a formalized incident reporting, triage, mitigation, and postmortem workflow. Which is similar to the thoughts I had on triaging failing automated tests.
So, first to define the severity levels. Fortunately, David Lutz has a good break down on his blog – http://dlutzy.wordpress.com/2013/10/13/incident-severity-sev1-sev2-sev3-sev4-sev5/.
- Sev1 Complete outage
- Sev2 Major functionality broken and revenue affected
- Sev3 Minor problem, bug
- Sev4 Redundant component failure
- Sev5 False alarm or alert for something you can’t fix
With that I need to define how to identify the levels. IBM has a break down that simplifies it on their Java SDK site – http://publib.boulder.ibm.com/infocenter/javasdk/v1r4m2/index.jsp?topic=%2Fcom.ibm.java.doc.diagnostics.142%2Fhtml%2Fbugseverity.html:
- In development: You cannot continue development.
- In service: Customers cannot use your product.
- In development: Major delays exist in your development.
- In service: Users cannot access a major function of your product.
- In development: Major delays exist in your development, but you have temporary workarounds, or can continue to work on other parts of your project.
- In service: Users cannot access minor functions of your product.
- In development: Minor delays and irritations exist, but good workarounds are available.
- In service: Minor functions are affected or unavailable, but good workarounds are available.
Now that we have more guidance on identifying the severity of an incident, how should it be reported? I believe that anyone can report an incident, bug, something not working, but it is up to an analyst to determine the severity level of the report.
So, the first step is for the person who discovered the issue to open a ticket. Of course if it is a customer and we don’t have a self-support system, they will probably report it to an employee in support or sales and the employee will create the ticket for the customer. All tickets should be auto routed to the analyst team where it is assigned to an analyst to triage. The analyst will assign the severity level and assign to engineering support where the ticket will be reviewed, discussed and prioritized. The analyst in this instance can be a QA, BA, even a developer assigned to the task, but the point is to have a dedicated team/person responsible.
During the analysis, a time line of the the failure should be established. What led up to the failure, the changes, actions taken, and people involved should all be laid out in chronological order. Also, during triage, a description of how to recreate the failure should be written if possible. The goal is to collect as much information about the failure as possible in one place so that the team can review and help investigate. Depending on the Sev level various degrees of details and speed in which feedback is given should be established.
This is turning out to be a lot deeper than I care to dive into right now, but this gives me food for thought. My take aways so far are to
- formalize severity levels
- define how to identify the levels
- assign someone to do the analysis and assign the levels