Failing Test Severity Index
Have you ever been to the emergency room and gone through triage? They have a standard way of identifying patients that need immediate or priority attention vs those that can wait. Actually, they have an algorithm for this called the Emergency Severity Index (ESI). Well I have a lot of failing tests in my test ER with their bloody redness in test reports begging to be fixed. Everyday I get a ping from them, “I’ve fallen and I can’t get up.” What do I do? Well fix them of course. So, I decided to record my actions in fixing a few of them to see if any patterns arise that I can learn from. In doing so, I formalized my steps so that I can have a repeatable process for fixing tests. I borrowed from ESI and decided to call my process the Failing Test Severity Index (FTSI), sounds official like its something every tester should use and it comes with an acronym FTSI (foot-see). If there is no acronym it’s not official, right?
Triage
The first step is to triage the failing tests to identify the most critical tests. I want to devote my energy to the most important, or the tests that provide the most bang for the buck, and triage gives me an official stage in my workflow for prioritizing the tests that need to be fixed.
As a side note, I have so many failures because I inherited the tests and they were originally deployed and never added to a maintenance plan. So, they have festered with multiple issues with test structure and problems with code changes in the system under test that need to be addressed in the tests. With that said, there are a lot, hundreds, of failures. This gave me an excellent opportunity to draft and tune my FTSI process.
During triage I categorize the failures to give them some sort of priority. I take what I know about the tests, the functionality they test, and the value they provide and grade them “critical”, “fix now”, “fix later”, “flaky”, and “ignore”.
Critical
Critical tests cover critical functionality. These failing tests cover functionality that if broken would cripple or significantly harm the user, the system, or the business.
Fix Now
Fix now are tests that provide high value, but wouldn’t cause a software apocalypse. These are usually easy fixes or easy wins with a known root cause for the failure that I can tackle after the critical failures.
Fix Later
Fix later are important tests, but not necessary to fix in the near term. These are usually tests that are a little harder to understand why they are failing. Sometimes I will mark a test as fix later, investigate the root cause and once I have a handle on what it takes to fix them, I move them to fix now. Sometimes I will tag a failure as fix later when a test has a known root cause, but the the time to fix is not in line with the value that the test provides compared to other tests I want to fix now.
Flaky
Flaky tests are tests that are failing for indeterminate reasons. They fail one time and pass another time, but nothing has changed in the functionality they cover. This is a tag I am questioning whether I should keep because I could end up with a ton of flaky test being ignored and cluttering the test report with a bunch of irritating yellow (almost as bad as red).
Ignore
Ignore are tests that can be removed from the test suite. These are usually tests that are not providing any real value because they no longer address a valid business case or are too difficult to maintain. If I triage and tag a test as ignore, it is because I suspect the failure can be ignored, but I want to spend some extra time investigating if I should remove them. When a test is ignored it is important to not let them stay ignored for long. Either remove them from the test suite or fix them so that they are effective tests again.
Test Tagging and Ticketing
Once the test failures are categorized, I tag them in the test tool so they are ignored during the test run. This also gives me a way to quickly identify the test code that is causing me issues. I also add the failing tests to our defect tracker. For now failing tests are not related to product releases so they are just entered as a defect on the test suite for the system under test and they are released when they are fixed. I am not adding any details on the reason of the failure besides the test failure message and stack trace, if available. Also, I may add cursory information that gives clues to possible root cause of failure such as test timing out, config issue, data issue…obvious reasons for failures. This allows me to focus on the initial triage and not spend time on trying to investigate issues that may not be worth my time.
This process helps to get rid of the red and replace it with yellow (ignore) in the test reports so there is still an issue if we are not diligent and efficient in the process. It does clear the report of red so that we can easily see new failures. If we leave the tests as failures, we can get in the habit of overlooking the red and miss a critical new failure.
Critical Fix
After I complete triage and tagging, I will focus on fixing the critical test failures. Depending on the nature of the critical failure I will forgo triage on the other failures and immediately fix the critical failure. Fixing critical tests have to be the most important task because the life of the feature or system under test is in jeopardy.
Investigate
Once I have my most critical tests fixed, I will begin investigating. I just select the most high priority tests that don’t have a known root cause and begin looking for the root cause of the failure. I begin with reading through the scenario covered by the test. Then I read the code to see what the code is doing. Of course if I was very familiar with a test I may not have to do this review, but if I have any doubts I start with gaining an understanding of the test.
Now, I run the test and watch it fail. If I can clearly tell why the test fails I move on to the next step. If not, since I have access to the source code for the system under test, I open a debug session and step through the test. If it still isn’t obvious I may get a developer involved, ask a QA, BA, or even end user about their experience with the scenario.
Once I have the root cause identified, I will tag the test as fix now. This is not gospel, if the fix is very easy, I will forgo the fix now tag and just fix the test.
Fix
Once I have run through triage and tagging, fixed critical tests, and have investigated new failures, I focus time on fixing any other criticals then move on to the fix now list. During the fix I will also focus on improving the tests through refactoring and looking at improving things like maintainability and performance of the tests. The goal during the fix is not only to fix the problem, but simplify future maintenance and improve the performance of the test. I want the test to be checked in better than I checked it out.
Conclusion
This is not the most optimal process at the moment, but is the result of going through the process on hundreds of failures and it works for the moment. I have to constantly review the list of failures and as new failures are identified, there could be a situation where some tests are perpetually listed on the fix later list. For now this is our process and we can learn from it to form an opinion on how to do it better. It’s not perfect, but its more formal than fixing tests all willy-nilly with no clear strategy for achieving the best value for our efforts. I have to give props to the medical community for the inspiration for the Failing Test Severity Index, now we just have to possibly formalize the process algorithm like ESI and look for improvements.
Let me know what you think and send links to other process you have had success with.