Category: Quality

Keyword Driven Testing with Gherkin in SpecFlow

Well this may be a little confusing because Gherkin is essentially a keyword driven test that uses the Given, When, Then keywords. What I am talking about is using Gherkin, specifically the SpecFlow implementation of Gherkin to create another layer of keywords on top of Gherkin to allow users to not only define the tests in plain English with Gherkin, but to also write new test scenarios without having to ask developers to implement the tests.

Even though we can look at the Gherkin scenario steps as keywords many times developers are needed for new tests because the steps can’t be reused to compose tests for new pages without developers implementing new steps. Now this may be an issue with my approach to Gherkin, but many of the examples I have seen suffer from the same problem so I’m in good company.

Keyword Driven Tests

What I was looking for is a way for developers to just create page objects and users can reuse steps to build up new scenarios they want to run without having developers implement new step definitions. This brought me full circle to one of the first test frameworks I wrote several years ago. I built a keyword driven test framework that allowed testers to open an Excel spread sheet, select action keywords, enter data, and select controls to target, to compose new test scenarios without having to involve developers in writing new tests. The spreadsheet looked something like:

Step Keyword Data Control
1 Login charlesb,topsecret
2 Open Contact
3 EnterText Charles Bryant FullName
4 Click Submit

This would be read by the test framework that would use it to drive Selenium to execute the tests. With my limited experience in test automation at the time, this became a maintenance nightmare.

Business users were able to create and run the tests, but there was a lot of duplication because the business had to write the same scenarios over and over again to test different data and expectations for related scenario. If you maintain large automated test suites, you know duplication is your enemy. For the scenario above, if they wanted to test what would happen when FullName was empty, they would have to write these four steps again in a new scenario. If there are more fields, the number of scenarios to properly cover the form could become huge. Then when the business wants to add or remove a field, change the workflow, or any other change that affects all of the duplicated tests, the change has to be done on all of them.

It would have been more maintainable if I would have created more high level keywords like Login and separated data from the scenario step definition, but I wasn’t thinking and just gave up after we had issue after issue that required fixing many scenarios because of problems with duplication. Soon I learned how to overcome this particular issue with Data Driven tests and the trick with encapsulating steps in course grained keywords (methods), but I was way past keyword driven tests and had an extreme hate for them.


You may be asking why am I trying to create a keyword driven framework on top of SpecFlow if I hate keyword driven tests. Well, I have been totally against any keyword driven approach because of my experience and I realized that it may not have been the keyword driven approach in general, but my understanding and implementation of it. I just wanted to see what it would look like and what I could do to make it maintainable. I can appreciate allowing business users to create tests on demand without having to involve developers every step of the way. I am not sold on them wanting to do it and I draw the line on giving them test recorders to write the keyword tests for them (test recorders are another fight I will have with myself later).

So, now I know what it could look like, but I haven’t figured out how I would make it maintainable yet. The source code is on GitHub and it isn’t something anyone should use as it is very naive. If you are looking for a keyword driven approach for SpecFlow, it may provide a base for one way of doing it, but there is a lot to do to make it production ready. There are probably much better ways to implement it, but for a couple hours of development it works and I can see multiple ways of making it better. I probably won’t complete it, but it was fun doing it and taking a stroll down memory lane. I still advocate creating steps that aren’t so fine grained and defined at a higher level of abstraction.

The Implementation

So the approach started with the SpecFlow Feature file. I took a scenario and tried to word it in fine grained steps like the table above.

Scenario: Enter Welcome Text
 Given I am on the "Welcome" page
 And I enter "Hello" in "Welcome"
 When I click "Submit"
 Then I should be on the "Success" page
 And "Header" text should be "Success"

Then I implemented page objects for the Welcome and Success page. Next, I implemented the first Given which allows a user to use this step in any feature scenario and open any page that we have defined as a page object that can be loaded by a page factory. When the business adds a new page a developer just has to add the new page object and the business can compose their tests against the page with the predefined generic steps.

Next, I coded the steps that allow a user to enter text in a control, click a control, verify that a specific page is open, and verify that a control has the specified text. Comparing this to my previous approach, the Keywords are the predefined scenario steps. The Data and Controls are expressed as regex properties in the steps (the quoted items). I would have to define many more keywords for this to be as robust as my previous approach with Excel, but I didn’t have to write an Excel data layer and a complex parsing engine. Yet, this still smells like a maintenance nightmare.

One problem outside of test maintenance is code maintenance. I used hard coded strings in my factories to create or select page and control objects. I could have done some reflection and used well known conventions to create generic object factories. I could have used a Data Driven approach to supply the data for scenarios and users would have to just define the actions in the tests. For example, Given I enter text in “Welcome”. Then they would define the test data in a spreadsheet or JSON file and the data can easily be changed for different environments or situations (like scalability tests). With this more generic example step the implementation would be smart enough to get the text data that needs to be entered for the current scenario from the JSON file. This was another problem with my previous keyword driven approach because I didn’t separate data from the tests so to move to another environment or provide data for different situations meant copying the Excel files and updating the data for each scenario that needed to change.


Well, that’s if for now. Maybe I can grow to love or at least like this type of keyword driven testing.

Sample Project:

Selenium WebDriver Custom Table Element

One thing I like to do is to create wrappers that simplify usage of complex APIs. I also like wrappers that allow me to add functionality to an API. And when I can use a wrapper to get rid of duplicate code, I’m down right ecstatic.

Well today I’d like to highlight a little wrapper magic that helps simplify testing HTML tables. The approach isn’t new, but it was introduced in TestPipe by a junior developer with no prior test automation experience (very smart guy).

First some explanation before code as it may be a little hard to understand. TestPipe actually provides a wrapper around Selenium Web Driver. This includes wrappers around the Driver, SearchContext, and Elements. TestPipe has a BaseControl that is basically a wrapper around WebDriver’s IWebElement. This not only allows us to bend WebDriver to our will, the biggest benefit, but it also gives us the ability to swap browser drivers without changing our test code. So, we could opt to use WatiN or some custom System.Net and helpers and not have to do much, if anything, in terms of adjusting our test code. (Side Note: this isn’t a common use case, almost as useless as wrapping the database just so you can change the database later…doesn’t happen often, if ever, but its nice that its there just in case)


Like I said, we have a BaseControl. To provide some nice custom functionality for working with tables we will extend BaseControl.

public class GridControl : BaseControl
 public GridControl(IBrowser browser)
   : base(browser, null, null)
 //...Grid methods

Breaking this down, we create a new class, GridControl, that inherits BaseControl. In the constructor we inject an IBrowser, which is the interface implemented by our WebDriver wrapper. So, outside of TestPipe, this would be like injecting IWebDriver. The BaseControl handles state for Driver and exposes it to our GridControl class through a public property. This same concept can be used to wrap text boxes, drop down lists, radio button…you get the idea.


I won’t show all of the code, but it’s on GitHub for your viewing pleasure. Instead, I will break down a method that provide some functionality to help finding cells in tables. GetCell, this method provides a simple way to get any cell in a table by the row and column.

public IControl GetCell(int row, int column)
 StringBuilder xpathCell = new StringBuilder();
 Select selector = new Select(FindByEnum.XPath, xpathCell.ToString());
 IControl control = this.Find(selector);
 return control;

This isn’t the actual TestPipe code, but enough to get the gist of how this is handled in TestPipe. We are building up an XPath query to find the tr at the specified row index and the td at the specified column index. Then we pass the XPath to a Selector class that is used in the Find method to return an IControl. IControl is the interface for the wrapper around WebDriver’s IWebElement. The Find method uses our wrapper around WebDrivers ISearchContext to find the element…who’s on first?

One problem with this approach is the string “/tbody/tr”. What if your table doesn’t have a tbody? Exactly, it won’t find the cell. This bug was found by another junior developer that recently joined the project, another very smart guy. Anyway, this is a problem we are solving as we speak and it looks like we will default to “/tr” and allow for injection of a different structure like “/tbody/tr” as a method argument. Alternatively, we could search for “/tr” then if not found search for “/tbody/tr”. We may do both search for the 2 structures and allow injection. These solutions are pretty messy, but it’s better than having to write this code every time you want to find a table cell. The point is we are encapsulating this messy code so we don’t have to look at it and getting a cell is as simple as passing the row and column we want to get from the table.

Usage in Tests

In our page object class we could use this by assigning GridControl to a property that represents a table on our page. (If you aren’t using page objects in your automated browser tests, get with the times)

//In our Page Object class we have this property
public GridControl MyGrid
        return new GridControl(this.Browser);

Then it is pretty easy to use it in tests.

//Here we use myPage, an instance of our Page Object,
//to access MyGrid to get the cell we want to test
string cell = myPage.MyGrid.GetCell(2,5).Text;
Assert.AreEqual("hello", cell);

Boom! A ton of evil duplicate code eradicated and some sweet functionality added on top of WebDriver.


The point here is that you should try to find ways to wrap complex APIs to find simpler uses of the API, cut down on duplication, and extend the functionality provided by the API. This is especially true in test automation where duplication causes maintenance nightmares.

Failing Test Severity Index

Have you ever been to the emergency room and gone through triage? They have a standard way of identifying patients that need immediate or priority attention vs those that can wait. Actually, they have an algorithm for this called the Emergency Severity Index (ESI). Well I have a lot of failing tests in my test ER with their bloody redness in test reports begging to be fixed. Everyday I get a ping from them, “I’ve fallen and I can’t get up.” What do I do? Well fix them of course. So, I decided to record my actions in fixing a few of them to see if any patterns arise that I can learn from. In doing so, I formalized my steps so that I can have a repeatable process for fixing tests. I borrowed from ESI and decided to call my process the Failing Test Severity Index (FTSI), sounds official like its something every tester should use and it comes with an acronym FTSI (foot-see). If there is no acronym it’s not official, right?


The first step is to triage the failing tests to identify the most critical tests. I want to devote my energy to the most important, or the tests that provide the most bang for the buck, and triage gives me an official stage in my workflow for prioritizing the tests that need to be fixed.

As a side note, I have so many failures because I inherited the tests and they were originally deployed and never added to a maintenance plan. So, they have festered with multiple issues with test structure and problems with code changes in the system under test that need to be addressed in the tests. With that said, there are a lot, hundreds, of failures. This gave me an excellent opportunity to draft and tune my FTSI process.

During triage I categorize the failures to give them some sort of priority. I take what I know about the tests, the functionality they test, and the value they provide and grade them “critical”, “fix now”, “fix later”, “flaky”, and “ignore”.


Critical tests cover critical functionality. These failing tests cover functionality that if broken would cripple or significantly harm the user, the system, or the business.

Fix Now

Fix now are tests that provide high value, but wouldn’t cause a software apocalypse. These are usually easy fixes or easy wins with a known root cause for the failure that I can tackle after the critical failures.

Fix Later

Fix later are important tests, but not necessary to fix in the near term. These are usually tests that are a little harder to understand why they are failing. Sometimes I will mark a test as fix later, investigate the root cause and once I have a handle on what it takes to fix them, I move them to fix now. Sometimes I will tag a failure as fix later when a test has a known root cause, but the the time to fix is not in line with the value that the test provides compared to other tests I want to fix now.


Flaky tests are tests that are failing for indeterminate reasons. They fail one time and pass another time, but nothing has changed in the functionality they cover. This is a tag I am questioning whether I should keep because I could end up with a ton of flaky test being ignored and cluttering the test report with a bunch of irritating yellow (almost as bad as red).


Ignore are tests that can be removed from the test suite. These are usually tests that are not providing any real value because they no longer address a valid business case or are too difficult to maintain. If I triage and tag a test as ignore, it is because I suspect the failure can be ignored, but I want to spend some extra time investigating if I should remove them. When a test is ignored it is important to not let them stay ignored for long. Either remove them from the test suite or fix them so that they are effective tests again.

Test Tagging and Ticketing

Once the test failures are categorized, I tag them in the test tool so they are ignored during the test run. This also gives me a way to quickly identify the test code that is causing me issues. I also add the failing tests to our defect tracker. For now failing tests are not related to product releases so they are just entered as a defect on the test suite for the system under test and they are released when they are fixed. I am not adding any details on the reason of the failure besides the test failure message and stack trace, if available. Also, I may add cursory information that gives clues to possible root cause of failure such as test timing out, config issue, data issue…obvious reasons for failures. This allows me to focus on the initial triage and not spend time on trying to investigate issues that may not be worth my time.

This process helps to get rid of the red and replace it with yellow (ignore) in the test reports so there is still an issue if we are not diligent and efficient in the process. It does clear the report of red so that we can easily see new failures. If we leave the tests as failures, we can get in the habit of overlooking the red and miss a critical new failure.

Critical Fix

After I complete triage and tagging, I will focus on fixing the critical test failures. Depending on the nature of the critical failure I will forgo triage on the other failures and immediately fix the critical failure. Fixing critical tests have to be the most important task because the life of the feature or system under test is in jeopardy.


Once I have my most critical tests fixed, I will begin investigating. I just select the most high priority tests that don’t have a known root cause and begin looking for the root cause of the failure. I begin with reading through the scenario covered by the test. Then I read the code to see what the code is doing. Of course if I was very familiar with a test I may not have to do this review, but if I have any doubts I start with gaining an understanding of the test.

Now, I run the test and watch it fail. If I can clearly tell why the test fails I move on to the next step. If not, since I have access to the source code for the system under test, I open a debug session and step through the test. If it still isn’t obvious I may get a developer involved, ask a QA, BA, or even end user about their experience with the scenario.

Once I have the root cause identified, I will tag the test as fix now. This is not gospel, if the fix is very easy, I will forgo the fix now tag and just fix the test.


Once I have run through triage and tagging, fixed critical tests, and have investigated new failures, I focus time on fixing any other criticals then move on to the fix now list. During the fix I will also focus on improving the tests through refactoring and looking at improving things like maintainability and performance of the tests. The goal during the fix is not only to fix the problem, but simplify future maintenance and improve the performance of the test. I want the test to be checked in better than I checked it out.


This is not the most optimal process at the moment, but is the result of going through the process on hundreds of failures and it works for the moment. I have to constantly review the list of failures and as new failures are identified, there could be a situation where some tests are perpetually listed on the fix later list. For now this is our process and we can learn from it to form an opinion on how to do it better. It’s not perfect, but its more formal than fixing tests all willy-nilly with no clear strategy for achieving the best value for our efforts. I have to give props to the medical community for the inspiration for the Failing Test Severity Index, now we just have to possibly formalize the process algorithm like ESI and look for improvements.

Let me know what you think and send links to other process you have had success with.

If It Looks Like a Defect is It a Defect?

Our software quality metrics work group had a discussion today and metrics around defects became an interesting topic. One of the work group members said that the concept of a defect is not relevant to agile teams. This was clarified as defect metrics within the confines of an agile sprint. I felt kind of dumb, because I didn’t know this and it appeared that there may be a consensus with it. Maybe I misunderstood, but the logic was that there are no defects in sprint because once a problem is found it is immediately fixed in the sprint. I wanted to push for defect metrics from check-in through production. The later in the software delivery pipeline that a defect is found the more it will cost, so you have to know where it was caught. I didn’t get to dig in to the topic with the group because I was contemplating whether I needed to revisit my understanding of Agile and I didn’t want to slow the group down. I already feel like a lightweight in the ring with a bunch of heavyweights :).

Defects Cost Money

After pondering it a bit, I am still of the opinion that defects exists whether you name them something else, quietly fix them before anyone notices, or collectively as a team agree not to track them. Defects are an unavoidable artifact of software development. Defect, bug, issue…it doesn’t work as expected, name it what you like or obscure it in the process, defects are always there and will be until humans and computers become perfect beings. Defects cost money when more than one person has to deal with them. If a defect is caught in an exploratory test and it is acknowledged that it must be fixed in sprint, then it will have to be retested after the fix. Pile this double testing cost on top of the development cost and defects can get expensive.

Not to mentions, defects slow sprints down. When you estimated a certain amount of story points, let’s say 2, and ultimately the story was an 8 because of misunderstandings and bad coding practices, there is a cost associated with this. Maybe estimates are stable or perfect in mature hard core Agile teams or defects just another chore in the process that don’t warrant tracking or analysis. For new teams just making the transition to agile, tracking defects provides an additional signal that something is wrong in the process. If you are unable to see where your estimate overruns are occurring you can’t take action to fix them.


If someone besides the developer finds a defect, the story should be rejected. At the end of the sprint we should be able to see how many rejections there were and at what stage the rejects occurred in the pipeline. If these number are high or trending up, especially later in the pipeline, something needs to be done and you know there is a problem because you tracked defects. It may be my lack of experience in a hard core Agile team, but I just can’t see a reason to ignore defects just because they are supposed to be fixed in sprint.

Can someone help me see the light? I thought I was agile’ish. I am sure there is an agile expert out there than can give me another view of what defects mean in agile and how my current thought process is out of place in agile. I think my fellow group members are awesome, but I usually look for a second opinion in topics I am unsure about.

Test Automation Tips: 2

#2 If you update or delete test data in a test, you need to isolate the data so it isn’t used by any other tests.

A major cause of flaky tests (tests that fail for indeterminate reasons) is bad test data. This is often caused by a previous test altering the test data in a way that causes subsequent tests to fail. It is best to isolate test data for scenarios that change the test data. You should have an easy way to isolate read only data so that you can insure it isn’t corrupted up by your tests.

Delta Seed

One way I like to do this is having what I call a delta test data seed. This delta seed loads all read only test data at the start of a test suite run. Any test data that needs to be updated or deleted is created per test. Mutable test data seeds are ran after the delta seed. So, in addition to the delta seed I will have a suite, feature or scenario seed.

Suite Seed

The suite seed is ran right after the delta seed, usually with the same process that runs the delta seed. Because the suite seed data is available to all tests being ran, it is the most riskiest seed and the least efficient unless you are running all of your tests as you may not need all of the data being loaded. I say risky, because it opens up the scenario where someone writes a test against mutable data when it should only be used by the test that will be changing the data.

Feature Seed

The feature seed would run at the beginning of a feature test during text fixture setup. This basically loads all the data used by tests in the feature. This has some of the same issues as the suite seed. All of the data is available for all tests in the feature and if someone gets lazy and writs a test against the mutable data instead of creating new data specifically for the test may result in flaky tests.

Scenario Seed

The scenario seed runs at the beginning of an individual test in the feature test fixture. This is the safest in terms of risk as the data is loaded for the test and deleted after the test so no other tests can use it. The problem I have with this is when you have a lot of tests having to create hundreds of database connections and dealing with seeding in the test can have an impact on overall and individual test time. If not implemented properly this type of data seeding can also create a maintenance nightmare. I like to use test timing as an indicator of issues with tests. If you can’t separate the time to seed the data from the time to run the test, having the seed factored into the test can affect the timing in a way that has nothing to do with what is being tested. So, you have to be careful not to pollute your test timing with seeding.

Which Seed?

Which seed to use depends on multiple factors. How you run your tests? If you are running all tests, it may be efficient to use a suite seed. If you run features in parallel, you may want a feature seed to quickly load all feature data at one time. If you run tests based on dependencies in a committed change, you may want to keep seeding granular with a scenario seed. There are many other factors that you can take into account and trial and error is a good way to go as you optimize test data seeding.


The thing to take away is you need a strategy to manage the isolation of test data by desired immutability of the data. Tests that don’t alter test data should use read only data. Tests that alter test data should use test data specifically created for the test. If you allow a test to use data altered by another test, you open yourself up to flaky test syndrome.

View more tips.

When I am writing and maintaining large functional UI tests I often realize somethings that would make my life easier. I decided to write this series of posts to describe some of the tips I have for myself in hopes that they prove to be helpful to someone else. What are some of your tips?

An Easy Win for Testable Methods

One of our developers was having an issue testing a service. Basically, he was having a hard time hitting the service as it is controlled by an external company and we don’t have firewall rules to allow us to easily reach it in our local environment. I suggested mocking the service since we really weren’t testing getting a response from the service, but what we do with the response. I was told that the code is not conducive to mocking. So, I took a look and they were right, but the fix to make it testable was a very simple refactor. Here is the gist of the code:

public SomeResponseObject GetResponse(SomeRequestObject request)
//Set some additonal properties on the request
request.Id = "12345";

//Get the response from the service
SomeResponseObject response = Client.SendRequest(request);

//Do something with the response
if (response != null)
//Do some awesome stuff to the response
response.LogId = "98765";
if (response.Id > "999")
response.Special = true;

return response;

What we want to test is the “Do something with the response” section of the code, but this method is doing so many things that we can’t isolate that section and test it…or can we? To make this testable we simply move everything that conserns “Do something with the response” to a separate method.

public SomeResponseObject GetResponse(SomeRequestObject request)
//Set some additonal properties on the request
request.Id = "12345";

//Get the response from the service
SomeResponseObject response = Client.SendRequest(request);

return ProcessResponse(response);
public SomeResponseObject ProcessResponse(SomeResponseObject response)
//Do something with the response
if (response != null)
//Do some awesome stuff to the response
response.LogId = "98765";
if (response.Id > "999")
response.Special = true;

return response;

Now we can just test the changes to ProcessResponse method in isolation away from the service calls. Since there were no changes to the service or the service client, we didn’t have to worry about testing them for this specific change. We don’t care what gets returned we just want to know if the response was properly processed and logged. We still have a hard dependency on LogResponse’s connection to the database, but I will live with this as an integration test and fight for unit tests another day. This is a quick win for testability and a step closer to making this class SOLID.

Test Automation Tips: 1

#1 Don’t test links unless you are testing links.

Unless you are specifically testing menus or navigation links, don’t automate the clicking of links to navigate to the actual page under test. Instead take the shortest route to set the test up.

Let’s say I have a test that starts by clicking the product catalog link, then clicks on a product to get to product details, just so I can test that a coupon appears in product details. This test is testing too many concepts that are out of the scope of the intent of the test. I am trying to test the coupon visibility, not links. Instead, I should navigate directly to the product details page and make my assertion.

If you use the link method and fool yourself into believing that your test is acting like a user, when links are broken your test will fail for a reason other than the reason you are testing. If you use the same broken link in multiple tests, you will have multiple failures and will have to fix and rerun the tests to get feedback on the features you are really trying to test. If the broken tests are in a nightly test, you just lost a lot of test coverage and and you don’t know if the features are passing or failing because they never got tested. Using the link method can cause a dramatic loss of test coverage. So, test links in navigation tests only when the links are front and center in the purpose of the test. For all other tests, take the shortest route to the page under.

View more tips.

When I am writing and maintaining large functional UI tests I often realize somethings that would make my life easier. I decided to write this series of posts to describe some of the tips I have for myself in hopes that they prove to be helpful to someone else. What are some of your tips?

Test Automation Tips

When I am writing and maintaining large functional UI tests I often realize somethings that would make my life easier. I decided to write this series of posts to describe some of the tips I have for myself in hopes that they prove to be helpful to someone else. What are some of your tips?

#1 Don’t test links unless you are testing links.

#2 If you update or delete test data in a test, you need to isolate the data so it isn’t used by any other tests.

#3 Don’t become complacent in testing when a change is simple.

What Makes a Good Candidate for Test Automation?

Writing large UI based functional tests can be expensive in terms of money and time. It is sometimes hard to know where to focus your test budget. New features are good candidates, especially the most common successful and exceptional paths through the feature. But, when you have a monster legacy application with little to no coverage, where to get the biggest bang for the buck can be hard to ascertain.

Bugs, Defects, Issues…It Doesn’t Work

I believe bugs provide a good candidate for automation, especially if regression is a problem for you. Even if regression is not an issue, its always good to protect against regressions. So, automating bugs are kind of a win-win in terms of risk assessment. Hopefully, when a bug is found whoever finds the bug or whoever adds it to the bug database provides reproduction steps. If the steps are a good candidate for automation, automate it.

Analyzing Bugs

What makes a bug a good candidate for test automation? When analyzing bugs for automated testing I like to evaluate on 4 basic criteria. In descending order of precedence:

  • The steps are easy to model in the test framework.
  • The steps are maintainable as an automated test.
  • The bug was found before.
  • The bug caused a lot of pain to users or the company.

It is just common sense that “bug caused a lot of pain” is the top candidate. If a bug caused a lot of pain, you don’t want to repeat it, unless you like pain. Yet, if the painful bug is a maintenance nightmare as an automated test, the steps are hard to model, and the bug wasn’t found before you may want to just mark it for manual regression. If your test matches 2 or more of the criteria I’d say it is a high priority candidate for test automation.


These are just my opinion and there is no study to prove any of it. I know this has been thought of and pondered, maybe even researched by someone. If you know where I can find some good discussions on this topic or if you want to start one, please let me know.


Finding Bad Build Culprits…Who Broke the Build!

I found an interesting Google Talk on finding culprits automatically in failing builds – This is actually a lightening talk at GTAC 2013 given by grad students Celal Ziftci and Vivek Ramavajjala. First they gave an overview of how culprit analysis is done on build failures triggered by small test and medium sized tests.

CL or change list is a term I first heard in “How Google Tests Software” and refers to a logical grouping o changes committed to the source tree. This would be like a git feature branch.

Build and Small Tests Failures

When the build fails because of a build issue we build the CLs separately until a CL fails the build. When the failure is a small test (unit test) we do the same thing. Build CLs separately and run the tests against them to find the culprit. In both cases, we can do the analysis in parallel to speed it up. This is what I covered in my post on Bisecting Our Code Quality Pipeline where git bisect is used to recurse the CLs.

Medium Tests

Ziftci and Ramavajjala define these tests as taking less than 8 minutes to run and suggest using a binary search to find the culprit. Target the middle CL, build it and if it fails, the culprit is most likely to the left, so we recurse to the left until we find the culprit. If it passes, we recurse to the right.

CL 1 – CL 2 – CL 3 – CL 4 – CL 5 – CL6

CL 1 is the last known passing CL. CL 6 was the last CL in the failing build. We start by analyzing CL 4 and if fails, then we move left and check CL 3. If CL 3 passes, we mark CL 4 as the culprit. If CL 3 fails we mark CL 2 as the culprit because we know that CL 1 was good and don’t need to continue analyzing.

If CL 4 passed, we would move right and test CL 5 and if it fails, mark CL 5 as the culprit. If it passes, then we mark CL 6 as the culprit because it is the last suspect and we don’t have to waste resources analyzing it.

Large Tests

They defined these tests as taking longer than 8 minutes to run. This was the primary focus of Ramavajjala and Ziftci’s research. They are focusing on developing heuristics that will let a tool identify culprits by pattern matching. They explained how they have a heuristic that will analyze a CL for number of files changed and give a higher ranking to CLs with more files changed.

They also have a heuristic that calculates the distance of code in the CL from base libraries, like the distance from the core Python library for example. The closer it is to the core the more likely that it is a core piece of code that has had more rigorous evaluation because there may be many projects depending on it.

They seemed to be investing a lot of time into insuring that they can do this fast. They stress caching and optimizing how they do this. It sounds interesting and once they have had a chance to run their tool and heuristics against the massive amount of tests at Google (they both became employees of Google) hopefully they can share the heuristics that prove to be most adept at finding culprits at Google and maybe anywhere.


They did mention possibly using a heuristic that looks at the logs generated by build failures to identify keywords that may provide more detail on who the culprit maybe. I had a similar thought after I wrote the git bisect post.

Many times when a test fails in larger tests there are clues left behind that we would normally manually inspect to find the culprit. If the test has good messaging on their assertions, that is the first place to look. In a large end to end test there may be many places for the test to fail, but if the failure message gives a clue of what fails it helps to find the culprit. Although, they spoke of 2 hr tests and I have never seen one test that takes 2 hours so what I was thinking about and what they are dealing with may be another animal.

There is also the test itself. If the test covers a feature and I know that only one file in one CL is included in the dependencies involved in the feature test, then I have a candidate. There is also application logs and system logs. The goal as I saw it was to find a trail that led me back to a class, method, or file that matches a CL.

The problem with me trying to seriously try and solve this is I don’t have a PhD in Computer Science, actually I don’t have a degree except from the school of hard knocks. When they talked about the binary search for medium sized tests it sounded great. I kind of know what a binary search is. I have read about it and remember writing a simple one years ago, but if you ask me to articulate the benefits of using quad tree instead of binary search or to write a particular one on the spot, I will fumble. So, trying to find an automated way to analyze logs in a thorough, fast and resource friendly manner is a lot for my brain to handle. Yet, I haven’t let my shortcomings stop me yet, so I will continue to ponder the question.

We are talking about parsing and matching strings, not rocket science. This maybe a chance for me to use or learn a new language more adept at working with strings than C#.


At any rate I find this stuff fascinating and useful in my new role. Hopefully, I can find more on this subject.