I found an interesting Google Talk on finding culprits automatically in failing builds – https://www.youtube.com/watch?v=SZLuBYlq3OM. This is actually a lightening talk at GTAC 2013 given by grad students Celal Ziftci and Vivek Ramavajjala. First they gave an overview of how culprit analysis is done on build failures triggered by small test and medium sized tests.
CL or change list is a term I first heard in “How Google Tests Software” and refers to a logical grouping o changes committed to the source tree. This would be like a git feature branch.
Build and Small Tests Failures
When the build fails because of a build issue we build the CLs separately until a CL fails the build. When the failure is a small test (unit test) we do the same thing. Build CLs separately and run the tests against them to find the culprit. In both cases, we can do the analysis in parallel to speed it up. This is what I covered in my post on Bisecting Our Code Quality Pipeline where git bisect is used to recurse the CLs.
Ziftci and Ramavajjala define these tests as taking less than 8 minutes to run and suggest using a binary search to find the culprit. Target the middle CL, build it and if it fails, the culprit is most likely to the left, so we recurse to the left until we find the culprit. If it passes, we recurse to the right.
CL 1 – CL 2 – CL 3 – CL 4 – CL 5 – CL6
CL 1 is the last known passing CL. CL 6 was the last CL in the failing build. We start by analyzing CL 4 and if fails, then we move left and check CL 3. If CL 3 passes, we mark CL 4 as the culprit. If CL 3 fails we mark CL 2 as the culprit because we know that CL 1 was good and don’t need to continue analyzing.
If CL 4 passed, we would move right and test CL 5 and if it fails, mark CL 5 as the culprit. If it passes, then we mark CL 6 as the culprit because it is the last suspect and we don’t have to waste resources analyzing it.
They defined these tests as taking longer than 8 minutes to run. This was the primary focus of Ramavajjala and Ziftci’s research. They are focusing on developing heuristics that will let a tool identify culprits by pattern matching. They explained how they have a heuristic that will analyze a CL for number of files changed and give a higher ranking to CLs with more files changed.
They also have a heuristic that calculates the distance of code in the CL from base libraries, like the distance from the core Python library for example. The closer it is to the core the more likely that it is a core piece of code that has had more rigorous evaluation because there may be many projects depending on it.
They seemed to be investing a lot of time into insuring that they can do this fast. They stress caching and optimizing how they do this. It sounds interesting and once they have had a chance to run their tool and heuristics against the massive amount of tests at Google (they both became employees of Google) hopefully they can share the heuristics that prove to be most adept at finding culprits at Google and maybe anywhere.
They did mention possibly using a heuristic that looks at the logs generated by build failures to identify keywords that may provide more detail on who the culprit maybe. I had a similar thought after I wrote the git bisect post.
Many times when a test fails in larger tests there are clues left behind that we would normally manually inspect to find the culprit. If the test has good messaging on their assertions, that is the first place to look. In a large end to end test there may be many places for the test to fail, but if the failure message gives a clue of what fails it helps to find the culprit. Although, they spoke of 2 hr tests and I have never seen one test that takes 2 hours so what I was thinking about and what they are dealing with may be another animal.
There is also the test itself. If the test covers a feature and I know that only one file in one CL is included in the dependencies involved in the feature test, then I have a candidate. There is also application logs and system logs. The goal as I saw it was to find a trail that led me back to a class, method, or file that matches a CL.
The problem with me trying to seriously try and solve this is I don’t have a PhD in Computer Science, actually I don’t have a degree except from the school of hard knocks. When they talked about the binary search for medium sized tests it sounded great. I kind of know what a binary search is. I have read about it and remember writing a simple one years ago, but if you ask me to articulate the benefits of using quad tree instead of binary search or to write a particular one on the spot, I will fumble. So, trying to find an automated way to analyze logs in a thorough, fast and resource friendly manner is a lot for my brain to handle. Yet, I haven’t let my shortcomings stop me yet, so I will continue to ponder the question.
We are talking about parsing and matching strings, not rocket science. This maybe a chance for me to use or learn a new language more adept at working with strings than C#.
At any rate I find this stuff fascinating and useful in my new role. Hopefully, I can find more on this subject.
I want to implement gated check-ins, but it will be some time before I can restructure our process and tooling to accomplish it. What I really want is to be able to keep the source tree green and when it is red provide feedback to quickly get it green again. I want to run tests on every commit and give developers feedback on their failing commits before it pollutes the source tree. Unfortunately, to run the tests as we have it today would take too long to test on every commit. I came across a quick blog post by Ayende Rahien on Bisecting RavenDB and they had a solution were they used git bisect to find the culprit that failed a test. They gave no information on how it actually worked just a tease that they are doing it. I left a comment to see if they would share some of their secret sauce behind their solution, but until I get that response I wanted to ponder it for a moment.
To speed up testing and also allow test failure culprit identification with git bisect we would need a custom test runner that can identify what test to run and run them. We don’t run tests on every commit, we run tests nightly against all the commits that occurred for the day. When the test fails it can be difficult identifying the culprit(s) that failed the test. This is were the Ayende steps in with his team’s idea to use bisect to help identity the culprit. Bisect works by traversing commits. It starts at the commit we mark as the last known good commit to the last commit that was included in the failing nightly test. As bisect iterates over the commits, it pauses at each commit and allows you to test it and mark if it is good or bad. In our case we could run a test against a single commit. If it passes, tell bisect its good and to move to the next. If it fails, save the commit and failing test(s) as a culprit, tell bisect its bad and to move to the next. This will result in a list of culprit commits and their failing tests that we can use for reporting and bashing over the head of the culprit owners (just kidding…not).
Custom Test Runner
The test runner has to be intelligent enough to run all of the tests that exercise the code included in a commit. The custom test runner has to look for testable code files in the commit change log, in our case .cs files. When it finds a code file it will identify the class in the code file and find the test that targets the class. We are assuming one class per code file and one unit test class per code file class. If this convention isn’t enforced, then some tests may be missed or we have to do a more complex search. Once all of the test classes are found for the commit’s code files, we run the the tests. If a test fails, we save the test name and maybe failure results, exception, stack trace… so it can be associated with the culprit commit. Once all of the tests are ran, if any of them failed, we mark the commit as a culprit. After the test and culprit identification is complete, we tell bisect to move to the next commit. As I said before, this will result in a list of culprits and failing test info that we can use in our feedback to the developers.
Make It Faster
We could make this fancy and look for the specific methods that were changed in the commit’s code file classes. We would then only find tests that test the methods that were changed. This would make testing focused like a lazer and even faster, but we could probably employ Roslyn to handle the code analysis to make finding tests easier. I suspect tools like ContinuousTests – MightyMoose do something like this, so it’s not that far fetched an idea, but definitely a mountain of things to think about.
Well this is just a thought, a thesis if you will, and if it works, it will open up all kind of possibilities to improve our Code Quality Pipeline. Thanks Ayende and please think about open sourcing that bisect.ps1 PowerShell script 🙂