Category: Analytics

Observable Resilience with Envoy and Hystrix works for .NET Teams

We had an interesting production issue where a service decided to stalk a Google API like a bad date and incured a mountain of charges. The issue made me ponder the inadequate observability and resilience we had in the system. We had resource monitoring through some simple Kubernetes dashboards, but I always wanted to have something more robust for observability. We also didn’t have a standard policy on timeouts, rate limiting, circuit breaking, bulk heading… resilience engineering. Then my mind wandered back to a video that I thought was amazing. The video was from the Netflix team and it altered my view on observability and system resilience.

I was hypnotized when Netflix released a view of the Netflix API Hystrix dashboard – There is no sound in the video, but for some reason this dashboard was speaking loudly to me through the Matrix or something, because I wanted it badly. Like teenage me back in the day wanting a date with Janet Jackson bad meaning bad.

Netflix blogged about the dashboard here – The simplicity of a circuit breaker monitoring dashboard blew me away. It had me dreaming of using the same type of monitoring to observe our software delivery process, marketing and sales programs, OKRs and our business in general. I saw more than microservices monitoring I saw system wide value stream monitoring (another topic that I spend too much time thinking about).

Unfortunately, when I learned about this Hystrix hotness I was under the impression that the dashboard required you to use Hystrix to instrument your code to send this telemetry to the dashboard. Being that Hystrix is Java based, I thought it was just another cool toy for the Java community that leaves me, .NET dev, out in the cold looking in on the party. Then I got my invitation.

I read where Envoy (on my circa 2018 cool things board and the most awesome K8s tool IMHO), was able to send telemetry to the Hytrix dashboard – This meant we, the .NET development community, could get similar visual indicators and faster issue discovery and recovery, like Netflix experienced, without the need to instrument code in any container workloads we have running in Kubernetes.

Install the Envoy sidecar, configure it on a pod, send sidecar metrics to Hystrix Dashboard and we have deep observability and a resilience boost without changing one line of .NET Core code. That may not be a good “getting started” explanation, but the point is, it isn’t a heavy lift to get the gist and be excited about this. I feel like if we had this on the system, we would have caught our Google API issue a lot sooner than we did and incurred less charges (even though Google is willing to give one-time forgiveness, thanks Google).

In hindsight, it is easy to identify how we failed with the Google API fiasco, umm.. my bad code. We’re a blameless team, but I can blame myself. I’d also argue that better observability into the system and improving resilience mechanisms has been a high priority of mine for this system. We haven’t been able to fully explore and operationalize system monitoring and alerts because of jumping through made up hoops to build unnecessary premature features. If we spent that precious time building out monitoring and alerts that let us know when request/response count has gone off the rails, if we implemented circuit breakers to prevent repeated requests when all we get in response are errors, if we were able to focus on scale and resilience instead of low priority vanity functionality, I think we’d have what we need to better operate in production (but this is also biased by hindsight). Real root cause – our poor product management and inability to raise the priority of observability and resilience.

Anyway, if you are going to scale in Kubernetes and are looking for a path to better observability and resilience, check out Envoy, Istio, Ambassador and Hystrix, it could change your production life. Hopefully, I will blog one day about how we use each of these.

An Agile Transformation

I wrote this a few years ago, but I’m going through a similar agile transformation right now. Although, every agile transformation is different, this still makes sense to me although it is just a draft post. I figured I’d just post it because I never search my drafts for nuggets of knowledge :).

If we are going to do Kanban we shouldn’t waste time formally planning sprints. Just like we don’t want to do huge upfront specifications because of waste caused by unknowns that invalidate specs, we don’t want to spend time planning a sprint because the work being done in the sprint can change anytime the customer wants to reprioritize.

We should have a backlog of prioritized features. The backlog is regularly prioritized (daily, weekly…) to keep features available to work. If we want to deliver a specific set of features or features in two weeks, prioritize them and the team will do those features next.

There is a limit on the number of features the team can have in progress (work in progress or WIP). Features are considered WIP until they pass UAT. Production would be a better target, but saying a feature is WIP until production is a little far fetched if you aren’t practicing “real” continuous delivery. So, for our system, production is considered passing UAT. When the team is under their WIP limit they are free to pull the next feature from highest priority features in the backlog.

This is going to most likely reduce resource utilization, but will increase throughput and improve quality. Managers may take issue at developers not being used at full capacity, but there is a reason for this madness and hopefully I can explain it.

Having features pulled into the pipeline from a prioritized backlog instead planning a sprint allows decisions on what features to be worked to be deferred until the last possible moment. This provides more agility in the flow of work in the pipeline and the product owner is able to respond quickly to optimize the product in production. Isn’t agile what we’re going for?

Pulling work with WIP limits also gives greater risk management. Since batch sizes are smaller, problems will only affect a limited amount of work in progress and risk can be mitigated as new work is introduced in the pipeline. This is especially true if we increase the number of production releases. If every change results in a production release we don’t have to worry about the branch and hotfix dance.

Focusing on a limited amount of work improves the speed at which work is done. There is no context switching and there is a single focus on moving a one or limited amount work items through the system at one time. This increases the flow of work even though there may be times when a developer is idle.

The truth is the system can only flow as fast as its slowest link, the constraint. Having one part of the system run at full capacity and overload the constraint introduces a lot of potential waste in the system. If the idle parts of the system worked to help the bottlenecked part of the system, the entire system improves. So having a full system focus is important.

On my current team, we have constraints that determine how quickly we can turn around a feature. Currently, code review and QA are constraints. QA is the largest constraint that limits faster deployment cycles, but more on that later. To optimize our constraints we could follow the five basic steps outlined in the Theory of Constraints (TOC) from the book The Goal:

  1. Identify the constraint(s) – in this instance it’s code review and manual testing
  2. Exploit the constraint to maximize productivity – focus on improvements on the constraint
  3. Subordinate all other steps or processes to speed up or reduce capacity of the constraint – no new work may enter as WIP until the constraint has WIP available
  4. Elevate the constraint – prioritize work that helps remove the constraint.
  5. Repeat

To help with the code review constraint the plan is to have developers do code reviews any time WIP stops the movement of work. With this time developers can dig in and do more thoughtful code reviews and look for ways to refactor and improve the code base. Since we are touching code, why not make recommendations to make the code better. So, we can improve what an acceptable pull request is: good syntax, style, logic, tests… everything we can think of to make the codebase more maintainable and easy to validate.

To remove the QA constraint, the plan focuses on developers creating automated tests to help lessen the work that QA has to do. The reason we don’t first focus on optimizing QA processes directly is because focusing on simply optimizing QA processes would actually increases the capacity for QA without increasing the speed at which we can flow work to production. We don’t want to increase the number of features that QA can handle because it is important to take the proper time in testing. What we want to do is remove manual regression checks for QA. Exploiting QA for us means increasing QAs effectiveness freeing up time to do actual testing instead of just following a regression script. Having developers automate regression opens us up to deliver new features to production faster because automation runs these test much faster than QA. QA can focus on what they do best, testing and not running mundane scripted checks. Trick here is how do we convince developers to write automated tests without causing a revolt.

In summary, we would have to wait for a manual regression test cycle to occur and couldn’t introduce new work because it would invalidate the regression test. With automation handling +80% of regression QA can move faster, actually test more, and we can not only increase throughput through the entire system, but the overall quality of the product is also increased.

Monitoring Delivery Pipeline

We track work through the delivery pipeline as features. A feature in this sense is any change, new function, change existing function, or to fix a defect. Features are requested on features kept in a central database. We monitor the delivery pipeline by measuring:

  • Inventory
  • Lead Time
  • Quantity: Unit of Production
  • Production Rate


Inventory (V) is any work that has not been delivered to the customer. This is the same as work in progress (WIP). This counts all work from the backlog to a release awaiting production deployment. Whenever there is undelivered work and we have to cancel the work for some reason, we considered it an Operational Expense. Canceled work won’t be delivered to production because of defect, incorrect specs, the customer pivoted or otherwise doesn’t want it. Cancelled work is wasted effort and in some cases can also cause expensive un-budgeted rework. In traditional cost accounting inventory is seen as an asset, but in TOC it is a potential Operational Expense if it is not eventually delivered to customer so turning inventory as fast as possible without injecting defects is a goal.


Quantity (Q) is the total number of units that have moved through our delivery pipeline. Our unit of production is a feature. When a feature is deployed to production we can increase quantity by one unit. A feature is still considered inventory until it has been delivered to the customer in production. If a customer decides they don’t want the feature or some other reason to stop the deployment of the feature, it is counted as an Operational Expense and not quantity.

Flow Time

Flow time (FT) is the time it takes to move a feature, one unit, from submission to the backlog to deployed to a customer in production.

Production Rate

Production rate (PR) is the number of units delivered during a time period. This is the same as throughput. If we we deliver 3 features to production in a month our production rate is 3 features per month.

Optimize Delivery Pipeline for Flow Time

We should strive to optimize the delivery pipeline for flow time instead of production rate or throughput. The Theory Of Constraints – Productivity Metrics in Software Development posted on explains this well.

Let’s say our current flow time (FT) is 1 unit (Q) in a week or a production rate (PR) of 4 Q per month. If we optimize FT to 1 Q in 3 days, we will see a jump in PR to 6.67 Q per month or a 59% increase.

If we focus on optimizing PR, we may still see improvement in FT, but it can also lead to only an increase in inventory as WIP increases. The PR optimization may increase Q that is undeliverable because of some bottleneck in our system so the Q sits as inventory, ironically in a queue. The longer a feature sits in inventory the more it costs to move it through the pipeline and address any issues found in later stages of the pipeline. So, old inventory can also cause delay down stream as the team must take time to ramp up to address issues after they have moved on to another task.

So, to make sure we are optimizing for FT we focus on reducing waste or inventory in the pipeline by reducing WIP. The delivery team keeps a single purposed focused on one unit or a limited amount of work in progress to deliver what the customer needs right now, based on priority in the backlog. Reducing inventory reduces Operation Expense. (Excuse me if I am allowing some lean thinking into this TOC explanation)



Investment (I) is the total cost invested in the pipeline. In our case we will count this as time invested. We can sum the time invested on each unit in inventory in the pipeline to see how much is invested in WIP. We could count hours in timecards to determine this, but time cards are an evil construct. If we are good about moving cards, or even automated movement of cards based on some event (branch created, PR submitted, PR approved…), we could assign the time a card sits in some state to a standard investment amount in the time it sat. I’m still pondering this, but I feel like time investment based on card movement is way better than logging time.

Operating Expense

Operating expense (OE) is the cost of taking an idea and developing it to a deliverable. This is not to be confused with operational expense which is a loss in inventory or loss in investment. Any expense, variable or fixed, that is a cost to deliver a unit is considered OE. We will just use salaries of not only developers, but BA, QA, IT as our OE. Not sure how we will divide up our fixed salaries, maybe a function that includes time and investment. Investment would be a fraction of OE because all of a developers time is not invested in delivering features (still learning).


Throughput (T) in this sense is the amount earned per unit. Traditionally, this is that same as production rate as explained earlier, but in terms of cost, we calculate throughput by taking the amount earned on production rate, features delivered to production, minus the cost of delivering the features or the investment.

Throughput Accounting

To maximize ROI and net profit (NP) we need to increase T while decreasing I and OE.

NP = (T – OE)


Average Cost Per Feature

Average cost per feature (ACPF) is the average amount spent in the pipeline to create a feature.


There are more metrics that we can gather, monitor, and analyze; but we will keep it simple for now and learn to crawl first.

Average Lead Time Per Feature

The average time it takes to move a feature from the backlog to production. We also calculate the standard deviation to get a sense on how varying work sizes in the pipeline affects lead time.

Bonus: Estimating Becomes Easier

When we begin to monitor our pipeline with these metrics estimating becomes simpler. Instead of estimating based on time we switch to estimating based on size of feature. Since we are tracking work, we have a history to base our future size estimates on.

Issues in Transformation

Our current Q is a release, a group of features that have been grouped together for a deployment. We will build up an inventory of features over a month at times before they are delivered to production. This causes an increase in inventory. It would be better to use a feature instead of a release as our Q. When a feature is ready, deliver it. This reduces inventory and increase the speed at which we get feedback.

To change our unit, Q, to feature we have to attack our largest constraint, QA. Currently, we have to sit on features or build up inventory to get enough to justify a QA test cycle. We don’t want to force a two week regression on one feature that took a couple days to complete. So, reducing the test cycle is paramount with this approach.


  • The Goal: A Process of Ongoing Improvement, by Eliyahu M. Goldratt

Adding Report to Existing TFS 2017 Project

I had an issue where I couldn’t see reports for my TFS projects because they weren’t installed. I knew this because I opened SQL Reporting Services and I didn’t see a folder for my project under the TFS collection’s folder. I did a little digging and found a command that I could run to install the reports:

  1. Open administrator command prompt on server hosting TFS.
  2. Change directory to C:\Program Files\Microsoft Team Foundation Server 15.0\Tools
    Note: 64bit would be Program Files (x86)
  3. Run TFSConfig command to add project reports

TFSConfig addprojectreports /collection:”https://{TFSServerName}/{TFSCollectionName}” /teamproject:{TFSProjectName} /template:”Scrum”

You should replace the tokens with names that fit your context (remove the brackets). The template will be the template for your project:

  • Scrum – you will have backlog items under features
  • Agile – you will have stories under features

There’s another one, CMMI, but I’ve never used it. You should see a requirements work item, but I’m not sure if this template has a feature item.

Once you run the command, the reports will be added and you will be able to see how your team is doing by viewing the reports in SQL Reporting Services.

If It Looks Like a Defect is It a Defect?

Our software quality metrics work group had a discussion today and metrics around defects became an interesting topic. One of the work group members said that the concept of a defect is not relevant to agile teams. This was clarified as defect metrics within the confines of an agile sprint. I felt kind of dumb, because I didn’t know this and it appeared that there may be a consensus with it. Maybe I misunderstood, but the logic was that there are no defects in sprint because once a problem is found it is immediately fixed in the sprint. I wanted to push for defect metrics from check-in through production. The later in the software delivery pipeline that a defect is found the more it will cost, so you have to know where it was caught. I didn’t get to dig in to the topic with the group because I was contemplating whether I needed to revisit my understanding of Agile and I didn’t want to slow the group down. I already feel like a lightweight in the ring with a bunch of heavyweights :).

Defects Cost Money

After pondering it a bit, I am still of the opinion that defects exists whether you name them something else, quietly fix them before anyone notices, or collectively as a team agree not to track them. Defects are an unavoidable artifact of software development. Defect, bug, issue…it doesn’t work as expected, name it what you like or obscure it in the process, defects are always there and will be until humans and computers become perfect beings. Defects cost money when more than one person has to deal with them. If a defect is caught in an exploratory test and it is acknowledged that it must be fixed in sprint, then it will have to be retested after the fix. Pile this double testing cost on top of the development cost and defects can get expensive.

Not to mentions, defects slow sprints down. When you estimated a certain amount of story points, let’s say 2, and ultimately the story was an 8 because of misunderstandings and bad coding practices, there is a cost associated with this. Maybe estimates are stable or perfect in mature hard core Agile teams or defects just another chore in the process that don’t warrant tracking or analysis. For new teams just making the transition to agile, tracking defects provides an additional signal that something is wrong in the process. If you are unable to see where your estimate overruns are occurring you can’t take action to fix them.


If someone besides the developer finds a defect, the story should be rejected. At the end of the sprint we should be able to see how many rejections there were and at what stage the rejects occurred in the pipeline. If these number are high or trending up, especially later in the pipeline, something needs to be done and you know there is a problem because you tracked defects. It may be my lack of experience in a hard core Agile team, but I just can’t see a reason to ignore defects just because they are supposed to be fixed in sprint.

Can someone help me see the light? I thought I was agile’ish. I am sure there is an agile expert out there than can give me another view of what defects mean in agile and how my current thought process is out of place in agile. I think my fellow group members are awesome, but I usually look for a second opinion in topics I am unsure about.

Optimizing the Software Delivery Pipeline: Deployment Metrics

Currently, I have no way of easily determining what build version is deployed to an environment. This made me take more interest in metrics about deployments, we basically have none. I can look at the the CD (continuous deployment) server and see what time a deployment was done and I can look at the builds on the server and sort of deduce which build was deployed, but I have to manually check the server to verify my assumptions. I wondered what else I am missing. Am I flying blind, should I know more?

Metrics in the Software Delivery Pipeline

I am part of a work group that is exploring software quality metrics. So, my first instinct was to think about deployment quality metrics. After some soul searching, I decided what would be most helpful to me is to know where our bottle necks are. We have an assembly line or pipeline that consists of various stages our software goes through as it makes its way to public consumption. Develop, build, deploy, test, and release are the major phases of our software delivery pipeline (I am not including planning or analysis right now as that is another animal).

I believe that metrics that focus on reducing time in our software delivery pipeline will be more effective than just focusing on reducing defects or increasing quality. If we can reduce defects or increase quality in faster delivery iterations, the effect of defects and poor quality will have less of an impact. This is the point of quality metrics in the first place, reducing the effects of poor quality on our customers and the business. Focusing on reducing time in the pipeline also supports our quality initiatives as the tools to reduce time, like automated CI and testing, not only reduce iteration time, but improve quality. Faster release iterations will allow us to address quality issues quicker. This is not to say that other metrics should be ignored. I just think that since we have no real metrics at the moment starting with metrics that support speeding up the pipe is a worthy first step.

Deployment Metrics

Back to the point. What metrics should I capture for deployments. If my goal is to increase throughput in the pipeline, I need to identify bottlenecks. So, I need some timing data.

  • How long does deployment take?
  • How long do the individual deployment steps take?
  • How do we report this over time so we can identify issues?

This is pretty simple and I can extract it from the deployment log on the server. Reporting would be just a matter of querying this data and displaying deployment time totals over time.

Additional Deployment Metrics

In addition to the timing data it may be worthwhile to capture additional metrics like the size of deployment. Deploying involves pushing packages across wires and the size of the packages can have an effect on deployment time. Issues with individual servers can affect deployment time so, knowing the servers being deployed to can help identify server issues. With the timing data, we can also capture

  • The version of the build being deployed
  • The environment being deployed to
  • The individual servers being deployed to
  • The size and version of the packages being deployed to a server

Deployment Data

So, my first iteration of metrics center around timing, but would also have other data to give a more robust picture of deployments. This is a naive first draft of what the data schema could look like. I would suspect that this can all be captured on most CI/CD servers and augmented with data generated by the reporting tool:

  • Deployment Id – a unique identifier for the deployment, generated by the reporting tool
  • Environment Id – a unique identifier for the environment deployed to, generated by the reporting tool
  • Build Version – build version should be the version captured on the server
  • Timestamp – timestamp is the date/time the deployment record was created
  • Start – the date/time the deployment started
  • End – the date/time the deployment completed
  • Tasks – tasks are the individual steps taken by the deployment script; it is possible that there is only one step, it all depends on how deployment is scripted
    • Deployment Task Id – a unique identifier for the task, generated by the reporting tool
    • Server Id – a unique identifier for the physical server deployed to, generated by the reporting tool
    • Packages – packages represent the group of files pushed to the server, this is normally a zip or NuGet package in my scenarios
      • Package Version – the version of the package being pushed, this may be different than the software version and is generated outside of the reporting tool
      • Package Size – the physical size of the package in KB or MB (not sure which is better)
    • Start – the date/time the deployment to the server started
    • End – the date/time the deployment to the server ended

Imagine the above as some beautiful XML, JSON, or ProtoBuf, because I am too lazy to write it.

If my goal is to increase throughput in the pipe I should probably think about a higher level of abstraction in the hierarchy so that I can relate metrics from other parts of the pipeline. For now I will focus on this as a first step to prove that this is doable and provides some value.

All I need to do is a create data parsing tool that can be called by the deployment server once a deployment is done. The tool will receive the server log and store it, parse the log and generate a data structure similar to above, then store the data in a database. Then I have to create a reporting tool that can present graphs and charts of the data for easy analysis. Lastly, create an API that will allow other tools to consume the data. This maybe a job for CQRS and event sourcing. Easy right :). I know there is a tool for that, but I am a sucker for punishment.


This post will take more time than I thought so I will make this a series. I will cover my thoughts on metrics for development, build, test, and release in upcoming posts (if I can remember). Then possibly some posts on my thoughts on how the metrics and tools can be used to optimize the pipeline. Pretty ambitious, but sounds like fun to me.