We had an interesting production issue where a service decided to stalk a Google API like a bad date and incured a mountain of charges. The issue made me ponder the inadequate observability and resilience we had in the system. We had resource monitoring through some simple Kubernetes dashboards, but I always wanted to have something more robust for observability. We also didn’t have a standard policy on timeouts, rate limiting, circuit breaking, bulk heading… resilience engineering. Then my mind wandered back to a video that I thought was amazing. The video was from the Netflix team and it altered my view on observability and system resilience.
I was hypnotized when Netflix released a view of the Netflix API Hystrix dashboard – https://www.youtube.com/watch?v=zWM7oAbVL4g. There is no sound in the video, but for some reason this dashboard was speaking loudly to me through the Matrix or something, because I wanted it badly. Like teenage me back in the day wanting a date with Janet Jackson bad meaning bad.
Netflix blogged about the dashboard here – https://medium.com/netflix-techblog/hystrix-dashboard-turbine-stream-aggregator-60985a2e51df. The simplicity of a circuit breaker monitoring dashboard blew me away. It had me dreaming of using the same type of monitoring to observe our software delivery process, marketing and sales programs, OKRs and our business in general. I saw more than microservices monitoring I saw system wide value stream monitoring (another topic that I spend too much time thinking about).
Unfortunately, when I learned about this Hystrix hotness I was under the impression that the dashboard required you to use Hystrix to instrument your code to send this telemetry to the dashboard. Being that Hystrix is Java based, I thought it was just another cool toy for the Java community that leaves me, .NET dev, out in the cold looking in on the party. Then I got my invitation.
I read where Envoy (on my circa 2018 cool things board and the most awesome K8s tool IMHO), was able to send telemetry to the Hytrix dashboard – https://blog.envoyproxy.io/best-of-all-worlds-monitoring-envoys-activity-using-hystrix-dashboard-9af1f52b8dca. This meant we, the .NET development community, could get similar visual indicators and faster issue discovery and recovery, like Netflix experienced, without the need to instrument code in any container workloads we have running in Kubernetes.
Install the Envoy sidecar, configure it on a pod, send sidecar metrics to Hystrix Dashboard and we have deep observability and a resilience boost without changing one line of .NET Core code. That may not be a good “getting started” explanation, but the point is, it isn’t a heavy lift to get the gist and be excited about this. I feel like if we had this on the system, we would have caught our Google API issue a lot sooner than we did and incurred less charges (even though Google is willing to give one-time forgiveness, thanks Google).
In hindsight, it is easy to identify how we failed with the Google API fiasco, umm.. my bad code. We’re a blameless team, but I can blame myself. I’d also argue that better observability into the system and improving resilience mechanisms has been a high priority of mine for this system. We haven’t been able to fully explore and operationalize system monitoring and alerts because of jumping through made up hoops to build unnecessary premature features. If we spent that precious time building out monitoring and alerts that let us know when request/response count has gone off the rails, if we implemented circuit breakers to prevent repeated requests when all we get in response are errors, if we were able to focus on scale and resilience instead of low priority vanity functionality, I think we’d have what we need to better operate in production (but this is also biased by hindsight). Real root cause – our poor product management and inability to raise the priority of observability and resilience.
Anyway, if you are going to scale in Kubernetes and are looking for a path to better observability and resilience, check out Envoy, Istio, Ambassador and Hystrix, it could change your production life. Hopefully, I will blog one day about how we use each of these.