Tagged: Architecture

Building Resilient .NET Applications using Polly

In distributed systems, outages and transient errors are inevitable. Ensuring that your application stays responsive when a dependent service goes down is critical. This article explores service resilience using Polly, a .NET library that helps handle faults gracefully. It covers basic resilience strategies and explains how to keep your service running when a dependency is unavailable.

What Is Service Resilience

Service resilience is the ability of an application to continue operating despite failures such as network issues, temporary service unavailability, or unexpected exceptions. A resilient service degrades gracefully rather than crashing outright, ensuring users receive the best possible experience even during failures.

Key aspects of resilience include:

  • Retrying Failed Operations automatically attempts an operation again when a transient error occurs.
  • Breaking the Circuit prevents a system from continuously attempting operations that are likely to fail.
  • Falling Back provides an alternative response or functionality when a dependent service is unavailable.

Introducing Polly: The .NET Resilience Library

Polly is an open-source library for .NET that simplifies resilience strategies. Polly allows defining policies to handle transient faults, combining strategies into policy wraps, and integrating them into applications via dependency injection.

Polly provides several resilience strategies:

  • Retry automatically reattempts operations when failures occur.
  • Circuit Breaker stops attempts temporarily if failures exceed a threshold.
  • Fallback provides a default value or action when all retries fail.
  • Timeout cancels operations that take too long.

These strategies can be combined to build a robust resilience pipeline.

Key Polly Strategies for Service Resilience

Retry Policy

The retry policy is useful when failures are transient. Polly can automatically re-execute failed operations after a configurable delay. Example:

var retryPolicy = Policy
    .Handle<HttpRequestException>()
    .OrResult<HttpResponseMessage>(r => !r.IsSuccessStatusCode)
    .WaitAndRetryAsync(3, retryAttempt => TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)),
        onRetry: (outcome, timespan, retryCount, context) =>
        {
            Console.WriteLine($"Retry {retryCount}: waiting {timespan} before next attempt.");
        });

Circuit Breaker

A circuit breaker prevents an application from continuously retrying an operation that is likely to fail, protecting it from cascading failures. Example:

var circuitBreakerPolicy = Policy
    .Handle<HttpRequestException>()
    .OrResult<HttpResponseMessage>(r => !r.IsSuccessStatusCode)
    .CircuitBreakerAsync(
        handledEventsAllowedBeforeBreaking: 3,
        durationOfBreak: TimeSpan.FromSeconds(30),
        onBreak: (outcome, breakDelay) =>
        {
            Console.WriteLine("Circuit breaker opened.");
        },
        onReset: () =>
        {
            Console.WriteLine("Circuit breaker reset.");
        });

Fallback Strategy: Keeping Your Service Running

When a dependent service is down, a fallback policy provides a default or cached response instead of propagating an error. Example:

var fallbackPolicy = Policy<HttpResponseMessage>
    .Handle<HttpRequestException>()
    .OrResult(r => !r.IsSuccessStatusCode)
    .FallbackAsync(
         fallbackAction: cancellationToken => Task.FromResult(
             new HttpResponseMessage(HttpStatusCode.OK)
             {
                 Content = new StringContent("Service temporarily unavailable. Please try again later.")
             }),
         onFallbackAsync: (outcome, context) =>
         {
             Console.WriteLine("Fallback executed: dependent service is down.");
             return Task.CompletedTask;
         });

Timeout Policy

A timeout policy ensures that long-running requests do not block system resources indefinitely. Example:

var timeoutPolicy = Policy.TimeoutAsync<HttpResponseMessage>(TimeSpan.FromSeconds(10));

Implementing Basic Service Resilience with Polly

Example Use Case: Online Payment Processing System

Imagine an e-commerce platform, ShopEase, which processes customer payments through an external payment gateway. To ensure a seamless shopping experience, ShopEase implements the following resilience strategies:

  • Retry Policy: If the payment gateway experiences transient network issues, ShopEase retries the request automatically before failing.
  • Circuit Breaker: If the payment gateway goes down for an extended period, the circuit breaker prevents continuous failed attempts.
  • Fallback Policy: If the gateway is unavailable, ShopEase allows customers to save their cart and receive a notification when payment is available.
  • Timeout Policy: If the payment gateway takes too long to respond, ShopEase cancels the request and notifies the customer.

By integrating these resilience patterns, ShopEase ensures a robust payment processing system that enhances customer trust and maintains operational efficiency, even when external services face issues.

Conclusion

Building resilient services means designing systems that remain robust under pressure. Polly enables implementing retries, circuit breakers, timeouts, and fallback strategies to keep services running even when dependencies fail. This improves the user experience and enhances overall application reliability.

I advocate for 12-Factor Apps (https://12factor.net/) and while resilience is not directly a part of the 12-Factor methodology, many of its principles support resilience indirectly. For truly resilient applications, a combination of strategies like Polly for .NET, Kubernetes auto-recovery, and chaos engineering should be incorporated. Encouraging 12-Factor principles, auto-recovery, auto-scaling, and other methods ensures services remain resilient and performant.

By applying these techniques, developers can create resilient architectures that gracefully handle failure scenarios while maintaining consistent functionality for users. Implement Polly and supporting resilience strategies to ensure applications stay operational despite unexpected failures.

Multitenant Thoughts

I am building my 3rd multitenant SAAS solution. I am not referencing any of my earlier work because I think they were way more work than they should have been. Also, I have since moved on from the whole ASP.net web forms development mindset and I want to start with a fresh perspective instead of trying to improve my big balls of spaghetti code.

Today, my thoughts center around enforcing the inclusion and processing of a tenant ID in every command and query. My tenant model keeps all tenant data in a shared database and tables. To keep everything segregated every time I write data and read data there has to be a tenant ID included so that we don’t mess with the wrong tenants data.

I have seen all kinds of solutions for this, some more complicating than I care to tackle at this moment. I am currently leaning towards enforcing it in the data repository.

I am using a generic repository for CRUD operations and an event repository for async event driven workflows. In the repository API’s I want to introduce a validated parameter for tenant ID in every write and read operation. This will force all clients to provide the ID when they call the repos.

I just have to update a couple classes in the repos to enforce inclusion of the tenant ID when I write data. Also, every read will use the tenant ID to scope the result set to a specific tenant’s data. I already have a proof of concept for this app so this change will cause a breaking change in my existing clients, but still not a lot of work considering the fact that I almost decided to enforce the tenant ID in a layer higher than the repo, which would have been a maintenance nightmare.

Is this best practice? No. I don’t think there is a best practice besides the fact that you should use a tenant ID to segregate tenant data in a shared data store. This solution works for my problem and I am able to maintain it in just a couple classes. If the problem changes I can look into the fancy solutions I read about.

Now, how will I resolve the tenant ID? Sub-folder, sub-domain, query string, custom domain…?