From Feature Flags to A/B Testing - Experiment Driven Development

March 24, 2026From Feature Flags to A/B Testing - Experiment Driven Development
Barret Blake

Barret Blake, Architect

How many times does this happen? The company has an idea for a new feature in a software application. There’s general agreement that the idea is worthwhile. It gets prioritized, planned, weeks are spent on development, it rolls out, and… no one uses it. Or even worse, the users hate it and you lose business as your users go elsewhere.

In a previous blog post, Feature Flags - The Secret Weapon for .NET Developers, we covered the concept of feature flags, a process that lets you roll out new features with a minimum of risk by turning features on and off at will.

In this post we’ll talk about taking that process to the next logical step: Experiment-Driven Development (EDD). Often referred to as A/B testing, Hypothesis Driven Development or Experiment-Driven Product Development, EDD lets you roll out new features to a subset of users, letting you try a new feature before rolling it out to everyone.

Experimentation

Let’s look at a scenario. Some members of the marketing team believe that increasing the size of product images on the product details pages will influence more shoppers to buy products. Others aren’t so certain. No one wants to make a major across-the-board change if the result is going to be negative. Further, assessing the results of that change will take time. If users don’t like the change, the damage from lost customer sales could take a long time to recover from.

EDD gives you a path forward. By utilizing A/B testing scenarios and tracking the results via telemetry, you can experiment with a small subset of customers and gauge the results over time by comparing it to the main body of customers.

So, how do you approach this? There are a few key factors to consider:

Hypothesis

You need to start with a hypothesis. What is it that you are trying to achieve? What is the question that you are trying to answer. This is, after all, essentially a scientific experiment. If it’s going to be effective, it needs a hypothesis to define it.

Like any hypothesis, it should be a statement, not a question. “What happens if we make our product images 20% bigger?” is not a hypothesis. “Making our product images 20% bigger will lead to a 10% higher sales conversion rate” is a more appropriate hypothesis for this experiment.

The goal is to have a statement of assumption or belief that your experiment will either prove or disprove.

Minimum Sampling Size

In most scenarios you will have a minimum of two groups of users, but there may be more. For our example, maybe you want to run multiple scenarios: 20% larger images, 50% larger images, images on the left, images on the right, images at the top vs further down, and so forth. Whatever the scenario is that you want to test, you need to determine what the minimum sampling size is. Is it 10% of your users for 2 months? Is it 20% of the users for 1 month?

What makes an effective test sample will vary based on a lot of factors. Statistically, the larger the sample size, the better. But that has to be balanced with the risk to your reputation and sales with your customer base if the test goes poorly. Determining the proper sample size may require some experimentation of its own, but it needs to be large enough to be significant. A sample size of 1% for 1 day would not tell you anything in any test.

There are a number of tools available to help guide you through the calculation of an appropriate sample size. For example, the Experimenter’s Calculator on the DrSimonJ website is just one example. A quick GoogleBing search will turn up many such calculators. A good rule of thumb, though, is that you need enough conversions to achieve a “minimum detectable effect” at a 95% confidence level.

Tools

You need the right tool to set up the process. There are a number of commercial and open source libraries available, some of which were mentioned in the previous blog post. The key is that whatever solution you select, it needs to have the capability to be consistent with the users that are selected for each sample.

In the most basic feature flag setup mentioned in the previous post, the example selected at random which users see a feature each time a page is loaded. That isn’t an effective approach for any real testing. In fact, if a user is randomly seeing different things each time they open the same page, it could end up doing more harm than otherwise. The toolset you use for feature flags needs to pick a user for a subset and consistently maintain that user as part of the group going forward. The right tool for the right job.

As you’re considering the toolset to use, make sure it supports that capability. Most of the commercial and full-featured open source solutions do, but it’s a good idea to make sure before adopting a tool to use.

Telemetry

The feature flag toolset isn’t the only toolset to consider. You also need good telemetry. You can’t evaluate the results of the testing unless you have results to evaluate. For our example, you need to be able to compare the sales conversion rate of the users in the subsets with the sales conversion rates of the users in the main group. Telemetry is the path to determining that.

Whatever tool you’re using for telemetry, it needs to provide you with the data to let you draw conclusions from the experimentation. Before you begin the experiment, it’s critical that you have properly set up the logging and retention that you will need to be able reach those conclusions.

Results Example

The Why? Reasons To Use EDD

There are a number of reasons why you would want to implement A/B testing scenarios. If you’re going to invest in this approach, you need to be able to justify the expense and time involved. Let’s look at the why.

Testing Hypothesis

The scenario we’ve been looking at is one of those reasons. You have a hypothesis such as “Making our product images 20% bigger will increase the sales conversion rate”. But whatever the hypothesis, A/B testing provides you with the path to test it in the safest mode possible.

Exploratory Comparison Testing

Maybe you don’t have a full hypothesis. Maybe you just have some change or feature that you want to try out to gauge user reaction. By utilizing A/B testing you can compare & contrast the user engagement and reaction between two or more groups just to see what the result will be.

A/B Testing Example

Release Defense & Performance Evaluation

You can test new changes in all kinds of ways: Unit testing, integration testing, load testing, functional testing, and so forth. But in the end, there are times when you really have no idea how the system will react to real-world implementation of new code and features.

You can use EDD to slow the rollout of a new feature or piece of code so that you can evaluate the performance impact of that new implementation on your production systems before a problem becomes a crisis.

Or, just the opposite. Perhaps you have code that you believe will improve performance overall. A slow rollout will let you evaluate in small batches how that really goes before rolling it out across the board.

Personalization

We touched on this in the previous post, but there are times when you want specific features to only be available to a subset of users. Maybe you want to randomly select a group of users to receive a special bonus or reward. Whatever toolset you use, it should support the capability of selecting users by group, characteristic, or some other distinct value.

Regional Control

Perhaps you need to have certain features that are exclusive or restricted to users of a certain country or geographic region. This might be due to any number of factors, but oftentimes various government regulations play a large part in these decisions. Most of the tools out there will allow you to subdivide your users by region to allow you to enable or disable features based on such metrics.

AI

Everything is about AI these days. Artificial intelligence can come up with all kinds of ideas for things to do to your application. For example, maybe it was AI that suggested the image size change and not the marketing team. EDD is a great way to validate whether or not such suggestions have merit. It can, when guided well, iterate through these ideas quickly. With that rapid iteration, you might not always want to take the time to thoroughly evaluate the results. Or perhaps you just aren’t sure what the result of these new AI driven features might be overall.

Using a subset of your users to test these new features and their worth can be a good approach to AI driven development as well. The sample helps you beta test these new features and concepts in a limited way before rolling it out across the whole platform. It lets you condense the whole “market research” process that might previously take weeks or months into brief, rapid iterations that will let you evaluate ideas and move on much quicker than before.

Interpreting the Results

Whatever the goal or the process, your test needs to run long enough to get statistically significant results. In other words, until it’s relatively obvious one way or the other whether your hypothesis was correct. There is an entire field of statistics dedicated to such calculations. For the statistically inclined, a p-value below 0.05 is the goal.

For our purposes, however, it should simply be enough to know that enough time has passed that we can see whether there is a meaningful difference between the various groups being tested, if any at all. If there is no real difference, an evaluation might be in order to determine if further experimentation is in order.

Implementing EDD In .NET

We’ve talked about the what and why, let’s take a look at the how. We’ll use Azure App Configuration for our examples, but the approach works pretty much the same way for any of the various solutions available for .NET.

Setup

Install the required packages first:

dotnet add package Microsoft.Azure.AppConfiguration.AspNetCore
dotnet add package Microsoft.FeatureManagement.AspNetCore
dotnet add package Azure.Identity

Then wire everything up in Program.cs. The key steps are connecting to App Configuration with DefaultAzureCredential, calling UseFeatureFlags(), and registering AddAzureAppConfiguration and AddFeatureManagement in the service collection. For A/B testing you also need WithTargeting() so the split is consistent per user rather than random per request:

var endpoint = builder.Configuration["Endpoints:AppConfiguration"]
    ?? throw new InvalidOperationException("App Configuration endpoint not found.");

builder.Configuration.AddAzureAppConfiguration(options =>
{
    options.Connect(new Uri(endpoint), new DefaultAzureCredential())
           .UseFeatureFlags(ff =>
           {
               ff.CacheExpirationInterval = TimeSpan.FromMinutes(5);
           });
});

builder.Services.AddAzureAppConfiguration();
builder.Services.AddFeatureManagement()
                .WithTargeting(); // ensures consistent per-user bucketing

// ...
var app = builder.Build();
app.UseAzureAppConfiguration();

Add a Custom ITargetingContextAccessor

The critical A/B testing requirement — that users must be consistently assigned to a group — is handled by our ITargetingContextAccessor. The default implementation uses HttpContext.User.Identity.Name as the UserId, which is fine for authenticated apps. For anonymous users you need a custom implementation. Azure Docs Here’s one that falls back to a cookie-based ID for anonymous visitors:

public class UserTargetingContextAccessor : ITargetingContextAccessor
{
    private const string CookieName = "ab_user_id";
    private readonly IHttpContextAccessor _httpContextAccessor;

    public UserTargetingContextAccessor(IHttpContextAccessor httpContextAccessor)
        => _httpContextAccessor = httpContextAccessor;

    public ValueTask<TargetingContext> GetContextAsync()
    {
        var context = _httpContextAccessor.HttpContext!;

        // Use authenticated user ID if available, otherwise a stable cookie
        var userId = context.User.Identity?.IsAuthenticated == true
            ? context.User.Identity.Name!
            : GetOrCreateAnonymousId(context);

        return ValueTask.FromResult(new TargetingContext
        {
            UserId = userId,
            Groups = context.User.Claims
                .Where(c => c.Type == ClaimTypes.Role)
                .Select(c => c.Value)
                .ToList()
        });
    }

    private static string GetOrCreateAnonymousId(HttpContext context)
    {
        if (context.Request.Cookies.TryGetValue(CookieName, out var existing))
            return existing;

        var newId = Guid.NewGuid().ToString();
        context.Response.Cookies.Append(CookieName, newId, new CookieOptions
        {
            Expires = DateTimeOffset.UtcNow.AddDays(90),
            IsEssential = true
        });
        return newId;
    }
}

Register it in Program.cs:

builder.Services.AddHttpContextAccessor();
builder.Services.AddFeatureManagement()
                .WithTargeting<UserTargetingContextAccessor>();

Using the Flag in a Razor Page

In Azure App Configuration’s Feature Manager, create a flag called LargeProductImages and attach a Targeting filter with a 20% rollout. Then in your page model:

public class ProductDetailsModel : PageModel
{
    private readonly IFeatureManager _features;

    public bool ShowLargeImages { get; private set; }

    public ProductDetailsModel(IFeatureManager features)
        => _features = features;

    public async Task OnGetAsync()
    {
        ShowLargeImages = await _features.IsEnabledAsync("LargeProductImages");
    }
}

And in the Razor view, the tag helper keeps the template clean:

@addTagHelper *, Microsoft.FeatureManagement.AspNetCore

<feature name="LargeProductImages">
    <img src="@Model.ImageUrl" class="product-image product-image--large" />
</feature>
<feature name="LargeProductImages" negate="true">
    <img src="@Model.ImageUrl" class="product-image" />
</feature>

Telemetry

This closes the loop. You need to record which variant each user was shown so you can correlate it against conversion events in your OTEL pipeline (covered in our earlier OTEL post). Let’s upgrade our ProductDetailsModel class to add the full telemetry.

public class ProductDetailsModel : PageModel
{
    private readonly IFeatureManager _features;
    private readonly ILogger<ProductDetailsModel> _logger;

    public bool ShowLargeImages { get; private set; }

    public ProductDetailsModel(IFeatureManager features,
                               ILogger<ProductDetailsModel> logger)
    {
        _features = features;
        _logger = logger;
    }

    public async Task OnGetAsync(int productId)
    {
        ShowLargeImages = await _features.IsEnabledAsync("LargeProductImages");

        // Emit a structured log event so OTEL/App Insights can group by variant
        _logger.LogInformation(
            "AB variant assigned {Variant} for user on product {ProductId}",
            ShowLargeImages ? "LargeImages" : "Control",
            productId);
    }
}

Using structured logging here means you can write a Kusto query in Application Insights (or App Insights Workbooks) that groups conversion events by Variant and calculates the rate — which is exactly what the “Telemetry” section of the post is describing, just made concrete.

Conclusion

There is too often the attitude of “we spent the money, we’re going to use it.” Or even worse, “the project sponsor likes it so we’re going to keep it”. You need to be comfortable with the idea of sunk costs, and be willing to discard a change if the evidence doesn’t support keeping it. That’s why telemetry is so critical to the EDD process. You need to be able to support the conclusions with evidence. Good telemetry provides that.

In the end, whatever the reason or the approach, Experiment-Driven Development can provide you with the means of safely rolling out new ideas and features with a minimum of risk.

Resources