Unit testing has been around for a long time. We know how to do it. It’s a well established pattern and it just works. But unit testing an LLM is nothing like testing a sorting algorithm. The exact same input can produce dozens of different valid outputs, none of them “wrong.” The traditional approach to writing our assertions breaks down immediately. If you’ve ever tried to write unit tests for anything AI-powered, you’ve discovered very quickly that the way you’ve approached it before just doesn’t work.
What Are You Actually Testing?
Before you can write your AI-related tests, you first have to ask yourself a question: What do I mean when I say I’m “testing my AI feature”? There are three basic answers to that question.
1. Testing the Integration
This is the basic test and it’s no different than any other test you’ve created around an integration with an external service.
Given I have an external API or service that I must call and I have created my HTTP client When I call that external API or service Does my HTTP call succeed or do I correctly handle any potential errors that are returned?
It’s your basic integration test and nothing about that changes. You already know how to do this, so tackle the easy tests first.
2. Testing the Prompt
Now we get into the AI related portions of what we need to test. There are two halves that we will need to verify by some means. First, can I reliably steer the prompt toward the correct kind of output? If you’ve done anything at all with AI, you know that the quality of the output of any AI model relies on the quality of the prompt that it is given. Sometimes those prompts need some guidance.
3. Testing the Output
The other half of the AI testing is tied to the output. How do I test that the output that the AI returned, which will very much be inconsistent in its wording if not in its intent, meets my expectations for this application?
Each of these three questions need completely different testing strategies, and conflating them is where the majority of development teams tend to go wrong with their testing approach.
Mocking and Isolation
This is the familiar part of the testing. In order to be properly tested, our LLM related code needs to be abstracted behind an interface, IChatService or something like that, so that we can properly inject it and fake it in our tests. This lets us properly verify that our code handles the response correctly regardless of what the model itself said. It’s standard .NET dependency injection and generally necessary regardless of the actual testing framework being used, be it xUnit or any of the others.
// The abstraction
public interface IChatService
{
Task<string> GetCompletionAsync(string prompt, CancellationToken ct = default);
}
// The real implementation (registered in DI for production)
public class AzureOpenAIChatService : IChatService
{
private readonly ChatClient _client;
public AzureOpenAIChatService(ChatClient client) => _client = client;
public async Task<string> GetCompletionAsync(string prompt, CancellationToken ct = default)
{
var result = await _client.CompleteChatAsync(prompt, cancellationToken: ct);
return result.Value.Content[0].Text;
}
}
// A feature that uses it
public class ProductDescriptionService(IChatService chat)
{
public Task<string> SummarizeReviewsAsync(IEnumerable<string> reviews) =>
chat.GetCompletionAsync(
$"Summarize these customer reviews in 2 sentences:\n{string.Join('\n', reviews)}");
}
And the related test if we were using NSubstitute:
public class ProductDescriptionServiceTests
{
[Fact]
public async Task SummarizeReviews_CallsChatService_WithReviews()
{
var fakeChatService = Substitute.For<IChatService>();
fakeChatService
.GetCompletionAsync(Arg.Any<string>())
.Returns("Great product overall. Minor complaints about packaging.");
var sut = new ProductDescriptionService(fakeChatService);
var result = await sut.SummarizeReviewsAsync(["Loved it!", "Packaging was bad"]);
Assert.Equal("Great product overall. Minor complaints about packaging.", result);
await fakeChatService.Received(1).GetCompletionAsync(Arg.Is<string>(p =>
p.Contains("Loved it!") && p.Contains("Packaging was bad")));
}
}
This setup lets us answer question #1 and ensures that the technical aspects of our code function as expected, without needing to be concerned with what the AI is actually doing or saying. As I said: That’s the easy part. Now it gets more complicated.
Golden File/Snapshot Testing
With the technical aspects of testing out of the way, now we need to start figuring out how to test the AI specific aspects of our code. How we test that depends on what it is that our call to the AI is supposed to be doing. For prompts that should generally return consistent results, such as JSON extraction, OCR, classification & identification, summarization, or sentiment analysis, snapshot testing is a natural fit.
For snapshot testing, we run the prompt manually once and save the result as a “golden file”. A golden file is a file that stores an acceptable output from the AI against which all of our future outputs will be compared. For our testing, we compare the output from each test to our golden file and do a diff between them. If that diff is within a certain threshold, we pass the test. Outside of that, it’s a failure. For each test you’ll need to establish what an acceptable threshold is.
Regression testing with this approach helps us to easily flag major changes in structure or content from the model, identifying for us items that need immediate attention for analysis to determine whether the changes to how the model responds need further action.
There are a number of tools available to help you with snapshot testing. One such tool is Verify. It’s a “snapshot tool that simplifies the assertion of complex data models and documents”. It can run externally, integrate with ReSharper’s or Rider’s test runners as plugins, or can be integrated with your NUnit, MSTest, xUnit, and other testing framework tests.
There are more details on the Verify site, but the basic flow is the same. You start with your “golden file”, which Verify calls the “Verified” file. The test runs, it gets the output from the test, and a diff is created between the received file and the verified file. If they’re the same, it passes. If not, it fails. One of the things is that it lets you decide if the received file should be the standard going forward and lets you automate replacing the verified file with the received file. It’s a good approach for these scenarios where you expect a consistent result from the LLM.
dotnet add package Verify.Xunit
dotnet add package Verify.DiffPlex # optional, for readable diffs
[UsesVerify]
public class SentimentClassifierTests
{
[Fact]
public async Task ClassifySentiment_PositiveReview_ReturnsExpectedStructure()
{
// Use the real model here — this is what generates the golden file
var chat = new AzureOpenAIChatService(/* configured client */);
var result = await chat.GetCompletionAsync(
"Classify the sentiment of this review as positive, negative, or neutral. " +
"Return JSON with fields: sentiment, confidence (0-1), keywords[].\n" +
"Review: 'Absolutely love this product, shipping was fast!'");
await Verify(result);
}
}
The first time you run it, Verify creates a text file: SentimentClassifierTests.ClassifySentiment_PositiveReview_ReturnsExpectedStructure.received.txt. You review the result, and if it’s acceptable, rename it to .verified.txt instead of .received.txt. Future runs will generate new .received files and do a diff against the saved .verified version.
For pieces that you expect to change, such as dates and ID values, Verify has functions such as DontScrubDateTimes() which tells it to ignore datetime values when doing the diffs.
Now comes the tricky part. What do you do when the output from the LLM should generally mean the same thing, but the wording and presentation can vary wildly from test to test?
Evaluation Frameworks
Here is where things get differentiated from most .NET testing approaches. When you can’t test the exact output, you have to write evaluators. An evaluator is a function that scores a response from an LLM. An evaluator will help you answer several questions:
- Does it contain the required entities?
- Does it have the correct sentiment?
- Does it stay under the expected token budget?
- Does it avoid prohibited content and not go completely off the rails?
With evaluators, you run them consistently over a sample set of prompts and get a summary of the average of the results. But more importantly, with evaluators you need to track the results over time. LLMs can tend to drift. The answers and the sentiment may change, and that change could be slow or it might shift rapidly. You need to watch and understand how those changes occur and by how much so that you can be proactive in your approach to addressing any evolving issues. You don’t want your AI-driven solution to be in the news for all the wrong reasons.
Again, there are a number of tools available to assist you in this testing. Microsoft provides two excellent tools to assist with this. The first is Prompt flow. It’s a library for both the development of prompt-based applications, but also includes a suite of excellent functions to evaluate the quality and performance of the output. The second are the Microsoft.Extensions.AI.Evaluation libraries. These are designed specifically to evaluate the quality and safety of responses generated by AI models in .NET applications.
Here’s an example for the Evaluation library:
dotnet add package Microsoft.Extensions.AI.Evaluation
dotnet add package Microsoft.Extensions.AI.Evaluation.Quality
public class ProductSummaryEvaluationTests
{
private static readonly ChatConfiguration EvaluatorConfig =
new(new AzureChatClient(/* judge model config */));
[Fact]
public async Task ProductSummary_MeetsQualityThresholds()
{
// Arrange
var chat = new AzureOpenAIChatService(/* subject model */);
var reviews = new[]
{
"Fantastic quality, will buy again.",
"Took a while to arrive but worth the wait.",
"Good value for money."
};
// Act — get the real AI response
var summary = await chat.GetCompletionAsync(
$"Summarize these reviews in 2 sentences:\n{string.Join('\n', reviews)}");
// Build evaluation context
var messages = new ChatMessage[]
{
new(ChatRole.User, "Summarize these reviews in 2 sentences:\n" +
string.Join('\n', reviews))
};
var modelResponse = new ChatMessage(ChatRole.Assistant, summary);
// Evaluate with built-in quality evaluators
var evaluators = new IEvaluator[]
{
new CoherenceEvaluator(),
new RelevanceEvaluator(),
new FluencyEvaluator()
};
var pipeline = new ChatConversationEvaluationPipeline(
EvaluatorConfig, evaluators);
var result = await pipeline.EvaluateAsync(messages, modelResponse);
// Assert on numeric scores (each evaluator returns 1–5)
Assert.True(result.Get<CoherenceEvaluator>().Score >= 3,
$"Coherence too low: {result.Get<CoherenceEvaluator>().Score}");
Assert.True(result.Get<RelevanceEvaluator>().Score >= 3,
$"Relevance too low: {result.Get<RelevanceEvaluator>().Score}");
}
}
To enhance these, you would add your own custom evaluators, like:
public class NoHallucinationEvaluator : IEvaluator
{
public string MetricName => "NoHallucination";
public async Task<EvaluationResult> EvaluateAsync(
IReadOnlyList<ChatMessage> messages,
ChatMessage modelResponse,
ChatConfiguration? config = null,
IReadOnlyList<EvaluationContext>? additionalContext = null,
CancellationToken ct = default)
{
var sourceText = messages.Last().Text;
var responseText = modelResponse.Text;
// Use a judge model to check for invented facts
var judgePrompt =
$"Does the following summary contain any claims not supported by the source reviews? " +
$"Answer only YES or NO.\n\nSource:\n{sourceText}\n\nSummary:\n{responseText}";
var judgment = await config!.ChatClient
.CompleteChatAsync(judgePrompt, cancellationToken: ct);
var passed = judgment.Value.Content[0].Text
.Trim().StartsWith("NO", StringComparison.OrdinalIgnoreCase);
return new EvaluationResult(MetricName, passed ? 1.0 : 0.0);
}
}
What “Good Enough” Looks Like
That’s the big question. Unlike normal unit testing, with AI focused testing you have to be pragmatic. AI responses will not be acceptable 100% of the time. So what is acceptable? 95%? 90%? That’s where good logging and monitoring come in. For more on that, refer back to our post on OpenTelemetry. You need to make sure you’re logging the prompts, the results, the token usage, latency, and so forth as structured telemetry. Your acceptance criteria for AI related functionality is a dashboard tracked over time, not a bunch of green dots in your test suite.
public class InstrumentedChatService(IChatService inner, ILogger<InstrumentedChatService> logger)
: IChatService
{
public async Task<string> GetCompletionAsync(string prompt, CancellationToken ct = default)
{
var sw = Stopwatch.StartNew();
var response = await inner.GetCompletionAsync(prompt, ct);
sw.Stop();
logger.LogInformation(
"AI completion: promptLength={PromptLength} responseLength={ResponseLength} latencyMs={LatencyMs}",
prompt.Length, response.Length, sw.ElapsedMilliseconds);
return response;
}
}
// In Program.cs — wrap the real service with the instrumented decorator
builder.Services.AddScoped<AzureOpenAIChatService>();
builder.Services.AddScoped<IChatService>(sp =>
new InstrumentedChatService(
sp.GetRequiredService<AzureOpenAIChatService>(),
sp.GetRequiredService<ILogger<InstrumentedChatService>>()));
Keeping Your Tests Current
AI is constantly evolving and changing. Your tests need to evolve and change with them. AI-focused testing is not like most unit tests. You can’t write them once and then forget about them unless they start failing. They need to be regularly reviewed, evaluated, and updated. The models change, and your users will constantly find new ways to prompt the AI to do things it shouldn’t be doing, whether intentionally or not. Regularly review the input from your users and update your tests appropriately. What was “good enough” a few months ago probably isn’t good enough now.
Conclusion
AI features aren’t tested like functions, they’re evaluated. They’re experiments, really, and they need to be treated like such. Consider our previous post on A/B Testing. Testing AI features needs to be regarded in the same manner. That same Experiment Driven Development approach needs to be at the core of your thinking when rolling out new AI-focused features. AI is constantly changing and evolving, and how you test it needs to be as well.

