Part of any good software development strategy includes writing tests for your code. For any test to be valid, it needs a good set of data to test the code against. But where does that testing data come from? How do you create testing data that validates that your tests are, well… valid?
The Wrong Strategy: Making a Copy of Production
This is the time honored classic, and for years it was the approach that everyone used. Most of the time, it’s the wrong approach. Using production data for testing, even for QA & UAT environments, presents several risks to your company:
Security Vulnerabilities - Production data often contains sensitive and secure information like PII (Personally Identifiable Information), financial records, API keys, strategic data, sales information, and so forth. We all know that the “lower” environments are never as secure as production. Even if you get everyone following all the security best practices, the data in these lower environments is not secure. Just having a process in place to copy the data out of production to a lower environment opens up an attack vector to your production data. It’s bad security. And even if you “anonymize” the PII and financial data, there are plenty of tools and techniques available today that can help an attacker re-identify those customers based on the data that doesn’t get changed.
Legal & Regulatory Compliance - There are a number of laws across the globe to protect user data, such as HIPAA and GDPR. Exposing customer and sensitive data to people who should not have open access to it without strict controls (i.e. testers and developers and devops, oh my!) exposes your company to legal liability, fines, and lawsuits. Not to mention the reputational risk to your company’s name.
Data Staleness & Data Integrity - Snapshots are a picture in time of your data. So even if it was “real, live” production data at one point, the longer it has been since that snapshot, the less accurate that data is as a reflection of your production environment. As a consequence, the test results could be misleading. Add to that the fact that testing changes that data. If you started with a snapshot, the more you test it, the more that data gets changed. Even if you carefully design your tests to use transactions and revert those transactions after each test, things get missed, the data gets altered, and it strays over time.
Performance Impact - Making a copy of production can tie up your production system. The more data production collects over time, the longer that copy can take. This can have a significant performance hit on your production systems.
Making a copy of production is a bad idea. Don’t do it. Ever. But what can you do to create valid test data that will work? Let’s take a look at some options.
Manual Creation
The simplest path is to create hard-coded data for each test, passing that data in to the test. It’s quick for simple test cases and data sets, and can be put together with relative ease. It does have a major downside, however. As the codebase gets larger and more complex, hand crafting enough data to keep up with all the tests gets time consuming, tedious, and difficult to maintain. On top of that, you may miss a lot of situations that should be tested for, especially edge cases you didn’t think about. It’s a good starting point for simple scenarios, but it rapidly gets overwhelming the further you get.
Scripting a Test Database
You could take the manual creation flow a step further and go the route of scripting out a test database that generates a new instance (or refreshes an instance) of the test database before running the unit tests. While that does get around some of the issues of making a copy of production and manual creation, it also has some of the same issues, especially around data integrity and performance. It also has the issue of size. As your codebase grows, so does the test database. It will take longer and longer to set up the instance each time you want to run your tests. And if you aren’t careful with how you write your tests, they may alter the test data so that later tests run into data in a state that was not expected, producing unexpected results that can be hard to trace to their root cause. Further, tests running in parallel may interfere with each other in unexpected ways, causing those same unpredictable results.
Factory/Builder Pattern
The Builder Pattern is a great way to set up test data that can adapt to all kinds of situations. We’ve touched on the Builder Pattern previously, but it’s worth another quick overview in this context.
It takes some effort to do the initial setup with reasonable defaults for the majority of situations, but it also allows you to customize the generated data with just a little bit of enhancement.
To implement the Builder Pattern, you start with a base class that provides a set of default data and a Build() function:
public class StudentBuilder
{
private string _firstName = "John";
private string _lastName = "Smith";
private string _studentId = "12345";
public Student Build()
{
return new Student ( FirstName = _firstName, LastName = _lastName, StudentId = _studentId );
}
}
When you need to generate a student with the defaults for a unit test, you would call it as follows:
var student = new StudentBuilder().Build();
This will return a new Student object with a first name of “John”, last name of “Smith”, and a student ID of “12345”. To this we can add functions to customize the object. Most of the time, we use the pattern of “With…” to name the additional functions that add data: WithName, WithStudentId, etc. And we use the pattern of “Clear…” to name functions that clear out default values: ClearStudentId.
public class StudentBuilder
{
private string _firstName = "John";
private string _lastName = "Smith";
private string _studentId = "12345";
public StudentBuilder WithName(string firstName, string lastName)
{
_firstName = firstName;
_lastName = lastName;
return this;
}
public StudentBuilder ClearStudentId()
{
_studentId = "";
return this;
}
public Student Build()
{
return new Student ( FirstName = _firstName, LastName = _lastName, StudentId = _studentId );
}
}
This lets us override our default data with more customized data:
var student = new StudentBuilder().WithName("Mary", "Jones").Build();
This would return a new Student instance with a name of “Mary Jones” and a student ID of “12345”.
This pattern lets us set up a good default data object, but also lets us make the data highly customizable to the specific needs of each individual test. Another common pattern is to have the default be null/empty values for an object instance, but to also have a function named WithDefaultValues() that we can call to populate the common fields. We can also easily chain these builders together, passing the output from one into the builder for another.
var grades = new StudentGradeBuilder().WithFreshmanGrades().Build();
var student = new StudentBuilder().WithDefaultValues().WithGrades(grades).Build();
This could be the pattern to generate a student with grades attached to their instance. The biggest advantage of the Builder Pattern is that it lets us ensure that the data we are testing with is predictable and repeatable across every iteration of the test. The biggest thing to remember with the Builder Pattern is that the order that you chain the calls together matters. If you have two different methods that alter the same piece of data, the last one wins.
That gets us a long way, but it also has some limitations. For this example, what if I need a large group of students to test? Setting up each of those students could get tedious. If we were to go to the effort of setting up a group of 20 student objects using the Builder Pattern, those 20 students will be static across each iteration of the test. That might be a good thing or a bad thing depending on the needs of the test. But sometimes we also need to introduce a bit of randomness into our testing to help us discover edge cases.
Data Libraries & Tools
There are a lot of libraries available to help us generate test data. These libraries revolve around being able to generate random test data for us so we don’t have to create them ourselves. Each of the testing libraries available has things they do especially well.
Database Generators
For databases, Redgate SQL Data Generator is a common commercial tool. Another common tool is Datanamic Data Generator. Both these and the numerous other tools out there work pretty much the same way. They take your database schema and some defined rules for the various fields, and generate blocks of test data you can use to test with. This has advantages, in that it can generate a large amount of random, but appropriate, data to your test database. However, they also have the same potential downsides with data integrity as scripting out a database yourself.
That said, there are a number of valid scenarios where you want a database full of data to test against. One common scenario is performance testing. Using the Builder Pattern for individual unit tests is perfect, but it can’t replicate in any form the scenario of massive numbers of users hitting a real database simultaneously. If you want to identify choke points and performance issues, you want a real, live, large database for those tests. These tools can help you spin up those databases in just a few minutes.
Online Data Generators
There are a number of online tools, such as Mockaroo and GenerateData.com. They all work pretty much the same way, letting you pick from common data types or defining the data structure yourself, then it will generate a batch of sample data you can then export in various formats to be used however you need. Some of them, like GenerateData.com, will even let you export the data as C# classes that you can just drop into your code directly.
C# Libraries
For C#, there are multiple options for generating valid test data for unit and functional tests.
AutoFixture
AutoFixture is an open source library designed to automatically create instances of classes, even complex ones, with test data. You just instantiate an instance of the Fixture class, then call the Create<T> function. It will generate random data for the public properties of the passed in type. For example, for our Student class, we might arrange it as follows:
Fixture fixture = new Fixture();
Student student = Fixture.Create<Student>();
This will create a student with random text for the FirstName, LastName and StudentId fields. It’s very basic, but it can quickly populate a new instance of an object with random data appropriate to the type.
The downside is that it is, indeed, basic. You really don’t have any control over what data gets put in there. The 3 string fields in Student will get random strings. An int will get a random, but valid integer. That’s about as far as AutoFixture will take you. But if that’s all you need, then it’s fine for what it provides.
GenFu
GenFu is another open source library for testing various types of data. A big step up from AutoFixture, the data it generates is more realistic, using helpers they call “Fillers”, which can identify common properties based on name. For instance, if you have a Person class:
class Person
{
public string FirstName {get;set;}
public string LastName {get;set;}
}
and you instantiate an instance like this:
var person = A.New<Person>();
It will automatically identify FirstName and LastName and populate random data for it. If you need a List<Person>:
var people = A.ListOf<Person>();
It will automatically generate a List<Person> with 25 instances (25 is the default if you don’t specify). It’s a unique concept that I haven’t found in any other libraries. It also lets you define ranges for values like numbers. They have fillers for a bunch of data types. Unfortunately, GenFu is no longer in active development. But, for a lot of scenarios, it’s still a pretty good library to use.
Bogus
Bogus is a port to C# of the Faker.js library. It’s another topic we’ve visited previously:
Bogus excels at creating appropriate data for people, including names, physical addresses, email addresses, phone numbers, user IDs, and so forth. And it can customize that data based on locales. For instance, if you need a set of German names, or Slovakian names, it can do so.
Beyond that, it can generate all kinds of random data like product names, company names, dates and numbers, paragraphs of text, images, IP addresses, passwords, file paths, cars, and so on. There is a vast array of random data it can generate. Best of all, it’s easy to implement.
Bogus is open source, but it also has a premium paid tier that includes additional data specific to geographic locations, medical data, and a few other things.
Bogus has a couple of different approaches to generating data. Let’s enhance our StudentBuilder class with random names instead of a static one using the Bogus Faker facade.
public class StudentBuilder
{
private string _firstName;
private string _lastName;
private string _studentId = "12345";
public StudentBuilder()
{
var faker = new Faker("en"); //create a new Faker instance with the English (en) locale
_firstName = Faker.Name.FirstName();
_lastName = Faker.Name.LastName();
}
public Student Build()
{
return new Student ( FirstName = _firstName, LastName = _lastName, StudentId = _studentId );
}
}
Instead of a static first and last name, this will generate a random name each time a new instance of the builder is created.
Bogus also has a great randomizer function to generate random values like IDs using any pattern. For instance:
var studentId = new Randomizer().Replace("#######");
This will generate a random 7-digit value like 8675309. Or we could use “*******” to generate 7 alphanumeric characters. It even lets you set the seed for the Randomizer if you want random, but deterministic values, for repeatable test output. There’s a lot more to Bogus than what I’ve covered here. For instance, you can extend it with your own custom datasets and extension methods. It’s by far the most in depth library of this type for C# currently, and it’s still being actively maintained.
Property Based Testing
If you don’t need structured data, there is also a different approach to testing called Property-Based Testing. The way it works is that instead of designing specific data, you describe what properties should hold true for all inputs of a variable and the library creates a bunch of tests on the fly to test all the possible inputs.
For .NET, the most popular library to implement this is Hedgehog. For example (taken from the Hedgehog docs), if a particular input could be an integer between 0 and 100, instead of defining 100 different inputs for a unit test, you would write:
var property =
from xs in Gen.Alpha.List(Range.LinearInt32(0, 100)).ForAll()
select xs.Reverse().Reverse().SequenceEqual(xs);
property.Check();
This would test all the values from 0 to 100 without you having to define 100 different inputs yourself. The advantage of this approach to testing and libraries like Hedgehog is that they are able to run wide bands of tests to help identify the edge cases where failures occur. It doesn’t have to run every case, of course. These libraries support sampling where it can select a range of assorted test values from the minimum to the maximum to get a good result set without tying up your system testing for excessive periods of time.
Conclusion
In the end, taking the time to set up good, reliable, and predictable data for testing is well worth the time and effort spent up front. By combining the Builder Pattern with a data generator like Bogus, you get the best of all the various aspects of testing, by making it consistent, reliable, repeatable, and predictable. You get data that correctly prepared and your tests have more validity as a result. It gives you the perfect way to emulate your production data without exposing you to the legal and security risks of copying data from production.

