Tests aren't what you think!

In programming, the word “tests” evokes a metaphor of teachers quizzing students. It’s as if our software “tests” are intended to throw tricky information at our code to verify that our code figures out the answer. The word “test” suggests validation. Supposedly a “test” determines whether our code “does the right thing”. But this entire metaphor is mistaken. Our tests don’t ensure that our code does the right thing. Tests have no concept of “correct” or “incorrect”. It’s the wrong metaphor. What we call “tests” should instead be called “chains”.

Let’s say that you ride a bike, and you store that bike outside. Each night when you get home, you take the time to wrap a chain through and around your bike. Each morning when you leave home, you take the time to unwrap the chain from your bike. This process is a little bit annoying, and it takes some time that you otherwise wouldn’t spend. So why do it at all?

Anyone who’s ever had a bike stolen knows exactly why you take the time to wrap it with a chain. The chain makes sure that during the night, when you’re not paying any attention, your bike doesn’t disappear. From the moment you add the chain, to the moment you remove the chain, you can rest assured that your bike remains exactly where you left it. That’s a valuable promise, and that’s why we wrap our bikes with chains.

Software “tests” function just like a chain around a bike, and they do so for exactly the same reason. When you write a test, you impose constraints on your code based on the current behavior of your code, which you’ve personally validated. The test itself doesn’t know whether your code does the “right thing” or not, but you do. Now that you’ve decided that your code does the right thing, you write a test to chain your code to those expectations. Your code remains chained to those expectations until you come back later and alter the test, the same way that your bike remains exactly where you left it, until you come back the next morning and remove the chain.

Some developers argue that tests are a waste of time because eventually you have to come back and change the tests. The tests just get in the way of the “real work”. Wouldn’t it be nice if we could change our code without spending time changing tests? Sure, just like it would be nice if you could leave your bike outside unchained without anybody stealing it. But this is the real world. If you want to keep your bike, you have to chain it up. If you want your software project to survive, you have to chain it up. In the absence of effective tests, you will write spaghetti code, you will violate encapsulation boundaries, you will introduce lots of accidental complexity, you will create bugs that make it to your customers, you will rebreak the same behaviors over and over, you will move slower and slower over time, and you will reach the point where you tell management that you have to re-write the whole app because “reasons”.

How to chain your code

The art and science of testing is to understand what’s worth chaining up, what isn’t, and what kind of chain you should use.

What you should chain up

You chain up your bike, but you probably don’t chain up your potted plants, your deck furniture, or your dresser. Is that because those items can’t be stolen? No, it’s because those items are incredibly unlikely to be stolen. In fact, it’s so unlikely, that we forego the purchase and installation of a cheap chain to secure any of those items. Similarly, not everything in your code needs to be chained up. Sometimes code is so unlikely to break, that you don’t gain anything by locking it down. That doesn’t mean your code can’t break, it just means that a breakage is so unlikely that it’s not worth writing and maintaining a test suite. For the testing purists, yes, this is a tradeoff, but as all the experts keep reminding me, software engineering is the process of repeated tradeoffs.

The best behavior to start chaining up is user interactions. Your app or package has an explicit specification for what’s supposed to happen during a given user interaction. We might be talking about a tap, double tap, long press, drag, pan, pinch, or text input. Not only does your specification dictate exactly what’s supposed to happen during these user interactions, but user interactions are also the most critical behavior in a “user interface”. This is why we use Flutter in the first place. For user interactions. By chaining up your user interactions, you guarantee that your design specification is implemented as requested, and you guarantee that your users can use your app or package as intended. Additionally, user interactions are the highest level behavior in your code. Therefore, user interaction tests exercise large swathes of your codebase, giving you a lot of additional test coverage, for free.

As a codebase grows, the code evolves from a single bundle of functionality into a collection of subsystems that all talk to each other. For example, consider Super Editor. We could think of Super Editor as one big bundle of code, and we could test that entire bundle of code through user interactions. But this is counterproductive. Within Super Editor, we have certain areas of code where we’ve installed strong encapsulation boundaries. We can think of Super Editor as a composition of a logical document, a logical document editor, logical content selections, document layouts, and document gestures. As a result, we can write targeted tests for those subsystems, and those tests cover more functionality with less code than interaction tests. Those targeted tests also tell us exactly where something is going wrong, unlike user interaction tests.

Furthermore, as a single code-base evolves into a composition of subsystems, the interactions between those subsystems become de facto protocols. In other words, the interactions between subsystems aren’t just “incidental” behavior, it’s carefully designed behavior. User interaction tests make sure that the requirements from your design team are fulfilled. Subsystem tests make sure that the requirements from your software architect are fulfilled. The subsystem protocols may be arbitrary, but those arbitrary requirements are very real, and they should remain stable.

Consider Flutter’s protocol between Widgets, Elements, and RenderObjects. On the one hand, we could say that the relationships between these objects are supposed to result in visible behavior, such as painting a UI. We should test these objects by starting up an entire Flutter app, pump a widget tree, and then validate the visual output. On the other hand, the communication between Widgets, Elements, and RenderObjects is complex, critical, and expected to remain as stable as possible. It’s a much better idea to write tests that chain up the precise interactions between these objects. Regardless of the visual outcome, we don’t want the Widget, Element, and RenderObject protocol to change unless it has too, and unless we intended to do so. Chaining up the protocol will give us the desired result.

As you choose what to chain up, and what to leave free, consider a few heuristics. Every user interaction specified by your design should probably be tested. Every protocol interaction should probably be tested. Any area of code that you find difficult to read or explain should probably be tested. Any area of code that you’ve broken in the past should probably be tested. Beyond that, use your discretion.

What kind of chains you should use

The same way that chains come in different lengths, thickness, and prices, tests have different scopes, levels of stability, and tradeoffs. Let’s describe the options from most broad to most specific.

The broadest test you can write is an end-to-end test. An end-to-end test is a test that runs your real app, on a real device or emulator, and communicates with real servers and services. An end-to-end test is the closest that your tests can get to real-world human use. As such, if your end-to-end tests are well-written, you cover all reasonable user goals, and your tests pass consistently, then those tests are your best indicator that your app does what users need it to do.

That said, end-to-end tests come at great cost. End-to-end tests require significant infrastructure. Your test runner needs to build a real app, such as an APK or IPA, rather than simply deploy your Flutter code to the Flutter test system. Building a full app means that all of your app-level configurations, such as API keys, app permissions, and signing certificates are configured as needed. Your test runner also needs to know how to take your compiled app and deploy it to a device or an emulator, while connecting to a special session to run test scripts and capture results. End-to-end tests take thousands of times longer to run because they require full app builds, and because they run real network and system behaviors. End-to-end tests are also notoriously flaky, for the same reasons. Lastly, you can’t let end-to-end tests interact with your production servers, because your end-to-end tests might corrupt production data. So your team needs to setup replicas of your production servers for the sole purpose of interacting with your end-to-end tests. Those test servers will need to be kept up-to-date with the code running on your production servers. As a result, in practice, developers minimize the number of end-to-end tests.

One level down from end-to-end tests are user interaction integration tests. Integration tests are tests that run on a real device, similar to end-to-end tests. However, you can choose to run integration tests without talking to any servers or services and instead test user journeys with fake server and service responses. In practice, there isn’t much of a point to these tests. There may be unique situations where Flutter’s local test runner can’t facilitate your test. In those cases you might choose to run those interaction tests on a real device. But this situation is rare. Typically, you either go all the way to end-to-end tests, or you stick with non-integration user interaction tests.

After integration tests, there are user interaction widget tests. I usually refer to these simply as “user interaction tests”. A user interaction test pumps a widget tree with Flutter’s local test runner, simulates various user interactions, and then checks what happens after each interaction. You might check that a drag changed a scroll offset, or a tap on a button submitted a form. User interaction tests are a great sweet spot for testing tools. These tests cover the majority of the details that you’d chain up with end-to-end tests, but these tests run a thousand times faster, they’re not flaky, and no test infrastructure is required. As mentioned earlier, when you’re starting on a new project, most of the tests that you write should be user interaction tests. They give you the biggest bang for your buck.

User interaction tests are the last type of test that focuses on product requirements. The remaining forms of tests focus on architectural requirements. For example, you may want to chain up a protocol between a few objects. These objects send messages to each other in a particular order, and under particular conditions. A test that chains up the communication protocol between a few objects is what I call a “component test”. I give it this name because a group of objects that work closely together probably have a strong encapsulation boundary around the group of objects, like a component.

Lastly, we have the most granular test of all - unit tests. A unit test locks down the preconditions and postconditions for a single method on a single object. Unit tests often utilize fakes and mocks because the purpose of the test is to lock down the details of a single method. Your test doesn’t care if other objects do the right thing - those behaviors are locked down in other tests. Unit tests are great for throwing every conceivable precondition at a specific method. Each unit test can configure different arguments. With fakes and mocks you can configure those arguments to represent any situation without any infrastructure setup. Furthermore, because you’re only running a single method, it’s quick and easy to check the return value and postconditions. The downside of unit tests is that their granular nature requires that you write many of them to cover the behaviors of even a small area of your code. It’s not uncommon to write dozens of unit tests in a single PR.

Summary

Tests aren’t quizzes, they’re chains. Locking down your code isn’t a bug, it’s a feature. The key to effective testing is to select the right things to chain down, and chain those details at the most appropriate level of abstraction. You can chain down your real app with end-to-end tests. You can chain down user interactions in a real app with integration tests. You can chain down user interactions more cheaply with interaction widget tests. You can chain down protocols with component tests. And you can chain down all conceivable inputs and outputs of a single method with unit tests.

Here at the Flutter Bounty Hunters we've published a few packages to make it easier to write certain type of tests. We've published flutter_test_robots to more quickly and easier simulate typical user behaviors. We've published flutter_test_runners to more quickly and easily configure widget tests for different platforms such as iOS vs Mac. And we've published golden_bricksto give you a more realistic font for golden tests, which still minimized flakiness.

Effective testing will make you a much better software engineer. Now, go lock down some of your code, and y’all come back now, you hear?