Saturday 13 February 2016

Is test isolation always a good thing?

TDD is not only about unit tests. Steve Freeman and Nat Pryce argue in their book that a TDD cycle is kickstarted using an end-to-end automated test. These tests exercise a system as a black box. This often implies difficulties and issues that well written unit tests don’t have. On of such fields of difficulty is test fixture management, especially fixture teardown.

Gerard Meszaros in his book titled XUnit test patterns provides an extensive discussion of different fixture management strategies and their implications. Using Meszaros’ terminology, what I have seen many end-to-end tests using is a combination of:

  • Persistent shared immutable fixture - this is the data that’s shared by many tests - usually some initial environment setup and plumbing, uploading licenses and so on. The “immutable” part of the fixture name means that no test is allowed to modify any data that’s part of this fixture. The reason is that each test relies on this data being intact. If any test modifies the fixture, then other tests’ assumptions are broken and we cannot expect them to work reliably.
  • Persistent fresh fixture - this is the data that’s needed for a single test. This may be, for example, adding a specific employee required by the test scenario to a database. As database is a persistent storage, this data must be cleaned up either at the end of the same test that created it or at the beginning of any new test (there are some ways to achieve that safely, but that’s a topic for another post). The cleanup is needed to avoid test interactions - many tests may add an employee with the same ID number, but with other data being different. Thus we need to prepare a clean bed for each test so that leftover data from previous tests doesn’t get in the way (by the way, there are some other ways of dealing with test interactions than cleaning up - although they’re quite difficult to apply).
  • Transient fresh fixture - this part of data is created inside a test and removed without our intervention when the test finishes running, usually by garbage collector or other memory management mechanism. The simplest example of such fixture would be an integer variable defined in a test method - we don’t have to worry about it impacting the other tests, because its lifetime is automatically associated with the lifetime of a test.

After reading XUnit test patterns, my conclusion was obvious - life would be beautiful if we could only use transient fresh fixture. When writing unit tests for classes or small clusters of classes, this kind of fixture is the only one we ever need to use. Things change, on the other hand, when e.g. our tests require installation of the application and a database - in such situation, using only transient fixture would probably mean that each test should perform a fresh installation of an operating system. This would take ages, so we use the other fixture types as a compromise. We accept the problems they bring and accept the fact that they can introduce interactions between tests (and thus, instability) and trade these things for shorter test execution time.

I met with an opinion that this is not actually bad. What’s more, the opinion states, this can be seen as an advantage, because when a single instance of an application is exercised by many different tests, it resembles closer the way the user exercises the system when it’s deployed and, by working longer with an existing instance of a system, it may uncover some defects that would otherwise not become visible.

I have several issues with this point of view on fixture management. In order to explain why is that, let’s recap the differences between how user exercises a system and how automated end-to-end tests do it.

What are the differences between usage by automated tests and by user?

There are three pretty big differences that I can think of:

An automated test has a scope, i.e. it has a beginning

When defining a test, we usually define the context in which a test should be executed, a series of steps and an expected outcome. For example, we might say:

Given there are two employees: Ken and Tom //the context
When I choose to remove Ken's data         //step 1
And list all my employees                  //step 2
Then I should only see Tom's data          //expected outcome

Here, the explicitly stated context is that we have two employees - if there is only one, executing the two steps would most probably create different results. Of course, there is also an implicit context that’s not written here, for example that the application is installed and turned on and probably that I’m logged into the application. The test relies on both kinds of context to be the same for every execution. In other words, in order to bring stable outcomes, a test requires a context to be defined. Users usually work in the context they are currently given and are able to adapt to it. For example, I want to create an e-mail account with an address "buziaczek@mymail.com", but this address already exists. No problem, I’ll just create "buziaczek-1984@mymail.com" and I’m ready to go. Most of the time, an automated test doesn’t make such decisions (although we can do something to bypass such situations, e.g. generating a portion of the e-mail address randomly, but note that it’s still not making a decision).

Thus, although it may happen for a user to follow a strict sequence of steps in a strictly defined context, this is not what their work with the application is about. If it was, we could probably automate these sequences in the software itself.

Anyway, a demand for strictly defined context for each automated test means that this context would usually have to be cleaned after or before each test (Meszaros writes about other ways, like generating distinct test data for each test, but I didn’t see them in practice all that often). This is somehow contradictory to wanting to achieve testing “as the user would” by executing tests in partially the same context (i.e. on the same running or installed application instance). The contradiction for me is that we seem to value test isolation, except that we don’t.

An automated test has a scope, i.e. it has an end

A user uses a product in a continuous manner, usually taking many, many steps during each session with an application. On the other hand, when reading on good practices for automated testing, one can spot that such tutorials hardly ever advise writing long, automated scenarios. It is more common to see several test cases for several different behaviors or variants of a single behavior rather than a single test case exercising all of these conditions. So, it looks like we deliberately try to run away from testing the system the way a user would in automated tests. Why is that?

This is for several reasons. One of them is that most of the time, a failed assertion ends an automated test, so the longer the test, the harder it is to get feedback from the later parts of that test, because there are more things that can fail along the way. Two other reasons that I’d like to talk about are: splitting work into manageable chunks and defect localization.

Splitting into manageable chunks means that we prefer shorter scenarios to longer ones because it is easier to maintain and modify each scenario independently when it’s smaller. This is because building up context becomes increasingly harder to grasp, understand and manage. Users, on the other hand, don’t need to manage scenarios - they dynamically choose their goals and steps based on the current situation.

Defect localization means that when a test fails, I need to search for the failure root cause only among what’s in the scope of this test. When a scope of a test is smaller, I have fewer places to search for such root cause. This is also one of the reasons why we set descriptive names for tests, i.e. we’d rather say “the user should be able to delete an employee” than “the user should be able to play with the system” - because in the first case, a failure points us closer to the root cause. The usual thing I do when I see an automated test failing is to try and run it in isolation (i.e. not as part of a suite) and in clean environment. This is because it gives me a better feedback on whether the failure is caused by something that’s in the scope. So, the broader the scope of a test, the lower defect localization.

In his wonderful book, Lean-Agile Acceptance Test-Driven Development, Ken Pugh outlines three conditions we strive to achieve for automated tests:

  1. A test should fail for a well-defined reason.
  2. No other test should fail for the same reason.
  3. A test should not fail for any other reason.

And, while Ken admits this is often very hard (or impossible) to achieve, this is the ideal we should be targeting. Managing a persistent fixture introduces more complexity and possible root causes of test failures, driving us away from these three conditions.

An automated test has a scope, i.e. it has a purpose

As I said above, a test should fail for a well-defined reason. This well-defined reason for failure is a purpose of the test. In other words, every test failure should relate to its purpose. This purpose is usually conveyed in test name, for example “An added employee should appear on an employee report”. Note that very rarely do we see tests like “previous tests runs should not have led the application into an illegal or crashing state”. This is because the tests aren’t usually written to check whether an application crashes if we do some random things with it. Sharing parts of the fixture between tests doesn’t help them achieving their purpose.

I also said that it is probably impossible to get rid of all the things that are beyond the purpose of a test but still can impact the outcome of the test. For example, if a test loads anything from a disk, it can fail because of (unlikely but possible) disk drivers failure. These things are outside the scope of a test, but still influence its result.

The point is, the more things exist outside the scope of an automated test that can affect its outcome, the more the test becomes incompetent at testing what’s inside its scope. Do we want the tests to be incompetent in fulfilling their purpose? I hope not. Thus, my preference is to maximize the competency of a test by reducing to a minimum other things that can impact its outcome. This is how I believe I get the most value out of automated tests. This is in line with Meszaros’ write-up on the goals of test automation, where he stresses the ease of maintainability as one of such goals.

Automated tests teardown is often very error-prone and adds a lot of instability to the test suite, thus making it less competent in doing the thing it written for. For some time, For test execution time-related reasons, I’m afraid we’ll have to live with sharing some parts of persistent fixtures, however, If I could get rid of the necessity of teardowns in tests without sacrificing their run time, I’d do so in a blink of an eye.

Summary

My opinion on test isolation is that the more the better. For me, saying that partial lack of isolation is better (because it would uncover some defects only a user otherwise could) is really taking the biggest disadvantage of automated black-box tests and pretending it’s an advantage. It is relying on undeliberate potential effect of this disadvantage. Creating a test approach targeted towards uncovering such defects (i.e. relying on deliberate effort to uncover them instead of hoping they will come up from other tests) would in my opinion give a better return of investment.