Programming is writing code to solve problems. Software Engineering is the practice of using a structured process to solve problems. As engineers, we want to have a codebase we can change, extend, and refactor as required. Tests ensure our program works as intended and that changes to the codebase do not break existing functionality.

At my last job, I worked with a Senior Engineer to build out a microservices-based backend to replace our existing Django monolith. It was a greenfield project and we were encouraged to try new things. I was reading Python Testing with pytest and convinced the Senior Engineer to let me bring pytest into our project. This was fortuitous as it forced me to take the lead in writing the initial set of tests we used as a template for all of our services.

This experience reinforced the principles highlighted in The Pragmatic Programmer. It's about being pragmatic in what we test, how we test, and when we test; we should leverage tools and techniques that allow us to test our code as efficiently as possible. Testing needs to be easy and free of barriers; once testing feels like a chore, programmers won't do it... and this is how software quality slips.

We dread going into the code because either there are no tests or the tests that exist are so brittle that we're forced to rewrite tests as we write code. This is not what Software Engineering is about. Test should enable refactoring, not hamper our ability to make changes to the codebase. We should spend our time writing business logic, not wrestling with tests.

Testing is folklore in the sense that best practices and techniques are passed down from programmer to programmer while working on projects as part of a team. If you are new to the industry and are trying to grok testing, it's hard to figure out how to get started. It feels like there is a lot of conflicting advice out there, and that's because there is. Testing is opinionated, more-so than any other software engineering discipline. Folks are always arguing about what to test, how to test, and especially when to test.

This is the first in a series of posts that details my thought process for how I go about adding tests to a codebase. In this post, I provide a broad introduction to the world of testing so we can have a common vocabulary for future posts.

Table of Contents

What is Testing

When we write code, we need to run it to ensure that it is doing what we expect it to. Tests are a contract with our code: given a value, we expect a certain result to be returned.

Running tests can be thought of as a feedback mechanism that informs us if our program works as intended:

While passing tests cannot prove the absence bugs, they do inform us that our code is working in the manner defined by the test. In contrast, a failing test indicates that something is not right. We need to understand why our test failed so we can modify code and/or tests, as required.

Properties of Tests

1. Fast

Tests give us confidence that our code is working as intended. A slower feedback loop hampers development as it takes us longer to find out if our change was correct. If our workflow is plagued by slow tests, we won't be running them as often. This will lead to problems down the line.

2. Deterministic

Tests should be deterministic, i.e. the same input will always result in the same output. If tests are non-deterministic, we have to find a way to account for random behavior inside of our tests.

While there is definitely non-deterministic code in production (i.e. Machine Learning and AI), we should try to make all our non-probabilistic code as deterministic as possible. There is no point of doing additional work unless our program requires it.

3. Automated

We can confirm our program works by running it. This could be manually running a command in the REPL or refreshing a webpage; in both cases, we are looking to see if our program does what it is supposed to do. While manual testing is fine for small projects, it becomes unmanageable as our project grows in complexity.

By automating our test suite, we can quickly verify our program works on-demand. Some developers even have their tests triggered to run on file save.

Formal Definition

Let's go over some definitions so we have a common vocabulary going forward.

A System Under Test (SUT) is the entity that is currently being tested. This could be a line of code, a method, or an entire program.

Acceptance Criteria refers to the check we perform that allows us to accept output from the system. The specificity and range of acceptance criteria depends on what we are testing: medical device and aerospace require tests to be specific as there is a lot less room for error.

If Amazon makes a bad recommendation, it's not the end of the world. If IBM's Watson suggests the wrong surgery, it can be life threatening.

Testing refers to the process of entering Inputs into our System Under Test and validating Outputs against our Acceptance Criteria:

If output is okay, our test passes.

If output is not okay, our test fails and we have to debug.

Hopefully the test failure provides enough contextual information for us to find out where to look.

Benefits of Testing

A well-thought-out testing strategy paired with thorough test cases provides the following benefits:

Modify Code with Confidence

If a program does anything of interest, it has interactions between functions, classes, and modules. This means a single line change can break our program in unexpected ways. Tests give us confidence in our code. By running our tests after we modify our code, we can confirm our changes did not break existing functionality as defined by our tests.

In contrast, modifying a code base without tests is a challenge. There is no way of knowing if things are working as intended. We are programming by the seat of our pants, which is quite a risky proposition.

Identify Bugs Early

Bugs cost money. How much depends on when you find them.

Fixing bugs gets more expensive the further you are in the Software Development Life Cycle (SDLC). True Cost of a Software Bug digs into this issue.

Improve System Design

This one is a bit controversial, but I think writing code with tests in mind improves system design. A thorough test suite shows that the developer has actually thought about the problem in some depth. Writing tests forces you to use your own API; this hopefully results in a better interface.

All projects have time constraints and it's quite easy to get into the habit of taking shortcuts that increase coupling between modules leading to complex interdependencies. We have to be cognizant of solving problems with spaghetti code.

Knowing we have to test our code forces us to write modular code. If something is clunky to test, there might be a better interface we can implement. Taking the time to write tests forces mindfulness upon us; we take a deep breath before looking at the problem from the perspective of a user.

Once you write testable code by using patterns like dependency injection, you'll see how adding structure makes it easier to verify our code is doing what we expect it to.

Black Box vs White Box

Tests can be broadly classified into two broad categories: black box testing and white box testing.

Black Box Testing refers to testing techniques in which the tester cannot see the inner workings of the item being tested.

White Box Testing is the technique in which the tester can see the inner workings of the item being tested.

As developers, we perform white box testing. We wrote the code inside of the box and know how to test it thoroughly. This is not to say that there is not a need for black box testing, we should still have somebody perform testing at a higher level; proximity to the code can lead to blind spots in our tests.

Test Pyramid

The Automated Test Pyramid provides guidance on how to structure our testing strategy. It says we should write lots of fast and cheap unit tests and a small number of slow and expensive end-to-end tests.

The Test Pyramid is not a hard and fast rule, but it provides a good place to start thinking about a testing strategy. A good rule of thumb is to write as many tests at each level as you need to have confidence in your system. We should be writing tests as we write code, iterating towards a testing strategy that works for the project we are working on.

Unit Tests

Unit tests are low-level tests that focus on testing a specific part of our system. They are cheap to write and fast to run. Test failures should provide enough contextual information to pinpoint the source of the error. These tests are typically written by developers during the Implementation phase of the Software Development Life Cycle (SDLC).

Unit tests should be independent and isolated; interacting with external components increases both the scope of our tests and the time it takes for tests to run. As we will see in a future post, replacing dependencies with test doubles results in deterministic tests that are quick to run.

How big should our unit test be? Like everything else in programming, it depends on what we are trying to do. Thinking in terms of a unit of behavior allows us to write tests around logical blocks of code.

The Test Pyramid recommends having a lot of unit tests in our test suite. These tests give us confidence that our program works as expected. Writing new code or modifying existing code might require us to rewrite some of our tests. This is standard practice, our test suite grows with our code base.

Try to be cognizant of our test suite growing in complexity. Remember, code that tests our production code is also production code. Take the time to refactor your tests to ensure they are efficient and effective.

Unit Test Example

Suppose we have the following function that takes a list of words and returns the most common word and the number of occurrences of that word:

def find_top_word ( words ) # Return most common word & occurrences word_counter = Counter ( words ) return word_counter . most_common ( 1 )[ 0 ]

We can test this function by creating a list, running the find_top_word function over that list and comparing the results of the function to the value we expect:

def test_find_top_word (): words = [ "foo" , "bar" , "bat" , "baz" , "foo" , "baz" , "foo" ] result = find_top_word ( words ) assert result [ 0 ] == "foo" assert result [ 1 ] == 3

If we ever wanted to change the implementation of find_top_words , we can do it without fear. Our test ensures that the functionality of find_top_word cannot change without causing a test failure.

Integration Tests

Every complex application has internal and external components working together to do something interesting. In contrast to units tests which focus on individual components, integration tests combine various parts of the system and test them together as a group. Integration testing can also refer to testing at service boundaries of our application, i.e. when it goes out to the database, file system, or external API.

These tests are typically written by developers, but they don't have to be. By definition, integration tests are larger in scope and take longer to run than unit tests. This means that test failures require some investigation: we know that one of the components in our test is not working, but the failure's exact location needs to be found. This is in contrast to unit tests which are smaller in scope and indicate exactly where things have failed.

We should try to run integration tests in a production-like environment; this minimizes the chance that tests fail due to differences in configuration.

Integration Test Example

Suppose we have the following function that takes in a URL and a tuple of (word, occurrences) . Our function creates a records and saves it to the database:

def save_to_db ( url , top_word ): record = TopWord () record . url = url record . word = top_word [ 0 ] record . num_occurrences = top_word [ 1 ] db . session . add ( record ) db . session . commit () return record

We test this function by passing in known information; the function should save the information we entered into the database. Our test code pulls the newly saved record from the database and confirms its fields match the input we passed in.

def test_save_to_db (): url = "http://test_url.com" most_common_word_details = ( "Python" , 42 ) word = save_to_db ( url , most_common_word_details ) inserted_record = TopWord . query . get ( word . id ) assert inserted_record . url == "http://test_url.com" assert inserted_record . word == "Python" assert inserted_record . num_occurrences == 42

Notice how this is the kind of testing we do manually to confirm things are working as expected. Automating this test saves us from having to repeatedly check this functionality each time we make a change to the code.

End-to-End

End-to-end tests check to see if the system meets our defined business requirements. A common test is to trace a path through the system in the same manner a user would experience. For example, we can test a new user workflow: simulate creating an account, "clicking" the link in the activate email, logging-in for the first time, and interacting with our web application's tutorial modal pop-up.

We can conduct end-to-end tests through our user interface (UI) by leveraging a browser automation tool like Selenium. This creates a dependency between our UI and our tests, which makes our tests brittle: a change to the front-end requires us to change tests. This is not sustainable as either our front-end will become static or our tests will not be run.

A better solution is to test the subcutaneous layer, i.e. the layer just below our user interface. For a web application, this would be testing the REST API, both sending in JSON and getting JSON out.

Our subcutaneous tests are our contracts with our front-end; they can be used by our front-end developers as a specification of the REST API. Tools, like swagger-meqa, that are built on top of the OpenAPI Specification can help us automate this process. We could also full-featured tools like Postman to test, debug, and validate our API.

End-to-end tests are considered black box as we do not need to know anything about the implementation in order to conduct testing. This also means that test failures provide no indication of what went wrong; we would need to use logs to help us trace the error and diagnose system failure.

End-to-End Test Example

Here we are using the Flask Test client to run subcutaneous testing on our REST API. There are a lot of things happening behind the scene and the result we get back (HTTP status code) lets us know that the test either passed or failed.

def test_end_to_end (): client = app . test_client () body = { "url" : "https://www.python.org" } response = client . post ( "/top-word" , json = body ) assert response . status_code == HTTPStatus . OK

Resources

Structuring Tests

Each test case can be separated into the following phases:

setting up the system under test (SUT) to the environment required by the test case (pre-conditions)

performing the action we want to test on SUT

verifying if the expected outcome occurred (post-conditions)

tearing down SUT and putting the environment back to the state we found it in

There are two widely used frameworks for structuring tests: Arrange-Act-Assert and Given-When-Then.

Arrange-Act-Assert (AAA)

The AAA pattern is abstraction for separating the different part of our tests:

Arrange all necessary pre-conditions

all necessary pre-conditions Act on the SUT

on the SUT Assert that our post-conditions are met

Arrange-Act-Assert Example

def test_find_top_word (): # Arrange words = [ "foo" , "bar" , "bat" , "baz" , "foo" , "baz" , "foo" ] # Act result = find_top_word ( words ) # Assert assert result [ 0 ] == "foo" assert result [ 1 ] == 3

The clear separation between the phases allows us to see if our test method is trying to test too many different things at once. Arrange-Act-Assert is the pattern I use when writing tests.

Given-When-Then (GWT)

GWT provides a useful abstraction for separating the different phases of our test:

Given a set of pre-conditions

a set of pre-conditions When we perform an action on the SUT

we perform an action on the SUT Then our post-conditions should be as follows

GWT is widely used in Behavior Driven Development (BDD).

Given-When-Then Example

def test_find_top_word (): # Given a list of word words = [ "foo" , "bar" , "bat" , "baz" , "foo" , "baz" , "foo" ] # When we run the function over the list result = find_top_word ( words ) # Then we should see `foo` occurring 3 times assert result [ 0 ] == "foo" assert result [ 1 ] == 3

Resources

What to Test

In order to prove that our program is correct, we have to test it against every conceivable combination of input values. This type of exhaustive testing is not practical so we need to employ testing strategies that allow us to select test cases where errors are most likely to error.

Seasoned developers can balance writing code to solve business problems with writing tests to ensure correctness and prevent regression. Finding this balance and knowing what to test can feel more like an art than a science. Fortunately, there are a few rules of thumb we can follow to make sure our testing is thorough.

Functional Requirements

We want to make sure that all relevant requirements have been implemented. Our test cases should be detailed enough to check business requirements. There is no point building something if doesn't it meet the criteria you set forth.

Basis Path Testing

We have to test each statement at least once. If the statement has a conditional ( if or while ), we have to vary our testing to make sure we test all branches of the conditional. For example, if we have the following code:

if x > 18 : # statement1 elif 18 >= x >= 35 : # statement2 else : # statement3

To make sure we hit all branches of the above conditional, we need to write the following tests:

x < 18 18 <= x <= 35 x > 35

Equivalence Partitioning

Two test cases that result in the same output are said to be equivalent. We only require one of the test cases in order to cover that class of errors.

Boundary Analysis

"There are 2 hard problems in Computer Science: cache invalidation, naming things, and off-by-1 errors."

This is one of the oldest jokes in programming, but there is a lot of truth behind it, we often confuse if we need a < or a <= . This is why we should always test the boundary conditions. Given the following example:

if x > 18 : # statement1 else : # statement2

To ensure we thoroughly test the boundary conditions of the code snippet above, we would to have test cases for x=17 , x=18 , and x=19 . Be aware that writing test cases becomes more complicated if our boundary has compound conditionals.

This is a great guide on testing boundary conditions.

Classes of Bad Data

This refers to any of the the following cases:

Too little data (or no data)

Too much data

Invalid data

Wrong size of data

Uninitialized data

Data Flow Testing

Focuses on tracing the control flow of the program with a focus on exploring the sequence of events related to the status of data objects. For example, we get an error if we try to access a variable that has been deleted. We can use Data Flow testing to come up with additional test cases for variables that have not be tested by other tests.

Error Guessing

Past experience provides insights into parts of our code base that can lead to errors. Keeping a record of previous errors can improve the likelihood that you will not make that same mistake again in the future.

Recap

Figuring out what to test and doing it efficiently is what I mean when I say Art of Developer Testing. The only way to get better at testing is by writing tests, coming up come up better testing strategies, and learning about different testing techniques. Just like in software development, the more you know about something, the better you will become at it.

When to Write Tests

While there is a lot of interesting discussion about when to write tests, I feel it takes away from the point of testing. It doesn't matter when you write tests, it just matters that you write tests.

If you are interested in exploring this topic, I recommend the following links:

Conclusion

In this post, we got a broad introduction to the world of testing. Now that we are all on the same page, we can explore testing in more depth in future posts.

Additional Resources