Avoid and test for race conditions, deadlocks, timing issues, memory corruption, uninitialized memory access, memory leaks, and resource issues

Guidelines for development:

Simplify your synchronization logic. If it’s too hard to understand, it will be difficult to reproduce and debug complex concurrency problems.

Always obtain locks in the same order. This is a tried-and-true guideline to avoid deadlocks, but I still see code that breaks it periodically. Define an order for obtaining multiple locks and never change that order.

Don’t optimize by creating many fine-grained locks, unless you have verified that they are needed. Extra locks increase concurrency complexity.

Avoid shared memory, unless you truly need it. Shared memory access is very easy to get wrong, and the bugs may be quite difficult to reproduce.

Guidelines for testing:

Stress test your system regularly. You don't want to be surprised by unexpected failures when your system is under heavy load.

Test timeouts. Create tests that mock/fake dependencies to test timeout code. If your timeout code does something bad, it may cause a bug that only occurs under certain system conditions.

Test with debug and optimized builds. You may find that a well behaved debug build works fine, but the system fails in strange ways once optimized.

Test under constrained resources. Try reducing the number of data centers, machines, processes, threads, available disk space, or available memory. Also try simulating reduced network bandwidth.

Test for longevity. Some bugs require a long period of time to reveal themselves. For example, persistent data may become corrupt over time.

Use dynamic analysis tools like memory debuggers, ASan, TSan, and MSan regularly. They can help identify many categories of unreproducible memory/threading issues.

Enforce preconditions

void ScheduleEvent( int timeDurationMilliseconds) { if (timeDurationMilliseconds <= 0) { timeDurationMilliseconds = 1; } ... }

Guidelines for development:

Enforce preconditions in your functions unless you have a good reason not to.

Use defensive programming

double GetMonthlyLoanPayment() { double rate = GetTodaysInterestRateFromExternalSystem(); if (rate < 0.001 || rate > 0.5) { throw BadInterestRate(rate); } ... }

Guidelines for development:

When possible, use defensive programming to verify the work of your dependencies with known risks of failure like user-provided data, I/O operations, and RPC calls.

Guidelines for testing:

Use fuzz testing to test your systems hardiness when enduring bad data.

Don’t hide all errors from the user

Guidelines for development:

Only hide errors from the user when you are certain that there is no impact to system state or the user.

Any error with impact to the user should be reported to the user with instructions for how to proceed. The information shown to the user, combined with data available to an engineer, should be enough to determine what went wrong.

Test error handling

Guidelines for testing:

Always test your error handling code. This is usually best accomplished by mocking or faking the component triggering the error.

It’s also a good practice to examine your log quality for all types of error handling.

Check for duplicate keys

Guidelines for development:

Try to guarantee uniqueness of all keys.

When not possible to guarantee unique keys, check if the recently generated key is already in use before using it.

Watch out for potential race conditions here and avoid them with synchronization.

Test for concurrent data access

Guidelines for testing:

Always test for concurrent data access if it’s a feature of the system. Actually, even if it’s not a feature, verify that the system rejects it. Testing concurrency can be challenging. An approach that usually works for me is to create many worker threads that simultaneously attempt access and a master thread that monitors and verifies that some number of attempts were indeed concurrent, blocked or allowed as expected, and all were successful. Programmatic post-analysis of all attempts and changing system state may also be necessary to ensure that the system behaved well.

Steer clear of undefined behavior and non-deterministic access to data

Guidelines for development:

Understand when the APIs and operations you use might have undefined behavior and prevent those conditions.

Do not depend on data structure iteration order unless it is guaranteed. It is a common mistake to depend on the ordering of sets or associative arrays.

Log the details for errors or test failures

Guidelines for development:

Follow good logging practices, especially in your error handling code.

If logs are stored on a user’s machine, create an easy way for them to provide you the logs.

Guidelines for testing:

Save your test logs for potential analysis later.

Anything to add?