This post has been republished on my personal website.

Lessons I’ve learned from software engineering are uniformly cynical:

Abstraction almost always fails; you can’t build something on top of a system without understanding how that system works.

Bleeding-edge methods are a recipe for disaster

Everything good is hype and you’ll only ever get a small fraction of the utility being promised.

Imagine my surprise, then, when the Z3 constraint solver from Microsoft Research effortlessly dispatched the thorniest technical problem I’ve been given in my short professional career.

The Problem

Microsoft Azure has a lot of computers in its datacenters — on the order of millions. For security, each of these computers is configured with a firewall which accepts communication from a comparatively small set of authorized servers. These firewalls aren’t created by hand — they’re automatically generated during deployment. We wanted to update the firewalls from a confused overlapping whitelist/blacklist system to a simple whitelist. Any change in this domain carries substantial risk:

Accidentally allowing connections from computers which should be blocked, a significant security issue.

Accidentally blocking connections from computers which should be allowed, a significant availability issue.

Thus we wanted strong guarantees that firewalls generated with the new method blocked & allowed the exact same connections as firewalls generated with the old method. This is very difficult; the naive solution of checking all 2⁸⁰ packets against both firewalls would take a computer 38 million years to finish at a brisk rate of one billion packets per second! There’s another way: give the problem to the Z3 theorem prover from Microsoft Research, and it checks equivalence in a fraction of a second. How?

Indistinguishable from magic

Z3 is variously described as a theorem prover, SMT solver, and constraint solver. I like to think of it as an Oracle. Let’s think — if we had access to an Oracle, what question would we ask to solve the firewall equivalence problem? First: we require an understanding of firewalls and packets.

Every piece of information sent over the network is encapsulated in a packet. Like a proper piece of correspondence, packets contain two important pieces of information: where they came from, and where they’re going. We’ll say each address is a single number, like 50 or 167. So, the packet [23, 75] came from source address 23 and is heading to destination address 75 (in real life these numbers are IPv4 or IPv6 addresses, but these are just [very large] numbers and so the simplification works).

Firewalls are lists of rules saying which packets to block and which to allow. Rules are expressed in terms of source and destination address ranges, plus a decision — block or allow. We say a packet matches a rule if the packet’s source address is in the rule’s source range and destination in the destination range, in which case the decision is applied to that packet. For example, we can write a rule to block any packets originating in the address range 100–150 and headed to an address in 60–70. This rule would block the packet [125, 65].

Rules can overlap. If a packet matches both a block and allow rule, the block rule ‘wins’ and the packet is blocked. If a packet doesn’t match any rules, it is blocked by default. A packet only gets through if it matches at least one allow rule and zero block rules.

Let’s return to the question of the question. What should we to ask? I submit the following: Oracle, what is a packet blocked by one firewall but allowed by the other? If the Oracle replies there is no such packet, we know the firewalls are equivalent (hurrah!). If it replies with an example of such a packet, we know the firewalls are not equal and have a really great lead on figuring out why they aren’t equal.

Z3, for all its amazing capabilities, can’t understand queries in plain English. The problem now becomes stating our question in a form understood by Z3: first-order logic.

The right question

First-order logic is not scary. We require only two logical operators: and and not. To ask our question of Z3, we must do three things:

Tell Z3 what a packet is. Tell Z3 what a firewall is. Tell Z3 we want to find a packet blocked by one firewall but allowed by the other.

Z3 works with popular programming languages Java, C#, C++, and Python, but for simplicity we’ll use its native language. You can follow along on the Z3 web demo here: http://rise4fun.com/Z3

The first task is easy. Our simple packets have two fields: source, and destination. We describe this to Z3 by declaring integer constants src and dst. Z3’s mission is to find values for these constants — once all the wiring is in place, their values tell us a packet accepted by one firewall but not the other. Here’s how you declare the constants in Z3:

(declare-const src Int)

(declare-const dst Int)

The second task is the real meat of the problem: tell Z3 what a firewall is. First, let’s define what it means for a packet to match a rule:

(define-fun matches ((srcLower Int) (srcUpper Int) (dstLower Int) (dstUpper Int)) Bool

(and (<= srcLower src) (<= src srcUpper) (<= dstLower dst) (<= dst dstUpper))

)

This function is true if src is in the rule’s source address range and dst is in the rule’s destination range. Otherwise it is false.

Now we define what it means for a firewall to accept or block a packet. Let’s use a simple firewall with two rules, an allow rule and a block rule. The firewall function returns true if the packet is allowed, and false if it is blocked. Here’s how we state this to Z3, using the match function defined above:

(define-fun firewall1 () Bool

(and

(matches 0 10 20 30)

(not (matches 5 10 25 30))

)

)

Z3 is now a firewall expert. On to the third task!

Satisfaction

The third task is to actually verify firewall equivalence. First, define a second firewall so we have something to check:

(define-fun firewall2 () Bool

(and

(matches 1 10 20 30)

(not (matches 5 10 25 30))

)

)

It’s time! We have everything we need. Let’s ask Z3 the question — what is a packet blocked by one firewall but allowed by the other?

(assert (not (= firewall1 firewall2)))

(check-sat)

(get-model)

Click the run button in the web demo and… boom! Z3 finds us a packet — for me, [0, 20] — that is accepted by firewall1 but blocked by firewall2. This works for any two firewalls! All we have to do is change the contents of the firewall1 and firewall2 functions.

This all seems a bit magical, so let’s break down the last step. First, we assert the two firewalls are not equivalent. Then we ask Z3 to check this assertion with the check-sat instruction! This has two possible outcomes:

The firewalls are not equivalent: check-sat returns satisfiable, and the get-model instruction provides a packet demonstrating firewall inequivalence. The firewalls are equivalent: check-sat returns unsatisfiable and no packet is produced.

Either way, we have our answer. Z3 ruthlessly tracks down values of src and dst representing a packet accepted by one firewall but not the other. This is very fast: clever logic manipulation rules enable Z3 to process 300-rule firewalls in a fraction of a second.

In the real world

Real packets don’t exactly correspond to our model. Instead of simple numbers, they use IPv4 or IPv6 source & destination addresses, port numbers, and protocol numbers. Z3 handles these with no real changes to the core logic; Z3 bitvectors are a drop-in replacement type for the address numbers in our model. The actual firewall-checking code used inside Azure has been open-sourced, and is available here.

Beyond the firewall

This problem hardly taxes Z3’s ability, which lists nonlinear constraints on real numbers in its repertoire. Despite its expansive set of use cases, Z3 significantly decreased problem complexity compared to other approaches. The code was simple to write and easy to understand. If you’re facing a thorny problem that seems like it could be stated in terms of satisfiability, I very much recommend giving Z3 a try.

For an in-depth whitepaper on this topic, see “Checking Cloud Contracts in Microsoft Azure” by Nikolaj Bjørner and Karthick Jayaraman.