Converting Boolean-Logic Decision Trees to Finite State Machines

for simpler, high-performance detection of cybersecurity events

When analyzing cybersecurity events, the detection algorithm evaluates attributes against boolean expressions to determine whether the event belongs to a class. This article describes converting boolean expressions to finite state machines to permit simpler, high-performance evaluation.

The open-source project Cyberprobe features this implementation. Conversion of rules to finite state machine (FSM) and application of the rules in FSM form is implemented in Python. Cyberprobe supports the use of millions of rules, which can be applied at greater than 200k events/second on a single processor core.

Problem

Applying boolean logic criteria to events solves many scanning and detection problems. For instance, an event occurs that is generated from an interaction with a service under protection. The event has the following attributes:

Source address: 123.123.123.123:14001

Destination address: 192.168.0.1:19001

URL: https://myservice.com/path1

One or more boolean expressions for the class of thing I am trying to detect:

If TCP port number is 80 or 8080 AND IP address is 10.0.0.1 AND URL is http://www.example.com/malware.dat OR http://example.com/malware.dat …

The aim is to analyze a high-rate stream of such events against a large set of boolean expressions to classify the events.

The boolean expressions get unreadable quickly with English, which has no built-in operator precedence.

Boolean expressions

Boolean operators are represented as functions, and type:value represents attribute type/value match terms.

and(

or(

tcp:80, tcp:8080

),

ipv4:10.0.0.1,

or(

url:http://www.example.com/malware.dat,

url:http://example.com/malware.dat

)

)

A boolean expression consists of a combination of, and(…) , or(…) and not(…) functions, along with type:value match terms. I am using type:value pairs for match terms as that is useful in the domain I’m working in, but we could just as easily use strings.

Input

When evaluating the attributes of an event, attributes are type:value pairs. e.g.

ipv4:123.123.123.123

tcp:14001

ipv4:192.168.0.1

tcp:19001

url:https://myservice.com/path1

A basic evaluation algorithm

A simple approach for evaluation of a boolean expression using type:value pair input is to represent the boolean expression as a tree, and then use type:value pairs to trigger evaluation. Observations are stored in the tree.

The rules for evaluating a boolean tree against an event are:

For each type:value attribute, see if there is a corresponding type:value term in the boolean tree. If it exists, set the term node as true, and evaluate the parent node.

attribute, see if there is a corresponding term in the boolean tree. If it exists, set the term node as true, and evaluate the parent node. When evaluating a parent or node, when any child is true, the or node is true, and its parent node is evaluated.

node, when any child is true, the node is true, and its parent node is evaluated. When evaluating a parent and node, when ALL children are true, the and node is true, and its parent node is evaluated.

node, when ALL children are true, the node is true, and its parent node is evaluated. When evaluating a parent not node, when the child node is true, the not node is false. Once evaluation of all attributes is complete, if a not node has not been deemed false because its child is false, then it is evaluated true, and it’s parent node is evaluated.

That’s a straightforward algorithm; the point of this article is to provide an optimization.

There is a compromise here, the algorithm to convert the boolean tree to an FSM is compute intensive: it has complexity which is non-linear with the number of nodes: it is linear with the product of combination nodes (described below) and type:value terms. In real-world scenarios, boolean expressions will be converted to FSM when the rule is parsed, thereafter the FSM can be used numerous times.

Converting to an FSM

Step 1: Identify the ‘basic states’

In order to find the FSM, we look for all of the nodes in the boolean tree where state needs to be observed as evaluation proceeds. If you look at the example above, you can see that or nodes and and nodes are different. A child of an or node when evaluated as true immediately results in its parent being true, so no state needs to be kept regarding the children of or nodes. Whereas, when a child of an and node is true this is something which may need to be stored for later evaluation to determine the point at which the and node can be evaluated true.

The evaluation of not nodes is also complicated: a not node can be evaluated as true by virtue of its child maintaining a false evaluation for the duration of analysis.

The rules we state here are that some nodes in the boolean tree can be described as basic states:

The root of a tree is inherently a hit state, which means the boolean expression is true. This is a basic state. A not node is never a basic state. A child of an and node is a basic state unless it is a not node. A child of a not node is a basic state unless it is a not node itself.

In the above example, the basic states are the two or nodes, and the ip:10.0.0.1 node. All qualify under rule 3.

The implementation gives each state a state name which consists of the letter s plus a unique number, assigned in a depth-first walk. The example boolean tree with states is shown below; the three children of the and node are given states, with the parent and node representing the hit state.

Step 2: Identify the ‘combination states’

The basic states are nodes where partial state needs to be recorded. One node in an FSM represents all state at the same time i.e. all the valid basic state combinations. Hence the combination states set consists all combinations of basic states. This includes the empty set, and a union of all states.

Combination states need to have a state name: in my implementation, I combine states to a name by ordering, separating state numbers with a hyphen preceded by s . For example, a combination of states s4 , s7 , s13 is called s4–7-13 .

The empty set has a special name which we call init . It represents the initial state of the FSM where no information is known.

There is a special state hit which is used to describe any combination of basic states which include the root node evaluating to true. The combination of other states is ignored.

In the above example, the combination state set consists of:

init : The empty set

: The empty set s3 : The first or node:

: The first node: s4 : The ip:10.0.0.1 node

: The node s7 : The second or node

: The second node s3-4 : The first or node and ip:10.0.0.1

: The first node and s4-7 : The ip:10.0.0.1 node and the second or node

The node and the second node s3-7 : The first and second or nodes

: The first and second nodes hit : the root node

Step 3: Find all match terms

This is the set of all type:value match nodes in the boolean expression tree.

Step 4: Find all transitions

This step is essentially about working out what all type:value match nodes do to all combination states. There is a special match term, end: which is used to evaluate what happens to not nodes when the list of terms is completed.

The algorithm is:

For every combination state:

Work out the state name of that 'input' combination state

For every match term:

Given the input state

What state results from evaluating that term as true?

Work out the state name of that 'output' combination state

Record a transition (input, match term, output)

Given the input state

What state results from evaluating end: as true?

Work out the state name of that 'output' combination state

Record a transition (input, end:, output)

For this analysis, when the whole boolean expression evaluates as true i.e. the root node of the boolean expression is true, we give that a special name hit .

The result is a complete set of triples: (input, term, output). If the input and output states are the same, we can ignore the transition so that the FSM only contains edges which change state.

At this point, the FSM has some inefficiencies: there may be areas of the FSM which it is not possible to navigate to from init . This is addressed in the next step.

Step 5: Remove invalid transitions

Not all combination states can be reached from init , and so some of the transitions discovered can be discarded as irrelevant.

We start by constructing a set of states which can navigate to hit :

Create a set containing only the combination state hit . Iterate over the FSM adding all transitions for which there is a navigation to any state in the set. Repeat 2. until the full set of states is discovered.

At this point we know all states which can lead to hit . However, there will be transitions which lead to states which are not in this set, and thus cannot ever travel to hit . So, the first simplification of invalid transitions is to reduce all transitions to states which are NOT in this set to the single state named fail .

There is a second simplification of the FSM: some of the states are not navigable from init , and can be removed:

Construct a set containing only init . Iterate over the FSM finding all transitions for which there is a navigation from any state in the set. Repeat 2. until the set of states is discovered.

At this point we know areas of the FSM which are not reachable, and they can be removed.

Resultant FSM