Some experiments to mistreat the Triton concolic execution framework through simple forged C programs.

The goal of this post is to describe several attempts made internally to challenge Triton's approach to control flow graph recovery, as described in . It's part of an internal initiative to improve both Triton and Epona, a generic LLVM bytecode obfuscator developed internally.

Each section presents an attack to a specific step of the control flow graph recovery algorithm.

Ssexify A typical way to attack a tool is to forge input it cannot handle. If Triton is unable to emulate instructions, we put a clear stop to the approach. The list of supported instructions is available at this page , all we need is to generate an unusual instruction and the engine raises an exception when meeting this instruction. For instance, there is no support for vector instructions, like the one from the SSE or AVX instructions sets, so one could enforce using such instructions, for instance using SSExy from Jurriaan Bremer. Still, supporting these is not conceptually impossible, as Triton can model instructions on registers of up to 512 bits. Running a multi-threaded application is another way to attack the approach. If you use Triton as emulator, you have to simulate all external calls to libraries and to represent their behaviors into the Triton's representation. Modeling several calls to pthread_create is a significant challenge for all DSE framework (regarding the memory model). Likewise, what happens when inserting random calls to the following function? #include <unistd.h> #include <sys/wait.h> void fork_n_wait () { // in case of error, we just go on standard execution if ( fork () > 0 ) { // the parent just forwards its childs exit status int wstatus ; if ( wait ( & wstatus ) >= 0 ) if ( WIFEXITED ( wstatus )) exit ( WEXITSTATUS ( wstatus )); // lazy ass: don't handle all wait cases } } The program repeatedly forks a continuation of its own execution. It is likely to break a symbolic execution engine, until the reverser sees the pattern and patch it (Until you add some diversity to the pattern through generic obfuscation, that is ;-)).

Duplicate By definition, recovering the full control flow graph of an application requires to run the program as many times as the number of different paths in the program. One way to trick the recovery is thus to artificially generate tons of paths. A naive way to do that is to create dead blocks guarded by an opaque predicate that always return false. int foo ( int a ) { if ( opaque_false ()) { /* crap code */ return ( a + 1 ) / ( a - a ); } /* real code */ return a ; } Unfortunately, if opaque_false does not depend on a symbolic variable, Triton is not going to symbolize the expression, and the effective value ( false ) is used. If the opaque predicate seems to depend on the context, say through opaque_false(a) then the symbolic engine tries to find a model for the expression, and unless it is fed with a complex problem (see below) it finds that the expression is constant, which does not triggers a new path creation. A simple way to create artificial paths is to clone an instruction and guard it by a random predicate. Arbitrary obfuscation techniques can then be used to make the two instructions look different: unsigned foo ( unsigned a ) { if ( a & 0x1 ) return a << 1 ; else return a * 2 ; } There are two different paths, the program can legitimately takes any of them, and triton has to go through both paths. Iteratively applying this transformation in a recursive manner quickly creates a program with a large number of paths, which makes it difficult for Triton to cover all of them. However, as most paths are duplicates of others, going through a subset of the paths still yields a fully functional program, so the defense only gives the feeling that the coverage is incomplete.

Xorify As Triton uses a concolic execution, it can use a DSE optimization which only constructs expressions that involve symbolic variables, all the others are just concretized. The idea of this protection is to force Triton to create large expression trees, in order to stress triton's expression building, and maybe z3 engine. Consider the following code. It uses a Mixed Boolean Arithmetic expression instead of a plain xor to avoid trivial simplification, and builds a larger degenerated tree for tmp : #include <stdio.h> #define xor(a,b) ((a+b) - 2 * (a & b)) int main ( int argc , char * argv []) { int tmp = argc / 2 ; for ( int i = 0 ; i < 100 ; ++ i ) tmp = xor ( tmp , argc ); if ( tmp == 12 ) { puts ( "yes" ); return 0 ; } return 1 ; } When this code was first submitted to Triton, it quickly ended in a memory error, as Triton used to generate a string representation (SMT2-Lib format) for the expression (a string of > 2Gb, that is) which in turn made z3 memory swap. The reimplementation of the conversion from Triton's expression tree to z3's did solve the issue, but the tree building remains a costly operation.

Ifify Based on the previous examples, one can forge a simple example that also triggers many tree evaluation: #include <stdio.h> #define xor(a,b) ((a+b) - 2 * (a & b)) int main ( int argc , char * argv []) { int tmp = argc * 2 ; for ( unsigned long i = 0 ; i < 1000000ul ; ++ i ) if ( tmp != 0 ) tmp = xor ( tmp , argc ); if ( tmp == 12 ) { puts ( "yes" ); return 0 ; } return 1 ; } The extra test is an opaque predicate that always return true. It cunningly depends on the symbolic variable argc . Adding an extra test does not significantly increases the run time of the program (I typically measured less than 1% of slowdown in that case), but it does trigger many evaluation of the model, which in turn requires to repeatedly build an expression that significantly grows with the loop trip count. For instance, if we want to solve the condition if(tmp == 12) where tmp is involved as symbolic variable, we have to go through all loop iterations which implies a huge expression. This snippet of code aims to show us from which iteration the expression takes time to solve. The following table shows us that the expression with less than ten iterations is easy to solve and for more than ten iterations we get an exponential time during the solving process. Iteration Time (second) 0 0.010108 1 0.015272 2 0.018748 3 0.053917 4 0.047126 5 0.116106 6 0.074973 7 0.140935 8 0.280171 9 0.634627 10 1.558280 11 4.537699 12 12.705435 13 40.657662 14 117.543763 15 342.179310 16 1052.419634 17 timeout

Non Deterministic Trace During the control flow graph reconstruction, one (approach used against the Tigress protection ) merges different execution trace in order to recover the original algorithm. This idea relies on the fact that common slices of the trace belong to a common path; What if the program exhibits undeterministic behavior? Like using the value from an uninitialized register, the address of a pointer when ASLR is enabled, calling random() as in the following source code? #include <stdio.h> #define xor(a,b) ((a+b) - 2 * (a & b)) int main ( int argc , char * argv []) { int tmp = argc / 2 ; printf ( "%p

" , & tmp ); if ( random () != random ()) tmp = xor ( xor ( tmp , argc ), argc ) if ( tmp == 12 && & tmp != ( void * ) tmp ) { puts ( "yes" ); return 0 ; } return 1 ; } There are countermeasure to this pseudo-random trace issues. One is to suppress randomness, through a very debianish implementation of random() that would always return the same value (but beware, one could base the test on some probabilistic properties of random, and using a dummy random generator would not help there). Similarly, because of the internal of Triton, the program behaves as if ASLR were not activated but invalid loads (like out-of-bound access) still take a non-deterministic value.

Timeout the Solver None of the previous code attacks z3 itself. It's still relatively easy to make it timeout upon the model checking call. It's a simple as reducing to a problem that is designed to be difficult to inverse, for instance secure hashing. It's not even necessary to call a complex hashing function, something derived from md5 as the following is enough (and does not take too much time to execute, which is part of the trade-off between protection level and usability). constexpr uint32_t left_rotate ( uint32_t x , uint32_t amount ) { return (( x << amount ) | ( x >> ( 32 - amount ))); } constexpr uint32_t hash ( uint32_t x ) { uint32_t rotate_amounts [] = { 7 , 12 , 17 , 22 , 7 , 12 , 17 , 22 , 7 , 12 , 17 , 22 , 7 , 12 , 17 , 22 , 5 , 9 , 14 , 20 , 5 , 9 , 14 , 20 , 5 , 9 , 14 , 20 , 5 , 9 , 14 , 20 , 4 , 11 , 16 , 23 , 4 , 11 , 16 , 23 , 4 , 11 , 16 , 23 , 4 , 11 , 16 , 23 , 6 , 10 , 15 , 21 , 6 , 10 , 15 , 21 , 6 , 10 , 15 , 21 , 6 , 10 , 15 , 21 }; uint32_t a = 0x67452301 , b = 0xefcdab89 , c = 0x98badcfe , d = 0x10325476 ; uint32_t constants [] = { 3614090360 , 3905402710 , 606105819 , 3250441966 , 4118548399 , 1200080426 , 2821735955 , 4249261313 , 1770035416 , 2336552879 , 4294925233 , 2304563134 , 1804603682 , 4254626195 , 2792965006 , 1236535329 , 4129170786 , 3225465664 , 643717713 , 3921069994 , 3593408605 , 38016083 , 3634488961 , 3889429448 , 568446438 , 3275163606 , 4107603335 , 1163531501 , 2850285829 , 4243563512 , 1735328473 , 2368359562 , 4294588738 , 2272392833 , 1839030562 , 4259657740 , 2763975236 , 1272893353 , 4139469664 , 3200236656 , 681279174 , 3936430074 , 3572445317 , 76029189 , 3654602809 , 3873151461 , 530742520 , 3299628645 , 4096336452 , 1126891415 , 2878612391 , 4237533241 , 1700485571 , 2399980690 , 4293915773 , 2240044497 , 1873313359 , 4264355552 , 2734768916 , 1309151649 , 4149444226 , 3174756917 , 718787259 , 3951481745 }; for ( uint32_t i = 0 ; i < 6 /* increase this to increase difficulty*/ ; ++ i ) { uint32_t f = ( b & c ) | ( ~ b & d ); uint32_t to_rotate = a + f + constants [ i ] + x ; uint32_t new_b = b + left_rotate ( to_rotate , rotate_amounts [ i ]); a = d ; d = c ; c = b ; b = new_b ; } return a + ( b << 8 ) + ( c << 16 ) + ( d << 24 ); } The hash is used in the following in a very explicit, non-hacker proof way: #include <stdio.h> #include <stdint.h> int main ( int argc , char * argv []) { int tmp = argc / 2 ; constexpr auto h12 = hash ( 12 ); if ( hash ( tmp ) == h12 && tmp == 12 ) { puts ( "yes" ); return 0 ; } return 1 ; } /**/ The trick is to first compute the hash, then the actual condition, so accessing puts is equivalent to inverting the hash, which is not within z3's capability. Actually, with a loop trip count of 6, it's still within z3's scope, but increasing it to 8 makes it bail out (well, > 1h of computation and still counting). Note that this trick can be used to protect equalities, and as usual a few obfuscations suffice to make hash(tmp) == h12 && tmp == 12 less obvious to the reverser.

Conclusion As a result of this training session, Triton developers improved the code base ( ) and Epona developers gathered some insight on techniques to slowdown dynamic analysis. This does not mean it's implemented in Epona, but it also does not mean it is not ;-) Either way, looks like a good partnership to us!