Title : Cyber Grand Shellphish

Author : Team Shellphish

Date : January 25, 2017

|=-----------------------------------------------------------------------=| |=---------------------=[ Cyber Grand Shellphish ]=----------------------=| |=-----------------------------------------------------------------------=| |=------------------------=[ Team Shellphish ]=--------------------------=| |=----------------------=[ team@shellphish.net ]=------------------------=| |=----------------=[ http://shellphish.net/cgc#team ]=-------------------=| |=-----------------------------------------------------------------------=| ##### # # # # ##### ###### ##### # # # # # # # # # # ##### ##### # # # # # # # ##### # # # # # # # # ##### # ##### ###### # # ##### # # ##### ## # # ##### # # # # # ## # # # # #### # # # # # # # # # # # ##### ###### # # # # # # # # # # # # ## # # ##### # # # # # # ##### MM . MM = M M MM MM M MM MM ,M M MMM MM+ MM MM M MM MM MM MM MM MMMMD MMM MM+ MM MM MM MM +M MM :M MM MMM: MMMM MM MM MM MM MM MMD M. MM MMM M MMM MMMMM =M~ MM MM MM MM MM M~MM MM ~MM MM MM ,MMM OM MM MM MM MM MM M.MM MM MMM MMMMMM+NM MM MM MMMMMMMMMM +MMMMMM MMMN MM MM MMMMMM MM MM MM MM MMO MM MMMMMMMMMMMMM MMMMMM MMMN MM MM MMMMMN MMMMMM MM MMMMMMMMM MMMMMMM MMMM: MM MM MM MM NM: MM ?M MM MM MMM MMM MM MM ,M MMM MM MM MM MM MM? MM ~MM ~M MM :MMMMMMMMM MM MMM M M~MMMMMM MM~ MM MM MM MM NM+ MM MM NMM ~N MM ,MMMMMMM8 MM MM :MN MM MM+ MMI NMM M MMI MM MMM+ MM~ MMD~ MM |=-----------------------------------------------------------------------=| |=----------------------------=[ The Team ]=-----------------------------=| |=-----------------------------------------------------------------------=| |=---=[ zardus ]=-----=[ mike_pizza ]=---=[ anton00b ]=----=[ salls ]=---=| |=-----------------------------------------------------------------------=| |=---=[ fish ]=---------=[ nebhiros ]=---=[ cao ]=--------=[ donfos ]=---=| |=-----------------------------------------------------------------------=| |=---=[ hacopo ]=---------=[ nezorg ]=---=[ rhelmot ]=------=[ paul ]=---=| |=-----------------------------------------------------------------------=| |=----------------------------=[ zanardi ]=------------------------------=| |=-----------------------------------------------------------------------=| Hacking is often considered more than a skill. In popular culture, hackers are seen as wizards of sorts, artists with powers to access the inaccessible, or perform acts that seem impossible. Hacking, like art, has great people, who are recognized for their skills, but whose abilities cannot be captured or reproduced. In fact, a single great hacker in a team is better than a hundred mediocre ones, similar to as none of the paintings from a hundred mediocre artists can match a painting from van Gogh. Vulnerability analysis is the science of capturing and reproducing what some hackers do. Vulnerability analysis studies how one can reason, in a principled way, about finding and exploiting vulnerabilities in all types of software or hardware. By developing algorithms and tools to help humans identify flaws in software, researchers "codify" the knowledge that hackers use, in an organic way, to analyze systems and find their flaws. The resulting tools can then be used at scale, and composed to create new analysis systems. This scientific process has generated a number of useful tools, such as static analysis tools, fuzzers, and symbolic execution frameworks. However, these tools codify only a subset of the skills of a hacker, and they are still used only to augment the abilities of humans. One approach to push the codification of what human hackers do, is to take the hackers out of the equation. This is precisely what the DARPA Cyber Grand Challenge was set out to do. The DARPA Cyber Grand Challenge (CGC) was designed as a Capture The Flag (CTF) competition among autonomous systems without any humans being involved. During the competition, Cyber Reasoning Systems (CRSs) would find vulnerabilities in binaries, exploit them, and generate patches to protect them from attacks, without any human involvement at all. The separation between human and machine is key, as it forces the participants to codify, in algorithms, the techniques used for both attack and defense. Although the competition was only a first step toward capturing the art of hacking, it was an important one: for the first time, completely autonomous systems were hacking one another with code, and not human intuition, driving the discovery of flaws in complex software systems. Shellphish is a team that was founded by Professor Giovanni Vigna at UC Santa Barbara in 2005 to participate in the DEF CON CTF with his graduate students. Since then, Shellphish has evolved to include dozens of individuals (graduate students - now professors elsewhere, undergraduate students, visitors, their friends, etc.) who are somewhat connected by the Security Lab at UC Santa Barbara, but who are now spread all across the world. Nonetheless, Shellphish has never lost its "hackademic" background and its interest in the science behind hacking. Participation in many CTF competitions sparked novel research ideas, which, in addition to publications, resulted in tools, which, in turn, were put to good use during CTF competitions. Given the academic focus of Shellphish, it is no surprise that the DARPA Cyber Grand Challenge seemed like a great opportunity to put the research carried out at the UC Santa Barbara SecLab to work. Unfortunately, when the call for participation for the funded track came out, the lab was (as usual) busy with a great number of research projects and research endeavors, and there were simply no cycles left to dedicate to this effort. However, when the call for the qualification round that was open to anybody who wanted to throw their hat in the ring was announced, a group of dedicated students from the SecLab decided to participate, entering the competition as team Shellphish. With only a few weeks to spare, the Shellphish team put together a prototype of a system that automatically identifies crashes in binaries using a novel composition of fuzzing and symbolic execution. Unsurprisingly, the system was largely unstable and crashed more than the binaries it was supposed to crash. Yet, it performed well and Shellphish was one of the seven teams (out of more than a hundred participants) that qualified for the final event. Since Shellphish was not initially funded by DARPA through the funded track, it received a $750,000 award to fund the creation of the autonomous system that would participate in the final competition. The following few months focused mostly on basic research, which resulted in a number of interesting scientific results in the field of binary analysis [Driller16, ArtOfWar16], and to the dramatic improvement of angr [angr], an open-source framework created at the UC Santa Barbara SecLab to support the analysis of binaries. Eventually, the pressure to create a fully autonomous system increased to the point that academic research had to be traded for system-building. During the several months preceding the final competition event, all the energy of the Shellphish team focused on creating a solid system that could be resilient to failure, perform at scale, and be able not only to crash binaries, but also to generate reliable exploits and patches. After gruesome months of Sushi-fueled work that lead to severe Circadian rhythm sleep disorder [Inversion] in many of the team's members, the Mechanical Phish Cyber Reasoning System was born. Mechanical Phish is a highly-available, distributed system that can identify flaws in DECREE binaries, generate exploits (called Proofs Of Vulnerability, or POVs), and patched binaries, without human intervention. In a way, Mechanical Phish represents a codification of some of the hacking skills of Shellphish. Mechanical Phish participated in the final event, held in Las Vegas on August 4th, in conjunction with DEF CON, and placed third, winning a $750,000 prize. As a team, we were ecstatic: our system performed more successful exploits than any other CRS, and it was exploited on fewer challenges than any other CRS. Of course, in typical Shellphish style, the game strategy is where we lost points, but the technical aspects of our CRS were some of the best. We decided to make our system completely open-source (at the time of writing, Shellphish is the only team that decided to do so), so that others can build upon and improve what we put together. The rest of this article describes the design of our system, how it performed, and the many lessons learned in designing, implementing, and deploying Mechanical Phish. --[ Contents 1 - The Cyber Grand Challenge 2 - Finding Bugs 3 - Exploiting 4 - Patching 5 - Orchestration 6 - Strategy 7 - Results 8 - Warez 9 - Looking Forward --[ 001 - The Cyber Grand Challenge The Cyber Grand Challenge was run by DARPA, and DARPA is a government agency. As such, the amount of rules, regulations, errata, and so forth was considerably out of the range with which we were familiar. For example, the Frequently Asked Questions document alone, which became the CGC Bible of sorts, reached 68 dense pages by the time the final event came around; roughly the size of a small novella. This novella held much crucial information, and this crucial information had to be absorbed by the team, digested, and regurgitated into the squawking maw of the Mechanical Phish. --[ 001.001 - Game Format The CGC Final Event took the form of a more-or-less traditional attack-defense CTF. This means that each team had to attack, and defend against, each other team. Counter to A&D CTF tradition, and similar to the setup of several editions of the UCSB iCTF, exploits could not be run directly against opponents, but had to be submitted to the organizers. Likewise, patches could not be directly installed (i.e., with something like scp), but had to be submitted through the central API, termed the "Team Interface" (TI) by DARPA. This allowed DARPA to maintain full control over when and how often attacks were launched, and how patches were evaluated. The CGC was divided into rounds (specifically, there were 95 rounds in the final event) of at least 5 minutes each. Each round, the TI specified the currently-active challenges. A challenge could be introduced at any point, up to a maximum of 30 concurrently active challenges, and was guaranteed to remain live for a minimum of 10 rounds. All teams would start with the same binaries for a challenge. Each round, teams could submit patches for their instances of the challenge binaries, exploits for the opponents' instances, and network rules to filter traffic. Submitted exploits would go "live" on the round *after* the one during which they were submitted, while submitted patches and network rules would go live 2 rounds after. Submitting a patch, causes that team to incur one round of downtime for that challenge, in which opponents can download the patched binary. Patched binaries were visible by opponents to allow them to exploit incomplete patches. --[ 001.002 - Game Scoring The CGC employed a scoring algorithm that had serious implications for the strategies that were viable for teams to adopt. Each team would be scored on a per-round basis, and a team's final score was the sum of all of their round scores. A team's round score was, in turn, the sum of their "Challenge Binary Round Scores" for that round. This CB Round Score was the real crux of the matter. A CB Round Score was calculated for every Challenge Binary to be a simple multiplication of: - Availability: this was a measure of the performance overhead and functionality impact introduced by a patch. The performance and functionality scores each ranged from 0 (broken) to 100 (perfect), and the availability score was the minimum of these numbers. - Security: this number was 2 if the Challenge Binary was secure (i.e., it had not been exploited by any competitor that round), and 1 otherwise (if it was exploited by at least one competitor). - Evaluation: this number ranged from 1.0 to 2.0, based on how many opponents the team was hitting with their exploits (if any) for this challenge binary. For example, if an exploit was succeeding against 2 of the 6 opponents, this number would be 1.33. The round after a patch was submitted or an IDS rule was updated for a challenge binary, that challenge binary's availability score would be set to 0 and no exploits would be scheduled (either by the patching team or against the patching team) for that round. On the surface, this seems like a simple formula. However, there were several complications: - DARPA did not disclose the formula used to calculate the performance or functionality scores prior to the CFE. While this was presumably done to avoid letting teams "game the system", it led to an astonishing amount of uncertainty regarding patching. Teams knew that they had some "free" overhead (20% for the file size, and 5% each for runtime overhead and memory overhead), but did not know how additional overhead would be penalized. - The runtime overhead included overhead introduced by the network IDS as it matched submitted network rules. This overhead was fundamentally unmeasurable before the game. During the CFE, it turned out that the overhead introduced by network rules from one service could actually influence overhead of *other* services. All this uncertainty kept all but two teams from using the network IDS during the CFE. - The memory and runtime overhead was astonishingly hard to measure. As we discuss later, local measurements were so unreliable that we actually decided not to test our own patches for performance during the CFE, but optimized the techniques as much as possible beforehand and relied on the TI feedback for already-submitted patches. As we'll also discuss, DARPA themselves had some trouble keeping the measurements constant, with measurements between teams (for identical, unpatched binaries) varying by significant amounts. Overall, the performance scoring was both a strength and weakness of the game. Because of the ease with which exploits can be "broken" by simple patches, it was critical to have some patch dis-incentives. However, the measurement of such patches is a very hard problem, and led to some chaos. --[ 001.003 - Visibility With the CGC, DARPA pioneered something called the "Consensus Evaluation". This had many implications for many parts of the game, but the biggest one was a slight change to the "visibility" of certain aspects of the game to the different teams. In this section, we'll detail what information was available to competitors during the game. All interaction with the game was done via the CGC API (what DARPA called the Team Interface, or TI). The TI provided the following information: - The Challenges: all teams could retrieve the challenges that were "in play" at any given moment. - Patched Binaries: all teams could retrieve the patched binaries of *all* opponents. This prevented opponents from deploying "chump" patches that relied on simple security through obscurity. - IDS rules: all teams could retrieve the IDS rules that were being fielded by their opponents. - Availability impacts: a team could see the runtime overhead, memory overhead, and functionality impact of their submitted patches. - Crashes: a team could see what signals their binaries crashed with as a result of processing network traffic. This could give hints for whether or not exploits were being thrown against a challenge binary. - Network Traffic: a team could get most of the network traffic, including all exploits and most of the functionality polls (DARPA reserved the right to keep some of them secret). - POV Feedback: a team could get the results of their own exploits against other teams. - Round scores: teams could get the overall round scores of all competitors. This level of visibility is different than your "average" CTF. Specifically, the big difference is that the patches submitted by the Cyber Reasoning Systems were made available to opponents. In theory, this would allow opponents to analyze, and try to bypass, a CRS' patches. In practice, as we discussed in the Patching section, patched binaries were too "dangerous" to analyze. Teams were allowed to add anti-reversing code, and, in our experience, it was easy to find ways to crash analysis tools, or "clog" them, causing them to slow down or use extra resources. We did not dare to run our heavier analysis on the patched binaries, and we are not aware of any team that did. Instead, our heavy analysis was run strictly on the original, unpatched binaries, and the patched binaries were run through a quick analysis that would fix up memory offsets and so forth in the case of "chump" patches. --[ 001.004 - Practice Full autonomy is difficult to achieve. To mitigate some of this difficulty, DARPA provided CGC practice sessions through a "sparring partner". In the months leading up to the final event, the sparring partner would connect to our Cyber Reasoning System at random, unannounced times and begin a game. The presence of the sparring partner was immensely helpful in debugging interactions with the API. However, the unannounced nature of these events meant that it was rarely possible to be ready for this debugging. Additionally, since DARPA only had one sparring partner setup, and teams had to take turns interacting with it, the sparring partner sessions were very short (generally, about 30 minutes, or 5 rounds). The reduced length of these sessions made it more difficult to truly stress-test the systems, and several teams showed signs of failure due to overload during the final event itself. We mostly used the practice round to test the difference between DARPA's calculated overhead from our estimated one. This allowed us to get a vague idea of how DARPA calculated overhead, and tailor our patching techniques accordingly. --[ 001.005 - DECREE OS The binaries that made up the Cyber Grand Challenge's challenges were not standard Linux binaries. Instead, they were binaries for an OS, called DECREE, that was created specifically for the Cyber Grand Challenge. This OS was extremely simple: there were 7 system calls, and no persistence (i.e., filesystem storage, etc). The DECREE syscalls are: 1. terminate: the equivalent of exit() 2. transmit: the equivalent of send() 3. receive: the equivalent of recv() 4. fdwait: the equivalent of select() 5. allocate: the equivalent of mmap() 6. deallocate: the equivalent of munmap() 7. random: a system call that would generate random data The hardware platform for the binaries was 32-bit x86. Challenge authors were not allowed to include inline assembly code, only code which was produced by clang or included in a library provided by DARPA. The provided math library included instructions such as floating point logarithms and trigonometric operations. All other code had to be produced directly by the compiler (clang). This simple environment model allowed competitors to focus on their program analysis techniques, rather than the implementation details that go along with support complex OS environments. The DECREE OS was, in the opinion of many of us, the biggest thing that made the CGC possible with humanity's current level of technology. --[ 002 - Finding Bugs ----------------------------------------- | Driller | | | | ++++++++++++++ ++++++++++++++ | | + angr + =====> + AFL + | | + + + + | | + Symbolic + + Genetic + | | + Tracing + <===== + Fuzzing + | | ++++++++++++++ ++++++++++++++ | | | ----------------------------------------- There are a few things we considered when we thought about how we wanted to find bugs in the Cyber Grand Challenge. Firstly, we will need to craft exploits using the bugs we find. As a result, we need to use techniques which generate inputs that trigger the bugs, not just point out that there could be a bug. Secondly, the bugs might be guarded by specific checks such as matching a command argument, password or checksum. Lastly, the programs which we need to analyze might be large, so we need techniques which scale well. The automated bug finding techniques can be divided into three groups: static analysis, fuzzing, and symbolic execution. Static analysis is not too useful as it doesn't generate inputs which actually trigger the bug. Symbolic execution is great for generating inputs which pass difficult checks, however, it scales poorly in large programs. Fuzzing can handle fairly large programs, but struggles to get past difficult checks. The solution we came up with is to combine fuzzing and symbolic execution, into a state-of-the-art guided fuzzer, called Driller. Driller uses a mutational fuzzer to exercise components within the binary, and then uses symbolic execution to find inputs which can reach a different component. * Fuzzing Driller leverages a popular off-the-shelf fuzzer, American Fuzzy Lop. AFL uses instrumentation to identify the transitions that a particular input exercises when it is passed to the program. These transitions are tuples of source and destination basic blocks in the control flow graph. New transition tuples often represent functionality, or code paths, that has not been exercised before; logically, these inputs containing new transition tuples are prioritized by the fuzzer. To facilitate the instrumentation, we use a fork of QEMU, which enables the execution of DECREE binaries. Some minor modifications were made to the fuzzer and to the emulation of DECREE binaries to enable faster fuzzing as well as finding deeper bugs: - De-randomization Randomization by the program interferes with the fuzzer's evaluation of inputs - an input that hits an interesting transition with one random seed, may not hit it with a different random seed. Removing randomness allows the fuzzer to explore sections of the program which may be guarded by randomness such as "challenge-response" exchanges. During fuzzing, we ensure that the flag page is initialized with a constant seed, and the random system call always returns constant values so there is no randomness in the system. The Exploitation component of our CRS, is responsible for handling the removal of randomness. - Double Receive Failure The system call receive fails after the input has been completely read in by the program and the file descriptor is now closed. Any binary which does not check the error codes for this failure may enter an infinite loop, which slows down the fuzzer dramatically. To prevent this behavior, if the receive system call fails twice because the end-of-file has been reached, the program is terminated immediately. - At-Receive Fork Server AFL employs a fork server, which forks the program for each execution of an input to speed up fuzzing by avoiding costly system calls and initialization. Given that the binary has been de-randomized as described above, all executions of it must be identical up until the first call to receive, which is the first point in which non-constant data may enter the system. This allows the fork-server to be moved from the entry of the program, to right before the first call to receive. If there is any costly initialization of globals and data structures, this modification speeds up the fuzzing process greatly. * Network Seeds Network traffic can contain valuable seeds which can be given to the fuzzer to greatly increase the fuzzing effectiveness. Functionality tests exercise deep functionality within the program and network traffic from exploits may exercise the particular functionality which is buggy. To generate seeds from the traffic, each input to the program is run with the instrumentation from QEMU to identify if it hits any transitions which have not been found before. If this condition is met, the input is considered interesting and is added as a seed for the fuzzer. * Adding Symbolic Execution Although symbolic execution is slow and costly, it is extremely powerful. Symbolic execution uses a constraint solver to generate specific inputs which will exercise a given path in the binary. As such, it can produce the inputs which pass a difficult check such as a password, a magic number, or even a checksum. However, an approach based entirely on symbolic execution will quickly succumb to path explosion, as the number of paths through the binary exponentially increases with each branch. Driller mitigates the path-explosion problem by only tracing the paths which the fuzzer, AFL, finds interesting. This set of paths is often small enough that tracing the inputs in it is feasible within the time constraints of the competition. During each symbolic trace, Driller attempts to identify transitions that have not yet been exercised by the fuzzer, and, if possible, it generates an input which will deviate from the trace and take a new transition instead. These new inputs are fed back into the fuzzer, where they will be further mutated to continue exercising deeper paths, following the new transitions. * Symbolic Tracing To ensure that the symbolic trace using angr's concolic execution engine is identical to the native execution trace, we use pre-constraining. In pre-constrained execution, each byte of input is constrained to match the original byte of input that was used by the fuzzer. When a branch is reached that would lead to a new transition, the pre-constraints are removed and then the solver is queried for an input which reaches the new transition. Pre-constraining also has the benefit of greatly improving execution time, because one does not need to perform expensive solves to determine the locations of reads and writes to memory because all variables have only one possible value. --[ 003 - The Exploiting Component Crashes Inputs || || || || ======================== || || || || || || || \/ \/ \/ -------------------- -------------------- -------------------- | REX | | PovFuzzer | | Colorguard | | | | | | | | ++++++++++++++ | | ++++++++++++++ | | ++++++++++++++ | | + Advanced + | | + Fuzzing + | | + Finding + | | + AEG + | | + For + | | + Leaks + | | + + | | + Exploits + | | + + | | ++++++++++++++ | | ++++++++++++++ | | ++++++++++++++ | | | | | | | | ... | | | | | -------------------- -------------------- -------------------- || || || ================================================ || \/ POVs * Exploitation In the Cyber Grand Challenge, stealing flags is a little different than it is in ordinary Capture the Flag games. Instead of reading a secret "flag" file and submitting the contents to the organizers, an exploit is demonstrated by submitting a Proof of Vulnerability (POV), which the organizers will run. A POV is a binary program, which will communicate with the opponent's binary and "exploit" it. There are two ways in which a POV can be considered successful: - Type 1: Cause a segmentation fault where the instruction pointer and one additional register have values which were previously negotiated with the competition infrastructure. Must control at least 20 bits of EIP as well as 20 bits of another register. - Type 2: Read 4 contiguous bytes from the secret flag data. The flag data is located at 0x4347c000-0x4347cfff and is randomly initialized by the kernel. Different types of vulnerabilities might lend themselves to a specific type of POV. For example, a vulnerability where the user can control the address passed to puts() might only be usable as a Type 2 POV. On the other hand, a stack-based buffer overflow can clearly be used to create a Type 1 exploit simply by setting a register and EIP, but it can also be used to create a Type 2 exploit by using Return Oriented Programming or by jumping to shellcode which prints data from the flag page. * Overview The basic design for Mechanical Phish's automatic exploitation is to take crashes, triage them, and modify them to create exploits. Mechanical Phish does not need to understand the root cause of the bug, instead it only needs to identify what registers and memory it controls at crash time, and how those values can be set to produce a POV. We created two systems which are designed to go from crashes to exploits. - PovFuzzer Executes the binary repeatedly, slightly modifying the input, tracking the relationship between input bytes and registers at the crash point. This method is fast, but cannot handle complex cases. - Rex Symbolically executes the input, tracking formulas for all registers and memory values. Applies "techniques" on this state, such as jumping to shellcode, and return oriented programming to create a POV. Now this design was missing one important thing. For some challenges the buggy functionality does not result in a crash! Consider a buffer over-read, where it reads passed the end of the buffer. This may not cause a crash, but if any copy of flag data is there, then it will print the secret data. To handle this, we added a third component. - Colorguard Traces the execution of a binary with a particular input and checks for flag data being leaked out. If flag data is leaked, then it uses the symbolic formulas to determine if it can produce a valid POV. --[ 003.001 - PovFuzzer The PovFuzzer takes a crash and repeatedly changes a single byte at a time until it can determine which bytes control the EIP as well as another register. Then a Type 1 POV can be constructed which simply chooses the input bytes that correspond to the negotiated EIP and register values, inserts the bytes in the payload, and sends it to the target program. For crashes which occur on a dereference of controlled data, the PovFuzzer chooses bytes that cause the dereference to point to the flag page, in the hope that flag data will be printed out. After pointing the dereference at the flag page, it executes the program to check if flag data is printed out. If so, it constructs a Type 2 POV using that input. The PovFuzzer has many limitations. It cannot handle cases where register values are computed in a non-trivial manner, such as through multiplication. Furthermore it cannot handle construction of more complex exploits such as jumping to shellcode, or where it needs to replay a random value printed by the target program. Even so, it is useful for a couple reasons. Firstly, it is much faster than Rex, because it only needs to execute the program concretely. Secondly, although we don't like to admit it, angr might still have occasional bugs, and the PovFuzzer is a good fallback in those cases as it doesn't rely on angr. --[ 003.002 - Rex The general design of Rex is to take a crashing payload, use angr to symbolically trace the program with the crashing payload, collecting symbolic formulas for all memory and input along the way. Once we hit the point where the program crashes, we stop tracing, but use the constraint solver to pick values that either make the crash a valid POV, or avoid the crash to explore further. There are many ways we can choose to constrain the values at this point, each of which tries to exploit the program in a different way. These methods of exploiting the program are called "techniques". One quick thing to note here is that we include a constraint solver in our POVs. By including a constraint solver we can simply add all of the constraints collected during tracing and exploitation into the POV and then ask the constraint solver at runtime for a solution that matches the negotiated values. The constraint solver, as well as angr, operates on bit-vectors enabling the techniques to be bit-precise. Here we will describe the various "techniques" which Rex employs. - Circumstantial Exploit This technique is applicable for ip-overwrite crashes and is the simplest technique. It determines if at least 20 bits of the instruction pointer and one register are controlled by user input. If so, an exploit is constructed that will negotiate the register and ip values and then solve the constraints to determine the user input that sets them correctly. - Shellcode Exploit Also only applicable to ip-overwrites, this technique will search for regions of executable memory that are controlled by user input. The largest region of controlled executable memory is chosen and the memory there is constrained to be a nop-sled followed by shellcode to either prove a type-1 or type-2 vulnerability. A shellcode exploit can function even if the user input only controls the ip value, and not an additional register. - ROP Exploit Although the stack is executable by default, opponents might employ additional protections that prevent jumping to shellcode, such as remapping the stack, primitive Address Randomization or even some form of Control Flow Integrity. Return Oriented Programming (ROP) can bypass incomplete defenses and still prove vulnerabilities for opponents that employ them. It is applicable for ip-overwrite crashes as long as there is user data near the stack pointer, or the binary contains a gadget to pivot the stack pointer to the user data. - Arbitrary Read - Point to Flag A crash that occurs when the program tries to dereference user- controlled data is considered an "arbitrary read". In some cases, by simply constraining the address that will be dereferenced to point at flag data, the flag data will be leaked to stdout, enabling the creation of a type-2 exploit. Point to Flag constrains the input to point at the flag page, or at any copy of flag data in memory, then uses Colorguard to determine if the new input causes an exploitable leak. - Arbitrary Read/Write - Exploration In some cases, the dereference of user-controlled data can lead to a more powerful exploit later. For example, a vtable overwrite will first appear as an arbitrary read, but if that memory address points to user data, then the read will result in a controlled ip. To explore arbitrary reads/writes for a better crash, the address of the read or write is constrained to point to user data, then the input is re-traced as a new crash. - Write-What-Where If the input from the user controls both the data being written and the address to which it is written, we want to identify valuable targets to overwrite. This is done by symbolically exploring the crash, to identify values in memory that influence the instruction pointer, such as return addresses or function pointers. Other valuable targets are pointers that are used to print data; overwriting these can lead to type-2 exploits. --[ 003.003 - Colorguard As explained above, there are some challenges which include vulnerabilities that can leak flag data, but do not cause a crash. One challenge here is that it is difficult to detect when a leak occurs. You can check if any 4 bytes of output data are contained in the flag page, but this will have false positives while fuzzing, and it will miss any case where the data is not leaked directly. The challenge authors seem to prefer to xor the data or otherwise obfuscate the leak, maybe to prevent such a method. To accurately detect these leaks we chose to trace the inputs symbolically, using angr. However, symbolic tracing is far too slow to run on every input that the fuzzer generates. Instead, we only perform the symbolic tracing on the inputs which the fuzzer considers "interesting". The hope here is that the leak causing inputs have a new transition, or number of loops which is unique, and that the fuzzer will consider it interesting. There are definitely cases where this doesn't work, but it's a fairly good heuristic for reducing the number of traces. In an effort to further combat the slowness of symbolic execution, Colorguard takes advantage of angr's concrete emulation mode. Since no modification of the input has to be made if a flag leak is discovered, our input is made entirely concrete, with the only symbolic data being that from the flag page. This allows us to only execute symbolically those basic blocks that touch the secret flag page contents. Colorguard traces the entire input concretely and collects the symbolic expression for the data that is printed to stdout. The expression is parsed to identify any four consecutive bytes of the flag page that are contained in the output. For each set of bytes, the solver is queried to check if we can compute the values of the bytes only from the output data. If so, then an exploit is crafted which solves for these four bytes after receiving the output from the program execution. One caveat here is that challenges may use many bytes of the flag page as a random seed. In these cases we might see every byte of the flag page as part of the expression for stdout. Querying the constraint solver for every one of these consecutive four byte sequences is prohibitively slow, so it is necessary to pre-filter such expressions. Colorguard does this pre-filter during the trace by replacing any expression containing more than 13 flag bytes with a new symbolic variable. The new symbolic variable is not considered a potential leak. The number 13 was arbitrarily chosen as it was high enough to still detect all of the leaks we had examples for, but low enough that checking for leaks was still fast. --[ 003.004 - Challenge Response A common pattern that still needs to be considered is where the binary randomly chooses a value, outputs it, and then requires the user to input that value or something computed from the random value. For example, the binary prints "Solve the equation: 6329*4291" and then the user must input "27157739". To handle these patterns in the exploit, we identify any constraints that involve both user input and random data. Once identified, we check the output that has been printed up to that point if it contains the random data. If so, then we have identified a challenge-response. We will include the output as a variable in the constraints that are passed to the exploit, and then read from stdout, adding constraints that the output bytes match what is received during the exploit. Then the solver can be queried to generate the input necessary for the correct "response". --[ 004 - The Patching Component: Patcherex Original Binary (CB) || || ||=============================================== || || || || \/ \/ -------------------- -------------------- -------------------- | TECHNIQUES | | PATCHES | | BACKENDS | | | | | | | | ++++++++++++++ | | | | | | + Return + | | - AddROData() | | | | + Pointer + --------> - AddCode() | | +++++++++++++++ | | + Encryption + | | - ... | | + Detour + | | ++++++++++++++ | | | | + Backend + | | | | | | + + | | ++++++++++++++ | | | | +++++++++++++++ | | + Transmit + | | - InsertCode() | | | | + Protection + --------> - AddRWData() |==>| OR | | + + | | - ... | | | | ++++++++++++++ | | | | +++++++++++++++ | | | | | | + Reassembler + | | ++++++++++++++ | | | | + Backend + | | + + | | - AddCode() | | + + | | + Backdoor + --------> - AddRWData() | | +++++++++++++++ | | + + | | - ... | | | | ++++++++++++++ | | | | | | | | | | | | ... | | | | | -------------------- -------------------- -------------------- || || \/ Replacement Binary (RB) Patcherex, which is built on top of angr, is the central patching system of Mechanical Phish. As illustrated in the overview image, Patcherex is composed of three major components: techniques, patches, and patching backends. * Techniques A technique is the implementation of a high-level patching strategy. A set of patches (described below) with respect to a binary are generated after applying a technique on it. Currently Patcherex implements three different types of techniques: - Generic binary hardening techniques, including Return Pointer Encryption, Transmit Protection, Simple Control-Flow Integrity, Indirect Control-Flow Integrity, Generic Pointer Encryption; - Techniques aiming at preventing rivals from analyzing or stealing our patches, including backdoors, anti-analysis techniques, etc.; - Optimization techniques that make binaries more performant, including constant propagation, dead assignment elimination, and redundant stack variables removal. * Patches A Patch is a low-level description of how a fix or an improvement should be made on the target binary. Patcherex defines a variety types of patches to perform tasks ranging from code/data insertion/removal to segment altering. * Backends A backend takes patches generated from one or more techniques, and applies them on the target binary. Two backends available in Patcherex: - ReassemblerBackend: this backend takes a binary, completely disassembles the entire binary, symbolizes all code and data references among code and data regions, and then generate an assembly file. It then apply patches on the assembly file, and calls an external assembler (it was clang for CGC) to reassemble the patched assembly file to the final binary. - DetourBackend: this backend acts as a fallback to the ReassemblerBackend. It performs in-line hooking and detouring to apply patches. --[ 004.001 - Patching Techniques In this section, we describe all techniques we implemented in Patcherex. Obviously we tried to implement techniques that will prevent exploitation of the given CBs, by making their bugs not exploitable. In some cases, however, our techniques do not render the bugs completely unexploitable, but they still force an attacker to adapt its exploits to our RBs. For instance, some techniques introduce differences in the memory layout between our generated RB and the original CB an attacker may have used to develop its exploit. In addition, we try to prevent attackers from adapting exploits to our RB by adding anti-analysis techniques inside our generated binaries. Furthermore, although we put a significant effort in minimizing the speed and memory impact of our patches, it is often impossible to have performance impact lower than 5% (a CB score starts to be lowered when it has more than 5% of speed or memory overhead). For this reason we decided to "optimize" the produced RB, as we will explain later. --[ 004.001.001 - Binary Hardening Techniques We implemented some techniques for generic binary hardening. Those general hardening techniques, although not extremely complex, turned out to be very useful in the CFE. Vulnerability-targeted hardening (targeted patching) was also planned initially. However, due to lack of manpower and fear for deploying replacement CBs too many times for the same challenge, we did not fully implement or test our targeted patching strategies. * Return Pointer Encryption This technique was designed to protect from classical stack buffer overflows, which typically give an attacker control over an overwritten saved return pointer. Our defense mechanism "encrypts" every return pointer saved on the stack when a function is called, by modifying the function's prologue. The encrypted pointer is then "decrypted" before every ret instruction terminating the same function. For encryption and decryption we simply xor'ed the saved return pointer with a nonce (randomly generated during program's startup). Since the added code is executed every time a function is called, we take special care in minimizing the performance impact of this technique. First of all, this technique is not applied in functions determined as safe. We classify a function as safe if one of these three conditions is true: - The function does not access any stack buffer. - The function is called by more than 5 different functions. In this case, we assume that the function is some standard "utility" function, and it is unlikely that it contains bugs. Even if it contains bugs, the performance cost of patching such a function is usually too high. - The function is called by printf or free. Again, we assume that library functions are unlikely to contain bugs. These common library functions are identified by running the functions with test input and output pairs. This function identification functionality is offered by a separate component (the "Function identifier" mentioned in the "Warez" Section). To further improve the performance, the code snippet that encrypts and decrypts the saved return pointer uses, when possible, a "free" register to perform its computations. This avoids saving and restoring the value of a register every time the injected code is executed. We identify free registers (at a specific code location) by looking for registers in which a write operation always happens before any read operation. This analysis is performed by exploring the binary's CFG in a depth-first manner, starting from the analyzed code location (i.e., the location where the code encrypting/decrypting the return pointer is injected). Finally, to avoid negatively impacting the functionality of the binary, we did not patch functions in which the CFG reconstruction algorithm has problems in identifying the prologue and the epilogues. In fact, in those cases, it could happen that if an epilogue of the analyzed function is not identified, the encrypted return address will not be decrypted when that epilogue is reached, and consequently the program will use the still-encrypted return pointer on the stack as the return target. This scenario typically happens when the compiler applies tail-call optimizations by inserting jmp instructions at the end of a function. * Transmit Protection As a defense mechanism against Type 2 exploits, we inject code around the transmit syscall so that a binary is forbidden from transmitting any 4 contiguous bytes of the flag page. The injected code uses an array to keep track of the last transmitted bytes so that it can identify cases in which bytes of the flag page are leaked one at a time. * Simple Control-Flow Integrity To protect indirect control flow instructions (e.g., call eax, jmp ebx), we inject, before any instruction of this kind, code that checks specific properties of the target address (i.e., the address which the instruction pointer will be after the call or jump). The specific checks are: - The target address must be an allocated address (to prevent an attacker from using the indirect control-flow instruction to directly perform a Type 1 attack). To do so, we try to read from the target address, so that if the address does not point to an allocated region the program will crash before the instruction pointer is modified. This prevents simple Type 1 exploits, because the attacker must control at least 20 bits of the instruction pointer, and it is unlikely (but not impossible) that the negotiated value will end up inside an allocated memory region. - The target address must be inside the memory range where the binary's code is typically loaded (for simplicity, we consider a "potential code" address to be any below 0x4347c000). To avoid breaking programs that use dynamically allocated code we do not perform this check if we statically detect that the analyzed program calls the allocate syscall in a way which will create additional executable memory. - The target address is not a pop instruction. As a partial mitigation against ROP attacks, we dynamically check if the target of indirect calls is a pop instruction, and terminate the program otherwise. * Uninitialized Data Cleaning For each function, we identify all of the instructions that read and write to stack variables, and the stack offset that is accessed. If there is any path through the CFG such that a stack variable is read before it is written, then we consider it possible that there is an uninitialized data usage. For each variable that is detected in an uninitialized data usage, we zero that variable by adding stack cleaning code at the beginning of the function. * Stack Base Address Randomization On program's startup we add a random value (which can assume any 16-byte aligned value between 16 and 1024) to the stack pointer address. This adds indeterminism to the position of the stack, hindering any exploit making assumptions on the program's stack layout. * malloc Protection To interfere with exploitation of heap overflows, if we are able to identify a malloc-like function inside the analyzed CB, we slightly modify its behavior. In particular, we change the amount of bytes allocated by a small, pseudo-random, value. * printf Protection For every printf-like function identified, such as printf, snprintf, etc., we ensure that the function is not used to perform a "format string" attack. Specifically, if the format string parameter is neither in the binary's read-only memory nor a string already present in the binary, we stop the execution of the binary if: - The format string parameter contains a meta character (e.g., "%"). - The format string parameter points to the flag page. --[ 004.001.002 - Adversarial Techniques Certain techniques are introduced to prevent rivals from analyzing, or even running our RBs in a controlled environment, while leaving those RBs still able to run in the real game environment. These techniques are presented in this section. * Anti-analysis We add some code, executed before the original entry point of the binary, to interfere with analyses that other teams could perform on our patched binary. Specifically, we add code to: - Detect if the binary is executed using QEMU or PIN. To do so, we probe the implementation of different aspects that are difficult to emulate correctly, such as segment registers, transmission of partially allocated memory regions, syscall error codes values in case of "double failure". In addition, we add some code triggering a previously unknown implementation bug in QEMU, making it stall. Specifically, during the development of our CRS, we found that QEMU, when using user-mode emulation, does not correctly handle taking the square root of an "un-normal" floating point number, that is, a nonzero 80-bit float whose explicit integer bit (the highest bit of the mantissa) is zero. When this happens, QEMU will hang forever. Because of this anti-QEMU patch, some of our RBs broke the live visualization during the CFE public event. - Interfere with symbolic execution engines. Our adversarial code contains self-modifying code designed to be extremely hard to simulate correctly and efficiently by a symbolic execution engine. In addition, some added code is specifically designed to trigger "path explosion" conditions. - Interfere with automatic exploitation systems. We add code to transmit the flag page to file descriptor 2 (stderr). Although data transmitted to this file descriptor is not sent to a CRS interacting with a binary, an opponent could mistakenly assume that any contiguous 4 bytes transmitted from the flag page constitutes a Type 2 exploit and thus fielding a POV trying to leverage this "fake" leak. In addition, we inject "fake backdoor" code. This code is triggered if a CRS sends a specific 4-byte sequence. When triggered, it reads from the CRS 8 bytes used to set the program instruction pointer and the value of a register. For this reason, this code looks like easily exploitable to create a Type 1 POV, however the read values are xor'ed with a random value, before being used to set the instruction pointer and the register, making this code non-exploitable. In addition, to counteract an adversary fielding one of our own patched binaries as its own patched binary, we inject a backdoor in every fielded patched binary. This backdoor can be used by our CRS to exploit the patched binary we generate, but it is designed to be extremely hard to be exploited from other teams' CRSs. The backdoor is triggered when a specific 4-byte sequence is received. To detect this, the function wrapping the receive syscall is modified to keep track of the first 4 bytes a program receives. Once triggered, the backdoor sends to the CRS a "challenge" C (which is a 19-bit value), and the CRS responds with a response R (a 64-bit value). Then, the backdoor code checks if the following condition is true: first_32_bits_of(SHA1(pad(R,160))) == pad(C,32), where pad(A,N) is a function padding the input value A up to N bits by adding zeros. The challenge can be easily solved by pre-computing all the possible responses, but it is impossible for an opponent's POV to compute a solution for the challenge within the game-imposed 10-second timeout. --[ 004.001.003 - Optimization Techniques Performance is a vital concern of our patching strategy. While we stress the necessity of optimizing all our patching techniques, some overhead cannot be avoided. From analyzing binaries collected from CQE and CFE samples, we noticed that most of them are compiled with O0, i.e., without optimization enabled. We do not know why organizers decided not to optimize most of the provided challenges, but we speculated that this may have been decided to leave room for optimizations and patching. It is well-known that O0 and O1 binaries can have a huge difference in execution time. Fortunately, some of the optimization methods used in O1 are not that difficult to perform directly on binaries. Further, angr provides all necessary data-flow analysis techniques, which makes the whole optimization development easier. Finally, with the help of the ReassemblerBackend, we can easily fully remove instructions that we want to get rid of, without having to replace them with nops. Therefore, we implemented some basic in-line binary optimization techniques in order to optimize O0 binaries in CFE, which are described below. - Constant Propagation. We propagate constants used as immediates in each instruction, and eliminate unnecessary mov instructions in assembly code. - Dead Assignment Elimination. Many unnecessary assignments occur in unoptimized code. For example, in unoptimized code, when a function reads arguments passed from the stack, it will always make a copy of the argument into the local stack frame, without checking if the argument is modified or not in the local function. We perform a conservative check for cases where a parameter is not modified at all in a function and the copy-to-local-frame is unnecessary. In this case, the copy-to-local instruction is eliminated, and all references to the corresponding variable on the local stack frame are altered to reference the original parameter on the previous stack frame. Theoretically, we may break the locality, but we noticed some improvement in performance in our off-line tests. - Redundant Stack Variable Removal. In unoptimized code, registers are not allocated optimally, and usually many registers end up not being used. We perform a data-flow analysis on individual functions, and try to replace stack variables with registers. This technique works well with variables accessed within tight loops. Empirically speaking, this technique contributes the most to the overall performance gain we have seen during testing. Thanks to these optimizations, our patches often had *zero* overall performance overhead. --[ 004.002 - Patches The techniques presented in the previous section return, as an output, lists of patches. In Patcherex, a patch is a single modification to a binary. The most important types of patches are: - InsertCodePatch: add some code that is going to be executed before an instruction at a specific address. - AddEntryPointPatch: add some code that is going to be executed before the original entry point of the binary. - AddCodePatch: add some code that other patches can use. - AddRWData: add some readable and writable data that other patches can use. - AddROData: add some read-only data that other patches can use. Patches can refer to each other using an easy symbol system. For instance, code injected by an InsertCodePatch can contain an instruction like call check_function. In this example, this call instruction will call the code contained in an InsertCodePatch named check_function. --[ 004.003 - Backends We implemented two different backends to inject different patches. The DetourBackend adds patches by inserting jumps inside the original code, whereas the ReassemblerBacked adds code by disassembling and then reassembling the original binary. The DetourBackend generates bigger (thus using more memory) and slower binaries (and in some rare cases it cannot insert some patches), however it is slightly more reliable than the ReassemblerBackend (i.e., it breaks functionality in slightly less binaries). * DetourBackend This backend adds patches by inserting jumps inside the original code. To avoid breaking the original binary, information from the CFG of the binary is used to avoid placing the added jmp instruction in-between two basic blocks. The added jmp instruction points to an added code segment in which first the code overwritten by the added jmp and then the injected code is executed. At the end of the injected code, an additional jmp instruction brings the instruction pointer back to its normal flow. In some cases, when the basic block that needs to be modified is too small, this backend may fail applying an InsertCodePatch. This requires special handling, since patches are not, in the general case, independent (a patch may require the presence of another patch not to break the functionality of a binary). For this reason, when this backend fails in inserting a patch, the patches "depending" from the failed one are not applied to the binary. * ReassemblerBackend ReassemblerBackend fully disassembles the target binary, applies all patches on the generated assembly, and then assembles the assembly back into a new binary. This is the primary patching backend we used in the CFE. Being able to fully reassemble binaries greatly reduces the performance hit introduced by our patches, and enables binary optimization, which improves the performance of our RBs even further. Also, reassembling usually changes base addresses and function offsets, which achieves a certain level of "security by obscurity" -- rivals will have to analyze our RBs if they want to properly adapt their code-reusing and data-reusing attacks. We provide an empirical solution for binary reassembling that works on almost every binary from CQE and CFE samples. The technique is open-sourced as a component in angr, and, after the CGC competition, it has been extended to work with generic x86 and x86-64 Linux binaries. A detailed explanation as well as evaluation of this technique is published as an academic paper [Ramblr17]. --[ 004.004 - Patching Strategy We used Patcherex to generate, for every CB, three different Replacement Binaries (RBs): - Detouring RB: this RB was generated by applying, using the DetourBackend, all the patches generated by the hardening and adversarial techniques presented previously. - Reassembled RB: this RB was generated by applying the same patches used by the Detouring RB, but using the ReassemblerBackend instead of the DetourBackend. - Optimized Reassembled RB: this RB was generated as the Reassembled one, but, in addition all the patches generated by the optimization techniques were added. These three patched RBs have been listed in order of decreasing performance overhead and decreasing reliability. In other words, the Detouring RB is the most reliable (i.e., it has the smallest probability of having broken functionality), but it has the highest performance overhead with respect to the original unpatched binary. On the contrary the Optimized Reassembled RB is the most likely to have broken functionality, but it has a lower performance impact. --[ 004.005 - Replacement Binary Evaluation * Pre-CFE Evaluation During initial stages of Patcherex development, testing of the RBs was done by an in-house developed component called Tester. Internally, Tester uses cb-test, a utility provided by DARPA for testing a binary with pre-generated input and output pairs, called polls. We made modifications to cb-test, which enabled the testing of a binary and its associated IDS rules on a single machine, whereas the default cb-test needs 3-machines to test a binary with IDS rules. Tester can perform both performance and functionality testing of the provided binary using the pre-generated polls for the corresponding binary using its Makefile. For functionality testing, given a binary, we randomly pick 10 polls and check that the binary passes all the polls. For performance testing, we compute the relative overhead of the provided binary against the unpatched one using all the polls. However, there was huge discrepancy (~10%) between the performance overhead computed by us and that provided during sparring partner sessions for the same binaries. Moreover, during the sparring partner sessions, we also noticed that performance numbers were different across different rounds for the same binaries. Because of these discrepancies and to be conservative, for every patching strategy, we computed the performance overhead as the maximum overhead across all rounds of sparring partner sessions during which RBs with corresponding patching strategy are fielded. During internal testing we used all the available binaries publicly released on GitHub (some of which were designed for the CQE event, whereas others were sample CFE challenges). To further extend our test cases, we recompiled all the binaries using different compilation flags influencing the optimization level used by the compiler. In fact, we noticed that heavily optimized binaries (e.g., -O3), were significantly harder to analyze and to patch without breaking functionality. Specifically, we used the following compilations flags: -O0, -Os, -Oz, -O1, -O2, -O3, -Ofast. Interestingly, we noticed that some of the binaries, when recompiled with specific compilation flags, failed to work even when not patched. During the final stages of Patcherex development, we noticed that almost all generated RBs never failed the functionality and that the performance overhead was reasonable except for a few binaries. This, combined with the discrepancy inherent in performance testing, led us not to use any in-depth testing of our replacement binaries during the CFE. * CFE Evaluation For every RB successfully generated, Patcherex first performs a quick test of their functionality. The test is designed to spot RBs that are broken by the patching strategy. In particular, Patcherex only checks that every generated RB does not crash when provided with a small test set of hardcoded input strings ("B", "

", "











", etc.) We decided to perform only a minimal tests of the functionality because of for performance and reliability considerations. --[ 004.006 - Qualification Round Approaches It is worth mentioning that for the CGC qualification round, the rules were very different from the final round. Between this and the fact that our analysis tools were not yet mature it was necessary for us to approach patching very differently from the final round approaches previously described. In the qualification round, the only criteria for "exploitation" was a crash. If you could crash a binary, it meant that you could exploit it, and if your binary could crash, it meant that you were vulnerable. Furthermore, the qualification scoring formula was such that your "defense" score, i.e. how many vulnerabilities your patch protected against, was a global multiplier for your score between zero and one. This meant that if you didn't submit a patch for a binary, or if your patch failed to protect against any of the vulnerabilities, you received zero points for that challenge, regardless of how well you were able to exploit it. This is such an unconventional scoring system that when we analyzed the (publicly available) patches produced by other teams for the qualification round, we found that at least one qualifying team had a patching strategy such that whenever they discovered a crash, they patched the crashing instruction to simply call the exit syscall. This is technically not a crash, so the teams that did this did in fact receive defense points for the challenges, and accordingly did in fact receive a non-negligible score for effectively stubbing out any vaguely problematic part of the binary. Our approaches were slightly more nuanced! There were two techniques we developed for the qualification round, one "general" technique, meaning that it could be applied to a program without any knowledge of the vulnerabilities in a binary, and one "targeted" technique, meaning that it was applied based on our CRS' knowledge of a vulnerability. Each of the techniques could produce several candidate patched binaries, so we had to choose which one to submit in the end - our choice was based on some rudimentary testing to try to ascertain if the binary could still crash, and if not, to assess the performance impact of the patch. It is important to notice for CQE, the patched binaries were tested by the organizers against a fixed set of pre-generated exploits. For this reason, our patched binaries just had to prevent to be exploited when run against exploit developed for the original, unpatched, version of the program. In other words, the attacker had no way to adapt its exploits to our patches and so "security trough obscurity" techniques were extremely effective during the qualification event. --[ 004.006.001 - Fidget Our "general" patching technique for CQE was a tool called Fidget. Fidget was developed the summer prior to the announcement of the CGC for use in attack-defense CTFs. Its basic intuition is that the development of an attack makes a large number of very strong assumptions about the internal memory layout of a program, so in many cases simply tweaking the layout of stack variables is a reliable security-through-obscurity technique. At the time of the CQE, Fidget was a tool capable of expanding function stack frames, putting unused space in between local variables stored on the stack. This is clearly not sufficient to prevent crashes, as is necessary for the strange qualification scoring formula. However, the tool had an additional mode that could control the amount of padding that was inserted; the mode that we used attempted to insert thousands of bytes of padding into a single stack frame! The idea here is, of course, that no overflow attack would ever include hundreds of bytes more than strictly necessary to cause a crash. The primary issue with Fidget is that it's pretty hard to tell that accesses to different members or indexes of a variable are actually accesses to the same variable! It's pretty common for Fidget to patch a binary that uses local array and struct variables liberally, and the resulting patched binary is hilariously broken, crashing if you so much as blow on it. There are a huge number of heuristics we apply to try not to separate different accesses to the same variable, but in the end, variable detection and binary type inference are still open problems. As a result, Fidget also has a "safe mode" that does not try to pad the space in between variables, instead only padding the space between local variables and the saved base pointer and return address. We originally planned to use Fidget in the final round, since it had the potential to disrupt exploits that overflow only from one local variable into an adjacent one, something that none of our final-round techniques can address. However, it was cut from our arsenal at the last minute upon the discovery of a bug in Fidget that was more fundamental than our ability to fix it in the limited time available! Unfortunate... --[ 004.006.002 - CGrex If we know that it's possible for a binary to crash at a given instruction, why don't we just add some code to check if that specific instruction would try to access unmapped or otherwise unusable memory, and if so exit cleanly instead of crashing? This is exactly what CGrex does. Our reassembler was not developed until several weeks before the CGC final event, so for the qualification round CGrex was implemented with a primitive version of what then became the DetourBackend. Once our CRS found a crash, the last good instruction pointer address was sent to CGrex, which produced a patched binary that replaced all crashing instructions with jumps to a special inserted section that used some quirks in some syscalls to determine if the given memory location was readable/writable/executable, and exit cleanly if the instruction would produce a crash. More precisely, CGrex takes, as input, a list of POVs and a CB and it outputs a patched CB "immune" against the provided POVs. CGrex works in five steps: 1) Run the CGC binary against a given POV using a modified QEMU version with improved instruction trace logging and able to run CGC binaries. 2) Detect the instruction pointer where the POV generates a crash (the "culprit instruction"). 3) Extract the symbolic expression of the memory accesses performed by the "culprit instruction" (by using Miasm). For instance, if the crashing instruction is mov eax, [ebx*4+2] the symbolic expression would be ebx*4+2. 4) Generate "checking" code that dynamically: - Compute the memory accesses that the "culprit instruction" is going to perform. - Verify that these memory accesses are within allocated memory regions (and so the "culprit instruction" is not going to crash). To understand if some memory is allocated or not CGrex "abuses" the return values of the random and fdwait syscalls. In particular these syscalls were used by passing as one of the parameters the value to be checked. The kernel code handling these functions verifies that, for instance, the pointer were the number of random bytes returned by random is written is actually pointing to writable memory and it returns a specific error code if not. CGrex checks this error code to understand if the tested memory region is allocated. Special care is taken so that, no matter if the tested memory location is allocated or not, the injected syscall will not modify the state of the program. - If a memory access outside allocated memory is detected, the injected code just calls exit. 5) Inject the "checking" code. Steps 1 to 5 are repeated until the binary does not crash anymore with all the provided POVs. --[ 005 - Orchestration +--------+ +---------+ CGC endpoints | TI API | | IDS tap | +--------+ +---------+ . . / \ / \ | | -----------------------------|--------------|------------------------------ | | Mechanical Phish \ / \ / ' ' +------------+ +------------+ | Ambassador | |Network Dude| +------------+ +-+----------+ | | +------------+ +------------+ | | | Meister | | Scriba | | | +-----+------+ +-------+----+ | | | | | | +--------------------------------+ +--------+--------+ | | | Worker | | | | | | _----------_ | | | Poll creator AFL Driller | ( )<---------+ | | ============ === ======= | |`----------`|<-------------+ | Tester POV Tester POV Fuzzer | | |<---------------+ ====== ========== ========== | | Farnsworth | | Patcherex Colorguard Rex | ( ) | ========= ========== === | `----------` +--------------------------------+ Designing a fully autonomous system is a challenging feat from an engineering perspective too. In the scope of the CFE, the CRS was required to run without fault for at least 10 hours. Although it was proposed to allow debugging during the CFE, eventually, no human intervention was permitted. To that end, we designed our CRS using a microservice-based approach. Each logical part was split following the KISS principle ("Keep it simple, stupid") and the Unix philosophy ("Do one thing and do it well"). Specifically, the separation of logical units allowed us to test and work on every component in complete isolation. We leveraged Docker to run components independently, and Kubernetes to schedule, deploy, and control component instances across all 64 nodes provided to us. --[ 005.001 - Components Several different components of Mechanical Phish interacted closely together during the CRS (see diagram above). In the following, we will briefly talk about the role of each component. * Farnsworth Farnsworth is a Python-based wrapper around the PostgreSQL database, and stores all data shared between components: CBs, POVs, crashing inputs, synchronization structures, etc. In our design, we prohibited any direct communication between components and required them to talk "through" Farnsworth. Therefore, Farnsworth was a potential single point of failure. To reduce the associated risk for the CFE, we paid particular attention to possible database problems and mitigated them accordingly. * Ambassador Ambassador was the component that talked to the CGC Team Interface (TI) to retrieve CBs, obtain feedback, and submit RBs and POVs. In the spirit of KISS, this component is the only part of Mechanical Phish to communicate externally and the only source for the ground truth in respect to the game state. * Meister Meister coordinated Mechanical Phish. For each component, a component- specific creator decided which jobs should be run at any point in the game, based on information obtained through Farnsworth and written by Ambassador. Consequently, Meister decided which jobs to run based on the priority information of each job (as specified by the creator) and usage of the nodes in terms of CPU and memory. Note that, specifically, Meister and its creators were entirely stateless. At any point, it could crash, yet it would not kill existing jobs upon automatic restart if they were still considered important by the creators. * Scriba An important task of the CRS was to select and submit the POVs and RBs, respectively. Scriba looked at performance results of exploits and patches and decided what and when to submit (for more details on the selection strategy see Section 6 - Strategy). As mentioned previously, after Scriba decided what and when to submit, Ambassador actually submitted to the TI (as Ambassador is the only component allowed to communicate externally). * Network Dude The Network Dude component received UDP traffic coming from the IDS tap and stored it into the database via Farnsworth. Since it is required to receive packets at line-speed, neither parsing nor analysis of the network data was performed within Network Dude, instead, we relied different components to process the network traffic. * Worker Worker was the executor for analysis tasks of the CRS. It wrapped tools such as angr, Driller, Patcherex in a generic interface to be managed easily. In fact, every Worker instance referred to an entry in a jobs queue specifying task arguments and type. Since some of the workers had to execute CGC DECREE binaries for functionality and performance evaluation, we included a DECREE virtual machine running on QEMU within a worker. --[ 005.002 - Dynamic Resource Allocation Another advantage of our architecture design, alongside dependency isolation and ease of deployment, was the possibility to dynamically scale components to meet our needs. Except for the PostgreSQL database and some internal Kubernetes services, all our components could run on any node without limitation. Furthermore, when creating a job, Meister assigned it a priority based on the component and the current game status. For example, crash-generation jobs (Rex) were prioritized lower if an input crash was not considered reliable, or the analysis of a de-fielded CB was considered of no importance at all. Intuitively, all created jobs were sorted by descending values of priority, and scheduled through the Kubernetes API until all nodes' resources (CPU and memory) were saturated. Once all node resources were taken by running jobs, Meister killed jobs with lower priority to accommodate new higher priority jobs, but it did not over-provision. --[ 005.003 - Fail-over Mechanical Phish was required to run without failure for the duration of the CFE, an estimated 10 hours. Furthermore, no debugging sessions were permitted and the CRS was recommended to be resistant to minor hardware failure (or might have risked to "crash and burn"). To improve the reliability and resiliency of Mechanical Phish, we took various steps. First, every component, including Ambassador, Meister, Scriba, Network Dude, and Workers, was deployed as a Docker container. All components were designed to be entirely stateless, allowing us to restart them and move them across nodes if necessary. Although components could be terminated abruptly without any significant consequence, some components were critical and were required to be running for Mechanical Phish to function correctly: These were Ambassador, Network Dude, Scriba, and Meister (crashing and recovering is acceptable for these components). Fortunately, Kubernetes provided a way to define always-running instances through DaemonSet and ReplicationController resources. If an instance of such type is terminated or timed out, it is automatically launched on another node (to prevent Kubernetes to be a single point-of-failure, Mechanical Phish was using a highly-available Kubernetes setup with multiple masters and virtual IP addresses for access). * Database Naturally, the entire system cannot be completely stateless, and a single stateful component is required, which was Farnsworth. To prevent any failure of the node running the PostgreSQL Docker containers or the containers themselves, we leveraged PostgreSQL's built-in master-slave streaming replication for a resilient system. Specifically, for the CFE, we ran 5 instances on 5 different physical nodes evenly spread across the rack, and an additional health-checking monitoring service. The monitor service itself was run using a ReplicationController resource. If the master database container would have been considered dead by the monitor, a slave instance would have been elected as the new master and a replacement slave would have been created on a healthy node. To prevent components from failing during database disaster recovery, they accessed the database in a retry loop with exponential back-off. In turn, it would have ensured that no data would have been lost during the transition from a slave to master. * CGC Access Interfaces The CGC CFE Trials Schedule defined that specific IP addresses were required to communicate with the CGC API. Given the distributed nature of our CRS and the recommendation to survive failure, the IP addresses remained the last single point of failure as specific components needed to be run on specific physical hosts. Consequently, we used Pacemaker and Corosync to monitor our components (Ambassador and Network Dude), and assign the specific IP addresses as virtual IP addresses to a healthy instance: if a node failed, the address would move to a healthy node. --[ 006 - Strategy ---;;;;;;;-----'''''''''``' --- `' .,,ccc$$hcccccc,. `' ,;;!!!'``,;;!!' ;;;;,,.,;-------''''''' ,;;!!- .zJ$$$$$$$$$$$$$$$$$$$c,. `' ,;;!!!!' ,; ```' -;;;!'''''- `.,.. .zJ$$$$$$$$$$$$$$$$$$$$$$$$$$c, `!!'' ,;!!' !!- ' `,;;;;;;;;;;'''''```' ,c$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$c, ;!!'' ,; ,;;;!!!!!!!!''``.,;;;;!'`' z$$$$$$$$???"""""'.,,.`"?$$$$$$$$$$$ ``,;;!!! ;;.. --''```_..,;;! J$$$$$$??,zcd$$$$$$$$$$$$$$$$$$$$$$$$h ``'``' ```''' ,;;''``.,.,;;, ,$$$$$$F,z$$$$$$$$$$$$$$$$$$$c,`""?$$$$$h !!!!;;;;, --`!''''''' $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$h.`"$$$$h . `'''``.,;;;!;;;--;; zF,$$$$$$$$$$?????$$$$$$$$$$$$$?????$$r ;?$$$ $. !;.,..,.````.,;;;; ,$P'J"$$$$$$P" .,c,,.J$$$$$$$$$"',cc,_`?h.`$$$$ $L '``````' .,.. ,$$". $ $$$$P",c$$$$$$$$$$$$$$$$',$$$$$$$$$$ $$$$ $$c, !!!!!!!!!!!!!''' J$',$ $.`$$P c$$$$$$$$$$$$$$$$$$,$$$$$$$$$$$ $$$$ $$$$C `` J$ ,$P $$ ?$',$$$$???$$$$$$$$$$$$$$$??"""?$$$ <$$$ $$$$$ c ;, z$F,$$ `$$ $ ?$" "$$$.?$$$ $$$P c??c, ?$.<$$',$$$$$F $$h. -!> (' $" $F ,F ?$ $ F ,="?$$c,`$$F $$"z$$',$' ,$$P $h.`$ ?$$$$$r $$$$$hc,. ``' J$ $P J$ . $$F L ",,J$$$F <$hc$$ "$L,`??????,J$$$.` z$$$$$ $$$$$$$$$$c,'' ?F,$',$F.: $$ c$c,,,,,c,,J$$$$$$$ ?$$$c,,,c$$$$$$F. $$$$$$ `"$$$$$$$$$$$c, $$',$$ :: $$$$$$$$F"',$$$$$$$$$$h ?$$$L;;$$$??$$$$ $$$$$$ "?$$$$$$$$$$ $$$$$$ : .`F"$$$$$$$$$$$$""""?"""h $$$$$$$"$,J$$$$ $$$$$' "?$$$$$$$ $$$$$$.`.` h `$$$$$$$$$$$cccc$$c,zJ$$$$$P' $$$$$P',$$$$P $. `""?$$ $$$$$$$ ` "$c "?$$$$$$$$$$$$??$$$$$$$$" ,J$$$P",J$$$$P .. `" ?$$$$$$h ?$$c.`?$$$$$$$$$' . <$$$$$' ,$$$" ,$$$$$" !!>. . `$$$$$$$h . "$$$c,"$$$$$$$' `' `$$$P ,$$$' ,c$$$$$' ;! ```<!!!> `$$$$$$$c "$$$c`?$$$$$ : : $$$ ,$$P' z$$$$$$' ;!! $hc ```' ; `$$$$$$$. ?$$c ?$$$$ .: : $$$ $$F ,J$$$$$$' ;!! .,.. ' `$$$$$$$ "$$h`$$$$ .' ' $$$ ,$$ ,J$$$$$$' !!! ????P `$$$$$$L $$$ $$$F :.: J$$P J$F J$$$$$P ;!! -=< ?$$."$$ `$$ ?$$' `' z$$$F $P $$$$$$' !!' cc `$$$c`? ?$.`$$hc, cd$$F ,$' $$$$$$ ;!! $$$$c `$$c$$$$$$$$$",c$' $$$$$$ `!! $$$$$ `?$$$$$$$$$$$$P' $$$$$$> .. $$$$$ `"?$$$$$$$P" $$$$$$L $$c, !! <$$$$$ zc,`"""', <$$$$$$.`$$$$cc, !! J$$$$P `$$$$$$$' !' $$$$$$L `$$$$$$h ;, $$$$$L `! J$$$$$',!! $$$$$$$ `$$$$$$ ' <$$$$$. ! $$$$$$ !! ?$$$$$$ `$$$$$ ,$$$$$$$c `,`???? ;' c,?$$$$' `?$$$ $$$$$$$?? `!;;;;! . `h."?$P `$$$ ,$$$$$$$h. `''' `' `$$$P `?$ $$$$$$$$h `!' `"' ` `$$$$$$$$F !; ! ;, `$$$$$$$' `!!> `! c, ;, `?$$$$P !!> . $F !!> `""' `!! ;!> <- The General Strategy of Shellphish The Mechanical Phish was only about three months old when the final event of the Cyber Grand Challenge took place. Like any newborn, its strategic thought processes were not well developed and, unfortunately, the sleep-deprived hackers of Shellphish were haphazard teachers at best. In this section, we describe what amounted to our game strategy for the Mechanical Phish and how the rules of the Cyber Grand Challenge, combined with this strategy, impacted the final result. * Development Strategy The Shellphish CGC team was comprised completely of researchers at the UC Santa Barbara computer security lab. Unfortunately, research labs are extremely disorganized environments. Also unfortunately, as most of us were graduate students, and graduate students need to do research to survive (and eventually graduate), we were fairly limited in the amount of time that we could devote to the CGC. For example, for the CGC Qualification Event, we built our CRS in two and a half weeks. For the final event, we were able to devote a bit more time: on average, each member of the team (the size of which gradually increased from 10 to 13 over the course of the competition) probably spent just under three months on this insanity. One thing that bit us is that we put off true integration of all of the components until the last minute. This led to many last-minute performance issues, some of which we did not sort out before the CFE. * Exploitation Strategy Our exploitation strategy was simple: we attack as soon and as frequently as possible. The only reason that we found to hold back was to avoid letting a victim team steal an exploit. However, there was not enough information in the consensus evaluation to determine whether a team did or did not have an exploit for a given service (and attempting to recover this fact from the network traffic was unreliable), so we decided on a "Total War" approach. * Patching Strategy For every not-failing RB, we submitted the patch considered as the most likely to have a better performance score. Specifically, given the way in which RBs were created, we ranked RBs according to the following list: Optimized Reassembled RB, Reassembled RB, Detouring RB. This choice was motivated by the fact that detouring is slow (as it causes cache flushes due to its propensity to jumping to many code locations), whereas the generated optimized RBs are fast. Of course, we could not always rely on the reassembler (and optimizer) to produce a patch, as these backends had some failure cases. For every patch submitted, the corresponding CS is marked down for a round. Because this can be (and, in the end, was) debilitating to our score, we evaluated many strategies with regards to patching during the months before the CFE. We identified four potential strategies: - Never patch: The simplest strategy was to never patch. This has the advantage of nullifying the chances of functionality breakages and avoiding the downtime associated with patching. - Patch when exploited: The optimal strategy would be to patch as soon as we detect that a CS was being exploited. Unfortunately, detecting when a CS is exploited is very difficult. For example, while the consensus evaluation does provide signals that the binaries cause, these signals do not necessarily correlate to exploitation. Furthermore, replaying incoming traffic to detect exploitation is non-trivial due to complex behaviors on the part of the challenge binaries. - Patch after attacking: An alternative strategy is to assume that an opponent can quickly steal our exploits, and submit a patch immediately after such exploits are fired. In our latter analysis, we determined that this would have been the optimal strategy, granting us first place. - Always patch: If working under the assumption that the majority of challenge sets are exploited, it makes sense to always patch. Most teams in the Cyber Grand Challenge took this decision very seriously. Of course, being the rag-tag group of hackers that we are, we did some fast-and-loose calculations and made a result based on data that turned out to be incorrect. Specifically, we assumed that a similar fraction of the challenges would be exploited as the fraction of challenges crashed during the CQE (about 70%). At the time, this matched up with the percentage of sample CGC binaries, provided by DARPA during sparring partner rounds, that we were exploiting. Running the numbers under this assumption led us to adopt the "always patch" approach: 1. At the second round of a challenge set's existence, we would check if we had an exploit ready. If not, we would patch immediately, with the best available patch (out of the three discussed above). Otherwise, we would delay for a round so that our exploit had a chance to be run against other teams. 2. Once a patch was deployed, we would monitor performance feedback. 3. If feedback slipped below a set threshold, we would revert to the original binary and never patch again. In this way, we would patch *once* for every binary. As we discuss later, this turned out to be one of the worst strategies that we could have taken. --[ 007 - Fruits of our Labors In early August, our creation had to fight for its life, on stage, in front of thousands of people. The Mechanical Phish fought well and won third place, netting us $750,000 dollars and cementing our unexpected place as the richest CTF team in the world (with a combined winnings of 1.5 million dollars, Shellphish is the first millionaire CTF team in history!). This is really cool, but it isn't the whole story. The CGC generated enormous amounts of data, and to truly understand what happened in the final event, we need to delve into it. In this section, we'll talk specifically regarding what happened, who achieved what, and how the CGC Final Event played out. For reference throughout this section, the final scores of the CGC Final Event were: +--------------+------------------+---------+ | Team | CRS Name | Points | +--------------+------------------+---------+ | ForAllSecure | Mayhem | 270,042 | | TECHx | Xandra | 262,036 | | Shellphish | Mechanical Phish | 254,452 | | DeepRed | Rubeus | 251,759 | | CodeJitsu | Galactica | 247,534 | | CSDS | Jima | 246,437 | | Disekt | Crspy | 236,248 | +--------------+------------------+---------+ The attentive reader will notice that the score of the Mechanical Phish is the only one that is a palindrome. --[ 007.001 - Bugs The CGC was the first time that autonomous systems faced each other in a no-humans-allowed competition. As such, all of the Cyber Reasoning Systems likely faced some amount of bugs during the CFE. The most visible was Mayhem's, which resulted in the system being off-line for most of the second half of the game (although, as we discuss later in this section, that might not have hurt the system as much as one would think). Our system was no different. In looking through the results, we identified a number of bugs that the Mechanical Phish ran into during the CFE: * Multi-CB pipeline assertions We used fuzzing to identify crashes and POV Fuzzing to fuzz those crashes into POVs for challenge sets comprising of multiple binaries. Unfortunately, an accidental assert statement placed in the POV Fuzzing code caused it to opt out of any tasks involving multi-CB challenge sets, disabling our multi-CB exploitation capability. * Network traffic synchronization Due to a bug, the component that scheduled tasks to synchronize and analyze network traffic was set to download *all* recorded network traffic every minute. The volume of this traffic quickly caused it to exceed scheduling timeouts, and Mechanical Phish only analyzed network traffic for the first 15 rounds of the game. * RB submission race condition Due to a race condition between the component that determined what patches to submit and the component that actually submitted them, we had several instances where we submitted different patched binaries across different rounds, causing multiple rounds of lost uptime. * Scheduling issues Throughout the CFE, Mechanical Phish identified exploitable crashes in over 40 binaries. However, only 15 exploits were generated. Part, but not all, of this was due to the multi-CB pipeline assertions. It seems that the rest was due to scheduling issues that we have not yet been able to identify. * Slow task spawning Our configuration of Kubernetes was unable to spawn tasks quickly enough to keep up with the job queue. Luckily, we identified this bug a few days before the CFE and put in some workarounds, though we did not have time to fix the root cause. This bug caused us to under-utilize our infrastructure. --[ 007.002 - Pwning Kings Over the course of the CGC Final Event, the Mechanical Phish pwned the most challenges out of the competitors and stole the most flags. This was incredible to see, and is an achievement that we are extremely proud of. Furthermore, the Mechanical Phish stole the most flags even when taking into account only the first 49 rounds, to allow for the fact that Mayhem submitted its last flag on round 49. We've collected the results in a helpful chart, with the teams sorted by the total amount of flags that the teams captured throughout the game. +--------------+-------------------------------+--------------------------+ | Team | Flags Captured (first 49/all) | CSes Pwned (first 49/all)| +--------------+-------------------------------+--------------------------+ | Shellphish | 206 / 402 | 6 / 15 | | CodeJitsu | 59 / 392 | 3 / 9 | | DeepRed | 154 / 265 | 3 / 6 | | TECHx | 66 / 214 | 2 / 4 | | Disekt | 101 / 210 | 5 / 6 | | ForAllSecure | 185 / 187 | 10 / 11 | | CSDS | 20 / 22 | 1 / 2 | +--------------+-------------------------------+--------------------------+ Interestingly, Mayhem exploited an enormous amount of binaries before it went down, but the Mechanical Phish still achieved a higher exploitation score in the rounds that Mayhem was alive. One possibility is that the exploits launched by the Mechanical Phish were more reliable than those of Mayhem. However, its raw exploitation power should not be underestimated: within 49 rounds, before going off the grid, Mayhem managed to exploit 10 binaries. While the Mechanical Phish surpassed it over the entire game, this is still quite impressive. --[ 007.003 - Patching Kings To understand how effective our patches were, we calculated the number of flags lost (and the number of CSes on which flags were lost) throughout the game, both for the first 49 rounds (in which Mayhem was online) and for the entire game. As expected, because the Mechanical Phish patched every binary, we found that it was the least-exploited CRS. Specifically, it only leaked flags on 12 challenge sets. The runner-up in this sense (the second-place team, TECHx), leaked flags on 14 binaries. Interestingly, TECHx beat us on the total number of flags leaked. This makes sense -- they likely had some analysis to identify *when* they were being exploited, and patch only then. And, likely, their patches were more effective than our own, leading to a lower number of total flags leaked! The results, sorted by the number of compromised challenges, are: +--------------+-------------------------------+--------------------------+ | Team | Defended CS-Rounds (49/all) | CSes Compromised (49/*) | +--------------+-------------------------------+--------------------------+ | Shellphish | 29 / 68 | 7 / 12 | | TECHx | 27 / 61 | 7 / 14 | | DeepRed | 32 / 87 | 6 / 15 | | ForAllSecure | 54 / 160 | 7 / 16 | | CodeJitsu | 61 / 104 | 9 / 16 | | Disekt | 66 / 127 | 9 / 17 | | CSDS | 108 / 189 | 9 / 18 | +--------------+-------------------------------+--------------------------+ --[ 007.004 - Patching Effectiveness Before the CFE, we tested our patches against the dataset of all CQE binaries released until that point, compiled with many different compilation flags. On this dataset, our chance of breaking a binary, or lowering performance to below 85% (the guesstimated threshold at which we would pull the binary back) was under 2%, and we decided to abandon patch performance/functionality testing. On the actual CFE dataset, we fared worse. The CFE had 82 challenge sets. For these 82 challenge sets, we successfully generated patches for 78. Of these 78, we had to pull back 10 because the patch broke the binary and 7 because the patch lowered the performance of the binary below acceptable levels. This was considerably higher than expected. Furthermore, due to a race condition between the patch submission logic and the component that retrieves the game state, we did end up with situations where we would submit *multiple* patches, thinking that the round had not ended whereas, in reality, it had. In total, we made 107 RCB submissions throughout the game. This was the second-highest amount, with only CodeJitsu submitting more (108). Because very few challenges were exploited during the CFE, this was unequivocally the wrong choice. In total, we lost around 17000 points during the "downtime" rounds after the patch submissions alone. --[ 007.005 - Alternate History Case Studies Hindsight is 20/20. Even though we know, now, that we made the wrong choice regarding our patching strategy, it's still interesting to see what "could have been". In this section, we do that. To better understand the impact of our strategy decisions, we compute scores for several simulated CGC rounds where the Mechanical Phish undertook different strategies, and see what would have happened. It's very important to point out that this is all fantasy. *Every* team can look back and consider things that they might have done differently. The most obvious one is Mayhem: had they avoided crashing, they might have absolutely dominated the competition, rather than relaxedly coasting to victory. However, every other team has other "what if" moments. We explore ours here purely out of curiosity. * Mechanical Phish that never patched To calculate our score in the absence of patching, we recalculated CFE scores, assuming that, any time an exploit would be launched on a CS against *any* team, the exploit would be run against us during that round and all subsequent rounds of that CS being in play. With this calculation, our score would be 267,065, which is 12,613 points higher than the patch strategy that we did choose and would put us in second place by a margin of over 5,000 points. The prize for second place was $1,000,000. Patching at all cost us $250,000! * Mechanical Phish that patched after attacking Similar to the previous strategy, we calculated our score with a patch strategy that would delay patches until *after* we launched exploits on the corresponding CS. For the patches that would have been submitted, we used the same feedback that we received during the CFE itself. With this calculation, our score would be 271,506, which is 17,054 points higher than the patch strategy that we chose and would have put us in first place by a margin of over 1,500 points. The prize for first place was $2,000,000. Patching stupidly cost us $1,250,000 and quite a bit of glory! * Mechanical Phish that didn't do crap We were curious: did we really have to push so hard and trade so much sanity away over the months leading up to the CGC? How would a team that did *nothing* do? That is, if a team connected and then ceased to play, would they fare better or worse than the other players? We ran a similar analysis to the "Never patch" strategy previously (i.e., we counted a CS as exploited for all rounds after its first exploitation against any teams), but this time removed any POV-provided points. In the CFE, this "Team NOP" would have scored 255,678 points, barely *beating* Shellphish and placing 3rd in the CGC. To be fair, this score calculation does not take into account the fact that teams might have withheld exploits because all opponents were patched against them. However, ForAllSecure patched only 10 binaries, so it does not seem likely that many exploits were held back due to the presence of patches. One way of looking at this is that we could have simply enjoyed life for a year, shown up to the CGC, and walked away with $750,000. Another way of looking at this is that, despite us following the worst possible strategy in regards to patching, the technical aspects of our CRS were good enough to compensate and keep us in the top three positions! --[ 007.006 - Scoring Difficulties Similarly to this being the first time that autonomous systems compete against each other in a no-humans-allowed match, this was also the first time that such a match was *hosted*. The organizing team was up against an astonishing amount of challenges, from hardware to software to politics, and they pulled off an amazing event. However, as in any complex event, some issues are bound to arise. In this case, the problem was that measuring program performance is hard. We found this out while creating our CRS, and DARPA experienced this difficulty during the final event. Specifically, we noticed two anomalies in the scoring data: slight disparities in the initial scoring of challenge sets, and performance "cross-talk" between services. * Initial CS Scoring Patches can only be fielded on round 3 (after being submitted on round 2) of a binary being deployed. However we noticed that our availability scores were lower than our opponents, even on the *first* round of a challenge, when they could not yet be patched. In principle, these should all be the same, as a team has *no* way to influence this performance score. We calculated the average of the first-round CS availability scores, presented in the table below. The scores vary. The difference between the "luckiest" team, regarding their first-round CS score, and the "unluckiest" team was 1.6 percentage points. Unfortunately, Shellphish was that unluckiest team. Since the availability score was used as a multiplier for a team's total score, if the "luckiest" and "unluckiest" had their "luck" swapped, this would compensate for a total score difference of 3.2%. That is a bigger ratio than the difference between second and third place (2.9%), third and fourth place (1.1%), fourth and fifth place (1.7%), and fifth and sixth place (0.4%). The winner (Mayhem) could not have been unseated by these perturbations, but the rest of the playing field could have looked rather different. +--------------+----------------------------------+ | Team | Average First Round Availability | +--------------+----------------------------------+ | CSDS | 0.9985 | | ForAllSecure | 0.9978 | | Disekt | 0.9975 | | TECHx | 0.9973 | | CodeJitsu | 0.9971 | | DeepRed | 0.9917 | | Shellphish | 0.9824 | +--------------+----------------------------------+ * Scoring Cross-talk We also noticed that performance measurements of one challenge seem to influence others. Specifically, when we patched the binary NRFIN_00066 on round 39, we saw the performance of *all* of our other, previously-patched, binaries drop drastically for rounds 40 and 41. This caused us to pull back patches for *all* of our patched binaries, suffering the resulting downtime and decr