Software security vulnerabilities are a fact of life. So is the subsequent publicity, package updates, and suffering service restarts. Administrators are used to it, and users bear it, and it's a default and traditional method.

On the other hand, in some circumstances the update & restart methods are unacceptable, leading to the development of online fix facilities like kpatch, where code may be surgically replaced in a running system. There is plenty of potential in these systems, but they are still at an early stage of deployment.

In this article, we present another option: a limited sort of live patching using systemtap. This tool, now a decade old, is conventionally thought of as a tracing widget, but it can do more. It can not only monitor the detailed internals of the Linux kernel and user-space programs, it can also change them - a little. It turns out to be just enough to defeat some classes of security vulnerabilities. We refer to these as security "band-aids" rather than fixes, because we expect them to be used temporarily.

Systemtap

Systemtap is a system-wide programmable probing tool introduced in 2005, and supported since RHEL 4 on all Red Hat and many other Linux distributions. (Its capabilities vary with the kernel's. For example, user-space probing is not available in RHEL 4.) It is system-wide in the sense that it allows looking into much of the software stack: from device drivers, kernel core, system libraries, through user-applications. It is programmable because it operates based on programs in the systemtap scripting language in order to specify what operations to perform. It probes in the sense that, like a surgical instrument, it safely opens up running software so that we can peek and poke at its internals.

Systemtap's script language is inspired by dtrace, awk, C, intended to be easy to understand and compact while expressive. Here is hello-world:

probe oneshot { printf("hello world

") }

Here is some counting of vfs I/O:

global per_pid_io # an array probe kernel.function("vfs_read") { per_pid_io["read",pid()] += $count } probe kernel.function("vfs_write") { per_pid_io["write",pid()] += $count } probe timer.s(5) { exit() } # per_pid_io will be printed

Here is a system-wide strace for non-root processes:

probe syscall.* { if (uid() != 0) printf("%s %d %s %s

", execname(), tid(), name, argstr) }

Additional serious and silly samples are available on the Internet and also distributed with systemtap packages.

The meaning of a systemtap script is simple: whenever a given probed event occurs, pause that context, evaluate the statements in the probe handler atomically (safely and quickly), then resume the context. Those statements can inspect source-level state in context, trace it, or store it away for later analysis/summary.

The systemtap scripting language is well-featured. It has normal control flow statements, functions with recursion. It deals with integral and string data types, and look-up tables are available. An unusual feature for a small language, full type checking is paired with type inference, so type declarations are not necessary. There exist dozens of types of probes, like timers, calls/interiors/returns from arbitrary functions, designated tracing instrumentation in the kernel or user-space, hardware performance counter values, Java methods, and more.

Systemtap scripts are run by algorithmic translation to C, compilation to machine code, and execution within the kernel as a loadable module. The intuitive hazards of this technique are ameliorated by prolific checking throughout the process, both during translation, and within the generated C code and its runtime. Both time and space usage are limited, so experimentation is safe. There are even modes available for non-root users to inspect only their own processes. This capability has been used for a wide range of purposes:

performance tuning via profiling

program-understanding via pinpoint tracing, dynamic call-graphs, through pretty-printing local variables line by line.

Many scripts can run independently at the same time and any can be completely stopped and removed. People have even written interactive games!

Methodology

While systemtap is normally used in a passive (read-only) capacity, it may be configured to permit active manipulations of state. When invoked in "guru mode", a probe handler may send signals, insert delays, change variables in the probed programs, and even run arbitrary C code. Since systemtap cannot change the program code, changing data is the approach of choice for construction of security band-aids. Here are some of the steps involved, once a vulnerability has been identified in some piece of kernel or user code.

Plan

To begin, we need to look at the vulnerable code bug and, if available, the corrective patch to understand:

Is the bug mainly (a) data-processing-related or (b) algorithmic?

Is the bug (a) localized or (b) widespread?

Is the control flow to trigger the bug (a) simple or (b) complicated?

Are control flow paths to bypass the bug (a) available nearby (included in callers) or (b) difficult to reach?

Is the bug dependent mainly on (a) local data (such as function parameters) or (b) global state?

Is the vulnerability-triggering data accessible over (a) a broad range of the function (a function parameter or global) or only (b) a narrow window (a local variable inside a nested block)?

Are the bug triggering conditions (a) specific and selective or (b) complex and imprecise?

Are the bug triggering conditions (a) deliberate or (b) incidental in normal operation?

More (a)s than (b)s means it's more likely that systemtap band-aids would work in the particular situation while more (b)s means it's likely that patching the traditional way would be best.

Then we need to decide how to change the system state at the vulnerable point. One possibility is to change to a safer error state; the other is to change to a correct state.

In a type-1 band-aid, we will redirect flow of control away from the vulnerable areas. In this approach, we want to "corrupt" incoming data further in the smallest possible way necessary to short-circuit the function to bypass the vulnerable regions. This is especially appropriate if:

correcting the data is difficult, perhaps because it is in multiple locations, or because it needs to be temporary or no obvious corrected value can be computed

error handling code already exists and is accessible

we don't have a clear identification of vulnerable data states and want to err on the side of error-handling

if the vulnerability is deliberate so we don't want to spend the effort of performing a corrected operation

A type-2 band-aid is correcting data so that the vulnerable code runs correctly. This is especially appropriate in the complementary cases from the above and if:

the vulnerable code and data can occur from real workloads so we would like them to succeed

corrected data can be practically computed from nearby state

natural points occur in the vulnerable code where the corrected data may be inserted

natural points occur in the vulnerable code where clean up code (restoring of previous state) may be inserted, if necessary

Implement

With the vulnerability-band-aid approach chosen, we need to express our intent in the systemtap scripting language. The model is simple: for each place where the state change is to be done we place a probe. In each probe handler, we detect whether the context indicates an exploit is in progress and, if so, make changes to the context. We might also need additional probes to detect and capture state from before the vulnerable section of code, for diagnostic purposes.

A minimal script form for changing state can be easily written. It demonstrates one kernel and one user-space function-entry probe, where each happens to take a parameter named p that needs to be range-limited. (The dollar sign identifies the symbol as a variable in the context of the probed program, not as a script-level temporary variable.)

probe kernel.function("foo"), process("/lib*/libc.so").function("bar") { if ($p > 100) $p = 4 }

Another possible action in the probe handler is to deliver a signal to the current user-space process using the raise function. In this script a global variable in the target program is checked at every statement in the given source code file and line-number-range and deliver a killing blow if necessary:

probe process("/bin/foo").statement("*@src/foo.c:100-200") { if (@var("a_global") > 1000) raise(9) # SIGKILL }

Another possible action is logging the attempt at the systemtap process console:

# ... printf("check process %s pid=%d uid=%d", execname(), pid(), uid()) # ...

Or sending a note to the sysadmin:

# ... system(sprintf("/bin/logger check process %s pid=%d uid=%d", execname(), pid(), uid())) # ...

These and other actions may be done in any combination.

During development of a band-aid one should start with just tracing (no band-aid countermeasures) to fine-tune the detection of the vulnerable state. If an exploit is available run it without systemtap, with systemtap (tracing only), and with the operational band-aid. If normal workload can trigger the bug run it with and without the same spectrum of systemtap supervision to confirm that we're not harming that traffic.

Deploy

To run a systemtap script, we will need systemtap on a developer workstation. Systemtap has been included in RHEL 4 and later since 2006. RHEL 6 and RHEL 7 still receive rebases from new upstream releases, though capabilities vary. (Systemtap upstream is tested against a gamut of RHEL 4 through fresh kernel.org kernels, and against other distributions.)

In addition to systemtap itself, security band-aid type scripts usually require source-level debugging information for the buggy programs. That is because we need the same symbolic information about types, functions, and variables in the program as an interactive debugger like gdb does. On Red Hat/Fedora distributions this means the "-debuginfo" RPMs available on RHN and YUM. Some other distributions make them available as "-dbgsym" packages. Systemtap scripts that probe the kernel will probably need the larger "kernel-debuginfo"; those that probe user-space will probably need a corresponding "foobar-debuginfo" package. (Systemtap will tell you if it's missing.)

Running systemtap security band-aid scripts will generally require root privileges and a "guru-mode" flag to designate permission to modify state such as:

# stap -g band_aid.stp [...] ^C

The script will inject instrumentation into existing and future processes and continue running until it is manually interrupted, it stops itself, or error conditions arise. In case of most errors, the script will stop cleanly, print a diagnostic message, and point to a manual page with further advice. For transient or acceptable conditions command line options are available to suppress some safety checks altogether.

Systemtap scripts may be distributed to a network of homogeneous workstations in "pre-compiled" (kernel-object) form, so that a full systemtap + compiler + debuginfo installation is not necessary on the other computers. Systemtap includes automation for remote compilation and execution of the scripts. A related facility is available to create MOK-signed modules for machines running under SecureBoot. Scripts may also be installed for automatic execution at startup via initscripts.

Some examples

Of all the security bugs for which systemtap band-aids have been published, we analyse a few below.

CVE-2013-2094, perf_swevent_enabled array out-of-bound access

This was an older bug in the kernel's perf-event subsystem which takes a complex command struct from a syscall. The bug involved missing a range check inside the struct pointed to by the event parameter. We opted for a type-2 data-correction fix, even though the a type-1 failure-induction could have worked as well.

This script demonstrates an unusual technique: embedded-C code called from the script, to adjust the erroneous value. There is a documented argument-passing API between embedded-C and script, but systemtap cannot analyze or guarantee anything about the safety of the C code. In this case, the same calculation could have been expressed within the safe scripting language, but it serves to demonstrate how a more intricate correction could be fitted.

# declaration for embedded-C code %{ #include <linux/perf_event.h> %} # embedded-C function - note %{ %} bracketing function sanitize_config:long (event:long) %{ struct perf_event *event; event = (struct perf_event *) (unsigned long) STAP_ARG_event; event->attr.config &= INT_MAX; %} probe kernel.function("perf_swevent_init").call { sanitize_config($event) # called with pointer }

CVE-2015-3456, "venom"

Here is a simple example from the recent VENOM bug, CVE-2015-3456, in QEMU's floppy-drive emulation code. In this case, a buffer-overflow bug allows some user-supplied data to overwrite unrelated memory. The official upstream patch adds explicit range limiting for an invented index variable pos , in several functions. For example:

@@ -1852,10 +1852,13 @@ static void fdctrl_handle_drive_specification_command(FDCtrl *fdctrl, int direction) { FDrive *cur_drv = get_cur_drv(fdctrl); + uint32_t pos; - if (fdctrl->fifo[fdctrl->data_pos - 1] & 0x80) { + pos = fdctrl->data_pos - 1; + pos %= FD_SECTOR_LEN; + if (fdctrl->fifo[pos] & 0x80) { /* Command parameters done */ - if (fdctrl->fifo[fdctrl->data_pos - 1] & 0x40) { + if (fdctrl->fifo[pos] & 0x40) { fdctrl->fifo[0] = fdctrl->fifo[1]; fdctrl->fifo[2] = 0; fdctrl->fifo[3] = 0;

Inspecting the original code, we see that the vulnerable index was inside a heap object at fdctrl->data_pos . A type-2 systemtap band-aid would have to adjust that value before the code runs the fifo[] dereference, and subsequently restore the previous value. This might be expressed like this:

global saved_data_pos probe process("/usr/bin/qemu-system-*").function("fdctrl_*spec*_command").call { saved_data_pos[tid()] = $fdctrl->data_pos; $fdctrl->data_pos = $fdctrl->data_pos % 512 # FD_SECTOR_LEN } probe process("/usr/bin/qemu-system-*").function("fdctrl_*spec*_command").return { $fdctrl->data_pos = saved_data_pos[tid()] delete saved_data_pos[tid()] }

The same work would have to be done at each of the three analogous vulnerable sites unless further detailed analysis suggests that a single common higher-level function could do the job.

However, this is probably too much work. The CVE advisory suggests that any call to this area of code is likely a deliberate exploit attempt (since modern operating systems don't use the floppy driver). Therefore, we could opt for a type-1 band-aid, where we bypass the vulnerable computations entirely. We find that all the vulnerable functions ultimately have a common caller, fdctrl_write .

static void fdctrl_write (void *opaque, uint32_t reg, uint32_t value) { FDCtrl *fdctrl = opaque; [...] reg &= 7; switch (reg) { case FD_REG_DOR: fdctrl_write_dor(fdctrl, value); break; [...] case FD_REG_CCR: fdctrl_write_ccr(fdctrl, value); break; default: break; } }

We can disarm the entire simulated floppy driver by pretending that the simulated CPU is addressing a reserved FDC register, thus falling through to the default: case. This requires just one probe. Here we're being more conservative than necessary, overwriting only the low few bits:

probe process("/usr/bin/qemu-system-*").function("fdctrl_write") { $reg = (($reg & ~7) | 6) # replace register address with 0x__6 }

CVE-2015-0235, "ghost"

This recent bug in glibc involved a buffer overflow related to dynamic allocation with a miscalculated size. It affected a function that is commonly used in normal software, and the data required to determine whether the vulnerability would be triggered or not is not available in situ. Therefore, a type-1 error-inducing band-aid would not be appropriate.

However, it is a good candidate for type-2 data-correction. The script below works by incrementing the size_needed variable set around line 86 of glibc nss/digits_dots.c , so as to account for the missing sizeof (*h_alias_ptr) . This makes the subsequent comparisons work and return error codes for buffer-overflow situations.

85 86 size_needed = (sizeof (*host_addr) 87 + sizeof (*h_addr_ptrs) + strlen (name) + 1); 88 89 if (buffer_size == NULL) 90 { 91 if (buflen < size_needed) 92 { 93 if (h_errnop != NULL) 94 *h_errnop = TRY_AGAIN; 95 __set_errno (ERANGE); 96 goto done; 97 } 98 } 99 else if (buffer_size != NULL && *buffer_size < size_needed) 100 { 101 char *new_buf; 102 *buffer_size = size_needed; 103 new_buf = (char *) realloc (*buffer, *buffer_size);

The script demonstrates an unusual technique. The variable in need of correction ( size_needed ) is deep within a particular function so we need "statement" probes to place it before the bad value is used. Because of compiler optimizations, the exact line number where the probe may be placed can't be known a prior so we ask systemtap to try a whole range. The probe handler than protects itself against being invoked more than once (per function call) using an auxiliary flag array.

global added% global trap = 1 # stap -G trap=0 to only trace, not fix

probe process("/lib*/libc.so.6").statement("__nss_hostname_digits_dots@*:87-102") { if (! added[tid()]) { added[tid()] = 1; # we only want to add once printf("%s[%d] BOO! size_needed=%d ", execname(), tid(), $size_needed) if (trap) { # The &@cast() business is a fancy sizeof(uintptr_t), # which makes this script work for both 32- and 64-bit glibc's. $size_needed = $size_needed + &@cast(0, "uintptr_t")[1] printf("ghostbusted to %d", $size_needed) } printf("

") } } probe process("/lib*/libc.so.6").function("__nss_hostname_digits_dots").return { delete added[tid()] # reset for next call }

This type-2 band-aid allows applications to operate as through glibc was patched.

Conclusions

We hope you enjoyed this foray into systemtap and its unexpected application as a potential band-aid for security bugs. If you would like to learn more, read our documentation, contact our team, or just go forth and experiment. If this technology seems like a fit for your installation and situation, consult your vendor for a possible systemtap band-aid.