Interesting behavior I have been dealing with lately.

Have you seen this perf profile before?

1 2 3 4 5 6 7 8 9 10 11 12 13 14 PerfTop: 12025 irqs/sec kernel:21.0% exact: 0.0% [1000Hz cycles], (all, 12 CPUs) ------------------------------------------------------------------------------------------------ samples pcnt function DSO _______ _____ ________________________________________ ____________________________ 19739.00 22.3% page_fault [kernel.kallsyms] 9475.00 10.7% __GI_time /lib64/libc-2.5.so 4934.00 5.6% free /usr/lib64/libjemalloc.so.1 4816.00 5.4% arena_dalloc_bin_locked /usr/lib64/libjemalloc.so.1 3111.00 3.5% calloc /usr/lib64/libjemalloc.so.1 2399.00 2.7% __GI_vfprintf /lib64/libc-2.5.so 2258.00 2.6% malloc /usr/lib64/libjemalloc.so.1 2176.00 2.5% arena_tcache_fill_small /usr/lib64/libjemalloc.so.1

I had not seen it. Probably I had not seen it because no one does this many syscalls in high performance code (dur). I googled it and nothing useful came up aside from the normal major/minor page_fault stuff explained by the paging model in the Linux kernel.

I then asked the smartest guy I know, and he pointed me in the right direction. What follows is the explanation in case you run into this. I suspect no one else will run into this because it has to do with libc/kernel mismatches.

We are working on a complex event processor at AppNexus. This thing is responsible for aggregating, joining, filtering, etc.. on advertising events. As such, it’s very concerned with what time it is on the system. This is due to various RAM eviction mechanisms that rely on windows of time for processing so that we don’t consume all the RAM on the system holding onto events longer than necessary. Turns out that we were naively calling:

1 time_t now = time ( NULL );

In a few key places (i.e. every time we processed an incoming event). The expectation that system time() calls would eat a majority of cycles was slightly unexpected. What’s more unexpected is that these system time() calls would invoke so many page faults.

We are using glibc version 2.5 (the stock version on CentOS 5) on x86_64 machines. This implements a time() call by using assembly to invoke a syscall on the vsyscall page provided my linux. From the time.S file:

glibc/sysdeps/unix/sysv/linux/x86_64/time.S 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 /* For the calculation see asm / vsyscall.h. * / #define VSYSCALL_ADDR_vtime 0xffffffffff600400 /* Return the current time as a ` time_t ' and also put it in * T if T is not NULL. Time is represented as seconds from Jan 1 00 : 00 : 00 1970 . * / ENTRY ( time ) /* Align stack. * / sub $0x8 , %rsp cfi_adjust_cfa_offset ( 8 ) movq $VSYSCALL_ADDR_vtime , %rax callq * %rax add $0x8 , %rsp cfi_adjust_cfa_offset (- 8 ) ret PSEUDO_END_NOERRNO ( time ) libc_hidden_def ( time )

However, because we want new hotness kernels we have a relatively new 3.8.13 kernel installed on most of these stock CentOS 5 machines. What is this magic address?

1 cat /proc/self/maps

You should see something like:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 00400000-00405000 r-xp 00000000 fd:00 13467750 /bin/cat 00604000-00606000 rw-p 00004000 fd:00 13467750 /bin/cat 01b15000-01b36000 rw-p 00000000 00:00 0 [ heap ] 3b58600000-3b5861c000 r-xp 00000000 fd:00 91324604 /lib64/ld-2.5.so 3b5881c000-3b5881d000 r--p 0001c000 fd:00 91324604 /lib64/ld-2.5.so 3b5881d000-3b5881e000 rw-p 0001d000 fd:00 91324604 /lib64/ld-2.5.so 3b58a00000-3b58b4d000 r-xp 00000000 fd:00 91324605 /lib64/libc-2.5.so 3b58b4d000-3b58d4d000 ---p 0014d000 fd:00 91324605 /lib64/libc-2.5.so 3b58d4d000-3b58d51000 r--p 0014d000 fd:00 91324605 /lib64/libc-2.5.so 3b58d51000-3b58d52000 rw-p 00151000 fd:00 91324605 /lib64/libc-2.5.so 3b58d52000-3b58d57000 rw-p 00000000 00:00 0 7fa403eb5000-7fa40748c000 r--p 00000000 fd:00 120077090 /usr/lib/locale/locale-archive 7fa40748c000-7fa40748e000 rw-p 00000000 00:00 0 7fa40749e000-7fa4074a0000 rw-p 00000000 00:00 0 7ffff864b000-7ffff866c000 rw-p 00000000 00:00 0 [ stack ] 7ffff87f8000-7ffff87fa000 r-xp 00000000 00:00 0 [ vdso ] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [ vsyscall ]

These are sorted lowest to highest address in the memory map of this process. You can see the process itself, it’s heap, then DSOs, then locale related stuff, then some unknown, then stack, then vdso then vsyscall.

On newer libc’s we are supposed to get the sys_time by using vdso space. That is out of the scope of this blog post but if you are interested: http://lwn.net/Articles/446528/. Note that when that article says “The static vsyscall page will not continue to exist in its current form, and applications which still use it will continue to work but will get a little bit slower”, it is severly understating how much slower they will get. More on that below.

The penultimate page starting at 0xffffffffff600000 and ending at 0xffffffffff601000 must contain a time() function at offset 0xffffffffff600400 . We are merely dereferencing that memory address and calling the function defined there. This causes the kernel to trap ( page_fault ) this address and invoke special handling to emulate the syscalls that used to be there. This “fixes” the security hole and still allows legacy apps to work (much more slowly).

I created a simple test program to avoid any other complications in my own code to isolate this problem:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 #include <time.h> #include <stdlib.h> int main ( int argc , char ** argv ) { time_t now = 0 ; int count = atoi ( argv [ 1 ]); for ( int i = 0 ; i < count ; i ++ ) { now = time ( NULL ); } return 0 ; }

The perf output on this simpler program is pretty revealing as to what’s happening.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 # Events: 637 cycles # # Overhead Command Shared Object Symbol # ........ ....... ................. ......................... # 48.66 % tpf [ kernel . kallsyms ] [ k ] page_fault 16.89 % tpf libc - 2.5 . so [.] __GI_time 4.30 % tpf [ kernel . kallsyms ] [ k ] trace_hardirqs_off_caller 3.68 % tpf [ kernel . kallsyms ] [ k ] trace_hardirqs_off_thunk 3.22 % tpf tpf [.] main 3.07 % tpf [ kernel . kallsyms ] [ k ] emulate_vsyscall 2.61 % tpf [ kernel . kallsyms ] [ k ] error_entry 2.46 % tpf [ kernel . kallsyms ] [ k ] do_page_fault 2.30 % tpf [ kernel . kallsyms ] [ k ] __bad_area_nosemaphore 2.14 % tpf [ kernel . kallsyms ] [ k ] is_prefetch 1.54 % tpf [ kernel . kallsyms ] [ k ] restore 1.07 % tpf [ kernel . kallsyms ] [ k ] spurious_fault 1.03 % tpf [ kernel . kallsyms ] [ k ] system_call_after_swapgs 0.92 % tpf [ kernel . kallsyms ] [ k ] error_sti 0.92 % tpf [ kernel . kallsyms ] [ k ] trace_hardirqs_on_caller 0.78 % tpf [ kernel . kallsyms ] [ k ] retint_swapgs 0.77 % tpf [ kernel . kallsyms ] [ k ] error_exit 0.61 % tpf [ kernel . kallsyms ] [ k ] restore_args 0.52 % tpf [ kernel . kallsyms ] [ k ] sys_time 0.49 % tpf [ kernel . kallsyms ] [ k ] write_ok_or_segv

The dereference and function call on that special page at 0xffffffffff600000 is causing the kernel to trigger a page_fault as that is not RAM I should be accessing. This triggers the do_page_fault which eventually calls __bad_area_nosemaphore .

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 717 static void 718 __bad_area_nosemaphore ( struct pt_regs * regs , unsigned long error_code , 719 unsigned long address , int si_code ) 720 { 721 struct task_struct * tsk = current ; 722 723 /* User mode accesses just cause a SIGSEGV */ 724 if ( error_code & PF_USER ) { 725 /* 726 * It's possible to have interrupts off here: 727 */ 728 local_irq_enable (); 729 730 /* 731 * Valid to do another page fault here because this one came 732 * from user space: 733 */ 734 if ( is_prefetch ( regs , error_code , address )) 735 return ; 736 737 if ( is_errata100 ( regs , address )) 738 return ; 739 740 # ifdef CONFIG_X86_64 741 /* 742 * Instruction fetch faults in the vsyscall page might need 743 * emulation. 744 */ 745 if ( unlikely (( error_code & PF_INSTR ) && 746 (( address & ~ 0xfff ) == VSYSCALL_START ))) { 747 if ( emulate_vsyscall ( regs , address )) 748 return ; 749 } 750 # endif 751 /* Kernel addresses are always protection faults: */ 752 if ( address >= TASK_SIZE ) 753 error_code |= PF_PROT ; 754 755 if ( likely ( show_unhandled_signals )) 756 show_signal_msg ( regs , error_code , address , tsk ); 757 758 tsk -> thread . cr2 = address ; 759 tsk -> thread . error_code = error_code ; 760 tsk -> thread . trap_nr = X86_TRAP_PF ; 761 762 force_sig_info_fault ( SIGSEGV , si_code , address , tsk , 0 ); 763 764 return ; 765 } 766 767 if ( is_f00f_bug ( regs , address )) 768 return ; 769 770 no_context ( regs , error_code , address , SIGSEGV , si_code ); 771 }

So we’ve accessed a bad area (kernel memory) in order to execute an instruction there. User accesses are supposed to segfault unless we are x86_64 (which we are). In that case, emulate vsyscall functionality.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 176 bool emulate_vsyscall ( struct pt_regs * regs , unsigned long address ) 177 { 178 struct task_struct * tsk ; 179 unsigned long caller ; 180 int vsyscall_nr , syscall_nr , tmp ; 181 int prev_sig_on_uaccess_error ; 182 long ret ; 183 184 /* 185 * No point in checking CS -- the only way to get here is a user mode 186 * trap to a high address, which means that we're in 64-bit user code. 187 */ 188 189 WARN_ON_ONCE ( address != regs -> ip ); 190 191 if ( vsyscall_mode == NONE ) { 192 warn_bad_vsyscall ( KERN_INFO , regs , 193 "vsyscall attempted with vsyscall=none" ); 194 return false ; 195 } 196 197 vsyscall_nr = addr_to_vsyscall_nr ( address ); 198 199 trace_emulate_vsyscall ( vsyscall_nr ); 200 201 if ( vsyscall_nr < 0 ) { 202 warn_bad_vsyscall ( KERN_WARNING , regs , 203 "misaligned vsyscall (exploit attempt or buggy program) -- look up the vsyscall kernel parameter if you need a workaround" ); 204 goto sigsegv ; 205 } 206 207 if ( get_user ( caller , ( unsigned long __user * ) regs -> sp ) != 0 ) { 208 warn_bad_vsyscall ( KERN_WARNING , regs , 209 "vsyscall with bad stack (exploit attempt?)" ); 210 goto sigsegv ; 211 } 212 213 tsk = current ; 214 215 /* 216 * Check for access_ok violations and find the syscall nr. 217 * 218 * NULL is a valid user pointer (in the access_ok sense) on 32-bit and 219 * 64-bit, so we don't need to special-case it here. For all the 220 * vsyscalls, NULL means "don't write anything" not "write it at 221 * address 0". 222 */ 223 switch ( vsyscall_nr ) { 224 case 0 : 225 if ( ! write_ok_or_segv ( regs -> di , sizeof ( struct timeval )) || 226 ! write_ok_or_segv ( regs -> si , sizeof ( struct timezone ))) { 227 ret = - EFAULT ; 228 goto check_fault ; 229 } 230 231 syscall_nr = __NR_gettimeofday ; 232 break ; 233 234 case 1 : 235 if ( ! write_ok_or_segv ( regs -> di , sizeof ( time_t ))) { 236 ret = - EFAULT ; 237 goto check_fault ; 238 } 239 240 syscall_nr = __NR_time ; 241 break ; 242 243 case 2 : 244 if ( ! write_ok_or_segv ( regs -> di , sizeof ( unsigned )) || 245 ! write_ok_or_segv ( regs -> si , sizeof ( unsigned ))) { 246 ret = - EFAULT ; 247 goto check_fault ; 248 } 249 250 syscall_nr = __NR_getcpu ; 251 break ; 252 } 253 254 /* 255 * Handle seccomp. regs->ip must be the original value. 256 * See seccomp_send_sigsys and Documentation/prctl/seccomp_filter.txt. 257 * 258 * We could optimize the seccomp disabled case, but performance 259 * here doesn't matter. 260 */ 261 regs -> orig_ax = syscall_nr ; 262 regs -> ax = - ENOSYS ; 263 tmp = secure_computing ( syscall_nr ); 264 if (( ! tmp && regs -> orig_ax != syscall_nr ) || regs -> ip != address ) { 265 warn_bad_vsyscall ( KERN_DEBUG , regs , 266 "seccomp tried to change syscall nr or ip" ); 267 do_exit ( SIGSYS ); 268 } 269 if ( tmp ) 270 goto do_ret ; /* skip requested */ 271 272 /* 273 * With a real vsyscall, page faults cause SIGSEGV. We want to 274 * preserve that behavior to make writing exploits harder. 275 */ 276 prev_sig_on_uaccess_error = current_thread_info () -> sig_on_uaccess_error ; 277 current_thread_info () -> sig_on_uaccess_error = 1 ; 278 279 ret = - EFAULT ; 280 switch ( vsyscall_nr ) { 281 case 0 : 282 ret = sys_gettimeofday ( 283 ( struct timeval __user * ) regs -> di , 284 ( struct timezone __user * ) regs -> si ); 285 break ; 286 287 case 1 : 288 ret = sys_time (( time_t __user * ) regs -> di ); 289 break ; 290 291 case 2 : 292 ret = sys_getcpu (( unsigned __user * ) regs -> di , 293 ( unsigned __user * ) regs -> si , 294 NULL ); 295 break ; 296 } 297 298 current_thread_info () -> sig_on_uaccess_error = prev_sig_on_uaccess_error ; 299 300 check_fault : 301 if ( ret == - EFAULT ) { 302 /* Bad news -- userspace fed a bad pointer to a vsyscall. */ 303 warn_bad_vsyscall ( KERN_INFO , regs , 304 "vsyscall fault (exploit attempt?)" ); 305 306 /* 307 * If we failed to generate a signal for any reason, 308 * generate one here. (This should be impossible.) 309 */ 310 if ( WARN_ON_ONCE ( ! sigismember ( & tsk -> pending . signal , SIGBUS ) && 311 ! sigismember ( & tsk -> pending . signal , SIGSEGV ))) 312 goto sigsegv ; 313 314 return true ; /* Don't emulate the ret. */ 315 } 316 317 regs -> ax = ret ; 318 319 do_ret : 320 /* Emulate a ret instruction. */ 321 regs -> ip = caller ; 322 regs -> sp += 8 ; 323 return true ; 324 325 sigsegv : 326 force_sig ( SIGSEGV , current ); 327 return true ; 328 } 329

This basically checks that we aren’t trying to write the return value from the syscall into memory we have no permissions on ( write_ok_or_segv ) and then calls sys_time storing the answer in reg->ax and then emulating the rest of the syscall protocol.

Going back to the test program. Let’s bench mark it (crudely) so we can see how much slower it really is:

10MM calls to time(NULL) on a broken system ( gcc -std=c99 -O2 -o tpf tpf.c )

1 2 3 4 5 [ rberton@broken_machine test_page_fault []] $ time ./tpf 10000000 real 0m6.415s user 0m2.230s sys 0m4.046s

10MM calls to time(NULL) on a fairly modern ubuntu:

1 2 3 4 5 6 7 8 [ rberton@ubuntu ~ ] $ uname -a Linux ubuntu 3.8.0-35-generic #50-Ubuntu SMP Tue Dec 3 01:24:59 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux [ rberton@ubuntu ~ ] $ time ./tpf 10000000 real 0m0.041s user 0m0.040s sys 0m0.000s

156X faster. A “little slower” is missing the mark here.