by Lukas Pustina

Previously on OpenStack Crime Investigation. I was called to a crime scene; our OpenStack based private cloud for CenterDevice. Somebody or something was causing our virtual load balancers to flap their highly available IP address. tcpdump showed me that there were time gaps between the keepalive VRRP packets. And these gaps originated on the first hop, the master load balancer.

After discovering the culprit might be hiding inside the virtual machine, I fetched my assistant sysdig and we ssh’ed into loadblancer01. Together, we started to ask questions; very intense, uncomfortable questions to finally get to the bottom of this.

sysdig is a relatively new tool that basically traces all syscalls and uses Lua scripts – called chisels – to derive conclusions about the captured sequence of syscalls. In general, it is like tcdump for syscalls. Since every process that wants to do anything useful has to use syscalls, a very detailed view of your system’s behavior evolves. sysdig is developing into a serious swiss army knife and you should give it a try; especially, read these three blog post to learn about sysdig’s capabilities which convinced me.

sysdig being the good detective assistant went to work immediately and kept a transcript about all questions and answers in a file.

$ sysdig -w transcript $ sysdig -w transcript

I was especially interested in the sendmsg syscall that actually sends network packets issued by the keepalived process.

After a syscall is invoked by a process, the process is paused until the syscall has been finished by the kernel and only then the execution of the process is resumed.

I was interested in the invocation, that is the event direction from user process to kernel.

$ sysdig -r transcript proc.name=keepalived and evt.type=sendmsg and evt.dir= '>' $ sysdig -r transcript proc.name=keepalived and evt.type=sendmsg and evt.dir='>'

That stirred up quite some evidence. But I wanted to find the time gaps larger than one second. I drew Lua from my concealed holster inside my jacket and fired a quick chisel right between two sendmsg syscalls:

$ sysdig -r transcript -c find_time_gap proc.name=keepalived and evt.type=sendmsg and evt.dir= '>' 370964 17 :01: 26.375374738 ( 5.03543187 ) keepalived > sendmsg fd = 13 ( 192.168.205.8- > 224.0.0.18 ) size = 40 tuple =0.0.0.0: 112 - > 224.0.0.18: 0 $ sysdig -r transcript -c find_time_gap proc.name=keepalived and evt.type=sendmsg and evt.dir='>' 370964 17:01:26.375374738 (5.03543187) keepalived > sendmsg fd=13(192.168.205.8->224.0.0.18) size=40 tuple=0.0.0.0:112->224.0.0.18:0

Got you! A time gap of more than five seconds turned up. I inspected the events closely looking for event number 370964.

$ sysdig -r transcript proc.name=keepalived and evt.type=sendmsg and evt.dir= '>' | grep -C 1 370964 368250 17 :01: 21.339942868 0 keepalived ( 10400 ) > sendmsg fd = 13 ( 192.168.205.8- > 224.0.0.18 ) size = 40 tuple =0.0.0.0: 112 - > 224.0.0.18: 0 370964 17 :01: 26.375374738 1 keepalived ( 10400 ) > sendmsg fd = 13 ( 192.168.205.8- > 224.0.0.18 ) size = 40 tuple =0.0.0.0: 112 - > 224.0.0.18: 0 371446 17 :01: 26.377770247 1 keepalived ( 10400 ) > sendmsg fd = 13 ( 192.168.205.8- > 224.0.0.18 ) size = 40 tuple =0.0.0.0: 112 - > 224.0.0.18: 0 $ sysdig -r transcript proc.name=keepalived and evt.type=sendmsg and evt.dir='>' | grep -C 1 370964 368250 17:01:21.339942868 0 keepalived (10400) > sendmsg fd=13(192.168.205.8->224.0.0.18) size=40 tuple=0.0.0.0:112->224.0.0.18:0 370964 17:01:26.375374738 1 keepalived (10400) > sendmsg fd=13(192.168.205.8->224.0.0.18) size=40 tuple=0.0.0.0:112->224.0.0.18:0 371446 17:01:26.377770247 1 keepalived (10400) > sendmsg fd=13(192.168.205.8->224.0.0.18) size=40 tuple=0.0.0.0:112->224.0.0.18:0

There, in lines 1 and 2. There it was. A gap much larger than one second. Now it was clear. The perpetrator was hiding inside the virtual machine causing lags in the processing of keepalived’s sendmsg. But who was it?

After some ruling out of innocent bystanders, I found this:

$ sysdig -r transcript | tail -n + 368250 | grep -v "keepalived\|haproxy\|zabbix" 369051 17 :01: 23.621071175 3 ( 0 ) > switch next = 7 pgft_maj = 0 pgft_min = 0 vm_size = 0 vm_rss = 0 vm_swap = 0 369052 17 :01: 23.621077045 3 ( 7 ) > switch next = 0 pgft_maj = 0 pgft_min = 0 vm_size = 0 vm_rss = 0 vm_swap = 0 369053 17 :01: 23.625105578 3 ( 0 ) > switch next = 7 pgft_maj = 0 pgft_min = 0 vm_size = 0 vm_rss = 0 vm_swap = 0 369054 17 :01: 23.625112785 3 ( 7 ) > switch next = 0 pgft_maj = 0 pgft_min = 0 vm_size = 0 vm_rss = 0 vm_swap = 0 369055 17 :01: 23.628568892 3 ( 0 ) > switch next = 25978 ( sysdig ) pgft_maj = 0 pgft_min = 0 vm_size = 0 vm_rss = 0 vm_swap = 0 369056 17 :01: 23.628597921 3 sysdig ( 25978 ) > switch next = 0 pgft_maj = 0 pgft_min = 2143 vm_size = 99684 vm_rss = 6772 vm_swap = 0 369057 17 :01: 23.629068124 3 ( 0 ) > switch next = 7 pgft_maj = 0 pgft_min = 0 vm_size = 0 vm_rss = 0 vm_swap = 0 369058 17 :01: 23.629073516 3 ( 7 ) > switch next = 0 pgft_maj = 0 pgft_min = 0 vm_size = 0 vm_rss = 0 vm_swap = 0 369059 17 :01: 23.633104746 3 ( 0 ) > switch next = 7 pgft_maj = 0 pgft_min = 0 vm_size = 0 vm_rss = 0 vm_swap = 0 369060 17 :01: 23.633110185 3 ( 7 ) > switch next = 0 pgft_maj = 0 pgft_min = 0 vm_size = 0 vm_rss = 0 vm_swap = 0 369061 17 :01: 23.637061129 3 ( 0 ) > switch next = 7 pgft_maj = 0 pgft_min = 0 vm_size = 0 vm_rss = 0 vm_swap = 0 369062 17 :01: 23.637065648 3 ( 7 ) > switch next = 0 pgft_maj = 0 pgft_min = 0 vm_size = 0 vm_rss = 0 vm_swap = 0 369063 17 :01: 23.641104396 3 ( 0 ) > switch next = 7 pgft_maj = 0 pgft_min = 0 vm_size = 0 vm_rss = 0 vm_swap = 0 369064 17 :01: 23.641109883 3 ( 7 ) > switch next = 0 pgft_maj = 0 pgft_min = 0 vm_size = 0 vm_rss = 0 vm_swap = 0 369065 17 :01: 23.645069607 3 ( 0 ) > switch next = 7 pgft_maj = 0 pgft_min = 0 vm_size = 0 vm_rss = 0 vm_swap = 0 369066 17 :01: 23.645075551 3 ( 7 ) > switch next = 0 pgft_maj = 0 pgft_min = 0 vm_size = 0 vm_rss = 0 vm_swap = 0 369067 17 :01: 23.649071836 3 ( 0 ) > switch next = 7 pgft_maj = 0 pgft_min = 0 vm_size = 0 vm_rss = 0 vm_swap = 0 369068 17 :01: 23.649077700 3 ( 7 ) > switch next = 0 pgft_maj = 0 pgft_min = 0 vm_size = 0 vm_rss = 0 vm_swap = 0 369069 17 :01: 23.653073066 3 ( 0 ) > switch next = 7 pgft_maj = 0 pgft_min = 0 vm_size = 0 vm_rss = 0 vm_swap = 0 369070 17 :01: 23.653078791 3 ( 7 ) > switch next = 0 pgft_maj = 0 pgft_min = 0 vm_size = 0 vm_rss = 0 vm_swap = 0 369071 17 :01: 23.657069807 3 ( 0 ) > switch next = 7 pgft_maj = 0 pgft_min = 0 vm_size = 0 vm_rss = 0 vm_swap = 0 369072 17 :01: 23.657075525 3 ( 7 ) > switch next = 0 pgft_maj = 0 pgft_min = 0 vm_size = 0 vm_rss = 0 vm_swap = 0 369073 17 :01: 23.658678681 3 ( 0 ) > switch next = 25978 ( sysdig ) pgft_maj = 0 pgft_min = 0 vm_size = 0 vm_rss = 0 vm_swap = 0 369074 17 :01: 23.658698453 3 sysdig ( 25978 ) > switch next = 0 pgft_maj = 0 pgft_min = 2143 vm_size = 99684 vm_rss = 6772 vm_swap = 0 $ sysdig -r transcript | tail -n +368250 | grep -v "keepalived\|haproxy\|zabbix" 369051 17:01:23.621071175 3 (0) > switch next=7 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0 369052 17:01:23.621077045 3 (7) > switch next=0 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0 369053 17:01:23.625105578 3 (0) > switch next=7 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0 369054 17:01:23.625112785 3 (7) > switch next=0 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0 369055 17:01:23.628568892 3 (0) > switch next=25978(sysdig) pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0 369056 17:01:23.628597921 3 sysdig (25978) > switch next=0 pgft_maj=0 pgft_min=2143 vm_size=99684 vm_rss=6772 vm_swap=0 369057 17:01:23.629068124 3 (0) > switch next=7 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0 369058 17:01:23.629073516 3 (7) > switch next=0 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0 369059 17:01:23.633104746 3 (0) > switch next=7 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0 369060 17:01:23.633110185 3 (7) > switch next=0 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0 369061 17:01:23.637061129 3 (0) > switch next=7 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0 369062 17:01:23.637065648 3 (7) > switch next=0 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0 369063 17:01:23.641104396 3 (0) > switch next=7 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0 369064 17:01:23.641109883 3 (7) > switch next=0 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0 369065 17:01:23.645069607 3 (0) > switch next=7 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0 369066 17:01:23.645075551 3 (7) > switch next=0 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0 369067 17:01:23.649071836 3 (0) > switch next=7 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0 369068 17:01:23.649077700 3 (7) > switch next=0 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0 369069 17:01:23.653073066 3 (0) > switch next=7 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0 369070 17:01:23.653078791 3 (7) > switch next=0 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0 369071 17:01:23.657069807 3 (0) > switch next=7 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0 369072 17:01:23.657075525 3 (7) > switch next=0 pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0 369073 17:01:23.658678681 3 (0) > switch next=25978(sysdig) pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0 369074 17:01:23.658698453 3 sysdig (25978) > switch next=0 pgft_maj=0 pgft_min=2143 vm_size=99684 vm_rss=6772 vm_swap=0

The processes masked behind pids 0 and 7 were consuming the virtual CPU. 0 and 7? These low process ids indicate kernel threads! I double checked with sysdig’s supervisor, just to be sure. I was right.

The kernel, the perpetrator? This hard working, ever reliable guy? No, I could not believe it. There must be more. And by the way, the virtual machine was not under load. So what could the virtual machine’s kernel be doing … except … it was not doing anything at all? Maybe it was waiting for IO! Could that be? Since all hardware of a virtual machine is, well, virtual, the IO wait could only be on the baremetal level.

I left loadbalancer01 for its bare metal node01. On my way, I was trying to wrap my head around what was happening. What is this all about? So many proxies, so many things happening in the shadows. So many false suspects. This is not a regular case. There is something much more vicious happening behind the scene. I must find it …

Stay tuned for our all new, next and final episode of OpenStack Crime Investigation tomorrow on codecentric Blog.