SO_REUSEPORT and accept(2) performance

Hi all, The brief summary (as of 9a1272865c3cb6e079e1554031dcc712a881598b): nonblocking accept(2)+kqueue(2) w/ SO_REUSEPORT, we are doing 335Kconns/s nonblocking accept(2)+kqueue(2) w/o SO_REUSEPORT, we are doing 100Kconns/s On DragonFly, in addition to better load balance, SO_REUSEPORT also gives us ~200% performance boost! Since the testing is on 1000Mbps network, the nonblocking accept(2) w/ SO_REUSEPORT has maxed out the network device's input path: ~1.35Mpps from netstat -I output, which matches theoretical value for 1000Mbps network (the input are consisted w/ 78B SYN, 66B ACK, 66B FIN and 66B ACK packets for each connection). The testing server hardware: CPU intel i7-2600 w/ hyperthreading enabled (8 HT) NIC broadcom 5719 (4 RX queues and 4 TX queues, using MSI-X) The testing server software is: tools/tools/netrate/accept_connect/kq_accept_server kq_accept_server -p 5000 -i 8 [-r] (8 user space processes accept connections, -r turns on SO_REUSEPORT) The testing client software is: tools/tools/netrate/accept_connect/connect_client connect_client -p 5000 -4 10.0.0.49 -i 64 (64 user space processes do the connect) route change -net 10.0.0.0/24 -msl 10 sysctl net.inet.ip.portrange.last=40000 (these two configures make sure that the client won't run out of local ports) The network configure: +---------+ |+--- emx | client1 | || +---------+ || || +---------+ |+--- emx | client2 | +--------+ || +---------+ | server | bnx ---+| +--------+ || +---------+ |+--- emx | client3 | || +---------+ || || +---------+ |+--- bce | client4 | +---------+ "client1"~"client4" run the testing client software simultaneously as shown above, mainly to generate enough traffic. "server" runs the testing server software. Statistics: w/ SO_REUSEPORT nonblocking accept(2) rate: 335Kconns/s NIC interrupt rate: 6000/s on the first 4 HT CPU idle time on HTs processing interrupt: ~15% CPU idle time on HTs not processing interrupt: ~20% Token contention rate: ~500/s (mostly TCP listen completion queue pool token and TCP porthash token) This shows w/ SO_REUSEPORT: - We still have CPU time to process more connections. - There is only minor TCP listen completion queue contention and this probably could be further reduced by binding process to the specific CPU. w/o SO_REUSEPORT nonblocking accept(2) rate: 100Kconns/s NIC interrupt rate: 6000/s on the first 4 HT CPU idle time on HTs processing interrupt: ~5% - 70% CPU idle time on HTs not processing interrupt: ~10% - 80% Token contention rate: ~20K/s - 600K/s (mostly TCP listen completion queue pool token) This shows w/o SO_REUSEPORT: - TCP listen completion queue contention is obviously too high, i.e. we are facing scaling problem on this TCP listen socket usage model. Best Regards, sephe -- Tomorrow Will Never Die