Bug Description

[Impact]

The bug affects multiple users and introduces an user visible delay (~25 seconds) on SSH connections after a large number of sessions have been processed. This has a serious impact on big systems and servers running our software.

The currently proposed fix is actually a safe workaround for the bug as proposed by the dbus upstream. The workaround makes uid 0 immune to the pending_fd_timeout limit that kicks in and causes the original issue.

[Test Case]

lxc launch ubuntu:x test

lxc exec test -- login -f ubuntu

ssh-import-id <whatever>

Then ran a script as follows (passing in ubuntu@ <container- ip>):

while [ 1 ]; do

(time ssh $1 "echo OK > /dev/null") 2>&1 | grep ^real >> log

done

Then checking the log file if there are any ssh sessions that are taking 25+ seconds to complete.

Multiple instances of the same script can be used at the same time.

[Regression Potential]

The fix has a rather low regression potential as the workaround is a very small change only affecting one particular case - handling of uid 0. The fix has been tested by multiple users and has been around in zesty for a while, with multiple people involved in reviewing the change. It's also a change that has been proposed by upstream.

[Original Description]

I noticed on a system that accepts large numbers of SSH connections that after awhile, SSH sessions were taking ~25 seconds to complete.

Looking in /var/log/auth.log, systemd-logind starts failing with the following:

Jun 10 23:55:28 test sshd[3666]: pam_unix( sshd:session) : session opened for user ubuntu by (uid=0)

Jun 10 23:55:28 test systemd- logind[ 105]: New session c1052 of user ubuntu.

Jun 10 23:55:28 test systemd- logind[ 105]: Failed to abandon session scope: Transport endpoint is not connected

Jun 10 23:55:28 test sshd[3666]: pam_systemd( sshd:session) : Failed to create session: Message recipient disconnected from message bus without replying

I reproduced this in an LXD container by doing something like:

lxc launch ubuntu:x test

lxc exec test -- login -f ubuntu

ssh-import-id <whatever>

Then ran a script as follows (passing in ubuntu@ <container- ip>):

while [ 1 ]; do

(time ssh $1 "echo OK > /dev/null") 2>&1 | grep ^real >> log

done

In my case, after 1052 logins, the 1053rd and thereafter were taking 25+ seconds to complete. Here are some snippets from the log file:

$ cat log | grep 0m0 | wc -l

1052

$ cat log | grep 0m25 | wc -l

4

$ tail -5 log

real 0m0.222s

real 0m25.232s

real 0m25.235s

real 0m25.236s

real 0m25.239s

ProblemType: Bug

DistroRelease: Ubuntu 16.04

Package: systemd 229-4ubuntu5

ProcVersionSign ature: Ubuntu 4.4.0-22.40-generic 4.4.8

Uname: Linux 4.4.0-22-generic x86_64

ApportVersion: 2.20.1-0ubuntu2

Architecture: amd64

Date: Sat Jun 11 00:09:34 2016

MachineType: Notebook W230SS

ProcEnviron:

TERM=xterm- 256color

PATH=(custom, no user)

ProcKernelCmdLine: BOOT_IMAGE= /vmlinuz- 4.4.0-22- generic root=/dev/ mapper/ ubuntu- -vg-root ro quiet splash

SourcePackage: systemd

SystemdDelta:

[EXTENDED] /lib/systemd/ system/ rc-local. service → /lib/systemd/ system/ rc-local. service. d/debian. conf

[EXTENDED] /lib/systemd/ system/ systemd- timesyncd. service → /lib/systemd/ system/ systemd- timesyncd. service. d/disable- with-time- daemon. conf

2 overridden configuration files found.

UpgradeStatus: No upgrade log present (probably fresh install)

dmi.bios.date: 04/15/2014

dmi.bios.vendor: American Megatrends Inc.

dmi.bios.version: 4.6.5

dmi.board. asset.tag: Tag 12345

dmi.board.name: W230SS

dmi.board.vendor: Notebook

dmi.board.version: Not Applicable

dmi.chassis. asset.tag: No Asset Tag

dmi.chassis.type: 9

dmi.chassis.vendor: Notebook

dmi.chassis. version: N/A

dmi.modalias: dmi:bvnAmerican MegatrendsInc. :bvr4.6. 5:bd04/ 15/2014: svnNotebook: pnW230SS: pvrNotApplicabl e:rvnNotebook: rnW230SS: rvrNotApplicabl e:cvnNotebook: ct9:cvrN/ A:

dmi.product.name: W230SS

dmi.product. version: Not Applicable

dmi.sys.vendor: Notebook