Unexpected roaming fees are the worst. You’re just cruising along, having a jolly old time, and then boom. $20 per megabyte??? Should have read the fine print. Of course, if you had known to read the fine print, you probably would have already known about the roaming fees, and therefore not needed to read the fine print. And so it goes, in life and in ssh.

What, ssh has roaming??? Should have read the fine print. The Qualys Security Advisory is more than thorough. Now that we’ve read the fine print, what can we do differently?

The main bug (ignoring the second overflow for now) is that some sensitive memory was recycled and leaked. The possibility of this happening has been known for some time, and there’s some countermeasures in place, but they’re not foolproof.

First, stdio frees but does not clear its buffers. A sensible choice, since stdio is also used for a lot of not sensitive data and performance is important here. Or is it? Closing files and freeing buffers probably isn’t the fast path. Maybe we can clean up a little better. Or use mmap/munmap like glibc? Maybe worth doing, but I don’t want to see every subsystem in libc develop its own little malloc replacement. We already have quite a few, with their own quirks and bugs. (And future versions of glibc have actually switched from mmap to malloc.)

One level down, free itself already clears the first 2k by default (since 5.6). This was designed as a mitigation against a different class of use after free bugs, and I didn’t consider it worth the performance sacrifice to always clear large buffers. As part of this consideration, I figured that even for data in buffers, most keys are smaller than 2048 bytes.

In my experimentation, the default behavior here should be sufficient to prevent exposure of keys. I’m not sure how Qualys got the key to leak in 5.8. I instrumented ssh to print the contents of the stdio buffer before and after fclose and confirmed there’s no key data there, but I’ve only had a little time to play with it. Need to spend some more time with the exploit.

OpenBSD free also has some randomization features to thwart grooming efforts, but many of these work better for smaller chunks. The behavior of the free page list is still mostly predictable. Of course, even then this would depend on whether the attacker gets one shot or unlimited. Exploiting the roaming bug looks noisy, but perhaps goes unnoticed?

One level up, perhaps ssh should avoid stdio for reading keys? Independently mmap some memory and read into it? This would provide separation, but I don’t know. Akamai proposed something like this after Heartbleed, but it took a few tries to get correct, and reinventing stdio is also a great way to introduce new bugs. If you can’t trust your tools, get better tools, or something. As a sidenote, signify doesn’t use stdio, although I can’t claim that was the result of some prescient analysis.

I was happy to see that the effort spent pouring explicit_bzero all over ssh actually paid off and closed one potential exploit vector. So on the one hand, it seems like there will always be another leak somewhere (stdio in this case), but on the other it seems if we keep plugging away we might actually plug them all.

On the memory integrity mitigation side of things, I think we did alright. It’s hard to prevent bugs before you know what they are, and sometimes the chips don’t land the way we like. Certainly, we need to very carefully consider what’s happening in stdio, though I still need to reproduce the exploit before we make any changes.

The leak was also mitigated if one used ssh-agent, since then only encrypted keys were leaked. One idea that’s come up is ssh could always start a mini-agent to process encrypted keys, so that leaks in the main process have nothing to disclose. Privilege separation has unfortunately focused mostly on the server side and on reducing the damage of remote code execution, overlooking some other classes of bugs.

Another salient point revealed is that legacy compat code always comes back to bite. The ssh client was loading private keys, even if it didn’t need them, because the code to load public keys first loads the private key, then ignores it, then adds the .pub extension and tries again. All because 900 years ago, the long deprecated SSH1 stored the public key in the same identity file as the private key. When the code was adapted for separate files, the private key name was retained as the canonical name for the keypair. Damn.

If I have to choose, I must say that I’d rather have my experimental feature infoleaks be post auth client side shaped instead of pre auth server side, but I’d quite happily settle for fewer stupid experiments overall. Although, now that the word is out, it sounds like roaming would have been a feature people found useful. Could have converted a few mosh users, anyway.

I did like somebody’s suggestion that ssh wouldn’t have bugs if it were developed in git.

By the way, if you’re rotating keys as a precaution or even if you’re not, it would be a good time to revisit using the new private key format which offers much better protection against password cracking of encrypted keys.