In the wake of the Ars Technica article on NTP vulnerabilities, and Slashdot coverage, there has been sharply increased public interest in the work NTPsec is doing.

A lot of people have gotten the idea that I’m engaged in a full rewrite of the code, however, and that’s not accurate. What’s actually going on is more like a really massive cleanup and hardening effort. To give you some idea how massive, I report that the codebase is now down to about 43% of the size we inherited – in absolute numbers, down from 227KLOC to 97KLOC.

Details, possibly interesting, follow. But this is more than a summary of work; I’m going to use it to talk about good software-engineering practice by example.

The codebase we inherited, what we call “NTP Classic”, was not horrible. When I was first asked to describe it, the first thought that leapt to my mind was that it looked like really good state-of-the-art Unix systems code – from 1995. That is, before API standardization and a lot of modern practices got rolling. And well before ubiquitous Internet made security hardening the concern it is today.

Dave Mills, the original designer and implementor of NTP, was an eccentric genius, an Internet pioneer, and a systems architect with vision and exceptional technical skills. The basic design he laid down is sound. But it was old code, with old-code problems that his successor (Harlan Stenn) never solved. Problems like being full of port shims for big-iron Unixes from the Late Cretaceous, and the biggest, nastiest autoconf hairball of a build system I’ve ever seen.

Any time you try to modify a codebase like that, you tend to find yourself up to your ass in alligators before you can even get a start on draining the swamp. Not the least of the problems is that a mess like that is almost forbiddingly hard to read. You may be able to tell there’s a monument to good design underneath all the accumulated cruft, but that’s not a lot of help if the cruft is overwhelming.

One thing the cruft was overwhelming was efforts to secure and harden NTP. This was a serious problem; by late last year (2014) NTP was routinely cracked and in use as a DDoS amplifier, with consequences Ars Technica covers pretty well.

I got hired (the details are complicated) because the people who brought me on believed me to be a good enough systems architect to solve the general problem of why this codebase had become such a leaky mess, even if they couldn’t project exactly how I’d do it. (Well, if they could, they wouldn’t need me, would they?)

The approach I chose was to start by simplifying. Chiseling away all the historical platform-specific cruft in favor of modern POSIX APIs, stripping the code to its running gears, tossing out as many superannuated features as I could, and systematically hardening the remainder.

To illustrate what I mean by ‘hardening’, I’ll quote the following paragraph from our hacking guide:

* strcpy, strncpy, strcat: Use strlcpy and strlcat instead. * sprintf, vsprintf: use snprintf and vsnprintf instead. * In scanf and friends, the %s format without length limit is banned. * strtok: use strtok_r() or unroll this into the obvious loop. * gets: Use fgets instead. * gmtime(), localtime(), asctime(), ctime(): use the reentrant *_r variants. * tmpnam() - use mkstemp() or tmpfile() instead. * dirname() - the Linux version is re-entrant but this property is not portable.

This formalized an approach I’d used successfully on GPSD – instead of fixing defects and security holes after the fact, constrain your code so that it cannot have defects. The specific class of defects I was going after here was buffer overruns.

OK, you experienced C programmers out there are are thinking “What about wild-pointer and wild-index problems?” And it’s true that the achtung verboten above will not solve those kinds of overruns. But another prong of my strategy was systematic use of static code analyzers like Coverity, which actually is pretty good at picking up the defects that cause that sort of thing. Not 100% perfect, C will always allow you to shoot yourself in the foot, but I knew from prior success with GPSD that the combination of careful coding with automatic defect scanning can reduce the hell out of your bug load.

Another form of hardening is making better use of the type system to express invariants. In one early change, I ran through the entire codebase looking for places where integer flag variables could be turned into C99 booleans. The compiler itself doesn’t do much with this information, but it gives static analyzers more traction.

Back to chiseling away code. When you do that, and simultaneously code-harden, and use static analyzers, you can find yourself in a virtuous cycle. Simplification enables better static checking. The code becomes more readable. You can remove more dead weight and make more changes with higher confidence. You’re not just flailing.

I’m really good at this game (see: 57% of the code removed). I’m stating that to make a methodological point; being good at it is not magic. I’m not sitting on a mountaintop having satori, I’m applying best practices. The method is replicable. It’s about knowing what best practices are, about being systematic and careful and ruthless. I do have an an advantage because I’m very bright and can hold more complex state in my head than most people, but the best practices don’t depend on that personal advantage – its main effect is to make me faster at doing what I ought to be doing anyway.

A best practice I haven’t covered yet is to code strictly to standards. I’ve written before that one of our major early technical decisions was to assume up front that the entire POSIX.1-2001/C99 API would be available on all our target platforms and treat exceptions to that (like Mac OS X not having clock_gettime(2)) as specific defects that need to be isolated and worked around by emulating the standard calls.

This differs dramatically from the traditional Unix policy of leaving all porting shims back to the year zero in place because you never know when somebody might want to build your code on some remnant dinosaur workstation or minicomputer from the 1980s. That tradition is not harmless; the thicket of #ifdefs and odd code snippets that nobody has tested in Goddess knows how many years is a major drag on readability and maintainability. It rigidifies the code – you can wind up too frightened of breaking everything to change anything.

Little things also matter, like fixing all compiler warnings. I thought it was shockingly sloppy that the NTP Classic maintainers hadn’t done this. The pattern detectors behind those warnings are there because they often point at real defects. Also, voluminous warnings make it too easy to miss actual errors that break your build. And you never want to break your build, because later on that will make bisection testing more difficult.

Yet another important thing to do on an expedition like this is to get permission – or give yourself permission, or fscking take permission – to remove obsolete features in order to reduce code volume, complexity, and attack surface.

NTP Classic had two control programs for the main daemon, one called ntpq and one called ntpdc. ntpq used a textual (“mode 6”) packet protocol to talk to nptd; ntpdc used a binary one (“mode 7”). Over the years it became clear that ntpd’s handler coder code for mode 7 messages was a major source of bugs and security vulnerabilities, and ntpq mode 6 was enhanced to match its capabilities. Then ntpdc was deprecated, but not removed – the NTP Classic team had developed a culture of never breaking backward compatibility with anything.

And me? I shot ntpdc through the head specifically to reduce our attack surface. We took the mode 7 handler code out of ntpd. About four days later Cisco sent us a notice of critical DoS vulnerability that wasn’t there for us precisely because we had removed that stuff.

This is why ripping out 130KLOC is actually an even bigger win than the raw numbers suggest. The cruft we removed – the portability shims, the obsolete features, the binary-protocol handling – is disproportionately likely to have maintainability problems, defects and security holes lurking in it and implied by it. It was ever thus.

I cannot pass by the gains from taking a poleaxe to the autotools-based build system. It’s no secret that I walked into this project despising autotools. But the 31KLOC monstrosity I found would have justified a far more intense loathing than I had felt before. Its tentacles were everywhere. A few days ago, when I audited the post-fork commit history of NTP Classic in order to forward-port their recent bug fixes, I could not avoid noticing that a disproportionately large percentage of their commits were just fighting the build system, to the point where the actual C changes looked rather crowded out.

We replaced autotools with waf. It could have been scons – I like scons – but one of our guys is a waf expert and I don’t have time to do everything. It turns out waf is a huge win, possibly a bigger one than scons would have been. I think it produces faster builds than scons – it automatically parallelizes build tasks – which is important.

It’s important because when you’re doing exploratory programming, or mechanical bug-isolation procedures like bisection runs, faster builds reduce your costs. They also have much less tendency to drop you out of a good flow state.

Equally importantly, the waf build recipe is far easier to understand and modify than what it replaced. I won’t deny that waf dependency declarations are a bit cryptic if you’re used to plain Makefiles or scons productions (scons has a pretty clear readability edge over waf, with better locality of information) but the brute fact is this: when your build recipe drops from 31KLOC to 1.1KLOC you are winning big no matter what the new build engine’s language looks like,

The discerning reader will have noticed that though I’ve talked about being a systems architect, none of this sounds much like what you might think systems architects are supposed to do. Big plans! Bold refactorings! Major new features!

I do actually have plans like that. I’ll blog about them in the future. But here is truth: when you inherit a mess like NTP Classic (and you often will), the first thing you need to do is get it to a sound, maintainable, and properly hardened state. The months I’ve spent on that are now drawing to a close. Consequently, we have an internal schedule for first release; I’m not going to announce a date, but think weeks rather than months.

The NTP Classic devs fell into investing increasing effort merely fighting the friction of their own limiting assumptions because they lacked something that Dave Mills had and I have and any systems architect necessarily must have – professional courage. It’s the same quality that a surgeon needs to cut into a patient – the confidence, bordering on arrogance, that you do have what it takes to go in and solve the problem even if there’s bound to be blood on the floor before you’re done.

What justifies that confidence? The kind of best practices I’ve been describing. You have to know what you’re doing, and know that you know what you’re doing. OK, and I fibbed a little earlier. Sometimes there is a kind of Zen to it, especially on your best days. But to get to that you have to draw water and chop wood – you have to do your practice right.

As with GPSD, one of my goals for NTPsec is that it should not only be good software, but a practice model for how to do right. This essay. in addition to being a progress report, was intended as a down payment on that promise.

To support my work on NTPsec, please pledge at my Patreon or GoFundMe pages.