16 January 2016, 10:52 AM

(Details have been changed to protect the guilty and to make the story more entertaining.)

I work for a company that builds robotic kiosks (think vending machines that cut keys), which are stationed in various stores across the country. One morning, I arrive at work to find a coworker looking troubled. “We’ve got a kiosk in Florida that’s stuck in the middle of shutting down. Can you take a look?” This isn’t my area of expertise, but somebody’s got to deal with this, and it looks like I’m somebody.

I try to SSH into the kiosk, but the connection is refused with the message, The system is going down for power off in 1 minute! . “It’s been saying that since I arrived half an hour ago,” my coworker explains. “It’s not actually going to shut down in a minute.”

I half-recognize this symptom. In Linux, if the file /etc/nologin exists, then all non-root SSH connections will be refused and the contents of the file will be sent in reply. I know this because when I configured the battery backup system, I made sure to disable any option that would create this file during a power outage. Maybe I’m the right somebody after all.

Our kiosks write major events to a remote log, so I open that up:

Tue 10:00 PM Power outage! Continuing on battery power. Tue 10:00 PM Status changed from ALL_OK to DEGRADED. Tue 10:33 PM Shutting down kiosk due to low batteries! Wed 7:04 AM Status changed from DEGRADED to ALL_OK.

My coworker and I exchange glances. “This kiosk shut down last night, but it started up again in the morning. Why does it think it’s still shutting down?”

“…wait,” I say, “It shut down last night! I know this kiosk.” It’s in a large chain store where the electric outlets are on timers. Some pencil pusher at Corporate realized the company could save a few cents by turning off the vending machines when the stores are closed.

“This kiosk loses power every night at 10:00 sharp, and the battery backup kicks in,” I continue. “A little over half an hour later, the battery gives a warning that it’s nearly exhausted, we tell our systems to finish what they’re doing and quit, and we tell the OS to power off in 2 minutes. It usually takes our systems 30 seconds to quit, so the 2 minute buffer is plenty. Then, every morning at 7:00 on the dot, power is restored, the kiosk boots up, and 4 minutes later it checks in with us again.”

“I know. You’ve been telling us for weeks.” My coworker rolls his eyes. “The frequent power outages will prematurely shorten the battery life on this kiosk, but there’s not much we can do. We’re a small company and this kiosk is deployed into a major retailer, so we don’t have much leverage. The outages are not hurting sales because they only happen when the store is closed. If we have to replace the battery more often, it’s not a big deal.”

He’s got a point.

The parts of the system I own didn’t create the nologin file, so let’s see what else happens during a shutdown. I check the man page, and see

shutdown arranges for the system to be brought down in a safe way. All logged-in users are notified that the system is going down and, within the last five minutes of TIME , new logins are prevented.

That seems relevant. Does it use the nologin file? I run dpkg --search $(which shutdown) to see which software package contains the code that runs during shutdown, and get back upstart: /sbin/shutdown . I then run apt-get source upstart and start combing through the source code.

There it is, right in util/shutdown.c —a function called timer_callback that runs once per minute after a shutdown is requested. Part of it reads:

/* Write /etc/nologin with less than 5 minutes remaining */ if (delay <= 5) { FILE *nologin; nologin = fopen (ETC_NOLOGIN, "w"); if (nologin) { fputs (msg, nologin); fclose (nologin); } }

There’s also a function called shutdown_now , whose third-to-last line is

unlink (ETC_NOLOGIN);

So that’s how the nologin file gets created: shutdown writes it once per minute. When the 2 minutes are up and we actually shut down, it deletes it again.

Suddenly, all the pieces fall into place. The nightly blackouts had been wearing out the battery, and it was finally at the point that it couldn’t power the computer for long enough. The computer started to shut down last night, which wrote the nologin file, and the battery conked out before it was deleted again.

The fix for this was easy enough: edit /etc/rc.local to delete the nologin file every time we boot up, and change the battery backup config to shut down when the battery has slightly more of a charge remaining. I suppose another option would be to modify shutdown.c not to write the nologin file at all, recompile, and modify our Linux distro to use the custom shutdown utility. but that’s a can of worms I’d rather not open.

So now that we have the fix, how do we roll it out? We can’t SSH into the kiosk. We can’t push out updates without an SSH connection. Although the nologin file only prevents non-root SSH sessions, our version of Linux also disables all direct root logins. If we had physical access to the kiosk, we could log in without SSH, but we’re in New York and it’s in Florida.

Do you see the elegant solution? We didn’t. We hired one of our contractors in Florida to visit the kiosk, but they weren’t available until the next day.

That evening at 10:00 PM sharp, the power went out at the kiosk. A little over half an hour later, the battery was nearly exhausted and the kiosk started its shutdown. This time, though, the battery lasted the full 2 minutes, and the kiosk deleted its nologin file. The next morning, at 7:00 on the dot, it regained power, booted up, and by 7:04 we were able to log in again. By the time the contractor arrived, everything was fixed.

If this sort of thing interests you and you’re looking for a job in New York City, KeyMe is currently hiring (early 2016). Send a cover letter and résumé to alan@key.me if you have experience in any of these areas: robotics, computer vision, machine learning, Linux, Python, Ruby on Rails, C++.