Environment:

Customer using Exchange Online/Office 365 with no Exchange servers on-prem. Two ADFS 2.0 servers running on Server 2008 R2, enabling them to logon to Exchange Online via SSO (Single Sign On).

Issue:

After rebooting the two ADFS servers post Windows Updates the customer could no longer login to OWA & would receive a “503 Service Unavailable” error message via IIS on the two ADFS servers.

Background:

I have to hang my head in shame with this one as I really should have figured this out sooner. Initial troubleshooting showed that the ADFSAppPool was stopped in IIS. It would start but as soon as you tried accessing it, it would stop again. Nothing at all in the Application or ADFS logs in Event Viewer (more on this poor bit of troubleshooting on my part later).The ADFS service account it was running under looked ok; the App Pool would start & so would the ADFS Service (both running under this account) so it seemed to not be a credential issue (at least I got that part right). I even went as far as to reinstall ADFS & IIS on the non-primary ADFS server in the event it was something in IIS. I was clearly out-classed on this seemingly simple issue.

Resolution:

Because the customer was down & I was scratching my head, I decided to escalate the issue to Microsoft; at which point they resolved the issue in about 5min.

Now before I say the fix I’d just like to say I consider myself a good troubleshooter. I’ve been troubleshooting all manner of Microsoft, Cisco, etc technologies for more than a decade & made a pretty successful career out of it. I even managed to pass both the MCM 2010 & MCSM 2013 lab exams on the 1st attempt; but today was not my day. I spent over 2 hrs on this & I broke the cardinal rule of troubleshooting; I overlooked the simple things. Like many of us do I started digging a hole of deep troubleshooting, expecting this to be an incredibly complex issue; I was looking at SPN’s, SQL Permissions, checking settings in Azure, etc. I should have just looked back up in the sky instead of trying to dig a hole a mile deep but only 3 ft wide, because for some idiotic reason I chose to overlook the System Event logs….

I suppose once I saw nothing in the Application or ADFS logs I just moved on quickly to the next possibility but in a few short minutes the Microsoft Engineer checked the System Logs & saw Event 5021 from IIS stating that the service account did not have Batch Logon Rights (more on the event here). This lead him to look at Group Policy settings & sure enough, there was a GPO allowing only the Domain Admins group to log on as a batch job. (Reference 1 & 2). It seems this setting took effect after the ADFS servers were rebooted post Windows Updates. Not sure how the GPO got there as this solution was working for 2 years beforehand but it certainly was ruining our day today. After the GPO was modified to allow the ADFS service account to log on as a batch job, the issue was resolved after some service restarts.

Moral of the story:

Never overlook the obvious! It’s the best advice I can give to anyone, anywhere, & who has to troubleshoot anything. I’d like to say this is the 1st time this has happened to me but it’s not. Overlooking typos, not checking to see if a network cable is plugged in, not checking to see if a service is started… It happens to the best of us. I suppose overlooking the simple solution is just part of the human condition…..or at least whatever condition I have….. 🙂