Overview

Much is made about a healthy Active Directory environment being a prerequisite for a healthy Exchange deployment. This can be especially challenging when there are separate teams managing AD & Exchange; meaning sometimes things can slip through the cracks.

Issue

A colleague of mine recently ran into an issue when preparing to deploy Exchange 2013 into an existing Exchange Organization. While running Setup /PrepareAD, the process would fail at about 14%, stating the domain controller is not available. It was determined that the DC holding all of the FSMO roles was in the process of a reboot. At first the assumption was that this was coincidental; possibly the work of the AD team. After the server came back up, /PrepareAD was run again & had the exact same result! So it appeared something that the /PrepareAd process was doing was the culprit. The event logs on the DC gave the below output:

EVENTID: 1000

Faulting application name: lsass.exe, version: 6.3.9600.16384, time stamp: 0x5215e25f

Faulting module name: ntdsai.dll, version: 6.3.9600.16421, time stamp: 0x524fcaed

Exception code: 0xc0000005

Fault offset: 0x000000000019e45d

Faulting process id: 0x1ec

Faulting application start time: 0x01d0553575d64eb5

Faulting application path: C:\Windows\system32\lsass.exe

Faulting module path: C:\Windows\system32

tdsai.dll

Report Id: 53c0474e-c12d-11e4-9406-005056890b81

Faulting package full name:

Faulting package-relative application ID:

EVENTID: 1015

A critical system process, C:\Windows\system32\lsass.exe, failed with status code c0000005. The machine must now be restarted.

The logs were saying that the Lsass.exe process was crashing, leading to the Domain Controller restarting (see image below).

The easiest path of troubleshooting lead towards moving the FSMO roles to another server & seeing if the issue followed it. Setup /PrepareAD was run again & the issue did in fact follow the FSMO roles.

Resolution

It was at this point that I was engaged & I had a feeling this was either a performance issue on the domain controllers or something buggy at play. Before too long I was able to find the below MS KB for an issue that seemed to match our symptoms:

“Lsass.exe process and Windows Server 2012 R2-based domain controller crashes when the server runs under low memory”

http://support.microsoft.com/en-us/kb/3025087

The customer was more than willing to install the hotfix, but we soon realized that we also had to install the prerequisite update package below (which was sizeable):

Windows RT 8.1, Windows 8.1, and Windows Server 2012 R2 update: April 2014

http://support.microsoft.com/en-us/kb/2919355

During this time, the domain controller was also updated to .NET 4.5.2. After all of this was done, Setup /PrepareAD completed successfully. My colleague was 90% certain the hotfix was the fix, but also noted that before the patch the DC’s CPU utilization was consistently running at 60%. After the updates, it now sits in the 20-30% range. So regardless, we saw much better performance & stability after updating the Domain Controllers.

Conclusions

While I understand we can’t all be up to date on our patching 100% of the time, there is some health checking we can do to the environments we manage.

For all Windows servers, I strongly recommend getting a performance baseline of the big 3: Disk, Memory, & CPU. I like to say that you can’t truly say what bad performance is defined by if you don’t have a definition of good performance in the first place. Staying up to date with Windows Updates can greatly help with this. Even though a system may have performed to a certain level at one point in time, doesn’t mean any number of other variable couldn’t have changed since then to result in poor performance today; often times, vendor updates can remedy this.

As for Domain Controllers, they’re one of the easiest workloads to test with, since a new DC can be created with relative ease. You can use a test environment (recommended) or simply deploy Windows updates to a select number of domain controllers & then compare the current behavior with your baseline.

In this customer’s case, these performance/stability issues could have resulted in any number of applications to fail that relied on AD. Some failures may have been silent, while others could’ve been show stoppers like this one.