In-Depth

Fixing Active Directory Disasters: A How-To Guide

SOLUTIONS: 1. How to recreate SYSVOL and junction points when SYSVOL has been deleted from all DCs:

Stop the FRS service on all DCs. Create the SYSVOL folder tree manually (This is the FQDN of your domain): SYSVOL Domain DO_NOT_REMOVE_NtFrs_PreInstall_Directory Policies Staging Areas Staging\Domain SYSVOL Mydomain.com Set the ACLs on the "DO_NOT_REMOVE_NtFrs_PreInstall_Directory": Administrators (domain admins) and System both set to ONLY have "Special Permissions." Set the "DO_NOT_REMOVE..." directory as Hidden and Read only. Create the junction points. Make sure the FRS is stopped on the DC this is executed on: linkd "%systemroot%\SYSVOL\SYSVOL\mydomain.com" %SYSTEMROOT\SYSVOL\DOMAIN linkd "%systemroot%\Sysvol\staging areas\mydomain.com" %systemroot\sysvol\Staging\Domain NOTE: If SYSVOL is not stored on the Windows System Disk, replace C:\Windows in the linkd command to reflect the path to SYSVOL. How to Build the Default Domain Policy and Default Domain Controller Policy: If you don't have backups of the Default Domain Controller Policy or the Default Domain Policy, then from the command line of the Primary Domain Controller, run Microsoft's DCGPOFIX tool. See KB 833783. WARNING: This tool will create a virgin Default Domain Policy and Default Domain Controller Policy -- don't use this if you have a copy of these policies somewhere. If you do have backups, simply restore them to the proper location in SYSVOL. It will prompt you to restore the Default Domain Policy and will ask if you want to restore the Default Domain Controller Policy. You should answer "Yes" to both of the questions. Replicate SYSVOL for this DC by starting FRS:

C:>net Start "File Replication Service"

NOTE: Do NOT use the Burflags procedure. This can cause the SYSVOL directory to disappear. Make sure FRS is working. TIP: create a text file such as DC1.txt (on DC1) in the SYSVOL\SYSVOL directory (so it's easy to find). Let replication take place. This file should end up in this location on all DCs. Any DC without it is not replicating FRS properly. Remember this could be due to AD Replication failure as well.

2. How to recreate junction points if the SYSVOL tree exists but junction points don't exist: Stop FRS: C:>Net Stop "File Replication Service" Create the junction points. Make sure FRS is stopped on the DC: linkd "%systemroot%\SYSVOL\SYSVOL\mydomain. com" %SYSTEMROOT\SYSVOL\DOMAIN linkd "%systemroot%\Sysvol\staging areas\mydomain.com" %systemroot \sysvol\Staging\Domain NOTE: If SYSVOL isn't stored on the Windows System Disk, replace C:\Windows in the linkd command to reflect the path to SYSVOL.

Lingering Objects

No AD disaster recovery discussion would be complete without a section on Lingering Objects (LOs). LOs are more a result of some disaster but can also cause a lot of headaches for IT pros. I've found a number of environments where LOs exist -- and have existed for some time -- but have never been cleaned up. This is likely due to the fact that AD still works except for anomalies such as objects showing up in one domain and not in another. It's hard to clean them up, and it mostly applies to multiple domain forests. To make a long story short, LOs are caused by a DC being inaccessible by other DCs for longer than the tombstone lifetime (TSL) and then coming back online. The TSL defaults vary based on the version of Windows you're using and are customizable. If the DC comes online after deleted objects have been purged by garbage collection (GC), having expired the TSL, it can replicate those objects back to healthy DCs and reanimate the objects. Typically, this will be a problem on the GCs when read-only objects are replicated back.

Events 1864, 2042 and 1988 in the Directory Services event log are good indicators of LOs. You can see messages in event logs and Repadmin/showrepl output.

When LOs try to get replicated, it can trigger replication to stop between two DCs. If the very important "StrictReplicationConsistency" registry key is set to (1), which means Strict behavior, and if a replication partner wants to modify an object that doesn't exist on the DC, all replication will be shut off. A very helpful message to this effect will show up when executing the Repadmin/Showrepl command, the DirSvcs Event Log, Repadmin/Replsum, and other reports and logs:

The Active Directory cannot replicate with this server because the time since the last replication with this server has exceeded the tombstone lifetime.

There are other messages that are pretty obvious. This is good! It isolates the bad machine so you don't have to clean up all the DCs. I've seen many environments where this registry key is set to "loose" (0) which means the DCs will replicate LOs. Not good. If you have an environment that started out with Windows 2000 and has been upgraded (as opposed to a fresh install of the entire forest) to Windows 2003, 2008, etc, then this setting is probably set to "loose" as that was the default in Windows 2000. The key is located at:

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\NTDS\ Parameters ValueName = Strict Replication Consistency

Thanks to some diligent work on Microsoft's part, LOs went from being a hideous nightmare in Windows 2000 to being fairly easy to clean up in 2003 and later. The key tool is good ol' Repadmin and the /RemoveLingeringObjects switch. Can't find this option in the online help for Repadmin? Try Repadmin/ExpertHelp.

SOLUTION: If you have any Windows 2000 in your environment and they contain LOs, the solution is to replace (don't upgrade) them with Windows 2008 DCs (assuming you can't get Windows 2003).

For Windows 2003 and later, the short answer is:

Set all DCs to StrictReplicationConsistency = 1. Failure to do this will allow the LOs to keep replicating. Use the Repadmin command to quickly set this on all DCs (add all the DCs in the DC_LIST; see the online help for Repadmin for details): repadmin /regkey DC_LIST +strict Use the Repadmin /removeLingeringObject command: Repadmin /removelingeringobjects <Dest_DC_LIST> <Source DC GUID> <NC> [/ADVISORY_MODE] Dest_DC_List: list of DCs to operate on Source DC GUID – the DSA GUID of a reliable DC (preferably the PDC) NC – Naming context of the domain the lingering objects exist in /ADVISORY_MODE – identifies what will happen when you execute the command for real So a sample command would be: C:\>Repadmin /removeLingeringObjects wtec-dc1 f5cc63b8-cdc1 -4d43-8709-22b0e07b48d1 dc=wtec,dc=adapps,dc=hp,dc=com This has to be done on all DCs in the forest and can easily be scripted.

Armageddon: Recovering a Forest When the Root Domain Goes Away with No Backup

This example is from an actual case I worked about a year ago. It was easy to see the glaring design error in this configuration. The root domain has only one DC (see Figure 4). I was called when the single DC in the root domain went down and the company's IT staff couldn't recover it. It had a RAID 5 disk but, as fate would have it, the IT folks lost two disks from the array. To make matters worse, the backup was 11 months old. A true disaster!



[Click on image for larger view.] Figure 4. A root domain with only one DC led to disaster.

The child domain had all the user accounts and interestingly, there was no user outage -- no complaints. My first thought was LOs, but becuase there were no other DCs, there could be no lingering objects in the domain. There could, however, be GCs in the child domain.

SOLUTION: The plan was designed:

Set the tombstone lifetime to 365 days so we don't have to risk adding LOs. This is done via the ADSIEdit tool -- modify the TSL attribute at: cn=Directory Service,cn=WindowsNT,cn=Services, cn=Configuration, dc=mycomain,dc=com Restore the backup to the DC in the root domain. Set the system time on the DC in the forest domain to the current date/time Set StrictReplicationConsistency to 1 on all DCs "Demote" the GC in the child domain to a DC Do a health check: Event logs Validate the trust Logon from a machine in the root domain using an account in the child domain and vise versa Add test users and sites in each domain and see if they replicate to all DCs "Demote" the GC in the child domain to a DC Let replication take place and update the root DC Promote at least one DC to GC Check event logs for errors Build a second DC for the root domain Set the TSL to 180 days (minimum) Backup all DCs

Actually, we did all this in a lab first. Using the current backups of the child domain DCs and the old backup of the root domain DC, we reproduced the environment. Then we executed the procedure just described. The health check in the test environment actually turned up a few DNS errors -- unrelated to this procedure -- so we fixed those and some other issues in the production environment. At that point, we were confident that the restore would work, and it worked without incident. The interesting thing is that we did this during business hours and experienced no outages or complaints from users.

AD disasters are easy to cause, and not always easy to recover from. It's important for any AD administrator to be familiar with the warning signs and pay attention to logs and reports.

Pay attention and avoid disasters -- I hope these tips help!