As stated in my previous blog posts, we’re using a two node DAG spanning across two datacenters/two AD Sites. The problem with this scenario is the witness server, or should I say the location of the witness server. I learned this the hard way, as we had minor problems with one of our datacenters. It also happened to be the datacenter where the witness server is located. This resulted in unresponsive/non-working email for some users, even though the HA aspect of Exchange via the Load Balancer was working fine.
I’ll borrow some pictures and text from http://searchexchange.techtarget.com/tip/How-Dynamic-Quorum-keeps-Exchange-2013-clusters-running-during-failure as I’m too lazy to rewrite everything. (We’re using dynamic quorum by default btw, as our servers are Windows Server 2012 R2).
“In a planned data center shutdown — where power to the data center and a host is often cut cleanly — we would have the opportunity to change the FSW to a host in the secondary data center. This allows for maintenance, but it does not help the inevitable event where an air conditioning unit overheats one weekend, servers begin to shut down, email stops working — and somebody has to get everything up and running again.
Dynamic Quorum with Windows Server 2012 and Exchange 2013 protects not only against this scenario above, but also against scenarios where the majority of nodes in a cluster fail. In another example, we see that in the primary site, we’ve lost both one Exchange node and the FSW (Fig 1). (This happened to us).
In our example, Dynamic Quorum can protect against a data center failure while the Exchange DAG remains online. This means that when the circumstances are right (we’ll come to that in a moment), a power failure in your primary data center can occur and Exchange can continue to stay up and running. This can even happen for smaller environments without the need to place the FSW in a third site.”
Fig 1. Loss of the first node and FSW with Dynamic Quorum.
“The key caveat is that the cluster must shut down cleanly. In the previous example, where the first data center failed, we relied on a mechanism to coordinate data center shutdown. This doesn’t need to be complicated, and a well-designed data center often will have this built in.
This can also protect against another scenario where there are three-node Exchange DAGs in a similar configuration — with two Exchange nodes present in the first data center and a single node present in a second data center. As the two nodes in the first data center shut down cleanly, Dynamic Quorum will ensure the remaining node keeps the DAG online.”
Some similar information can also be found at:
Well, this would all be too good to be true if it wasn’t for the “the cluster must shut down cleanly” -part. This got me thinking about alternatives. What about a third Exchange server and skipping the witness server altogether? Well, it doesn’t work any better as stated above. It’s the same dilemma if two of the nodes looses power. The solution as I can see it is (briefly) explained in the below article, DAC – Database Activation Coordination mode. This, together with an alternative witness server is the recipe for a better disaster plan. With DAC and an alternative witness server in place, you can force the exchange servers in a AD-Site to connect to the local witness server. It requires some manual work (in case disaster strikes) though, but it’s doable.
So, what’s up with the DAC mode and the alternative witness server? Lets have a look. First, let’s do some homework and have a look at DAC:
“DAC mode is used to control the database mount on startup behavior of a DAG. This control is designed to prevent split brain from occurring at the database level during a datacenter switchback. Split brain, also known as split brain syndrome, is a condition that results in a database being mounted as an active copy on two members of the same DAG that are unable to communicate with one another. Split brain is prevented using DAC mode, because DAC mode requires DAG members to obtain permission to mount databases before they can be mounted”.
“Datacenter Activation Coordination (DAC) mode has nothing whatsoever to do with failover. DAC mode is a property of the DAG that, when enabled, forces starting DAG members to acquire permission from other DAG members in order to mount mailbox databases. DAC mode was created to handle the following basic scenario:
- You have a DAG extended to two datacenters.
- You lose the power to your primary datacenter, which also takes out WAN connectivity between your primary and secondary datacenters.
- Because primary datacenter power will be down for a while, you decide to activate your secondary datacenter and you perform a datacenter switchover.
- Eventually, power is restored to your primary datacenter, but WAN connectivity between the two datacenters is not yet functional.
- The DAG members starting up in the primary datacenter cannot communicate with any of the running DAG members in the secondary datacenter”.
In short: Enable DAC mode on your Exchange servers if using more than two nodes.
Alternative witness server
Now that we have some basic understanding about DAC, let’s look at the Alternative witness server (AWS):
I think it’s quite well summarized in the first article:
“The confusion lies in the event of datacenter activation; that the alternate file share witness would automatically come online as a means to provide quorum to the surviving DAG members and keep the databases mounted. So in many ways, some people view it as redundancy to the file share witness for an even numbered DAG.
In reality, the alternate file share witness is only invoked when an admin goes through procedures of activating the mailbox servers who lost quorum. DAC mode dramatically simplifies the process and when the “Restore-DatabaseAvailabilityGroup” cmdlet is executed during a datacenter activation, the alternate file share witness will be activated.”
The second article also has some nice overall information about High Availability Misconceptions. I suggest you read it.
In short: Manual labor is required even though you have configured an alternative witness server.
So, what to do when disaster strikes? First, have a look at the TechNet article “Datacenter switchovers”:
Then have a look at:
https://smtpport25.wordpress.com/2010/12/10/exchange-2010-dag-local-and-site-drfailover-and-fail-back/ for some serious deep diving into the subject. This has to be one of the most comprehensive articles about DAG/Failover/DAC/you name it on the Internet.
I’ll summarize the TechNet and the smtpport25 articles into actions:
“There are four basic steps that you complete to perform a datacenter switchover, after making the initial decision to activate the second datacenter:
Terminate a partially running datacenter This step involves terminating Exchange services in the primary datacenter, if any services are still running. This is particularly important for the Mailbox server role because it uses an active/passive high availability model. If services in a partially failed datacenter aren’t stopped, it’s possible for problems from the partially failed datacenter to negatively affect the services during a switchover back to the primary datacenter”.
The sub-chapter Terminating a Partially Failed Datacenter has details on how to do this, and smtpport25 has even more information. If you start reading from “Figure 19” onwards in the smtpport25 article you’ll find this:
In figure 20. Marked in red has the details about started mailbox servers and Stopped Mailbox Servers. Started mailbox servers are the servers which are available for DAG for bringing the Database online. Stopped mailbox Servers are no longer participating in the DAG. There may be servers which are offline or down because of Datacenter failures. When we are restoring the service on secondary site, ideally all the servers which are in primary should be marked as stopped and they should not use when the services are brought online”.
So, in other words we should move the primary servers into Stopped State. To do that, use the PowerShell command:
Stop-DatabaseAvailabilityGroup -Identity DAG1 -Mailboxserver AMBX1 –Configurationonly
Stop-DatabaseAvailabilityGroup -Identity DAG1 -Mailboxserver AMBX2 –Configurationonly
Then, TechNet and smtpport25 have different information:
TechNet tells you to:
“2.The second datacenter must now be updated to represent which primary datacenter servers are stopped. This is done by running the same Stop-DatabaseAvailabilityGroup command with the ConfigurationOnly parameter using the same ActiveDirectorySite parameter and specifying the name of the Active Directory site in the failed primary datacenter. The purpose of this step is to inform the servers in the second datacenter about which mailbox servers are available to use when restoring service”.
The above should be enough if the DAG is in DAC mode (which it is).
Smtpport25 however doesn’t mention DAC mode at all in this case, instead they use the non-DAC mode approach from TechNet, with a little twist:
- First, stop the cluster service on the secondary site/datacenter, Net stop Clussvc
- Then, restore DAG on the secondary site, Restore-DatabaseAvailabilityGroup -Identity DAG01 -ActiveDirectorySite BSite
I honestly don’t know which of the solutions are correct, and I hope I won’t have to find out in our production environment anytime soon 🙂
Next step would be to Activate the Mailboxes Servers, again following different information whether the DAG is in DAC mode or not. I won’t paste all the text here as it is available in the TechNet article.
Then, following on to the chapter Activating Client Access Services:
Activate Client Access services This involves using the URL mapping information and the Domain Name System (DNS) change methodology to perform all required DNS updates. The mapping information describes what DNS changes to perform. The amount of time required to complete the update depends on the methodology used and the Time to Live (TTL) settings on the DNS record (and whether the deployment’s infrastructure honors the TTL).
We do not need to perform this step as we’re using Zen Load Balancer 🙂
And lastly, I won’t copy/paste information regarding Restoring Service to the Primary Datacenter, it’s already nicely written in the TechNet or smtpport25 article. I sure do hope I won’t have to use the commands though 🙂