Using FarmGuardian to enable HA on Back-ends in Zen Load Balancer

We’ve been using the Zen Load Balancer Community Edition in production for almost a year now and it has been working great. I previously wrote a blog post about installing and configuring Zen, and now it was time to look at the HA aspect of the back-end servers defined in various Zen farms. Zen itself is quite easy to set up in HA-mode. You just configure two separate Zen servers in HA-mode according to Zen’s own documentation. Well, this is very nice and all, and it’s also working as it should. The thing that confused me the most however (until now), is the HA aspect of the back-ends. I somehow thought that If you specify two back-ends in Zen and one of them fail, Zen automatically uses the backend which is working and marked as green (status dot). Well, this isn’t the case. I don’t know if I should blame myself or the poor documentation – or both. Anyways, an example is probably better. Here’s an example of L4xNAT-farms for Exchange (with two back-ends):

zen_farm_table2017

I guess it’s quite self-explanatory; we’re Load Balancing the “normal” port 443 + imap and smtp. (All the smtp-ports aren’t open to the Internet though, just against our 3rd party smtp server). The http-farm is used for http to https redirection for OWA.

Furthermore, expanding the Exchange-OWAandAutodiscover-farm:

zen_owa_and_autodiscover_farm2017

 

and the monitoring part of the same farm:

zen_owa_and_autodiscover_farm_monitoring2017

 

This clearly shows that the “Load Balancing-part” of Zen is working – the load is evenly distributed. You can also see that the status is green on both back-ends. Fine. Now one would THINK that the status turns RED if a back-end is down and that all traffic would flow through the other server if this happens. Nope. Not happening. I was living in this illusion though 😦 As I said before, this is probably a combination of my own lack of knowledge and poor documentation. Also, afaik there are no clear “rules” for the farm type you should use when building farms. Zen itself (documentation) seem to like l4xnat for almost “everything”. However, if you’re using HTTP-farms, you get HA on the back-ends out-of-the box. (You can specify back-end response timeouts and checks for resurrected back-ends for example). Then again, you’ll also have to use SSL-offloading with the http-farm which is a whole different chapter/challenge when used with Exchange. If you’re using l4xnat you will NOT have HA enabled on the back-ends out-of-the-box and you’ll have to use FarmGuardian instead. Yet another not-so-well-documented feature of Zen.

FarmGuardian “documentation” is available at https://www.zenloadbalancer.com/farmguardian-quick-start/. Have a look for yourself and tell me if it’s obvious how to use FarmGuardian after reading.

Luckily I found a few hits on Google (not that many) that were trying to achieve something similar:

https://sourceforge.net/p/zenloadbalancer/mailman/message/29228868/
https://sourceforge.net/p/zenloadbalancer/mailman/message/32339595/
https://sourceforge.net/p/zenloadbalancer/mailman/message/27781778/
https://sourceforge.net/p/zenloadbalancer/mailman/zenloadbalancer-support/thread/BLU164-W39A7180399A764E10E6183C7280@phx.gbl/

These gave me some ideas. Well, I’ll spare you the pain of googling and instead I’ll present our (working) solution:

zen_owa_and_autodiscover_farm_with_farmguardian_enabled2017

First off, you’ll NEED a working script or command for the check-part. Our solution is actually a script that checks that every virtual directory is up and running on each exchange back-end. If NOT, the “broken” back-end will be put in down-mode and all traffic will instead flow through the other (working) one. I chose 60 sec for the check time, as Outlook times out after one minute by default (if a connection to the exchange server can’t be established). Here’s the script, which is based on a script found at https://gist.github.com/phunehehe/5564090:

zen_farmguardian_script2017

Big thanks to the original script writer and to my workmate which helped me modify the script. Sorry, only available in “screenshot form”.

You can manually test the script by running ./check_multi_utl.sh “yourexchangeserverIP”  from a Zen terminal:

zen_farmguardian_script_manual_testing_from_terminal2017

The (default) scripts in Zen are located in /usr/local/zenloadbalancer/app/libexec btw. This is a good place to stash your own scripts also.

 

You can find the logs in /usr/local/zenloadbalancer/logs. Here’s a screenshot from our log (with everything working):

zen_farmguardian_log2017

 

And lastly I’ll present a couple of screenshots illustrating how it looks when something is NOT OK:

(These screenshots are from my own virtual test environment, I don’t like taking down production servers just for fun 🙂 )

zen_owa_and_autodiscover_farm_monitoring_host_down2017

FarmGuardian will react and present a red status-symbol. In this test, I took down the owa virtual directory on ex2. When the problem is fixed, status will return to normal (green dot).

 

and in the log:

zen_farmguardian_log_when_failing2017

The log will tell you that the host is down.

 

Oh, as a bonus for those of you wondering how to do a http to https redirect in Zen:

zen_http_to_https_redirect2017

Create new HTTP-farm and leave everything as default. Add a new service (name it whatever you want) and then just add the rules for redirection. Yes, it’s actually this simple. At least after you find the documentation 🙂

And there you have it. Both the Zen servers AND the back-ends working in HA-mode. Yay 🙂

Load Balancing Exchange 2013 (CAS) with clustered (Zen) Load Balancers

I decided to skip my post about Exchange Database Availability Groups (DAGs), as all the information needed was already very well documented. All I did was following the excellent guide from exchangeserverpro.com, Installing an Exchange Server 2013 Database Availability Group. I got my DAG up and running in no time. We already have a working DAG environment up and running as well.

The part which needed some extra attention was High Availability/Load Balancing, mainly load balancing the CAS. (DAGs are sort of high availability/failover tolerant by design, but if the CAS go down they’re rather useless).

UPDATE 22.2.2017: I published a new blog post about Using FarmGuardian to enable HA on Back-ends in Zen Load Balancer. Use this as a compliment to this guide.

I’ll start off by making illustrations of the current and soon-to-be situations.

Current situation

 

exchange_current_setup

Fig 1. Current setup

  • 1 Exchange server used as CAS proxy/redirect.
    • No user mailboxes
    • Not actually needed, we could also use dns round robin. The server was meant to replace the Exchange 2010 server at first… then change of plans, not going into any details here.
    • Will be taken out of production and replaced by Zen Load Balancing cluster
    • Single point of failure
  • 1 server running Exchange 2010. Existing users will be migrated from this server to the (two) Exchange 2013 servers real soon. (When certificate issues are fixed).
    • When this is done, the server will be taken out of production.
  • 2 Exchange 2013 servers, running in two different physical locations (even though in same domain).
    • DAG is used between the two servers
    • All users will be moved to these two servers
  • Single namespace

 

Soon-to-be situation

Goals:

  • No (CAS) single point of failure
  • Only Exchange 2013 servers in the environment

 

exchange_soon_to_be_setup

Fig 2. Soon-to-be setup

  • 2 Clustered Zen Load Balancers (in different physical locations)
  • 2 Exchange 2013 servers (the existing ones from Fig 1)
  • Single namespace (exzen)

This whole setup might seem a bit small, but at the moment it’s sufficient. We’ll expand when the need is there. Only calendars and contact information is currently stored on the exchange servers, email is handled by our 3rd party imap server.

NOTE! If you are using VMware in your environment, be sure to check this information:

https://keepingitclassless.net/2013/04/virtual-routing-part-2-fhrp-issues-in-vmware-vsphere/

It will save you time and nerves when trying to figure out why replication isn’t working.

 

Installing Zen Load Balancing Cluster

Introduction

I was put to the task of finding a good load balancing solution for Exchange. In its most simple configuration you could just use dns round robin (http://exchangeserverpro.com/exchange-2013-client-access-server-high-availability/), but I wasn’t too convinced about this idea. Round robin seemed more like the poor mans load balancer. That said, it will work. I even gave it a go in my test environment. The clients got a little bit (too) confused though. Not good. I decided to move along to a “real” load balancer.

Luckily a Layer 4 solution will work fine with Exchange 2013, so there’s no need for a more complex Layer 7 solution. Keeping it simple is the key. Some facts:

“In Exchange Server 2010 it’s often said that the main skill administrators needed to learn when was how to deploy and manage load balancing. The concept of the RPC Client Access Array, the method used to distribute MAPI traffic between Client Access Servers was a common area of pain. Modern advances in Layer 7 load balancing also allowed for SSL offload, service level monitoring and load balancing and intelligent affinity using cookies to mitigate against some of Exchange 2010’s shortcomings.”

Improvements to Load Balancing

What we’re getting at is the two key improvements in Exchange 2013 that make load balancing suddenly quite simple. HTTPS-only access from clients means that we’ve only got one protocol to consider, and HTTP is a great choice because it’s failure states are well known, and clients typically respond in a uniform way.”

Source: http://www.msexchange.org/articles-tutorials/exchange-server-2013/high-availability-recovery/introducing-load-balancing-exchange-server-2013-part1.html

That said, there are many load balancing solutions available out there. Windows Network Load Balancing (WNLB) is one example that comes to mind, but it has limitations:

“WNLB can’t be used on Exchange servers where mailbox DAGs are also being used because WNLB is incompatible with Windows failover clustering. If you’re using an Exchange 2013 DAG and you want to use WNLB, you need to have the Client Access server role and the Mailbox server role running on separate servers. “

Source: https://technet.microsoft.com/en-us/library/jj898588%28v=exchg.150%29.aspx

That’s no good. After some investigation I found a webpage that compares open-source load balancers, http://wso2.com/library/articles/2014/03/wso2-elb-vs-other-open-source-load-balancers/. This was promising; Linux-based load balancers that not only use little resources in our data center – they’re free as well. Brilliant.

I started out with testing HAProxy. I had heard some good things about it. Well… short story: It turned out to be a bit of pain to configure. I didn’t want to waste all of my energy on configuring.

Next candidate: Zen Load Balancer. Oh, what a difference. Very easy to install, very easy to configure. A nice web interface from which you can configure everything needed. There are many commercial alternatives available as well, but Zen seemed to come very close to these (on the open-source market).

 

Preparations

Be sure to use a single namespace in Exchange 2013. This is actually best practise even without a load balancer, and it makes your life easier. If you have no idea what I’m talking about please have a look at the following links for example:

http://3techies.com/?p=194
http://www.msexchangegeek.com/inside-the-exchange-2013-single-namespace-part-1/
http://blogs.technet.com/b/exchange/archive/2014/02/28/namespace-planning-in-exchange-2013.aspx
http://blog.netwrix.com/2014/03/21/configuring-exchange-2013-for-site-resilience-2/

While you’re at it, have a look at managing certificates at the same time: http://www.msexchange.org/articles-tutorials/exchange-server-2013/management-administration/managing-certificates-exchange-server-2013-part1.html. We’re using a SAN certificate for autodiscover.ourdomain.com and exzen.ourdomain.com.

DNS: Instead of having clients pointing to the CAS, point your clients to the Load Balancer. (You then configure the load balancer with the real IP’s of the back-end exchange servers). More info in the next chapter.

 

Installation

 

Pictures say more than words, so here you go:

zen_global_view

Fig 3. Zen Load Balancer Global View

 

zen_farms

Fig 4. Zen Farms

 

zen_backend_status

Fig 5. Zen backend status

 

zen_interfaces

Fig 6. Zen Interfaces. Both Zen servers have dual NIC, one dedicated to the cluster service (eth1).

Zen 1:

  • Physical IP (eth0): 10.0.0.60
  • Virtual IP (virtual network interface): 10.0.0.61
  • Physical IP (eth1): 10.0.0.80
  • Virtual IP (used for cluster service): 10.0.0.85

Zen 2:

  • Physical IP (eth0): 10.0.0.70
  • Virtual IP (virtual network interface): 10.0.0.61
  • Physical IP (eth1): 10.0.0.90
  • Virtual IP (used for cluster service): 10.0.0.85

 

zen_rsa_between_hosts

Fig 7. RSA communication between cluster hosts

 

zen_cluster1

Fig 8. Zen Cluster configuration – success (with failover).

 

zen_master_node

Fig 9. Cluster status – master

 

zen_backup_node

Fig 10. Cluster status – backup

 

DNS settings:

exchange_zen_dns

Fig 11. DNS settings

 

Outlook connection status:

outlook_connection_status

Fig 12. Outlook connection status. Connected to exzen, (which is pointed to Zen Load Balancer in DNS).

 

Testing failover

Failover is luckily easy to test in a virtual environment as you can just suspend a virtual machine 🙂 I did a test run (suspended zen1) while constantly pinging the virtual IP and having an eye on the Outlook connection status window. Here’s the result:

http://youtu.be/DLCMVVw2tN4

  • At 00:11 the ping reply stops for a brief moment (when I hit suspend on zen1)
  • At 00:15 the client is noticing that the proxy server is down/suspended (ping request times out)
  • At 00:26 ping reply to the proxy server is active again (automatic failover to zen2/”backup”)
  • At 00:40 I resume/re-activate zen1 again
  • At 00:42 Outlook notices this (established/connecting changes state in Outlook Connection Status)
  • At 00:45 the ping reply stops for a brief moment (when zen1 is resuming/getting back online after the “fail”)
  • At 00:46 ping reply times out (zen1 is configuring itself to become the “master” again)
  • At 00:55 all connectivity from Outlook to the proxy is temporarily lost
  • At 00:58 the proxy server answers to ping requests again
  • At 01:05 the connection to the proxy server is active/established in Outlook
  • At 01:06 –> everything is back to normal.

To sum it up: Outlook is only offline for 10 seconds (00:55 – 01:05), and it newer even complains about being offline (no yellow exclamation mark). Not bad. 10 seconds of downtime is something we definitely can live with.

 

Final words

As you can see, everything is working nicely in my test environment. Before going production there are a couple of things that should be considered:

  • Certificate for the Zen Load Balancer Web Interface. Is it needed or blocked from the outside world?
    • other security considerations on the load balancer(s) such as open ports etc. Networking team will decide this.
  • After the cluster is up and running in production, test it only on a couple of clients at first.
  • Probably lots more I can’t think of right now…