Health Checking / Monitoring Exchange Server 2013/2016

I‘ve never wrote about monitoring / health checking before so here we go. There are maaaaany different ways of monitoring servers, so I’ll just present my way of monitoring things (in the Exchange environment). If you’re using SCOM or Nagios, you’re already halfway there. The basic checks in SCOM or Nagios will warn you about low disk space and high CPU load and so forth. But what if a Exchange service or a DAG renders errors for example? Exchange is a massive beast to master, so in the end you’ll need decent monitoring tools to make your life easier (before disaster strikes).

We’re using Nagios for basic monitoring which is working great. That said, from time to time I’ve noticed some small problems that Nagios won’t report. I have then resorted to PowerShell commands or the windows event log. These problems would probably have been noticed (in time) if we had decent/additional Exchange-specific monitoring in place. There are Exchange-plugins available for Nagios (I’ve tried a few), but they aren’t as sophisticated as custom PowerShell scripts made by “Exchange experts”. It’s also much easier running a script from Task Scheduler than configuring Nagios. At least that’s my opinion.

Anyhow, our monitoring/health checking consist of three scripts, namely:

add-pssnapin *exchange* -erroraction SilentlyContinue
$body=Get-HealthReport -Server “yourserver” | where {$_.alertvalue -ne “Healthy” -and $_.AlertValue -ne “Disabled”} | Format-Table -Wrap -AutoSize; Send-MailMessage -To “me@ourdomain.com” -From “HealthSetReport@yourserver” -Subject “HealthSetReport, yourserver” -Body ($body | out-string ) -SmtpServer yoursmtpserver.domain.com

 

I have all of these scripts set up as scheduled tasks. You’d think that setting up a scheduled task is easy. Well, not in my opinion. I had to try many different techniques but at least it’s working now.

For Paul’s script I’m using the following settings:

ExServerHealth_schedtask_general

  • For “Triggers” I’m using 02:00 daily.
  • For “Actions” I’m using:
    • Start a program: powershell.exe
    • Add arguments (optional): -NoProfile -ExecutionPolicy Bypass -File “G:\software\scripts\Test-ExchangeServerHealth.ps1” –SendEmail
    • Start in (optional): G:\software\scripts

 

The same method wouldn’t work for Steve’s script though. I used the same “Run with highest privileges” setting, but running the PowerShell command similar to the above wouldn’t work. (This was easily tested running from a cmd promt manually (instead of powershell)). My solution:

  • Triggers: 01:00 every Saturday (yeah, I don’t feel the need to run this every night. Paul’s script will report the most important things anyways).
  • Actions:
    • Start a program: C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe
    • Add arguments (optional): -NonInteractive -WindowStyle Hidden -command “. ‘C:\Program Files\Microsoft\Exchange Server\V15\bin\RemoteExchange.ps1’; Connect-ExchangeServer -auto; G:\software\scripts\Get-ExchangeEnvironmentReport.ps1 -HTMLReport G:\software\scripts\environreport.html -SendMail:$true -MailFrom:environreport@yourserver -MailTo:me@ourdomain.com -MailServer:yoursmtpserver.domain.com
    • Start in (optional): (empty)

 

My own script:

  • Triggers: 00:00 every Sunday (yeah, I don’t feel the need to run this every night. Paul’s script will report the most important things anyways).
  • Actions:
    • Start a program: powershell.exe
    • Add arguments (optional): -NoProfile -ExecutionPolicy Bypass -File “G:\software\scripts\Get-HealthSetReport.ps1”
    • Start in (optional): G:\software\scripts

 

Sources:

http://practical365.com/exchange-server/powershell-script-exchange-server-health-check-report/ (Paul’s script)
http://www.stevieg.org/2011/06/exchange-environment-report/ (Steve’s script)

https://technet.microsoft.com/en-us/library/jj218724%28v=exchg.160%29.aspx
http://www.msexchange.org/kbase/ExchangeServerTips/ExchangeServer2013/Powershell/scheduling-exchange-powershell-task.html
http://blog.enowsoftware.com/solutions-engine/bid/186014/Introduction-to-Managed-Availability-How-to-Check-Recover-and-Maintain-Your-Exchange-Organization-Part-II

 

Exchange also checks its own health. Let me copy/paste some information:

One of the interesting features of Exchange Server 2013 is the way that Managed Availability communicates the health of individual Client Access protocols (eg OWA, ActiveSync, EWS) by rendering a healthcheck.htm file in each CAS virtual directory. When the protocol is healthy you can see it yourself by navigating to a URL such as https://mail.exchangeserverpro.net/owa/healthcheck.htm.
When the protocol is unhealthy the page is unavailable, and instead of the HTTP 200 result above you will see a “Page not found” or HTTP 404 result instead.

Source: http://practical365.com/exchange-server/testing-exchange-server-2013-client-access-server-health-with-powershell/

Further reading: https://blogs.technet.microsoft.com/exchange/2014/03/05/load-balancing-in-exchange-2013/ and the chapter about Health Probe Checking

We have no need to implement this at the “Exchange server-level” though, as these checks are already done in our Zen Load Balancer (described in this blog post). I guess you could call this the “fourth script” for checking/monitoring server health.

 

Reports

So, what all this adds up to are some nice reports. I’ll get a daily mail (generated at 02:00) looking like this (and hopefully always will 🙂 ):

ExServerHealth_screenshot_from_report

Daily report generated from Test-ExchangeServerHealth.ps1 script

 

ExEnvironmentreport_screenshot_from_report

Weekly report generated from Get-ExchangeEnvironmentReport.ps1 script

 

ExServerHealthSet_screenshot_from_report

Weekly report generated from Get-HealthSetReport.ps1 script (The problem was already taken care of 🙂 )

 

As you can see from the screenshots above, these checks are all reporting different aspects of the Exchange server health.

This quite much covers all the necessary monitoring in my opinion and with these checks in place you can probably also sleep better during the nights 🙂 Even though these checks are comprehensive, I’m still planning on even more checks. My next step will be an attempt at Real-time event log monitoring using NSClient++ / Nagios. Actually the Nagios thought was buried and I will instead focus on the ELK stack which is a VERY comprehensive logging solution.