How should uptime be calculated? The answer depends a lot on whom you ask and what their vested interest is as we all have our biases. Early on, we tried to strike the best balance we could with how we record and report on uptime and downtime data. Our customers use our data for a variety of purposes (internal reporting, SLAs, advertising, etc.) so we want to be fair while providing data that is consistent, accurate and meaningful.
For us, there have always been two main issues or points of contention.
We monitor at the specified interval whether a device is up or down. We don’t monitor at an increased interval when the device is down. To do so would distort the statistical value of the data. It’s true that we’re overstating downtime if we don’t detect the end of an outage immediately, but we’re also overstating uptime when we don’t immediately detect the beginning of an outage. Think of the monitoring interval as a random sampling of whether a system is up or down. If you alter the interval while the system is down, you ruin the randomness of the sample.
Here’s an example where the monitoring interval is 60 minutes (which we do not recommend for important systems but it helps the illustration):
In the example, what is the actual uptime percentage during the two-hour period? The URL was only available 50% of the time, right? If we were to start checking the URL very frequently while it was offline, we would record only 30 minutes of downtime and report the uptime as 75% which would be very inaccurate.
Our goal is to report a statistically accurate picture of uptime over a period of time. To achieve that we need two things:
Note : You can always click the Check Now button for a device as soon as you resolve an outage and we’ll stop recording downtime. The effect of that over time would be to make your uptime look a little better than it really is, but sometimes you just need to get the Device OK alert right now so you can go back to bed.
By the way, we’re not just trying to promote 1-minute monitoring intervals because they cost more, we think customers should choose the interval based on their tolerance for outage duration before receiving an alert.
Also, keep in mind, our innovative Synapse technology allows us to detect many kinds of outages within seconds regardless of your monitoring interval. If your outages tend to be mostly because of network failures, we would develop accurate uptime statistics very quickly.
The other question is what to do about maintenance time. One school of thought says that if you take your site down for scheduled maintenance, and you’ve informed your visitors about it in advance, then it’s not really downtime and shouldn’t be counted against your uptime percentage. Uptime purists maintain that anytime a site is unavailable then it’s downtime as far as the site visitors are concerned.
Our belief is that it really depends on the application, the expectations of the particular set of site users, the organization, the organization’s stakeholders, and on and on.
In an effort to strike the best balance, our approach is to simply exclude maintenance time from the uptime calculation. Here’s an example:
What is the uptime percentage during the time period of 01:00 - 02:00?
We calculate uptime using this formula: Uptime / (Uptime + Downtime)
So time spent in maintenance doesn’t count for you and it doesn’t count against you , it just doesn’t count.
If you’re reading this, it probably means you care more about your uptime statistics than the average user. You may be interested in publishing your statistics for others to see. If so, here’s how to do that.