The IT industry is awash with credible metrics that are used to measure performance, efficiency, resiliency and a load of other important infrastructure characteristics that are routinely used and abused by suppliers in their marketing materials.
Reliability statistics, in particular, are often seized on by suppliers as a selling point for services by talking up the “five (or more) nines” of uptime and availability they can offer to their customers.
Suppliers love to boast about the high availability of their systems by claiming to have a 99.999% (and counting) uptime service-level agreement (SLA) because it is what prospective buyers want to hear.
And while it may sound great, each additional nine a supplier adds to their uptime and availability score equates to a relatively small increase in reliability, but the cost still rises.
To summarise, the higher the uptime figure quoted, the more money it is likely to cost to run.
Reading between the uptime lines
What many user organisations might be unware of is that, in isolation, SLAs are pretty meaningless as an availability metric or measurement tool. Instead, it could be thought of as a complementary layer in addition to the company’s internal SLAs.
SLAs require commitment. It is great to have a 99.99% SLA, but if there are no staff available at the time an issue occurs, it will be very hard to meet it.
If the system doesn’t achieve whatever the agreed SLA states, the supplier compensation will be minimal or non-existent, and the terms and conditions spelling out exactly what users are entitled to can be difficult to understand. In short, it is a very grey area.
For example, many of the major cloud providers promise 99.9%-plus availability, but when an outage occurs, refunds tend to be offered in the form of service credits, but these often only constitute a percentage of the overall service costs.
For customers with no real disaster recovery or backup plans in place, such an outage may have had a big impact on their ability to trade, contributing to lost orders, missed revenue opportunities and (possibly) even reputational damage. And yet the remuneration offered by the average public cloud provider rarely covers the cost of the staff overtime required to bring the service back.
Another thing to be mindful of is the four-hour SLA response window a lot of suppliers offer up in their terms and conditions. It may sound great, but it does not necessarily guarantee the issue is going to be fixed within four hours.
Instead, it usually means the supplier will acknowledge the issue and start the trouble-shooting procedure in that time, but the amount of time their systems could be offline could be far longer, so it is important to account for that within any disaster recovery planning that gets embarked on.
As always, read the agreement document carefully before signing to see what’s being promised.
Even when taken to the extreme, SLA are not protection against human error causing downtime and availability issues, as we witnessed with the recent spate of banking system availability issues experienced in the UK lately. In the case of TSB and the botched migration of its IT systems, its SLAs ended up not counting for very much at all.
That said, SLA metrics as a general tool can involve human consequences – for example, when a company agrees the availability of a web-based application will heed to a specific level of agreement.
Over-egging the importance of availability
The business as a consumer of technology and applications must stop at some point and think, “Do we really need 99.999% uptime?”. This is especially so when its a case of 99.99% in exchange for a relatively horrific price-tag to reduce unplanned downtime by a mere 31 seconds a year. For instance, 99.999% equates to 5 minutes 16 seconds of unplanned availability per year.
Meanwhile, it is worth remembering that planned maintenance events, including upgrades and reconfigurations, are not included in those availability stats, so there will be downtime, albeit planned, that exceeds that five-minute window of unplanned outages. It also includes when maintenance goes wrong, which is a scenario most administrators will surely relate to.
Even if a device comes with 99.9% uptime SLA, it equates to just over one working day per calendar year. Unfortunately, it is not possible to ensure that downtime only occurs during off-hours.
Uptime essentially boils down to buying a high level of assurance devices will not go down at inopportune moments. With that in mind, does the company in question really need five nines availability and the somewhat extreme costs that come with it?
To add further fuel to the fire, the availability relates to a particular device – the ancillary components of a storage array, for example, do not count towards the uptime. If a storage switch fails and prevents storage area network (SAN) access, it is not the SAN’s SLA that is affected.
Obviously storage switches and such should be redundant, and – in effect – the SLA for the system as a whole is only as good as the weakest link.
Looking at it from a business perspective, it comes down to the SLAs you provide to your clients. Your SLA should provide an SLA at least equal to, if not greater than, that of your clients, as it all comes down to the amount of risk a business is prepared to sustain verses the cost.
It is worth bearing in mind that SLAs refer not just to hardware, but to services and applications. As long as that application is available per the agreed SLA, it does not matter what hardware failure occurs as long as the service stays up.
In summary, SLAs are there to provide a degree of certainty that the item in question will be available when needed. Making intelligent decisions to ensure there are no obvious single points of failure will help guard against most outages at a reasonable cost. It is really only those companies that have severe penalties that need to worry about five nines and above.