Cloud availability: third-party measurements needed!

cloud-measurementWhen subscribing to a cloud infrastructure (I'm mainly interested in cloud storage, but the issue concerns computing applications no less than storage), we expect availability performance not worse than what our in-house infrastructure would provide us with.  However, what do we know about the availability of public cloud platforms? The Amazon policy, for example, states that they will "use commercially reasonable efforts to make Amazon EC2 and Amazon EBS each available with a Monthly Uptime Percentage of at least 99.95%". Though this is not a strictly guaranteed Service Level, taking it as good would mean that we can expect a bit better than a 3 nines availability.

Though this is not a stellar performance, news of long outages, violating our expectations by far, appear from time to time (see, e.g., the "Another Amazon Outage Exposes the Cloud's Dark Lining" piece on Bloomberg Business Week). Many of these infos on outages appearing on the general press as well as on specialized websites have been collected and analyzed in the papers "Downtime statistics of current cloud solutions" and "The availability of cloud-based services: is it living up to its promise?", highlighting the possibility of availability performance much worse than the advertised one (if any).

Those studies are unfortunately based on reported outages, so they cannot provide a very accurate statistical representation of the phenomenon. A recently appeared paper "The Need for End-to-End Evaluation of Cloud Availability", presented at the PAM 2014 conference, provided some interesting results based on active measurements. The paper compares two approaches, based respectively on ICMP echoes (measurements carried out at the network level) or on HTTP requests (measurements carried out on an end-to-end basis). It turns out that the HTTP-method is more accurate, especially for cloud storage, where it allows to take into account the actual availability of the whole back-end infrastructure.

However, the paper falls short of giving an accurate statistical representation of the outage phenomenon in clouds. In fact, the granularity chosen by the authors is too rough for that purpose, being in the order of 10 minutes, which overlooks out-of-service glitches (most outages are indeed of very short duration). In addition, the measurements have to be really cleaned up of the outages that may be due to whatever lies in between the user and the cloud, so as to attribute clouds just the outages that are really due to the cloud provider infrastructure.

 

This entry was posted in Cloud computing, Risk analysis. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *