Cloud Computing's RAIC? What's that?

Well over a year ago, in a conversation with Alexis Richardson, I came up with a catchy acronym to articulate an idea that I had been kicking around as a simple way to respond to all of the Sturm und Drang in the press and the blogosphere about “lock-in”, “data portability” and reliability of cloud computing providers. I said — “You know what, mate, done properly, it would be like a RAID setup — it would be an array of cloud providers. Umm, yeah, it would be RAIC! ‘Redundant Array of Independent Cloud providers'”. Alexis, as I recall, burst out laughing, and said something like “You better trademark that, Mark. That’s great.”

Wile E. Coyote

A few weeks later, I sat down, and wrote a blog post to try to describe the idea in some detail. That post has since become the most popular post on my blog, ever, but that’s largely because people hotlink to the image of Wile E. Coyote that I included in it, apparently, and has little to do with the rest of the content. And as it turned out, it’s good I didn’t try to follow Alexis’ advice about that trademark stuff, as an angry commenter let me know (quite correctly) that he was the first to …

Well over a year ago, in a conversation with Alexis Richardson, I came
up with a catchy acronym to articulate an idea that I had been kicking around
as a simple way to respond to all of the Sturm und Drang in the
press and the blogosphere about “lock-in”, “data portability” and reliability
of cloud computing providers. I said — “You know what, mate, done
properly, it would be like a RAID setup — it would be an array of cloud
providers. Umm, yeah, it would be RAIC! ‘Redundant Array of
Independent Cloud providers'”. Alexis, as I recall, burst out laughing,
and said something like “You better trademark that, Mark. That’s
great.”

Wile E. Coyote

A few weeks later, I sat down, and wrote a
blog post to try to describe the idea in some detail. That post has
since become the most popular post on my blog, ever, but that’s largely
because people hotlink to the image of Wile E. Coyote that I included in it,
apparently, and has little to do with the rest of the content. And as it turned out, it’s good I didn’t try to follow Alexis’
advice about that trademark stuff, as an angry commenter let me know (quite correctly) that he was
the first to publish the term. 😀

Despite all that, the term has gotten some traction. I encounter it
now from time to time in other people’s writings, and I get a lot of
questions about it. By and large, the questions are a consequence of my
own ~~laziness~~ bandwidth constraints. That first blog post was never
intended to stand alone — I meant to follow it up with one or more posts,
expanding on the idea and explaining what I meant in more detail. Since
I never got around to doing that, I can’t blame anyone but myself if people
are left confused, or have questions.

A few months ago, I was asked by a CSC colleague in Holland if I could
contribute a chapter to a book that is being published (in Dutch) there in the coming
year on cloud computing. I said, “Sure, I’ll write about RAIC!”
And so I did. What follows is the English-language input I
provided.

Redundant Arrays of Independent Cloud computing providers –
RAIC

At one point, many years ago, during the early period of what has since
come to be known as the “client/server revolution”, the
reliability of hard disk drives in mainframe systems was a powerful sales
argument for manufacturers of such systems. Defending their markets from new
and aggressive competitors, they made the argument that hard drives used by
their competitors were too unreliable (by comparison) and offered
unacceptable performance for mission critical work. This argument was
helped immensely by the fact that it was, by and large, true.

In 1988, David A. Patterson, Garth A. Gibson and Randy Katz at the
University of California, Berkeley, published a paper entitled “A Case for
Redundant Arrays of Inexpensive Disks (RAID)” at the SIGMOD (Association of
Computing Machinery’s Special Interest Group on the Management Of Data)
Conference [1]. This paper laid the foundation for a relatively simple,
but extraordinarily effective response to the limitations of disk storage in
low cost, client server systems. Simple queuing theory mathematics
demonstrate that an array of service providers, working in parallel, provide
higher bandwidth than any equivalent single service provider can. But
low cost disks being used in client server systems seemed unsuitable for such
parallel arrays, because they were of relatively low quality, and
correspondingly unreliable. The RAID idea was to combine N disks in a
redundant manner. This would compensate for the inherent unreliability
of the hardware, and allow systems to exploit parallelism for higher
bandwidth. The Berkeley paper went on to outline several different
implementation strategies, described as “levels”, defining five
of them. The genius of the idea was that it took a perceived constraint
– low cost, low quality disks – and leveraged them to produce a
solution. In other words, RAID leveraged a core attribute of the new model to
solve some of its constraints.

RAID was a tremendous success. The commercial implementations in the
marketplace have often differed in many ways from the academic ideal embodied
by the Berkeley paper, and the precise meaning of a particular “RAID
level” has often been ambiguous as a result. But as a general
concept for system design, RAID has served as one of the core building blocks
of commercial IT in the last 20 years. Faced with an inflection point
in the history of IT, where the economic advantages of client server systems
were exerting enormous pressure on the industry to find a way to exploit it,
the idea of RAID emerged as a central enabling technique. In a very real way,
RAID helped pave the way for all of the subsequent developments that
leveraged this potential, including the Internet, the Web, and what people
have now begun calling “cloud computing.” RAID was a
conceptual milestone in the design of IT systems.

Arguably, the pressure that is now being exerted on the IT industry by the
economic advantages of cloud computing represent the next major inflection
point in the history of technology. Like the client server inflection
point before it, cloud computing presents us with a “perfect
storm” of correlated factors, all of which have now come together to
create a model of system design that is disruptive due to the business
opportunities that it is enabling.

However, like the client server inflection point before it, there are
significant gaps in the conceptual framework of cloud computing design and
architectural patterns. These gaps manifest themselves as problems,
constraints and challenges, some of which make the use of the new model
untenable in certain use cases. Like the mainframe before it,
entrenched models of computing have certain attributes – such as
reliability – which are expressed and implemented in ways that cannot
yet be replicated using the newer model. And like the client server
inflection point before it, these problems and constraints are being held up
by entrenched interests as justification for rejection of the new model
– “this doesn’t work!”

RAIC – Redundant Arrays of Independent Cloud providers – is a
conceptual response to some of these constraints. Like RAID before it,
RAIC proposes a particular set of design patterns, which can be used to not
only mitigate certain constraints, but also allow new potential benefits to
accrue, particularly for enterprise customers.

Constraints, problems and challenges facing cloud computing

Cloud computing is a very young conceptual model. Arguably, a
consensus on what the term means has still not been reached in the industry,
and to the extent that any consensus does exist, however rough, it has only
emerged in the last year. It is therefore hardly surprising that the
model that it represents has a number of problems, gaps, constraints and
challenges that have yet to be resolved.

Prominent amongst these are the following issues:

Reliability: Cloud computing
providers have business models that are optimised for their initial, and
primary customer base; providers of consumer-facing Web services. As
such, they offer levels of reliability that are suitable for the consumer
Web. These levels of reliability are inadequate for many (if not all)
transactional enterprise workloads. Moreover, due to constraints in
their own business models, consumer-oriented cloud computing providers have
proven reluctant to change this – they have been slow to offer a
different set of terms to enterprise customers, and slow to offer any kinds
of guarantees or Service Level Agreements (SLAs), which are standard
approaches in traditional outsourcing and hosting relationships in the
enterprise market. Above and beyond that, increasing reliance on
Internet-based sourcing providers calls into question the reliability of the
Internet itself. In a world where an accident in the Mediterranean can
take India, parts of Africa and Asia effectively offline for days [2], this
question is more than academic.
Lock-in/out: As befits its
relative youth, cloud computing is a domain that encompasses a broad and
diverse array of solutions, many of them competing with one another as
solutions for the same class of problem. These competitive solutions
are, by their nature, largely incompatible with one another, and very often
proprietary, so that there is little to no transparency for a customer into
the implementation itself. Choosing a provider in such a context
carries significant risks. Should that solution prove to be the loser
in the competitive marketplace, customers that have committed to it will find
themselves in a sub-optimal situation. The recent demise of the
Platform as a Service provider Coghead [3] provides an object lesson in these
sorts of risks.
Data portability: Closely
correlated with the problem of lock-in is the problem of data formats.
If a provider’s solutions uses proprietary formats, a customer may have
a significant data transformation burden to bear, should they decide to
extract the data for storage elsewhere. Moreover, in some cases,
providers who have optimised for the consumer-facing market are not prepared
to even provide direct access to customer data. In some cloud business
models, customer data is a valuable good, and earning revenue on it a key
part of the provider’s own profit structure. There are unresolved
debates about the boundaries of ownership of data to be drawn between a
customer and a provider, and not all cloud providers have strategies that are
acceptable for enterprise use cases.
Data size and the laws of
physics: In an increasing number of cases, and as startling as it may
seem, the speed of light is becoming a serious business constraint on the use
of cloud computing providers. An example will serve to illustrate the
problem: consider an enterprise using a cloud computing provider to
host a Business Intelligence (BI) solution. In our (completely
contrived and artificial) example, we will assume that the business in
question has decided to store both raw and aggregate transactional data in
the cloud. This is not as far-fetched as one might assume – in
many businesses, large quantities of data are collected on individual
transactions solely for the purpose of serving as the basis for later
aggregated figures. In such a case, an argument can be made for an
“elastic” solution architecture, where the resources required to
collect and store the raw data do not run twenty four hours a day, seven days
a week, but only as needed. So, to return to our fictional example, we
have a business that is collecting vast quantities of data, and storing them
with a cloud provider. Let us also assume that the business has been
operating this way for some time, and has accumulated many terabytes of data
in the cloud. What happens, however, if the sourcing relationship with
that provider suddenly goes sour, and the business wants to terminate
it? If the provider only offers an Internet-based
“pipeline” to the customer’s data, the amount of time it
will take to “pull the data out” of the cloud is a function of
the amount of data and network bandwidth available. As one industry
veteran put it, speaking at the first CloudCamp in London and describing a
situation confronted by a start-up company he was involved with, “even
if we had run a batch job round the clock for a month, we still would not
have been able to extract all of our data.” [4] The laws of physics
place an implacable limit on such things.

What is RAIC, and how does it work

RAIC presents us with design patterns that can mitigate all of these
constraints. In a nutshell, the idea is simply this: keep multiple,
redundant copies of all data with multiple cloud providers. The design
patterns embodied by the various RAID levels may provide templates for
similar patterns here, but that is the stuff of future work. In this
paper, we will limit the discussion to the implications of the simplest
possible such model – the equivalent of RAID level 1, mirroring of
data.

Conceptually, RAIC involves mirroring a business’s data with
multiple cloud providers. Rather than establishing an “eggs in
one basket” sourcing model, as shown in Figure 1, RAIC suggests a model
where a business has a commercial relationship with multiple cloud providers,
and writes all data to each and every one of these providers in
parallel. Figure 2 depicts RAIC in action.

Figure 1 - the all eggs in one basket model

Figure 1 – the “all eggs in one basket” model

RAIC, like RAID before it, capitalises on existing technologies (such as,
in the figures shown here, Virtual Private Networking (VPN) techniques), but
also leverages attributes of the components themselves to enable the model
itself. In RAIC’s case, the low cost of cloud computing services,
and the lack of capital expenditures needed to enable the model, are what
make it a viable solution.

Figure 2 - RAIC

Figure 2 – RAIC

Advantages RAIC provides

Let us now revisit the constraints, problems and challenges listed
earlier, and examine how the RAIC concept can mitigate each, in turn.

Reliability: This is the most
obvious benefit of the model, and arguably the easiest to understand.
Reliability becomes a function of the number of providers. More
providers equates to higher reliability. Moreover, distributing the
pool of providers across a number of geographies could enable a design that
was resistant to transient, localised problems with the Internet. An
enterprise using a global RAIC could effectively achieve the same aggregate
reliability as the global Internet itself – this is the same argument
as that made on behalf of the Content Distribution Network (CDN) concept,
with the difference that CDN is a read-only solution, whereas a RAIC is a
write-only design pattern. An event that caused the entire, global
Internet to fail would be likely to be a cataclysm of such apocalyptic
proportions that the failure of business systems might not be the highest
priority issue.
Lock-in/out: Essentially, a RAIC
system design eliminates the concern of lock-in. If a customer
employing a RAIC strategy decides to terminate a commercial relationship with
a provider, this presents no problem – the customer is no longer in a
relationship of sole dependency with such a provider. Customers will,
of course, have to balance risks involved in changes to reliability and
availability (at least until one provider can be replaced by another), but
this is a straightforward business decision, and one that RAIC enables the
business to make, by breaking the sole dependency on a single provider.
Data portability, data size and the laws of physics: In our experience, this is the least intuitive of the
advantages of a RAIC-like model, but in our judgement, the most
compelling. Put simply, RAIC sidesteps these problems. It
doesn’t solve them, per se – it enables a business to
simply go around them. Consider the “we’re terminating our
relationship” scenario suggested in the lock-in constraint. In a
RAIC system, a customer would merely issue a “delete” job on
their way out the door. There is no “data portability”
problem, because RAIC eliminates the need to ever move the data, in
bulk. Similarly, this mitigates the problems posed by large datasets
(vs. the laws of physics) in a straightforward way. If a business never
needs to move its data, it need not be concerned with the fact that it
isn’t feasible to do so. Of course, this assumes a pre-requisite:
that a “delete” command to a cloud computing provider really does
what the customer wants it to – that “delete” really means
“delete”. But, in our experience, customers will find it
easier to negotiate the terms of “delete” than an attempt
to re-write a provider’s cost model, not to mention the implacable laws
of physics.

Like RAID before it, RAIC system designs hint at tremendous opportunity
for optimisation, and new capabilities that might emerge as a consequence of
the same. Consider the question of how to implement a mechanism to
ensure that data is written in parallel to each of the cloud providers
involved. We have not detailed any particular implementation
strategies, nor do we intend to: these are left as an exercise for the
reader. But allow us to explore some of the implications of various
strategies for a moment, in order to highlight what we see as fertile ground
for optimisation and the emergence of new capabilities.

A naïve implementation of RAIC might simply write all transactions to
all providers concurrently – in parallel, but synchronously. This
would be simple enough to do, and would work. But this is certainly not
the only possible implementation strategy. It is almost as
straightforward to imagine more complex implementations, using some form of
asynchronous messaging. Imagine a system where transactions were first
written to one provider, in a synchronous manner, and then propagated, using
asynchronous messaging techniques, to the other providers. This is
similar to the design patterns used to implement federated databases.
By extension, it is simple to imagine any number of permutations of this sort
of design, ranging from an intermediate messaging broker, to peer-to-peer
quorum algorithms that distribute the role of the broker as well.
Further, these various approaches clearly have complex, differing
implications for the role of data in an overall system. Ideas like BASE
[5], the CAP theorem [6] and “eventual consistency” [7] will all
have a role to play here.

Broader implications for system design

These considerations imply that RAIC is only a starting point, a
foundational design pattern that enables other, more complex patterns in
turn.

We think it will be useful to explore some of these broader implications,
and to place RAIC in a conceptual framework that relates it to other aspects
of system design. However, before doing so, let us first make one thing
clear with regard to RAIC’s relationship to the concept of cloud
computing itself.

RAIC seems easiest to understand as a metaphor for data storage, and this
is not a coincidence – ultimately, RAIC is about the storage of data
with different providers. However, this sometimes leads people to
assume that it is also only applicable at the Infrastructure as a Service
layer of the SPI stack. This seems to be a natural consequence of the
nature of the SPI stack and the separation of concerns that it seems to
imply. For most people, “data storage” equates to things
like “hard disks” and “databases”. Those are
concepts, moreover, that one finds most prominently at the IaaS layer of the
SPI stack. Ergo, RAIC equates to IaaS.

The problem with this is that it unnecessarily restricts the applicability
of the pattern. RAIC is perhaps easiest to understand at the IaaS
level, but that does not mean that it does not apply to the Platform as a
Service (PaaS) or Software as a Service (SaaS) levels as well. Figure 3
demonstrates the pattern at the SaaS level.

Figure 3 - RAIC at the SaaS level of the SPI stack

Figure 3 – RAIC at the SaaS level of the SPI stack

In this figure, three SaaS providers of enterprise applications are being
used in parallel by an organisation. In the simplest possible example,
imagine an online spreadsheet that uses the APIs of these providers to store
the associated data. Of course, this presents implementation challenges
– in particular, with regard to a common user interface to these
services, which would otherwise be seen as, and provided by, one of the
service providers themselves – but the overall point should be
clear. Similar examples can be contrived for e-mail, for example.

What this observation, as well as our earlier remarks about various
implementation and data storage strategies, demonstrates is that RAIC is a
manifestation of deep design principles, with broad applicability.

Consider the formal, scientific definition of
“redundancy”. Redundancy in engineering means the precise
duplication of components [8] . Strictly speaking, insofar as the various
disks in a RAID system are not precisely identical (identical
manufacturer, identical model, identical attributes such as size, etc.), then
it is incorrect to speak of these components as being
“redundant”. The general usage of the term focuses on
isomorphism – components that are not identical, but
structurally equivalent. Two SCSI disk drives, made by different
manufacturers, seem to be an example. What this common interpretation
misapprehends is that it is not the isomorphism of such components that
enables their interchangability – it is their isofunctional
nature. “Isofunctional” is a term that means, “behave
in the same manner” – more strictly, a component is isofunctional
to another if, given the same inputs, they produce the same outputs.
RAID arrays can (and do) have disks that are very different from one another
(different manufacturers, different sizes, etc.), but they all behave the
same way, due to their conformance with the SCSI standard interface.
They are isofunctional. The more precise term for a “component
that is isofunctional without being isomorphic” is, strangely enough,
“degeneracy” [9]. Thus, for both RAID and RAIC, the usage of the
term “redundant” is, strictly speaking, wrong.
“Degenerate” would be the more accurate term. We suspect
the terms would enjoy less popularity, however, were they more correct in
this sense.

We raise this point out of more than simple academic curiosity. The
field where the term “degeneracy” is most commonly used, and
where the most care is paid to a precise distinction between these terms, is
biology.

Summary

Cloud computing is driving a rapid change in the overall complexity of IT
system designs. It is our belief that “traditional” models
of IT system design are being pushed to the outer limits of their utility by
this change – we believe that these models are already beginning to
fail us, and that this will increasingly become the case. When we
search for alternative models of system design to guide us, we find biology
to be the most promising source. Biomemesis is a term that describes
the explicit attempt to design systems that mimic biological models
[10]. One of the most profound differences between biological models
and “traditional” IT systems is their relationship to
duplication. Traditional IT system design approaches strive to
eliminate “unnecessary” duplication – ranging from attempts
to “normalise” database designs, to attempts to eliminate
duplicate code in inheritance-based type systems, to the focus on reuse that
characterises SOA. Biological systems, on the other hand, are rife with
duplication – duplicate genes, duplicate cells, duplicate processes,
duplicate organs, and so on. Often, this duplication is not isomorphic
in nature, but isofunctional, as in the brain’s well-documented ability
to heal after serious injuries by “rewiring” parts of itself to
perform lost functions. It is in this context that biologists find
themselves needing to be quite precise about the distinction between
“redundancy” and “degeneracy”.

We do not think that the core of the RAIC concept – duplication
– being the same as one of the core design principles in biological
systems is a coincidence. We believe it is a consequence – a
consequence of the pressure being exerted on us as designers to tame the
ever-increasing complexity of our systems.

There is no silver bullet [11], and RAIC is not one. But we do
believe it has significant value as a foundational design pattern that will
enable businesses to both exploit cloud computing in a manner that is
reliable enough to meet business goals and exploit cloud computing models, as
well as enable new, emergent capabilities, which we can now only dimly
imagine.