Google Teh Evil? Cloud economics, BigTable + GFS vs. EU privacy laws

February 5, 2009
162 Views

At the recent Google IO conference, Google Fellow Jeff Dean gave a talk about the “inner workings” of Google’s date centres. There was a writeup on C-Net — the talk seems to (re)use some material from the deck available here. This is fascinating stuff. Buried in this stream of data (in the C-Net piece, the money shot, for my purposes here, is literally in the last paragraph) is information about the nature of the architecture of GFS (Google File System) and BigTable, the technologies used by Google to store (and retrieve) data scalably. Arguably, the combination of GFS and BigTable is Google’s cloud computing offering — GAE (Google App Engine) is just one stack that might run on top of it.

There’s an aspect of this architecture that didn’t get a lot of press, and doesn’t seem to have registered with a larger audience yet, and I think it should — if for no other reason than I think it hands Amazon a big fat advantage in European (and possibly Asian) cloud computing markets.

In Dean’s slides, there’s the following statement:

Scheduling system + GFS + BigTable + MapReduce work well within single clusters

Followed by:

Truly global systems to span all our datacenters •

At the recent Google IO conference, Google Fellow Jeff Dean gave a talk about the “inner workings” of Google’s date centres. There was a writeup on C-Net — the talk seems to (re)use some material from the deck available here. This is fascinating stuff. Buried in this stream of data (in the C-Net piece, the money shot, for my purposes here, is literally in the last paragraph) is information about the nature of the architecture of GFS (Google File System) and BigTable, the technologies used by Google to store (and retrieve) data scalably. Arguably, the combination of GFS and BigTable is Google’s cloud computing offering — GAE (Google App Engine) is just one stack that might run on top of it.

There’s an aspect of this architecture that didn’t get a lot of press, and doesn’t seem to have registered with a larger audience yet, and I think it should — if for no other reason than I think it hands Amazon a big fat advantage in European (and possibly Asian) cloud computing markets.

In Dean’s slides, there’s the following statement:

Scheduling system + GFS + BigTable + MapReduce work well within single clusters

Followed by:

Truly global systems to span all our datacenters • Global namespace with many replicas of data worldwide

In the C-Net article, he’s quoted as saying:

“We want our next-generation infrastructure to be a system that runs across a large fraction of our machines rather than separate instances,” Dean said.

Right now some massive file systems have different names–GFS/Oregon and GFS/Atlanta, for example–but they’re meant to be copies of each other. “We want a single namespace,” he said.

So what’s the big deal with that? Well, in a nutshell, European law versus U.S. law. Wildly different understandings of privacy and data protection, coupled with even more wildly different attitudes about government powers, result in a situation ripe for conflict. Things like the Patriot Act have resulted in a situation where European organisations simply categorically forbid any storage of data in the United States — and note, for further splitting of hairs, that it’s unclear what the “storage” of data really means, and it may be broad enough to include processing of data within U.S. jurisdictions, even if it’s persisted elsewhere.

There are laws on the books in several European countries (like Germany, where I live) that literally forbid situations like the ones that seem likely to result from Google’s architecture. Now, IANL (I am not a lawyer), and it’s possible that I am completely wrong for that reason alone. Perhaps a nuanced reading of European privacy and data protection laws simply makes the apparent problem go away. There’s sure one hell of a lot of documentation to read on the subject, more than enough to keep any number of lawyers busy for years, as a quick glance at some of the links I’ve been squirrelling away on the topic should make evident. When I was at the Enterprise 2.0 conference in Boston in June, there was a session called “An Evening in the Clouds”, and during a Q&A session at the end, I asked Google’s Jeff Keltner about the issue directly. He kind of dodged the question with a corp-speak “We’re evaluating that” answer (and, in all fairness, I would have done the same thing, in his place), but then went on to suggest that, in fact, I might be overstating the problem. He suggested that a lot of the worries he’s encountered from European customers wound up being FUD (fear, uncertainty and doubt) about things like the Patriot Act, and that when the lawyers all sat down together and really examined the problems, lots of them just went away.

Maybe.

But I think this argument is underestimating a significant aversion to risk in enterprisey organisations. Lots of, if not most, buying decisions never make it to a desk in the legal department — they get made long before that, in the chain of the buying process, during decisions about who’s in the running and so forth. And the fact is, in many large organisations, the aversion to risk is the converse of the popularity of the “path of least resistance” strategy. If there’s even a suggestion that going with Google for cloud services might not be the path of least resistance (because, say, it’s merely unclear what all the legal ramifications might be), that will often be enough to skew the decision making process against them. And when I see things like this, I see a case in point. The bottom line is, there are laws on the books in the EU that stand in direct conflict with the needs of Google’s architecture, and no amount of hand waving will make that fact go away.

On the other hand, it’s possible that Google’s architecture can be adapted to allow for a more nuanced implementation. On those same slides from Jeff Dean, we also see this statement:

Users specify high-level desires: “Store this data on at least 2 disks in EU, 2 in U.S. & 1 in Asia”

An API that could do that implies an API that could also be used to store the data only on “disks” in a particular region for regulatory, rather than performance reasons. But again, it’s not clear that storage alone is the problem, and Google’s ambition of achieving a global namespace implies that data will flit back and forth; sometimes on U.S. hardware, sometimes not. Since it’s not clear if that’s in scope, with regard to existing EU laws, we’re back to considering the path of least resistance problem.

Amazon seems to be in a better position on this. Their architecture is not reliant on increasing the degree of globalisation in the same way that Google’s seems to be. Thus, they have no difficulties adapting to the current state, and that is what they are doing with their European operations, which are currently limited to S3, but which will supposedly be expanded to include EC2 and other services “real soon now”.

James Urquhart came up with a fascinating meme related to this issue, which he calls “follow the law” computing (and make sure to follow some of the links from James’ blog as well). The basic idea is that software would become aware of these issues, and be cleverly partitioned to delegate processing (and, presumably, storage) to the legal jurisdiction that provides the most favourable environment for it. That’s a brilliant idea, essentially the flip side of what I’ve been musing about here — for certain transactions, it may be economically ideal to conduct them in a particular jurisdiction (say, the Cayman Islands), and therefore, the software would be partitioned to do just that. My imagination runs wild with that idea — consider the possibilities that open up: market forces would come into play on the keepers of legal jurisdictions (typically, countries). Jurisdictions could find themselves competing to provide the most favourable environments — that already happens, of course, but software like this would dramatically accelerate the effect (similar to the way automation changed currency trading markets). The mind boggles at the implications.

These are complex issues, and there are no clear cut answers to some of these things (in other words, stuff for lawyers to do). Having said that, I do think there’s cause for concern. At work, I was talking about this whole topic recently with somebody, and the general tenor of the conversation was something like “Google vs. the EU? Google will lose. MSFT did, after all”. And we had a chuckle full of Schadenfreude at Google’s expense. “But,” I said, in a tone not free of sarcasm, “if there is a company capable of changing the world to fit its architecture, rather than the other way round, then surely it’s Goog.” The next day, this showed up in the New York Times, and the last paragraph (again!) almost had me spew my tea onto my PowerBook:

In addition, businesses that operate on both sides of the Atlantic are pushing to make sure they are not caught between conflicting legal obligations.

“This will require compromise,” said Peter Fleischer, the global privacy counsel for Google. “It will require people to agree on a framework that balances two conflicting issues: privacy and security. But the need to develop that kind of framework is becoming more important as more data moves onto the Internet and circles across the global architecture.”

Indeed. Why do I get the sneaking feeling that Google, doer of good, is not putting my interests above its own in this?