Double Dutch - Greenplum and Teradata

In 2004, Teradata withdrew from the public TPC-H data warehousing benchmark. At the time we took a certain amount of heat for our decision – with various analysts claiming that our withdrawal was part of a strategy intended to deliberately obscure our pricing model; and various competitors claiming that it amounted to an admission that we had lost our leadership position in the marketplace to the “me-too MPP-lite” products that they had brought to market.

In fact, our reasoning was very simple – and far less calculating. It was based on the old saw that whilst benchmarks may not lie, this has never been known to stop liars from using benchmarks. Well-constructed and well-executed benchmarks – those that represent real-world requirements and operating constraints – can certainly help organizations to understand the relative strengths and weaknesses of competing analytic database products. But many benchmarks are poorly designed and poorly executed. Benchmarks that fail even to reflect the complexity of an organization’s current analytical requirements – let alone the anticipated future complexity arising from the requirement to store ever greater volumes of detailed, integrated data and to expose them to an ever greater number of users for analysis – are at best a waste of time-and-effort, and at worst seriously misleading. The TPC-H benchmark – consisting as it does mostly of trivial queries run one-at-a-time against an 8-table physical data model – was light years away from the reality of real-world data warehouse implementations even seven years ago. It had failed to advance with evolving requirements and had become an irrelevance.

The intervening years have seen data warehouses continue to grow in size (by one recent estimate, by 40% per annum); the continued rise in the importance of advanced analytical workloads, requiring sophisticated in-database processing; and the widespread deployment of tactical workloads alongside traditional decision-support workloads. The TPC-H specification now looks like an historical curiosity; an artefact of earlier, more carefree times, when the industry was young, requirements were simpler – and I still had my own hair.

So imagine my surprise when a vendor whose analytic DBMS technology still cannot run even the obsolete TPC-H benchmark unless some of the queries are first modified claimed in a recent interview that their analytic DBMS is twice as fast as Teradata, at half the cost. The vendor in question is EMC; the product for which these extravagant claims are made is the Greenplum analytic DBMS; and the claimant is EMC / Greenplum’s VP Central and Eastern Europe, Uwe Weimar, in an interview with Hans Lamboo for Database Magazine, issue 7, published November 8th, 2011.

I thoroughly recommend this article to all of you. It made me laugh out loud – and it wasn’t even the predictable inaccuracies of Google Translate’s interpretation of the original Dutch text that were the most unintentionally hilarious parts of the whole exercise.

For those of you without the patience to engage with the full article, I will summarize Weimar’s key claims, as they are reported in the article –

Greenplum has “maximum scalability”, arising from its “unique shared nothing massively parallel [processing] technology”. As some readers will doubtless be aware, Teradata pioneered the use of shared nothing massively parallel processing (MPP) architectures for data processing in… 1979.
“Greenplum has the fastest database” because it is based on open source PostgreSQL. No other explanation is offered by Weimar for this claim – which seems debatable at the very least, given that the Greenplum optimizer cannot even process correlated sub-queries (hence that little difficulty with running the antiquated TPC-H benchmark in its unmodified form). And isn’t this a little bit like me claiming that my car is faster than yours “because it is red”, or that my wife is smarter than you “because she has brown eyes”? (Actually, my wife doesn’t need to be cleverer than you; it’s enough for her that she’s smarter than I am.)
“Teradata comes as an appliance” – well, yes, mostly it does, although there is also a software-only version of the product. What Weimar conveniently fails to mention is that the Teradata model – a vertically integrated appliance, consisting of hardware, operating system and DBMS, integrated and optimized with one another – is now completely dominant, with Oracle (Exadata) and IBM (DB2 and Netezza) all going to market this way. Even Greenplum, of course, has an appliance – the EMC Greenplum DCA machine – that it leads with in the vast majority of its sales motions.
We also learn that Teradata sells a “beautiful NCR appliance”. Whilst we’re flattered by the compliment, Teradata has been an independent and publically traded company since we spun off from NCR in late 2007. Nothing in any Teradata system shipped since that time is badged or sourced from NCR.
“Teradata provides half as good performance at double the price” – leaving aside the lack of any evidence or corroboration whatsoever for this bold and implausibly generic statement, Weimar does rather assume that you don’t want to run a query that contains something as basic as a correlated sub-query. In which case, of course, you can’t run on Greenplum at all – quickly or otherwise – as this basic functionality simply isn’t supported. And given that Teradata offers five different systems with radically different price / performance characteristics, with which member of the Teradata platform family is this comparison supposed to be made, anyway?

At least as striking as Weimar’s claims is his failure to explain that whilst the Greenplum DBMS might be based on PostgreSQL, it also includes a proprietary code base and is emphatically not open source.

Now it’s time for Bart Sjerps (Advisory Technology Consultant (EMEA) for EMC / Greenplum) to join the interview. Sjerps tells us that Greenplum is “strategic” to EMC, because EMC’s headcount has grown from 40,000 to 56,000 in the past year. If, like me, you are excessively pedantic, analytic and rarely get invited to dinner parties, you may be concerned at this point that these statistics conflate total EMC headcount with Greenplum headcount, perhaps to lead us to believe that Greenplum is more important to EMC than actually it is. (When did we all become so cynical? And do you have plans for this weekend? We could all get together and be sarcastic.)

Next Sjerps turns his attention to the buzzword du jour – in-memory computing – and treats us to a survey of sister company VMWare’s Campfire in-memory technology. Reading the interview, you could be forgiven for thinking that this technology is integrated with the Greenplum technology – except that in the next breath, Sjerps appears to suggest that in fact this technology should be deployed alongside Greenplum and will require an administrator to manually assign “hot” data to the in-memory system. Since the temperature of data changes rapidly and unpredictably, this approach is basically irrelevant for large-scale data warehousing – and the reason that our own Teradata Virtual Storage product was designed with the explicit goal of automating the migration of hot and cold data between higher- and lower-performance storage devices.

At this point in the interview, Sjerps appears to claim that the Teradata RDBMS only runs on specialized hardware. For the record: it doesn’t. Teradata systems use x86 processors and are built largely from Intel Standard High Volume (SHV) server technology and always have been. We once flirted with Itanium – didn’t everyone? – and we use hardware compression technology alongside the x86 processors in the latest version of our Data Warehouse Appliance (one member of the aforementioned family of products). But that’s about as “out there” and “non-standard” as we get with hardware because, at heart, we’re software guys. The “proprietary ASICs” providing core functionality that Sjerps refers to are found in the Netezza Performance Server, not in any Teradata system.

Sjerps does make another interesting observation, although he doesn’t say it in so many words: Intel is not delivering us faster CPUs any more, it’s delivering us more processing cores on each CPU socket. In Sjerps own words: “Intel CPU performance doubles every two years. Greenplum does not have to do anything”.

As I noted in my last post about Exadata, multi-core CPUs represent a fundamental architectural change. On the plus side, they mean that Moore’s law can be sustained where the Laws of Physics would otherwise have got in the way. But there is a catch. Or more precisely, there are two catches where analytical workloads executing against large volumes of detailed data are concerned.

The first catch is that the doubling in performance is only realized for explicitly parallel, multi-threaded software. Clock speeds are actually slower than they were a couple of hardware generations ago, so serial processes run slower than they used to. As I explained in the Exadata post, many years ago – and long before virtualization was trendy – much smarter Engineers than me (OK, actual Engineers) took the decision to virtualize the CPU in the Teradata DBMS, so that multi-threaded parallelism is built-in to the Teradata software, not bolted-on after the fact.

The second is that unless the I/O sub-system also gets twice as fast every 2 years, those super-fast-shiny-processors are left there, open-mouthed, waiting for data to process. (This realization, incidentally, is what motivated us to virtualize the storage sub-system in Teradata, so that we can now support a tiered hierarchy of storage; with frequently accessed data on super-fast-but-rather-expensive SSD storage; less frequently accessed data on rather-more-economical-but-much-slower magnetic storage; and automated, software-driven migration of the data between the two tiers).

What is Greenplum’s strategy for ensuring that its DBMS software can scale with Intel’s multi-core CPU architecture, and how will Greenplum exploit emerging storage technologies to make sure that data gets to those processors without them stalling? Alas in this article at least, Sjerps doesn’t say.

As an aside, ask yourself whether kicking back on the beach and relying on Intel to double data warehouse performance for you every two years is really something to shout about. Assume that those commentators claiming average data warehouse growth rates of 40% per annum are correct. That means that a data warehouse that is 10 TB in size at the end of this year has a fighting chance of being 19.6 TB in size at the end of 2013 (since we’re all friends, let’s call it 20 TB in round numbers). If Greenplum relies on improved hardware to double performance – meaning, incidentally, that customers will have to throw away the system that they were sold in 2011 and buy an entirely new one at the end of 2013, since, unlike Teradata, Greenplum has no proven references for hardware co-existence – then, all other things being equal, the performance-to-data ratio of the new system will be exactly the same as the performance-to-data ratio of the old one that it replaces. In this scenario, you will literally be paying Greenplum every two years – just to stand still. You (just about) get the help you need to deal with the requirement to store ever greater volumes of detailed, integrated data – so long as you don’t mind forking out for a brand new system with twice as much storage as the last one – but you’re on your own when it comes to the requirement to expose them to ever greater numbers of users for analysis so that you can actually derive some value from them.

Whist it is easy to see the attraction of this business model to Weimar and Sjerps – those beach holidays don’t pay for themselves! – it may prove less appealing to the guy or girl asked to keep writing the cheques every two years. In reality, explicitly parallel software that can scale with the latest-and-greatest hardware is the absolute bare minimum that you should demand of your analytic DBMS technology vendor; and if you’re going to out-run those exploding data volumes, said vendor’s technology had better do very much better than this by making the analytic DBMS software smarter every year, year-after-year.

In plain English: it simply isn’t enough to scale with increasingly powerful hardware, we need our RDBMS software to use the available hardware resources ever more efficiently.

Empty vassals?
The analytic database market is growing rapidly. Technology – smartphones, sensor networks, the Internet and Social Media – continue to make it possible to create more and more detailed data; and organizations continue to collect and integrate these data so that they can exploit them for business advantage.

Fast-growing markets can support many products; even less mature, less robust and less functional products can find a profitable niche that, if they are smart, their owners can then exploit to generate the revenues that they need to re-invest in making their products more mature, more robust, more functional.

Precisely because a fast-growing market can support many products, a vendor who is confident that his or her analytic DBMS product meets a genuine market need and is differentiated from the competition shouldn’t have to spend much time trashing the competition to win new business. After all, if Greenplum is as important to EMC as Weimar and Sjerps claim it is, EMC will presumably continue to invest in it (Who knows? One day it may even support correlated sub-queries!) And if the product is already as performant as they claim it is, then they can enter into demanding, well-designed benchmarks, secure in the knowledge that their product will do the talking for them. All of which begs two very obvious and very interesting questions which, in the spirit of not trashing the competition, I will let you contemplate for yourself.

I will leave you with a few parting thoughts.

Later in the same article, Weimar is quoted as asserting that Teradata customers don’t buy their systems, but rather are forced to lease them. That is simply untrue – and given that unit list pricing for Teradata systems has been available at www.teradata.com for the last several years or more, it’s difficult to accept that Weimar didn’t know this claim to be incorrect when he made it. Then again, he apparently still thinks that Teradata is part of NCR, that the Teradata RDBMS runs only on proprietary chip-sets – and that Teradata has only one product line. So perhaps it really was an honest mistake. Sun Tzu once said, “If you know yourself and you know your opponent, then you need not fear the result of a thousand battles”. For the sake of his stakeholders, let’s hope that Weimar knows himself rather better than he appears to know his competitors.

In concluding the interview, Weimar is quoted as claiming, “Our main target market is the replacement of Teradata”. If you ever meet Weimar face-to-face, I encourage you to ask him if he really means what he says. If he answers “yes”, look him in the eyes, ask him to speak slowly and make him tell you exactly how many customers – meaning organizations that have paid money for product, not organizations that have taken delivery of an evaluation machine – have migrated from Teradata to Greenplum. Because we count four since Greenplum was founded. Not in the Central and Eastern European markets that you might assume Weimar is most concerned with, but globally. And even that figure requires a generous interpretation of the word “replaced” – and ignores the fact that we, in turn, have replaced Greenplum at between two or three customers (whether it is two or three again hinges on your definition of “replaced”).

In case those data make it sound like this is a two horse race that may come down to a photo finish, then you should know that in 2011 we have competed directly with Greenplum – thus far and by our own count – in fourteen competitions where the result is in. And that we have defeated them in thirteen of those contests.

All of which is rather hard to reconcile with Weimar’s fighting talk. If Greenplum really is twice as fast, at half the cost, why do thirteen-out-of-fourteen customers who have evaluated both solutions chose Teradata?

Martin Willcox