Hadoop Can't Do That - SmartData Collective

I just got back from a little executive summit conference in Dallas for Chief Data Officers. Frustratingly, I heard a lot of folks telling me what Hadoop CAN’T do. Now, I know that Hadoop can’t bring about world peace or get my husband to put the toilet seat down, but the things people keep saying it can’t do are things that I’ve personally DONE on Hadoop clusters, so I know they’re doable.

If you asked most people if water could cut through steel, they would probably tell you it can’t. They would be wrong, too.

Going to the Evanta CDO Summit was a surprise. I first had to pinch hit for one of our sales execs who needed surgery, and then for my CMO when he missed his flight out from California. So, in the space of one day, the day before the conference, I went from “not going” to “going, but just talking to people” to “presenting.” Whee!

So, after doing tech support for one of our data scientists, Josh Poduska, for a KNIME sneak peek training at our local meetup group until about 9 PM, I hopped in a car and drove to Dallas. Got there at 1:30 in the morning, just as the deluge started. I was in Houston last weekend to speak on panels at Comicpalooza in my fiction writer role, and rode out the storms there in a Denny’s. Three years ago, Texas was so dry all the crops were burnt brown and the lakes were all but gone. This year, everything’s overflowing.

During the summit, I heard a talk by Rob Saker, CDO of Crossmark. He made a statement during his presentation that really stuck with me.

“Data is like water. Not enough, and you die. Too much, and you’re flooded. Properly focused, it can cut through steel.”

Surrounded by flooding in a state that was dying for lack of water a few years back, that struck me as an exceptionally apt metaphor. Before, businesses were not able to analyze even the small amount of data they could hold onto. They were dying for the lack of it. Now, they’re flooded with data, but struggling to get their arms around it. Hadoop is the life saver, but people have pretty set notions about what Hadoop can’t do.

If you asked most people if water could cut through steel, they would probably tell you it can’t. As my maker husband who loves computer controlled wet jets, lasers and router machines could tell you, water can and does cut through just about anything. Similarly, Hadoop, used properly, can accomplish just about any data analysis task.

Here are some of the things people mentioned at the summit that Hadoop can’t do:

Low Latency SQL access (with ACID compliance)

You can’t do low latency, interactive SQL on Hadoop data, and there’s certainly no way to get anything like transactional integrity.

I’m not going to beat this dead horse too much here. There are bunches of ways to access Hadoop data with SQL. That space is actually becoming a bit crowded. I already talked about the fact that “Not All Hadoop Users Drop ACID” and pointed out that SQL access was one great way to “Bridge the Big Data Analytics Skills Gap.” Despite all the options and information out there about this, I just saw an article this morning pointing out that Hadoop had no SQL access. Sigh. (Not pointing to that one. It was not worth a link.)

The other myth around this is that SQL access on Hadoop is crazy expensive because you need lots of specialized skills and time to make it work. There’s a recent article by Tamara Dull on the Smart Data Collective called Will You Always Save Money with Hadoop? comparing costs over time. It basically concludes that if you need sophisticated SQL access, it’s more economical to just use an old school data warehouse, no matter how much data you have. This runs completely counter to the argument for why SQL access to Hadoop data is a good idea. It gives your business analysts, the folks who already work for you and already know the data, access to all the data using tools and a language that they’re already fluent in.

How is that crazy expensive? That’s the most economical possible way to handle large data sets. I find it very hard to believe that buying a bigger Oracle or Netezza appliance is a better way to go financially.

CRUD operations

Hadoop is append only. You can’t do inserts, updates, or deletes.

Okay, well, don’t tell that to Splice Machines or MarkLogic. For that matter, don’t tell it to MapR, one of the big three Hadoop distributors, or the guys at Pivotal who make Hawq. Don’t tell it to us at Actian, for sure, because we’ll laugh at you. Or, maybe not. We’re generally polite, so we’ll wait until you’re not in the room, and then laugh at you. Maybe we’ll call the guys from Splice, MarkLogic, MapR and Pivotal over to share the joke.

We do feel like we have an edge up on the other guys because Actian Vortex uses a technique that not only does full insert, update and delete operations, but does them with high concurrency without slowing down query speed. Heck, Teradata doesn’t even do that. Most analytics databases can’t do that. And we do it. On Hadoop. All the time.

Batch processing without writing code

If you want to use Hadoop, you have to hire an army of expensive MapReduce coders.

Not so much, no. I can’t code MapReduce to save my life, but I’ve designed data preparation and machine learning workflows, tested them out, executed them on a Hadoop cluster, tweaked and monitored them, and put the answers I got from them to use. I’ve been working on a team with two data scientists and an infrastructure specialist for the past couple of years. The infrastructure specialist stands up clusters, maintains and builds workflows on Hadoop on a daily basis, and never touches MapReduce. The data scientists do their jobs on Hadoop all the time, and neither one has ever coded a word of MapReduce. Our marketing analytics department does analysis on data on Hadoop regularly, and none of them speak MapReduce.

Hadoop today is not the batch only, base level MapReduce + HDFS starting point that it was a decade ago. Yet, that’s still what many people think of when they hear the word, Hadoop. Many people even think of Hadoop and MapReduce as synonymous. That just isn’t the case.

YARN has turned Hadoop into a cluster operating system that can support many types of execution engines. Spark, Actian DataFlow, and Tez are all examples of ways to process data on Hadoop clusters that don’t use MapReduce.

Even MapReduce jobs don’t really require MapReduce coders. At this point, there are half a dozen different user interface applications that will let you design a MapReduce ETL process without writing a bit of code. Informatica and Pentaho, for example, will let you design your ETL workflows in the same interfaces you’re accustomed to, then will turn those into MapReduce jobs for you and execute them on a nearby cluster.

I’m not saying that’s the best way to go. MapReduce is slow, and MapReduce auto-generated by an interface is going to be even slower than usual. But if you’ve got the time, it will do the job. Spark, Tez and DataFlow are all faster execution engines, and DataFlow workflows can be created in the KNIME user interface. So, there’s those options as well.

Whatever method you decide to use, you can do all of the batch data crunching that you expect from Hadoop, without writing a single word of MapReduce code, or hiring a single MapReduce coder.

Other stuff

There are a lot of other things that Hadoop CAN do right now that I keep hearing people saying it can’t. It doesn’t have any security, for instance. Have you heard of Kerberos, KNOX, Sentry, …? It doesn’t have role-based authentication, or encryption, or audit capability. Actian does a lot of work for the financial and healthcare industries, among others. If those limitations were real. If that data wasn’t secure, those companies simply couldn’t use that software. They would be legally obligated not to use the software.

The data management industry had this crazy hype machine going for a while that said that Hadoop could do everything from cure cancer to get stubborn stains out of your socks. People, naturally, got disillusioned. Now, practical, sensible people like Chief Data Officers are on the other end of the pendulum swing. They now have ideas about what Hadoop can’t do that were set when the software was in its infancy. Hadoop has grown up and, while it still can’t wash your socks, it can process data efficiently, economically, and without requiring a massive investment in hardware or hard-to-find skills.

There’s more, too. I hear about data quality, life cycle management, data curation, … a lot of the stuff that Tony Baer and I both talked about on our 5 Tips for Getting Value Out of Hadoop webinar last Thursday. That’s all stuff that people keep telling me that Hadoop can’t do.

Right, and water can’t cut through steel.