By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData Collective
  • Analytics
    AnalyticsShow More
    predictive analytics in dropshipping
    Predictive Analytics Helps New Dropshipping Businesses Thrive
    12 Min Read
    data-driven approach in healthcare
    The Importance of Data-Driven Approaches to Improving Healthcare in Rural Areas
    6 Min Read
    analytics for tax compliance
    Analytics Changes the Calculus of Business Tax Compliance
    8 Min Read
    big data analytics in gaming
    The Role of Big Data Analytics in Gaming
    10 Min Read
    analyst,women,looking,at,kpi,data,on,computer,screen
    Promising Benefits of Predictive Analytics in Asset Management
    11 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: Actian DataFlow, the Little Hadoop Engine That Could, But Probably Won’t
Share
Notification Show More
Latest News
ai in marketing with 3D rendering
Marketers Use AI to Take Advantage of 3D Rendering
Artificial Intelligence
How Big Data Is Transforming the Maritime Industry
How Big Data Is Transforming the Maritime Industry
Big Data
ai digital marketing tools
Top Five AI-Driven Digital Marketing Tools in 2023
Artificial Intelligence
ai-generated content
Is AI-Generated Content a Net Positive for Businesses?
Artificial Intelligence
predictive analytics in dropshipping
Predictive Analytics Helps New Dropshipping Businesses Thrive
Predictive Analytics
Aa
SmartData Collective
Aa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Uncategorized > Actian DataFlow, the Little Hadoop Engine That Could, But Probably Won’t
Uncategorized

Actian DataFlow, the Little Hadoop Engine That Could, But Probably Won’t

Paige Roberts
Last updated: 2015/07/27 at 5:21 AM
Paige Roberts
11 Min Read
SHARE

In Hadoop’s ecosystem of massively parallel cluster computing frameworks, Actian DataFlow is an anomaly. It’s a powerful little engine that thinks it can take on any data processing problem, no matter the scale. The trouble is that unlike MapReduce, Tez, Spark, Storm and all of the other Hadoop engines, DataFlow is proprietary, not open source.

Contents
Data Processing Paradigm (How does it work?)Interface (How do you use it?)Weak Areas (What is it NOT good for?)Best Use Case (What is it good for?)General Comparison to Other Options

In Hadoop’s ecosystem of massively parallel cluster computing frameworks, Actian DataFlow is an anomaly. It’s a powerful little engine that thinks it can take on any data processing problem, no matter the scale. The trouble is that unlike MapReduce, Tez, Spark, Storm and all of the other Hadoop engines, DataFlow is proprietary, not open source.

Having worked at Actian, and Pervasive Software before that, and a little startup ETL software company called Data Junction before that where DataFlow was born, I know way more about this engine than the shiny paint on the surface. I know it down to the dirt and grease under the wheels.

Since I no longer work for Actian, I now have the option to give my completely honest opinion of DataFlow’s strengths and weaknesses. I no longer have any party line to toe. Perhaps surprisingly, that hasn’t significantly changed what I have to say about it. I still think that this little engine could take over the Hadoop execution world, but I also think that it probably won’t.

More Read

big data improves

3 Ways Big Data Improves Leadership Within Companies

IT Is Not Analytics. Here’s Why.
Romney Invokes Analytics in Rebuke of Trump
WEF Davos 2016: Top 100 CEO bloggers
In Memoriam: Robin Fray Carey

The Little Actian DataFlow Engine That Could

Data Processing Paradigm (How does it work?)

DataFlow was invented originally back in the early 2000’s for the multi-core revolution. As Moore’s Law started to slow down, a lot of hardware folks adapted to computer chips no longer getting faster at the same rate by putting in more and more chips. DataFlow was designed to automatically scale up at runtime to make best use of all those cores, without knowing ahead of time how many cores it was going to be running on. It’s power lay in a philosophy of “Create once, run many.” and leaving no hardware power behind. It squeezed power levels out of standard hardware that no one previously believed possible.

Then along came Hadoop, and instead of spending money on scaling up on machines with more and more cores, businesses started scaling out to do their data processing on multiple computers. A few code tweaks later, and voila, DataFlow was an engine for clusters that detected available cores, and nodes, and automatically parallelized jobs at runtime to make best use of all available hardware.

DataFlow uses the same high performance computing DAG strategy that gives Tez it’s advantages over MapReduce, but it doesn’t have any of the MapReduce baggage since it actually pre-dates it by quite a few years. DataFlow was never influenced by the MapReduce kangaroo data processing paradigm. Pipelining data in memory was the focus when DataFlow was created. Since it was intended to be a next-gen data and compute intensive ETL engine, a lot of thought was put into transforming data many times in many ways as efficiently as possible. Later, parallel machine learning and predictive analytics operators were added that took advantage of the same multiple pipeline strategy, but at it’s heart, DataFlow is an ETL engine.

Interface (How do you use it?)

The folks developing DataFlow really had their fellow engineers in mind when they designed the framework, so a lot of emphasis was put on making it easier to develop with. Like many of the Hadoop engines, the first users of DataFlow were the same people who invented it. DataFlow’s creators built a lot of abstraction into the framework itself to handle the difficult parallel aspects of multi-threaded application building. The DataFlow Java API is a breeze compared to writing MapReduce or most other types of parallel code. Most Java programmers can pick it up in about a week.

But who wants to spend weeks writing code when you can string together pre-built parallel operators that someone else already put the work into? The Javascript interface lets you build entire ETL and/or analytics applications in a few minutes.

Or, the Actian partnership with open source Eclipse-based data mining platform KNIME, means you can drag and drop to build applications with a mouse.

Weak Areas (What is it NOT good for?)

DataFlow’s biggest weakness is obvious. It’s not open source. I’ve got a lot of love and admiration for this zippy little engine, but it’s just not going to make it up that Hadoop elephant-shaped hill without support from the open source community.

Right now, KNIME is the only open source community that even notices DataFlow’s existence, but they can’t touch the source code. So, even if the data mining and predictive analytics folks WANTED to improve, support and build around this engine, they couldn’t. For them, it’s just a handy bit of freeware that they can use to boost speed on larger data mining jobs.

Best Use Case (What is it good for?)

Actian DataFlow Engine Power

Like Spark, DataFlow doesn’t really require Hadoop. It will run fine on anything from a laptop to a super-computer, almost any platform with a JVM. However, DataFlow has worked hand-in-hand with Hadoop development. It has it’s own built-in cluster manager and resource allocation capabilities created specifically so it could share resources on pre-YARN Hadoop versions. Then, DataFlow was practically first in line for the new YARN-Ready certification. DataFlow edged in through the back door as a second class citizen, then YARN opened the door and made it welcome.

Like all Hadoop engines, and Hadoop itself, DataFlow was built to solve a problem, mainly compute intensive data matching and profiling bogging down and taking forever. The little engine was put to work more than a decade ago in Pervasive’s data quality tools, Data Profiler, Data Matcher and Data MatchMerge. Years before Hadoop was more than a crazy idea that Google did a research paper on, DataFlow (or DataRush as it was once called) was executing parallel fuzzy matching algorithms for high speed record de-duplication, and blowing through data quality validation jobs against hundreds of business rules in seconds on plain old desktop computers. It’s had a lot of battle testing, and been refined by the use, abuse and demands of real users over those years.

That level of maturity and time tested solidity isn’t something you see yet in other Hadoop engines. If you have old school ETL and data quality problems at modern massive scales, DataFlow can power through those at unmatched speeds, dependably. That’s DataFlow’s sweet spot.

Also, if you need basic statistics or machine learning style analytics, DataFlow handles those fairly well. The library of operators is limited, but if they meet your needs, the performance is excellent.

General Comparison to Other Options

If you look at sheer power to do what a Hadoop engine should do, crunch through data at high speeds, DataFlow looks pretty darned impressive. Spark is the only batch style engine that even approaches DataFlow’s speed, and DataFlow doesn’t have the huge memory requirements that Spark has. (Yes, I know, Spark Streaming. Spark Streaming does micro-batching, not true stream processing. And so does DataFlow.)

Unfortunately, the ability to get the job done isn’t the only thing that guides adoption. Per node software license fees are not popular with the open source favoring companies that are likely to choose Hadoop. The ease of use and generally higher processing speed might make up for that with some companies, and the several years head start in software maturity could also help ease the pain of proprietary license costs.

The problem is that Spark has something in the neighborhood of 300 committers, and an entire ecosystem of its own being built around it. No company smaller than IBM or Oracle can afford to pay that many developers. Actian doesn’t stand a chance of keeping up. MapReduce and Tez have their own built in communities and integrated stacks, as does Storm. That’s the one thing that every successful open source project absolutely must have, community support.

DataFlow functionality may be ahead for now, thanks to Actian CTO, Mike Hoskins’, ability to see into the future, and build software to solve problems other folks didn’t even know were problems yet. But it won’t take long for open source to make up that lead and pass DataFlow. Without community support, this powerful little engine that could, won’t.

Paige Roberts July 27, 2015
Share this Article
Facebook Twitter Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

ai in marketing with 3D rendering
Marketers Use AI to Take Advantage of 3D Rendering
Artificial Intelligence
How Big Data Is Transforming the Maritime Industry
How Big Data Is Transforming the Maritime Industry
Big Data
ai digital marketing tools
Top Five AI-Driven Digital Marketing Tools in 2023
Artificial Intelligence
ai-generated content
Is AI-Generated Content a Net Positive for Businesses?
Artificial Intelligence

Stay Connected

1.2k Followers Like
33.7k Followers Follow
222 Followers Pin

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

[mc4wp_form id=”1616″]

You Might also Like

big data improves
Big DataJobsKnowledge ManagementUncategorized

3 Ways Big Data Improves Leadership Within Companies

6 Min Read
Image
Uncategorized

IT Is Not Analytics. Here’s Why.

7 Min Read

Romney Invokes Analytics in Rebuke of Trump

4 Min Read

WEF Davos 2016: Top 100 CEO bloggers

14 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

AI chatbots
AI Chatbots Can Help Retailers Convert Live Broadcast Viewers into Sales!
Chatbots
ai is improving the safety of cars
From Bolts to Bots: How AI Is Fortifying the Automotive Industry
Artificial Intelligence

Quick Link

  • About
  • Contact
  • Privacy
Follow US

© 2008-23 SmartData Collective. All Rights Reserved.

Removed from reading list

Undo
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?