Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    data driven insights
    How Data-Driven Insights Are Addressing Gaps in Patient Communication and Equity
    8 Min Read
    pexels pavel danilyuk 8112119
    Data Analytics Is Revolutionizing Medical Credentialing
    8 Min Read
    data and seo
    Maximize SEO Success with Powerful Data Analytics Insights
    8 Min Read
    data analytics for trademark registration
    Optimizing Trademark Registration with Data Analytics
    6 Min Read
    data analytics for finding zip codes
    Unlocking Zip Code Insights with Data Analytics
    6 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: Actian DataFlow, the Little Hadoop Engine That Could, But Probably Won’t
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Uncategorized > Actian DataFlow, the Little Hadoop Engine That Could, But Probably Won’t
Uncategorized

Actian DataFlow, the Little Hadoop Engine That Could, But Probably Won’t

Paige Roberts
Paige Roberts
11 Min Read
SHARE

In Hadoop’s ecosystem of massively parallel cluster computing frameworks, Actian DataFlow is an anomaly. It’s a powerful little engine that thinks it can take on any data processing problem, no matter the scale. The trouble is that unlike MapReduce, Tez, Spark, Storm and all of the other Hadoop engines, DataFlow is proprietary, not open source.

Contents
Data Processing Paradigm (How does it work?)Interface (How do you use it?)Weak Areas (What is it NOT good for?)Best Use Case (What is it good for?)General Comparison to Other Options

In Hadoop’s ecosystem of massively parallel cluster computing frameworks, Actian DataFlow is an anomaly. It’s a powerful little engine that thinks it can take on any data processing problem, no matter the scale. The trouble is that unlike MapReduce, Tez, Spark, Storm and all of the other Hadoop engines, DataFlow is proprietary, not open source.

Having worked at Actian, and Pervasive Software before that, and a little startup ETL software company called Data Junction before that where DataFlow was born, I know way more about this engine than the shiny paint on the surface. I know it down to the dirt and grease under the wheels.

Since I no longer work for Actian, I now have the option to give my completely honest opinion of DataFlow’s strengths and weaknesses. I no longer have any party line to toe. Perhaps surprisingly, that hasn’t significantly changed what I have to say about it. I still think that this little engine could take over the Hadoop execution world, but I also think that it probably won’t.

More Read

Leaders’ Perspectives on Big Data
Cisco Announces “Significant Innovations” in its Unified Computer Servers Exclusively for Data Centers
3 Things to Consider for Your 2016 IT Strategy
Breaches of data confidentiality can be costly
First Look: Predixion 4.0

The Little Actian DataFlow Engine That Could

Data Processing Paradigm (How does it work?)

DataFlow was invented originally back in the early 2000’s for the multi-core revolution. As Moore’s Law started to slow down, a lot of hardware folks adapted to computer chips no longer getting faster at the same rate by putting in more and more chips. DataFlow was designed to automatically scale up at runtime to make best use of all those cores, without knowing ahead of time how many cores it was going to be running on. It’s power lay in a philosophy of “Create once, run many.” and leaving no hardware power behind. It squeezed power levels out of standard hardware that no one previously believed possible.

Then along came Hadoop, and instead of spending money on scaling up on machines with more and more cores, businesses started scaling out to do their data processing on multiple computers. A few code tweaks later, and voila, DataFlow was an engine for clusters that detected available cores, and nodes, and automatically parallelized jobs at runtime to make best use of all available hardware.

DataFlow uses the same high performance computing DAG strategy that gives Tez it’s advantages over MapReduce, but it doesn’t have any of the MapReduce baggage since it actually pre-dates it by quite a few years. DataFlow was never influenced by the MapReduce kangaroo data processing paradigm. Pipelining data in memory was the focus when DataFlow was created. Since it was intended to be a next-gen data and compute intensive ETL engine, a lot of thought was put into transforming data many times in many ways as efficiently as possible. Later, parallel machine learning and predictive analytics operators were added that took advantage of the same multiple pipeline strategy, but at it’s heart, DataFlow is an ETL engine.

Interface (How do you use it?)

The folks developing DataFlow really had their fellow engineers in mind when they designed the framework, so a lot of emphasis was put on making it easier to develop with. Like many of the Hadoop engines, the first users of DataFlow were the same people who invented it. DataFlow’s creators built a lot of abstraction into the framework itself to handle the difficult parallel aspects of multi-threaded application building. The DataFlow Java API is a breeze compared to writing MapReduce or most other types of parallel code. Most Java programmers can pick it up in about a week.

But who wants to spend weeks writing code when you can string together pre-built parallel operators that someone else already put the work into? The Javascript interface lets you build entire ETL and/or analytics applications in a few minutes.

Or, the Actian partnership with open source Eclipse-based data mining platform KNIME, means you can drag and drop to build applications with a mouse.

Weak Areas (What is it NOT good for?)

DataFlow’s biggest weakness is obvious. It’s not open source. I’ve got a lot of love and admiration for this zippy little engine, but it’s just not going to make it up that Hadoop elephant-shaped hill without support from the open source community.

Right now, KNIME is the only open source community that even notices DataFlow’s existence, but they can’t touch the source code. So, even if the data mining and predictive analytics folks WANTED to improve, support and build around this engine, they couldn’t. For them, it’s just a handy bit of freeware that they can use to boost speed on larger data mining jobs.

Best Use Case (What is it good for?)

Actian DataFlow Engine Power

Like Spark, DataFlow doesn’t really require Hadoop. It will run fine on anything from a laptop to a super-computer, almost any platform with a JVM. However, DataFlow has worked hand-in-hand with Hadoop development. It has it’s own built-in cluster manager and resource allocation capabilities created specifically so it could share resources on pre-YARN Hadoop versions. Then, DataFlow was practically first in line for the new YARN-Ready certification. DataFlow edged in through the back door as a second class citizen, then YARN opened the door and made it welcome.

Like all Hadoop engines, and Hadoop itself, DataFlow was built to solve a problem, mainly compute intensive data matching and profiling bogging down and taking forever. The little engine was put to work more than a decade ago in Pervasive’s data quality tools, Data Profiler, Data Matcher and Data MatchMerge. Years before Hadoop was more than a crazy idea that Google did a research paper on, DataFlow (or DataRush as it was once called) was executing parallel fuzzy matching algorithms for high speed record de-duplication, and blowing through data quality validation jobs against hundreds of business rules in seconds on plain old desktop computers. It’s had a lot of battle testing, and been refined by the use, abuse and demands of real users over those years.

That level of maturity and time tested solidity isn’t something you see yet in other Hadoop engines. If you have old school ETL and data quality problems at modern massive scales, DataFlow can power through those at unmatched speeds, dependably. That’s DataFlow’s sweet spot.

Also, if you need basic statistics or machine learning style analytics, DataFlow handles those fairly well. The library of operators is limited, but if they meet your needs, the performance is excellent.

General Comparison to Other Options

If you look at sheer power to do what a Hadoop engine should do, crunch through data at high speeds, DataFlow looks pretty darned impressive. Spark is the only batch style engine that even approaches DataFlow’s speed, and DataFlow doesn’t have the huge memory requirements that Spark has. (Yes, I know, Spark Streaming. Spark Streaming does micro-batching, not true stream processing. And so does DataFlow.)

Unfortunately, the ability to get the job done isn’t the only thing that guides adoption. Per node software license fees are not popular with the open source favoring companies that are likely to choose Hadoop. The ease of use and generally higher processing speed might make up for that with some companies, and the several years head start in software maturity could also help ease the pain of proprietary license costs.

The problem is that Spark has something in the neighborhood of 300 committers, and an entire ecosystem of its own being built around it. No company smaller than IBM or Oracle can afford to pay that many developers. Actian doesn’t stand a chance of keeping up. MapReduce and Tez have their own built in communities and integrated stacks, as does Storm. That’s the one thing that every successful open source project absolutely must have, community support.

DataFlow functionality may be ahead for now, thanks to Actian CTO, Mike Hoskins’, ability to see into the future, and build software to solve problems other folks didn’t even know were problems yet. But it won’t take long for open source to make up that lead and pass DataFlow. Without community support, this powerful little engine that could, won’t.

Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

ESG reporting software
Data Shows How ESG Reporting Software Helps Companies Achieve Sustainability Goals
Big Data Infographic
ai in marketing
AI Helps Businesses Develop Better Marketing Strategies
Artificial Intelligence Exclusive
agenic ai
How Businesses Are Using AI to Make Smarter, Faster Decisions
Artificial Intelligence Exclusive
accountant using ai
AI Improves Integrity in Corporate Accounting
Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

Mark Madsen’s three indications of uselessness

4 Min Read

3 Customer Acquisition Strategies to Ensure Your Brand Is Not a “One Night Brand”

5 Min Read

Strange Bedfellows

3 Min Read

A Blog I Like: Devost.net

2 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

AI chatbots
AI Chatbots Can Help Retailers Convert Live Broadcast Viewers into Sales!
Chatbots
giveaway chatbots
How To Get An Award Winning Giveaway Bot
Big Data Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?