First Look: Datameer

As part of an ongoing expansion of our ecosystem mapping to include more Hadoop-based products I recently got an update from Datameer. Datameer was founded back in 2009 by Stefan Groschupf, who was one of the original contributors to Nutch, the open source project that spun off Hadoop. Prior to starting Datameer, he and the founding team were architecting and implementing custom distributed big data analytic systems. After several years of implementing the same kind of custom solutions over and over, they started Datameer to productize their experience. Datameer is a Hadoop-centric product, purpose built from the beginning as a Hadoop solution. VC funded they have about an 80 strong team, headquartered in San Mateo with a core engineering team in Germany.

Datameer’s product is an analytic application – a BI and analytics layer – that runs on Hadoop. Datameer aims to abstract the complexity of Hadoop through three key elements:

Wizard-based data integration for business users that ingests data from 55+ different sources (databases, some cloud sources, email etc) or an open API into Hadoop.
Point and click analytics based on an interactive spreadsheet.
This has 240+ pre-built analytic functions (from nearest neighbor and text analytics to simple joins) and provides instant previews based on a Smart Sample built when the data was integrated. Once your analytic pipeline is defined functions can be executed against the full data set at once, on a schedule or on demand.
Drag and drop visualization based on HTML 5.
The graphics start with a blank canvas for flexibility and allow annotation (including video and other more advanced annotations). All visualizations are updated automatically with new data. Part of the driver for this was a realization that business users need to present results, and that taking a screenshot of dashboards to annotate elsewhere meant the data displayed was instantly out of date.

There is also an app market for packaged analytic solutions which can be built by any user of the product using a one click “create app” option and then shared/sold. Datameer aims to play nice with all the various Hadoop distributions and has bi-directional connections to a wide range of databases, making it easy to fit into a company’s current environment.

Customers tend to be either larger companies, with lots of existing databases and BI tools where Datameer acts as a data hub that brings data sources together in Hadoop, or more emerging web 2.0 businesses that have all their data in Hadoop and don’t have a lot of historical BI infrastructure. Datameer claims plenty of large customers in both categories – banks, telcos, online gaming etc.

They recently announced 3.0, which builds on these core functions in a number of ways. In particular they have introduced what they call “Smart Analytics” – or self-service machine learning for Hadoop. This delivers a set of four key algorithms:

Clustering
The k-means algorithm is used to group data into clusters that are alike.
Column dependencies
This shows the degree of correlation between columns in a sheet
Decision trees
Shows the different combinations of data attributes that result in a desired outcome.
Recommendations
Develops a rating or preference for something not associated with a record based on which records are associated.

A simple click takes a set of data and runs these machine learning algorithms against the data. A graphical preview, using the Smart Sample is available. Running the algorithms creates new sheets containing the source data as well as the results of the machine learning algorithms. These can be stored back, used for additional analysis etc. These are based on available public algorithms and the implementation is designed both to execute against Hadoop and to allow very non-technical users to use machine learning models without a data scientist.

In addition, Datameer allows you to take a PMML model and then execute it against data stored in Hadoop. A PMML file generated from any data mining environment can be loaded up and is turned into a custom Datameer function. This function can then be used like any other in Datameer so the scores calculated from the model can be stored back into Hadoop and visualized/analyzed or pushed out to another environment (using Datameer as a data integration platform). With increased support among data mining tools for big data this is very timely, allowing you to extract data to your modeling environment (from Datameer perhaps), build the model(s) you want and then push them back into Datameer for ongoing use in your analysis.

You can get more information on Datameer and its support for PMML here.