Incorporating MapReduce in the Analytics Environment

A few weeks ago, I attended the Hadoop World show in New York to hear first-hand how organizations are making use of the new technology.

Contents

Scenario 1: A MapReduce platform as a “live” archive
Scenario 2: A MapReduce platform as an ETL and filtering platform
Scenario 3: A MapReduce platform as an exploration engine

A few weeks ago, I attended the Hadoop World show in New York to hear first-hand how organizations are making use of the new technology. In future postings, I may address what claims I thought entered the hype zone and what value propositions seemed weak. However, I want to focus here on three specific cases where a MapReduce platform, such as Hadoop, clearly has an important and valuable role to play. These three examples alone are enough to justify looking at incorporating MapReduce into your analytics environment.

I am going to use the generic term MapReduce in this column. Hadoop is an open source implementation of the MapReduce framework, but there are others such as Teradata’s Aster Data offering as well. The important part for our purposes here is what a MapReduce framework can do, not which specific implementation you choose to utilize.

Scenario 1: A MapReduce platform as a “live” archive

This is a theme I heard repeatedly, and it was explicitly mentioned by speakers from show sponsor Cloudera. Consider a huge mass of historical data that rarely needs to be accessed. Traditionally, such data would be archived to tape or some other media and sent to storage (if it wasn’t simply deleted). While in theory the data could be accessed if needed, it is difficult and expensive to recover the data. In practice, people rarely leverage the archives.

With disk space being so cheap, the data can now be sent to the inexpensive, commodity hardware within a MapReduce platform. The data is still accessible at any time and is a “live” archive. Perhaps few users will use the data over time, but when they need to, they can. MapReduce is an inexpensive way to enable such an archive. It wouldn’t make sense to keep this archive live in a more expensive, formal data warehousing environment.

Scenario 2: A MapReduce platform as an ETL and filtering platform

One of the biggest challenges with big data sources such as web logs, sensor data, or even masses of email messages, is the process of extracting the key pieces of valuable information from the noise. Loading raw web logs into a database system to then throw away 90% or more of the data during processing isn’t the best way to go. Loading large, raw files into a MapReduce environment for initial processing makes terrific sense.

A MapReduce platform can be used to read in the raw data, apply appropriate filters and logic against it, and output a more structured, usable set of data. That

reduced set of data can then be further analyzed in the MapReduce environment or migrated into a traditional analysis environment. The key is that only the important pieces of data remain, which makes it much more manageable. Typically, only a small percentage of a raw big data feed is required for a given business problem. MapReduce is a great tool to extract those pieces.

Scenario 3: A MapReduce platform as an exploration engine

Another recurrent theme at the show was the concept of a MapReduce platform being used for discovery and exploratory analysis. This is another solid application for MapReduce. Once raw data has been read and processed, further analysis can be done against the data within the MapReduce environment. As always, many paths of analysis may be tried before a successful one is found. Once the data is in a MapReduce setting, utilizing tools to analyze it where it already sits makes sense.

This scenario leads to a major decision point, however. Once a set of data is found to have high value via analysis in MapReduce, an important next step is to combine the new data with existing data. This is so that each data source can be made even more valuable by being combined with the others. Once you have distilled the data down to what is important, it should be loaded into the corporate systems that users have wide access to. It doesn’t make sense to pull all of the data out of a data warehouse, for example, just to match it with one new source of big data. It makes more sense to load the one new source of data alongside all the other pre-existing data within a data warehouse.

That last point is one where those loyal to MapReduce may differ with me. Many discussions at Hadoop World suggested that it does make sense to pull all corporate data into MapReduce. I predict that in the long run, however, things will go as I suggest above. Keeping data movement to a minimum is essential, as is making it available to as wide an audience as possible. For these reasons, MapReduce environments will augment, rather than replace, traditional environments.