A Technical Look at Big Data

July 30, 2013
664 Views

Big data can mean a number of things—and guide business decisions in countless ways. It can involve analyzing sheer volumes of data, from 100s of terabytes to petabytes. Or it can imply getting data and information from unlikely sources—sensors from a machine on an assembly line, Twitter and Facebook, or a company’s visitor web log. Big data also entails analyzing streaming data, such as stock market tickers and the external factors that make the market go up or down.

Big data can mean a number of things—and guide business decisions in countless ways. It can involve analyzing sheer volumes of data, from 100s of terabytes to petabytes. Or it can imply getting data and information from unlikely sources—sensors from a machine on an assembly line, Twitter and Facebook, or a company’s visitor web log. Big data also entails analyzing streaming data, such as stock market tickers and the external factors that make the market go up or down. In all instances, at the core of big data is an effort to understand behavior—and to use that understanding to make predictions and guide smart next steps.

That said, questions abound about how to make the most of big data—and use it strategically to inform key decisions in your business or organization. While there’s no easy answer, and many companies don’t have the time or expertise to craft and implement a plan, the first step is understanding the tools and technologies behind big data—and their potential to deliver deep insights to your team.

Hadoop

When the conversation turns to big data, Apache’s technology, Hadoop, comes up time and again. But if you ask most people how Hadoop actually works, they likely won’t know. Keep reading, and we’ll do our best to explain.

In a nutshell, Hadoop is a huge data processor and storage system. It uses the programming model MapReduce (first developed by Google) to process data and a Hadoop distributed file system (HDFS) to store data. Here’s an abridged version of how it works: MapReduce jobs roll up the original data input into aggregates defined by the job’s “keys,” which can be anything the code dictates. Once the algorithm is defined, the key aggregates are stored in the HDFS, which allows for data saving across many low level servers.

What makes Hadoop so popular—and powerful? Hadoop’s strength lies in its flexibility to add thousands of computers to the solution to improve the performance of the jobs and provide added data storage. All of the jobs working to break down the data operate in parallel across the many different servers in the Hadoop cluster.

Learn about the complete technical architecture of Hadoop here.  

More on Hadoop: What’s in its stack?

While Hadoop can process large data sets and distribute aggregations across many servers, its user interaction is far less responsive than traditional transactional databases. So you can think of Hadoop as the technology that retrieves, aggregates, and stores the data—but not the technology to serve data to your users. For that, we turn to other systems, such as a process architecture schedule that allows users to interact with the data in a timely manner (meaning fast).

In the Apache world of Hadoop, several technologies make up the Hadoop stack:

  • HBase – a NoSQL database that provides a structured storage of the data
  • Hive – often used as the bridge between the unstructured data sources and business intelligence (BI) components, such as PowerPivot and SSAS, that expect a tabular dataset
  • Pig – seen as the extract-transform-load (ETL) for Hadoop and made up of MapReduce jobs to generate a flat file of data to HDFS, which can then be used as a source for another part of the big data architecture
  • Sqoop – the ability to transfer bulk data from HDFS to a relational database, such as a SQL Server

Hadoop also comes in many different flavors. Among the most popular implementations are:

  • Microsoft HDInsight – MS + Hortonworks
    • Can be installed on a Windows platform or used in Windows Azure
    • Javascript MapReduce framework
    • Includes an ODBC driver for HIVE
    • Browser interface for interactions
    • Works with the MS BI stack very well
  • EMR – AmazonCDH4 – Cloudera
    • Works with Amazon’s S3 storage
    • Data is retrieved from S3, MapReduce jobs execute against the buckets, and results are stored back in S3
    • All command line interface
    • Virtual Machine available free – requires VMWare
    • Hue is the browser interface
    • Creates MapReduce jobs, similar to under the hood of EMR and HDInsight
    • New engine for CDH4 – Impala
      • bypasses MapReduce – queries interactively over the data
      • Brings BI tools and Hadoop closer together
      • Query with Beeswax
      • In Beta no
      • Not an Apache Software Foundation project – owned by Cloudera

What about implementation?

To implement a big data solution, one can pursue any number of technical paths—and end up in the same place. The real challenge of big data isn’t the technology—it’s figuring out the business value of the data, and exactly what insights can be gained from the solution. Another challenge involves determining the scale—and answering such questions as how much data to gather and aggregate, how often it needs to be refreshed, and how often the analysis will be used and reexamined (down the road) for new answers.

For a one-time implementation, one could structure the data by grabbing a past event—a promotion, for example—from several sources, such as an internal warehouse of data, web logs of visitors who came to the online store during the promotion period, and sales from the warehouse during the time of the promotion. These three pieces could be put together using Data Explorer, PowerPivot, and PowerView to provide some initial analytics of the promotion itself—and to answer such questions as: Did web traffic to the online store increase for the target market? Did sales increase for the target product? What were the costs associated with the promotion, and did the increase of sales for the period offset those costs?

These are good questions, and can certainly guide future promotions. But the more useful, strategic questions need to be asked earlier on, before the promotion even starts. These questions include: Who should we target in the first place? What data can point us to that target? 

Big Data in Action

The best way to understand big data is to see it in action. Consider this business scenario of a company that sells security products. Ideally, the company would market to areas across the country with high crime rate but low sales in security products. The reasoning? If sales are already high, more marketing probably won’t lead to a spike. And if crime is low, the need simply isn’t there. However, if the company can find the sweet spot in which sales have the most potential to spike, then—there you have itthe company can identify and go after a target.

To extend the scenario further, and show you how the big data puzzle pieces fit together, let’s look at the data sources we might pull from to determine our market target. These include:

  • A data warehouse that contains sales and promotion information;
  • An internal online website for customer transactions, with web log data not yet utilized to analyze traffic hits; and
  • A public market store available for crime data in the United States (this data is currently available in the Windows Azure Marketplace).

Now, here’s what the implementation looks like as we pull the data into Excel Powerpivot/Powerview:

What technologies would enable us to aggregate this data in one place? We’d start with Hadoop, and use it to:

  • Import the web log data into an HDInsight instance since this is the true unstructured data source.
  • Create a HIVE structure on top of this data to provide a tabular set of data.
  • Create an ODBC connection to the HIVE source to be used as a data source.

Next, using Microsoft’s Business Intelligence stack, we’d:

  • Set up a feed of the U.S. government crime data (from Windows Azure Marketplace) into a file;
  • Identify and pull from an existing data warehouse that stores sales and promotion data;
  • Combine these data sources into an SSAS tabular cube instance, which would be the source of the client toolset to interact with the data and schedule it for processing; and
  • Build PowerView reports to combine data from promotions, sales, and web log visits, and to analyze potential customer bases, etc., against the SSAS cube.

From a tabular point of view, the relational architecture might look like this:

And from a reporting perspective, we might generate the following series of reports to use in the analytics:

What does this tell us about big data?

Big data is really just a term that describes how to gain predictive analytics from the wide range of information-gathering tools available today. Many software systems and players occupy the big data space—and a lot of terms and technologies make it work. The bottom line, though, is that the technology will do nothing without a clearly defined business value.

So now that you see the big picture, the real task is figuring out what to do with the data—and how it can help you solve problems, explore new ways of thinking, and ultimately lead to profitable decisions.