By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData Collective
  • Analytics
    AnalyticsShow More
    predictive analytics in dropshipping
    Predictive Analytics Helps New Dropshipping Businesses Thrive
    12 Min Read
    data-driven approach in healthcare
    The Importance of Data-Driven Approaches to Improving Healthcare in Rural Areas
    6 Min Read
    analytics for tax compliance
    Analytics Changes the Calculus of Business Tax Compliance
    8 Min Read
    big data analytics in gaming
    The Role of Big Data Analytics in Gaming
    10 Min Read
    analyst,women,looking,at,kpi,data,on,computer,screen
    Promising Benefits of Predictive Analytics in Asset Management
    11 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: A Technical Look at Big Data
Share
Notification Show More
Latest News
ai digital marketing tools
Top Five AI-Driven Digital Marketing Tools in 2023
Artificial Intelligence
ai-generated content
Is AI-Generated Content a Net Positive for Businesses?
Artificial Intelligence
predictive analytics in dropshipping
Predictive Analytics Helps New Dropshipping Businesses Thrive
Predictive Analytics
cloud data security in 2023
Top Tools for Your Cloud Data Security Stack in 2023
Cloud Computing
become a data scientist
Boosting Your Chances for Landing a Job as a Data Scientist
Jobs
Aa
SmartData Collective
Aa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Software > Hadoop > A Technical Look at Big Data
Big DataBusiness IntelligenceHadoopITSoftware

A Technical Look at Big Data

Chuck Rivel
Last updated: 2013/07/30 at 8:00 AM
Chuck Rivel
12 Min Read
SHARE

Big data can mean a number of things—and guide business decisions in countless ways. It can involve analyzing sheer volumes of data, from 100s of terabytes to petabytes. Or it can imply getting data and information from unlikely sources—sensors from a machine on an assembly line, Twitter and Facebook, or a company’s visitor web log. Big data also entails analyzing streaming data, such as stock market tickers and the external factors that make the market go up or down.

Contents
HadoopMore on Hadoop: What’s in its stack?What about implementation?Big Data in ActionWhat does this tell us about big data?

Big data can mean a number of things—and guide business decisions in countless ways. It can involve analyzing sheer volumes of data, from 100s of terabytes to petabytes. Or it can imply getting data and information from unlikely sources—sensors from a machine on an assembly line, Twitter and Facebook, or a company’s visitor web log. Big data also entails analyzing streaming data, such as stock market tickers and the external factors that make the market go up or down. In all instances, at the core of big data is an effort to understand behavior—and to use that understanding to make predictions and guide smart next steps.

That said, questions abound about how to make the most of big data—and use it strategically to inform key decisions in your business or organization. While there’s no easy answer, and many companies don’t have the time or expertise to craft and implement a plan, the first step is understanding the tools and technologies behind big data—and their potential to deliver deep insights to your team.

Hadoop

When the conversation turns to big data, Apache’s technology, Hadoop, comes up time and again. But if you ask most people how Hadoop actually works, they likely won’t know. Keep reading, and we’ll do our best to explain.

More Read

ai digital marketing tools

Top Five AI-Driven Digital Marketing Tools in 2023

Is AI-Generated Content a Net Positive for Businesses?
Top Tools for Your Cloud Data Security Stack in 2023
Boosting Your Chances for Landing a Job as a Data Scientist
Combat AI-Powered Threats with Cybersecurity Simulations & Other Practices

In a nutshell, Hadoop is a huge data processor and storage system. It uses the programming model MapReduce (first developed by Google) to process data and a Hadoop distributed file system (HDFS) to store data. Here’s an abridged version of how it works: MapReduce jobs roll up the original data input into aggregates defined by the job’s “keys,” which can be anything the code dictates. Once the algorithm is defined, the key aggregates are stored in the HDFS, which allows for data saving across many low level servers.

What makes Hadoop so popular—and powerful? Hadoop’s strength lies in its flexibility to add thousands of computers to the solution to improve the performance of the jobs and provide added data storage. All of the jobs working to break down the data operate in parallel across the many different servers in the Hadoop cluster.

Learn about the complete technical architecture of Hadoop here.  

More on Hadoop: What’s in its stack?

While Hadoop can process large data sets and distribute aggregations across many servers, its user interaction is far less responsive than traditional transactional databases. So you can think of Hadoop as the technology that retrieves, aggregates, and stores the data—but not the technology to serve data to your users. For that, we turn to other systems, such as a process architecture schedule that allows users to interact with the data in a timely manner (meaning fast).

In the Apache world of Hadoop, several technologies make up the Hadoop stack:

  • HBase – a NoSQL database that provides a structured storage of the data
  • Hive – often used as the bridge between the unstructured data sources and business intelligence (BI) components, such as PowerPivot and SSAS, that expect a tabular dataset
  • Pig – seen as the extract-transform-load (ETL) for Hadoop and made up of MapReduce jobs to generate a flat file of data to HDFS, which can then be used as a source for another part of the big data architecture
  • Sqoop – the ability to transfer bulk data from HDFS to a relational database, such as a SQL Server

Hadoop also comes in many different flavors. Among the most popular implementations are:

  • Microsoft HDInsight – MS + Hortonworks
    • Can be installed on a Windows platform or used in Windows Azure
    • Javascript MapReduce framework
    • Includes an ODBC driver for HIVE
    • Browser interface for interactions
    • Works with the MS BI stack very well
  • EMR – AmazonCDH4 – Cloudera
    • Works with Amazon’s S3 storage
    • Data is retrieved from S3, MapReduce jobs execute against the buckets, and results are stored back in S3
    • All command line interface
    • Virtual Machine available free – requires VMWare
    • Hue is the browser interface
    • Creates MapReduce jobs, similar to under the hood of EMR and HDInsight
    • New engine for CDH4 – Impala
      • bypasses MapReduce – queries interactively over the data
      • Brings BI tools and Hadoop closer together
      • Query with Beeswax
      • In Beta no
      • Not an Apache Software Foundation project – owned by Cloudera

What about implementation?

To implement a big data solution, one can pursue any number of technical paths—and end up in the same place. The real challenge of big data isn’t the technology—it’s figuring out the business value of the data, and exactly what insights can be gained from the solution. Another challenge involves determining the scale—and answering such questions as how much data to gather and aggregate, how often it needs to be refreshed, and how often the analysis will be used and reexamined (down the road) for new answers.

For a one-time implementation, one could structure the data by grabbing a past event—a promotion, for example—from several sources, such as an internal warehouse of data, web logs of visitors who came to the online store during the promotion period, and sales from the warehouse during the time of the promotion. These three pieces could be put together using Data Explorer, PowerPivot, and PowerView to provide some initial analytics of the promotion itself—and to answer such questions as: Did web traffic to the online store increase for the target market? Did sales increase for the target product? What were the costs associated with the promotion, and did the increase of sales for the period offset those costs?

These are good questions, and can certainly guide future promotions. But the more useful, strategic questions need to be asked earlier on, before the promotion even starts. These questions include: Who should we target in the first place? What data can point us to that target? 

Big Data in Action

The best way to understand big data is to see it in action. Consider this business scenario of a company that sells security products. Ideally, the company would market to areas across the country with high crime rate but low sales in security products. The reasoning? If sales are already high, more marketing probably won’t lead to a spike. And if crime is low, the need simply isn’t there. However, if the company can find the sweet spot in which sales have the most potential to spike, then—there you have it—the company can identify and go after a target.

To extend the scenario further, and show you how the big data puzzle pieces fit together, let’s look at the data sources we might pull from to determine our market target. These include:

  • A data warehouse that contains sales and promotion information;
  • An internal online website for customer transactions, with web log data not yet utilized to analyze traffic hits; and
  • A public market store available for crime data in the United States (this data is currently available in the Windows Azure Marketplace).

Now, here’s what the implementation looks like as we pull the data into Excel Powerpivot/Powerview:

What technologies would enable us to aggregate this data in one place? We’d start with Hadoop, and use it to:

  • Import the web log data into an HDInsight instance since this is the true unstructured data source.
  • Create a HIVE structure on top of this data to provide a tabular set of data.
  • Create an ODBC connection to the HIVE source to be used as a data source.

Next, using Microsoft’s Business Intelligence stack, we’d:

  • Set up a feed of the U.S. government crime data (from Windows Azure Marketplace) into a file;
  • Identify and pull from an existing data warehouse that stores sales and promotion data;
  • Combine these data sources into an SSAS tabular cube instance, which would be the source of the client toolset to interact with the data and schedule it for processing; and
  • Build PowerView reports to combine data from promotions, sales, and web log visits, and to analyze potential customer bases, etc., against the SSAS cube.

From a tabular point of view, the relational architecture might look like this:

And from a reporting perspective, we might generate the following series of reports to use in the analytics:

What does this tell us about big data?

Big data is really just a term that describes how to gain predictive analytics from the wide range of information-gathering tools available today. Many software systems and players occupy the big data space—and a lot of terms and technologies make it work. The bottom line, though, is that the technology will do nothing without a clearly defined business value.

So now that you see the big picture, the real task is figuring out what to do with the data—and how it can help you solve problems, explore new ways of thinking, and ultimately lead to profitable decisions.

Chuck Rivel July 30, 2013
Share this Article
Facebook Twitter Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

ai digital marketing tools
Top Five AI-Driven Digital Marketing Tools in 2023
Artificial Intelligence
ai-generated content
Is AI-Generated Content a Net Positive for Businesses?
Artificial Intelligence
predictive analytics in dropshipping
Predictive Analytics Helps New Dropshipping Businesses Thrive
Predictive Analytics
cloud data security in 2023
Top Tools for Your Cloud Data Security Stack in 2023
Cloud Computing

Stay Connected

1.2k Followers Like
33.7k Followers Follow
222 Followers Pin

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

[mc4wp_form id=”1616″]

You Might also Like

ai digital marketing tools
Artificial Intelligence

Top Five AI-Driven Digital Marketing Tools in 2023

6 Min Read
ai-generated content
Artificial Intelligence

Is AI-Generated Content a Net Positive for Businesses?

5 Min Read
cloud data security in 2023
Cloud Computing

Top Tools for Your Cloud Data Security Stack in 2023

7 Min Read
become a data scientist
Jobs

Boosting Your Chances for Landing a Job as a Data Scientist

9 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence
giveaway chatbots
How To Get An Award Winning Giveaway Bot
Big Data Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US

© 2008-23 SmartData Collective. All Rights Reserved.

Removed from reading list

Undo
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?