By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData Collective
  • Analytics
    AnalyticsShow More
    predictive analytics in dropshipping
    Predictive Analytics Helps New Dropshipping Businesses Thrive
    12 Min Read
    data-driven approach in healthcare
    The Importance of Data-Driven Approaches to Improving Healthcare in Rural Areas
    6 Min Read
    analytics for tax compliance
    Analytics Changes the Calculus of Business Tax Compliance
    8 Min Read
    big data analytics in gaming
    The Role of Big Data Analytics in Gaming
    10 Min Read
    analyst,women,looking,at,kpi,data,on,computer,screen
    Promising Benefits of Predictive Analytics in Asset Management
    11 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: Taking the Mystery Out of Big Data
Share
Notification Show More
Latest News
ai digital marketing tools
Top Five AI-Driven Digital Marketing Tools in 2023
Artificial Intelligence
ai-generated content
Is AI-Generated Content a Net Positive for Businesses?
Artificial Intelligence
predictive analytics in dropshipping
Predictive Analytics Helps New Dropshipping Businesses Thrive
Predictive Analytics
cloud data security in 2023
Top Tools for Your Cloud Data Security Stack in 2023
Cloud Computing
become a data scientist
Boosting Your Chances for Landing a Job as a Data Scientist
Jobs
Aa
SmartData Collective
Aa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Software > Hadoop > Taking the Mystery Out of Big Data
AnalyticsBig DataBusiness IntelligenceHadoop

Taking the Mystery Out of Big Data

DougLautzenheiser
Last updated: 2015/07/11 at 12:14 AM
DougLautzenheiser
12 Min Read
SHARE
Today’s companies have the potential to benefit from incredibly large amounts of data.
 
To shake off the mystery of this “Big Data,” it’s useful to know its history.
 
In the not-so-distant past, firms tracked their own internal transactions and master data (products, customers, employees, and so forth) but little else. Companies probably only had very large databases if their industry called for high-volume and high-speed applications such as telecommunication, shipping, or point of sales.

Today’s companies have the potential to benefit from incredibly large amounts of data.
 
To shake off the mystery of this “Big Data,” it’s useful to know its history.
 
In the not-so-distant past, firms tracked their own internal transactions and master data (products, customers, employees, and so forth) but little else. Companies probably only had very large databases if their industry called for high-volume and high-speed applications such as telecommunication, shipping, or point of sales. Even then, these transactions were all formatted in a standard way and could be saved inside the relational database IBM designed in the 1960s.
 
This was perfectly fine for corporate computing in the 1970s and 1980s. Then, in the middle of the 1990s, along came the world-wide web, browsers, and e-commerce. Before the end of that decade, a web search engine company named Google was facing challenges as to how to track all of the changes happening all over global web pages. A traditional computing option would have been to scale-up: get a bigger platform, a more powerful database engine, and more disk space.
 
But spending money wasn’t a good option for a little operation like Google; it was well behind the established search engines like Lycos, WebCrawler, AltaVista, Infoseek, Yahoo, and others.
 
Google decided on a strategy of scaling out instead of up. Using easily-obtained commodity computers, they spread out not only the data but the application processing. Instead of buying a big super-computer, they used thousands of run-of-the-mill boxes all working together. On top of this distributed data framework, they built a processing engine using a common software technique known as Map-Shuffle-Reduce.
 
Of course, a scale-out paradigm meant Google now had multiple places where a failure could happen when writing data or running a software process. One or more of those thousands of cheap computers could crash and mess up everything. To deal with this, Google added automated data replication and fail-over logic to handle bad situations under the covers and still make everything work as expected for the user.
 
In 2003 in a published document, Google explained to the world their distributed data storage methods. The following year, they disclosed details on their parallel-processing engine.
 
One reader of Google’s white papers was Doug Cutting, who was working on an Apache Software Foundation open-source software spider/crawler search engine called Nutch. Like Google, Doug had run into issues handling the complexity and size of large-scale search problems. Within a couple of years, Doug applied Google’s techniques to Nutch and had it scaling out dramatically.
 
Understanding its importance, Doug shared his success with others. In 2006 while working with Yahoo, Doug started an Apache project called “Hadoop,” named after his daughter’s stuffed toy elephant. By 2008, individuals familiar with this new Hadoop open-source product were forming companies to provide complementary products and services.
 
With our history lesson over, we are back to the present. Today, Hadoop is an entire “ecosystem” of offerings available not only from the Apache Software Foundation but from for-profit companies such as Cloudera, Hortonworks, MapR, and others. Volunteers and paid employees around the world work diligently and passionately on these open-source Big Data software offerings.
 
When you hear somebody say “Big Data,” he or she typically refers to the need to accumulate and analyze massive amounts of very diverse and unstructured data that cannot fit on a single computer. Big Data is usually accomplished using the following:
 
  • Scale-out techniques to distribute data and process in parallel
  • Lots of commodity hardware
  • Open-source software (in particular, Apache Hadoop)
 
 
 
 
Large companies with terabytes of transactions stored in an enterprise data warehouse on database computers or applications like Teradata or Netezza are not doing Big Data. Sure, they have very large databases but that’s not “big” in today’s sense of the word.
 
Big Data comes from the world around the company; it’s generated rapidly from social media, server logs, machine interfaces, and so forth. Big Data doesn’t follow any particular set of rules, so you will be challenged when trying to slap a static layout on top of it and make it conform. That’s one big reason why traditional relational database management systems (RDBMSs) cannot handle Big Data.
 
The term “Hadoop” usually refers to several pieces of Big Data software:
 
  • The “Common” modules, handling features such as administration, management, and security
  • The distributed data engine, known as Hadoop Distributed File System (HDFS)
  • The parallel-processing engine (either the traditional MapReduce framework now known as YARN or an emerging one called Spark)
  • A distributed data warehouse feature on top of the HDFS (HBase for standard reporting needs or Cassandra for active, operational needs)
 
 
In addition to the basic Hadoop software, however, there are lots of other pieces. For putting data into Hadoop, for example, you have several options:
 
  • Programmatically with languages (e.g., Java, Python, Scala, or R), you can use Application Programming Interfaces (APIs)  or a Command Line Interface (CLI)
  • Streaming data using the Apache Flume software
  • Batch file transfers using the Sqoop module
  • Messages using the Kafka product
 
 
When pulling data out of Hadoop, you have other open-source options:
 
  • Programmatically with languages
  • Hbase, Hive with HiveQL, or Pig with PigLatin which all provide easier access than using MapReduce against the underlying distributed file system   
  • Elasticsearch or Solr for searching
  • Mahout for automated machine learning
  • Drill, an always-active “daemon” process, which acts as a query engine for data exploration
 
 
But why would you want the complexity of this “Big Data?”
 
It was obvious for Google and Nutch, search engines trying to scour and collect bytes from the entire world-wide web.  It was their business to handle Big Data.
 
Any large firm is on the other end of Google; they have a web site which people browse and use, quite probably navigating to it from Google’s search results. One Big Data use case for most companies would therefore be to do large-scale analysis of its web server logs. In particular, they could look for suspicious behavior that suggests some type of hacking attempt. Big Data can protect your company from cybercrimes.
 
If you offer products online, a common Big Data use case would be as a “recommendation engine.” A smart Big Data application can provide each customer with personalized suggestions on what to buy. By understanding the customer as an individual, Big Data can improve engagement, satisfaction, and retention.
 
Big Data can be a more cost-effective method of extracting, transforming, and loading data into an enterprise data warehouse. Apache open-source software might replace and modernize your expensive proprietary COTS ETL package and database engines. Big Data could reduce the cost and time of getting your BI results.  
 
It’s a jungle out there; there’s fraud happening. You may have some bad customers with phony returns, a bad manager trying to game the system for bonuses, or entire groups of bad hackers actively involved in scamming money from your company. Big Data can “score” financial activities and provide an estimate of how likely individual transactions are fraudulent. 
 
Most companies have machine-generated data: time-and-attendance boxes, garage security gates, badge readers, manufacturing machines with logs, and so forth. These are examples of the emerging tsunami of “Internet of Things” data. Capturing and analyzing time-series events from IoT devices can uncover high-value insights of which we would otherwise be ignorant.
 
The real key to Big Data success is having specific business problems you need to solve and on which you would take immediate action.
 
One of my clients was great about focusing on problems and taking actions. They had pharmacies inside their retail stores and, each week, a simple generated report showed the top 10 reasons insurance companies rejected their pharmacy claims. Somebody was then responsible for making sure the processing problems behind the top reasons went away.
 
Likewise, the company’s risk management system identified weekly the top 10 reasons customers got hurt in the stores (by the way, the next time you are in a grocery store, thank the worker sweeping up spilled grapes from the floor around the salad bar). This sounds simple, but you might be surprised the extreme business benefits obtained from constantly solving the problems from the top of a dynamic Top-10 list. 
 
Today, your company may be making the big mistake of ignoring the majority of data around it. Hadoop and its ecosystem of products and partners make it easier for everybody to get value from Big Data.
 
We are truly just at the beginning of this Big Data story. Exciting things are still ahead. 

DougLautzenheiser July 11, 2015
Share this Article
Facebook Twitter Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

ai digital marketing tools
Top Five AI-Driven Digital Marketing Tools in 2023
Artificial Intelligence
ai-generated content
Is AI-Generated Content a Net Positive for Businesses?
Artificial Intelligence
predictive analytics in dropshipping
Predictive Analytics Helps New Dropshipping Businesses Thrive
Predictive Analytics
cloud data security in 2023
Top Tools for Your Cloud Data Security Stack in 2023
Cloud Computing

Stay Connected

1.2k Followers Like
33.7k Followers Follow
222 Followers Pin

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

[mc4wp_form id=”1616″]

You Might also Like

ai digital marketing tools
Artificial Intelligence

Top Five AI-Driven Digital Marketing Tools in 2023

6 Min Read
ai-generated content
Artificial Intelligence

Is AI-Generated Content a Net Positive for Businesses?

5 Min Read
predictive analytics in dropshipping
Predictive Analytics

Predictive Analytics Helps New Dropshipping Businesses Thrive

12 Min Read
cybersecurity simulations
Artificial IntelligenceExclusiveITSecurity

Combat AI-Powered Threats with Cybersecurity Simulations & Other Practices

7 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

giveaway chatbots
How To Get An Award Winning Giveaway Bot
Big Data Chatbots Exclusive
AI chatbots
AI Chatbots Can Help Retailers Convert Live Broadcast Viewers into Sales!
Chatbots

Quick Link

  • About
  • Contact
  • Privacy
Follow US

© 2008-23 SmartData Collective. All Rights Reserved.

Removed from reading list

Undo
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?