Managing Big Data Integration and Security with Hadoop

The explosion of the World Wide Web into our lives was like being given a gigantic toy chest with anything you would ever want to play with in there. As the Web grew from hundreds of pages, which could be manually indexed and queried, to millions of pages, with thousands being added every day, the challenge was to figure out how to find something specific when you wanted it. Search engines like Yahoo and Google were the first to realize that in order to make the Web into something usable and manageable, an automated process would have to be developed to deal with the big data – to store it and to make sense of it, categorize it, and retrieve it on command. The need for better, automated search engines was born.

This need was a driving force in the development of Hadoop, an open-source framework that enables storage of enormous amounts of data as well as running associated applications using a distributed network of associated computers to accomplish Big Data tasks. The framework was originally part of a larger project put together by Mike Cafarella, a database management expert, and Doug Cutting, a proponent of open-source technology. Together they created a web-crawler and analytics system, called Nutch, which used cluster computing – distributing data and processing across a number of connected machines – to permit the execution of multiple tasks simultaneously.

This was exactly what the Web – particularly its search engines – was looking for. In fact, this development paralleled efforts at Google to use automation and distributed computing to accomplish better, faster, and more efficient processing of big data. In 2006, Cutting joined Yahoo, taking with him the Nutch system. Eventually, Nutch became two systems: the web-crawler portion, which drove Yahoo’s search functions, and the distributed processing portion which became Hadoop. (Hadoop, as a matter of interest, was the name of a yellow stuffed elephant that belonged to Cutting’s son.) Yahoo released Hadoop into the open-source community, where it is now maintained by the Apache Software Foundation, along with a set of associated technologies.

Hadoop is made up of four core technologies:

  • Hadoop Common, which are the libraries and utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS), which, as its name indicates, is the file system used by Hadoop. It is Java-based and highly scalable, permitting storage of data across multiple machines without prior organization – essentially making a community of nodes operate as though under a single file system. Read more about HDFS architecture guide.
  • MapReduce is a programming model that uses parallel processing for big data sets consisting of structured and unstructured data, reliably and with a high tolerance for variation and faults.
  • YARN (Yet Another Resource Negotiator) is a framework for resource management, as its name implies. It handles the scheduling of resource requests from the many distributed applications that are part of a Hadoop implementation.

Hadoop also integrates with a number of other processing and support applications in areas specifically geared towards supporting large data sets. Some of those supporting areas and applications include the following:

Data access.

  • The programming language Pig (which includes Pig Latin) is specifically intended to analyze any type of data without requiring the user to spend a lot of time setting up mapping and reduction programs;
  • HIVE, a SQL-like query language that breaks a SQL statement down to a MapReduce job and distributes it across the cluster;
  • Flume, which collects and aggregates large amounts of data from applications and moves them into the HDFS file system;
  • Spark is an open-source cluster computing data analytics program that, in certain circumstances, can be 100 times faster than MapReduce;
  • A data transfer program called Sqoop that can extract, load, and transform structured data;
  • Hbase, a non-relational, non-SQL database that runs on top of HDFS and can support very large tables;
  • Avro, which is a system that serializes data;
  • DataTorrent streaming software;
  • A data collection system, Chukwa,which is mean to work within large distributed systems;
  • Tez, which is a generalized data flow framework that uses the Hadoop module YARN to execute tasks in order to process data in both batch and interactive use modes.


  • Solr, a reliable, scalable enterprise tool from Apache with powerful search capabilities that includes hit highlighting, indexing, a central configuration system, and failover and recovery routines.


  • A high-performance coordination service for distributed applications, named ZooKeeper™;
  • The Kerberos authentication protocol;
  • Oozie, a workflow management tool.

Big Data and Hadoop We know from earlier discussions that Big Data is the commercial power of the internet. All the data being generated through e-commerce, social media, and user activities is virtually useless commercially without a set of applications to make sense of it all. Despite some real or perceived drawbacks, Hadoop is one of the recognized leaders in core technologies that are aimed at managing Big Data. Its distributed architecture is ideal for gathering and managing information from the internet. However, as of July, 2015, it was only in third place as a choice for enterprise Big Data Management, after Enterprise Data Warehousing and Cloud Computing. Hadoop is not itself a data management tool; it is the framework which allows collection and storage of massive amounts of data, and provides hooks for data analytics software to plug into it. Because it is part of a technology ecosystem, much of its usefulness depends on the tools and applications that are developed and integrated into the ecosystem. Mike Gualatieri, an analyst at Forrester, describes Hadoop as a “data operating system” – something sorely needed to cope with the amounts of data generated and gathered online. But Gualtieri adds that Hadoop is something more: it is an application platform, one specifically geared to run applications that deal with Big Data. There are many ways of getting data into Hadoop, where other applications and modules are added to the framework to process and analyze the data. You can use simple Java commands, write a shell script using “put” commands, mount a volume that uses HDFS and “put” files there as they are acquired, or you can use Sqoop or Flume, or an SAS (Statistical Analysis System) data loader, all designed specifically for Hadoop. Once the data is there, you can apply any of a number of data-processing solutions, depending on your goals for your data: analytics, queries, delivery, trend identification. Almost anything you would expect to be able to do with your data can be implemented through a Hadoop plug-in or module. Hadoop’s potential is enormous, but it faces certain challenges. Not the least of its challenges is that, in terms of deployment, it is still a new technology. Many businesses who had an ear to the ground and anticipated Hadoop as their framework for gathering, storing, serving, and analyzing Big Data, are finding that there is a dearth of experts out there to help them with integration. This is a problem often encountered by open-source platforms. Where a company has an interest in having organizations adopt their proprietary technologies, they invest in sales people, post-sales support, trainers, and other experts to assist businesses in making the most of their investment. Those experts have access to a core group of developers to help them develop their expertise. When a technology is open-source, though a variety of companies develop applications for it, the experts need to develop their expertise by gathering information from a variety of disparate sources. Many individual companies spawn many individual solutions. Developer companies do have experts at implementing their own add-ons, modules and distributions. But who can help a business that is new to Big Data management deciding between systems? While it may be in the interests of developer companies to evangelize the core technology, there is no single source from which to derive knowledge and expertise. This can be intimidating theoretically, but practically speaking, experts are created every day as Hadoop gains momentum in the business community. In other words, this is a short-term issue that is solved in the normal course of the maturing of the technology. Other challenges that face Hadoop include the fact that its MapReduce technology is not especially good at interactive analytics. MapReduce is not particularly efficient as a standalone technology – though it provides robust services in situ. It works primarily by splitting big data sets into many smaller ones, and in many situations this can cause a degradation in performance. However, as we saw above, there are alternatives being built to MapReduce, such as Spark, and one would expect in the future to see more alternatives that build on or substitute for that module in the Hadoop core set in those situations where MapReduce is not the most ideal technology. Big Data inherently comes along with security issues, due to the size of the data sets and the number of sources from which data is siphoned. The problem is not exclusive to Hadoop, although the distributed computing model does open it up to its own set of security challenges. But again, software development, like Kerberos authentication mentioned above, is starting to address this issue. Although it is always possible to call out the flaws in a new technology, frameworks are frameworks because they provide a broad opportunity for developers to fill in the gaps. As organizations install Hadoop and find the add-ins and modules that help them accomplish what they want, developers will respond by creating tools targeted to these pain points. If you are considering Hadoop, it’s good to be aware of its current limitations, but it’s also important to recognize that development work continues, and implementation is rapidly creating experts in the technology. Core Business Functions That Can Benefit from Hadoop When trying to describe frameworks, the technology can seem somewhat fuzzy, because a framework like Hadoop is meant to work most effectively when integrated with modules that perform specific functions. It is helpful to look at some real-world use cases where Hadoop supports some common core business processes.

  • Enterprise Data Management. Every large organization has a data warehouse, and almost all of them struggle with the problem of managing that data. Hadoop provides a data operating system that will allow you to perform ELT (extract, load, and transform) functions. This means that Hadoop provides a means of finding, gathering, identifying, uploading, normalizing, storing, repurposing, and serving your data in response to queries and analytical systems. The MapR data hub architecture allows all this to happen while still permitting storage of data in just about any native format, increasing the flexibility and resilience of the data.
  • Risk Management and Risk Modelling. Fraud technology never stops evolving, but effective Big Data analysis can help identify suspicious activity much more quickly. By analyzing large amounts of unstructured and structured data from many different sources, threat identification can be performed more quickly and preventative action taken. Analysis is not limited to one type of data, for instance, customer commerce; Hadoop can integrate data gathered from additional, unstructured sources such as help lines, customer chats, emails, network activity, server logs and so on, to identify patterns and, more importantly, deviations from those patterns that might indicate a data security threat. The financial services sector has invested heavily in using Hadoop to assist in risk analysis. Sophisticated machine learning modules can be built onto Hadoop that can process hundreds or even thousands of indicators from many different and differentiated sources.
  • Customer Attrition (Churn) and Sentiment Analysis. Businesses struggle to understand the factors that influence customer attrition, and to understand the significance by looking for patterns and trends. The term “Customer Churn” attempts to quantify customer attrition by defining it as the number of customers who discontinue a service during a specified time period divided by the average total number of customers over that same period. Finding this number can tell a company whether they are losing customers at an unsustainable rate – as opposed to a normal turnover rate – but it can’t tell an organization why it’s happening. By gathering a large amount of data from a variety of different sources, again utilizing customer interactions by phone, chat, email and other sources, companies can start comparing disparate data to identify solvable issues. For instance, they can identify if there is a particular problem product or a policy that is driving customers away, or a service that customers need that the organization does not supply. They can zero in on a problem with dropped calls on their help lines, and compare that with the number of people who seek help through a social media page. They can analyze the sentiment of people requiring customer service in the different venues for interaction. Real-time data can even be used to load-balance customer service calls to provide the best customer-retention service possible.
  • Targeted Advertising and Ad Effectiveness. Analysis of customer behavior can help targeted advertising be extremely effective, but in order to have the numbers and predictive analysis that makes internet advertising a success, data has to be gathered from multiple sources and analyzed in ways that produce results. Social network mining, ad clicks, and customer behavior and preferences obtained through surveys are just some of the inputs that assist in effective ad targeting and placement. MapR distributions can take a wide variety of incoming data and format it for analytics, allowing organizations to implement effective recommendation engines, sentiment aggregation, predictive trend analysis, and more, all of which contribute to an organization’s ability to target their advertising in the most effective way possible.
  • Operational Insight and Intelligence. Large organizations become siloed, making it difficult to analyze and improve operations. Information from departments that operate quite differently has to be normalized and compared. With Hadoop, you can take a wide variety of granular measurements and input from different departments, and using MapReduce, set up an environment that extracts meaningful information on operational efficiencies. From there, your organization can identify patterns of success and failure, in turn allowing you to make improvements in workflows, interdepartmental interactions, and internal functions that increase the profitability of your business. Hadoop’s Big Data analytics allow an organization to look at its supply chain logistics, quality assurance procedures, and infrastructure performance, allowing it to identify weak points and predict potential trouble spots. Because MapR integrates high availability (HA), data protection, and disaster recovery (DR) capabilities to protect against both hardware failure as well as site-wide failure.

The future of Big Data management offers many possibilities, and the most likely scenario is for options to grow and mature. An open-source framework like Hadoop offers endless possibilities for development, and with a strong management group like Apache Systems behind it, one can expect increasing numbers of modules and technologies to integrate with Hadoop to enable your business to achieve its Big Data goals – and maybe even go significantly beyond what you can envision today.