Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    data analytics and truck accident claims
    How Data Analytics Reduces Truck Accidents and Speeds Up Claims
    7 Min Read
    predictive analytics for interior designers
    Interior Designers Boost Profits with Predictive Analytics
    8 Min Read
    image fx (67)
    Improving LinkedIn Ad Strategies with Data Analytics
    9 Min Read
    big data and remote work
    Data Helps Speech-Language Pathologists Deliver Better Results
    6 Min Read
    data driven insights
    How Data-Driven Insights Are Addressing Gaps in Patient Communication and Equity
    8 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: Data Lakes: Safe Way to Swim in Big Data?
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Data Lakes: Safe Way to Swim in Big Data?
Big Data

Data Lakes: Safe Way to Swim in Big Data?

Dave Menninger
Dave Menninger
9 Min Read
SHARE

It has been more than five years since James Dixon of Pentaho coined the term “data lake.” His original post suggests, “If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state.” The analogy is a simple one, but in my experience talking with many end users there is still mystery surrounding the concept.

It has been more than five years since James Dixon of Pentaho coined the term “data lake.” His original post suggests, “If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state.” The analogy is a simple one, but in my experience talking with many end users there is still mystery surrounding the concept. In this post I’d like to clarify what a data lake is, review the reasons an organization might consider using one and the challenges they present, and outline some developments in software tools that support data lakes.

Data lakes offer a way to deal with big data. A data lake combines massive storage capabilities for any type of data in any format as well as processing power to transform and analyze the data. Often data lakes are implemented using Hadoop technology. Raw, detailed data from various sources is loaded into a single consolidated repository to enable analyses that look across any data available to the user. To understand why data lakes have become popular it’s helpful to contrast this approach with the enterprise data warehouse (EDW). In some ways, an EDW is similar to a data lake. Both act as a centralized repository for information from across an organization. However, the data loaded into an EDW is generally summarized, structured data. EDWs are typically based on relational database technologies, which are designed to deal with structured information. And while advances have been made in the scalability of relational databases, they are generally not as scalable as Hadoop. Because these technologies are not as scalable, it is not practical to store all the raw data that come into the organization. Hence there is a need for summarization. In contrast, a data lake contains the most granular data generated across the organization. The data may be structured information, such as sales transaction data, or unstructured information, such as email exchanged in customer service interactions.

Hadoop is often used with data lakes becausevr_Big_Data_Analytics_21_external_data_sources_for_big_data_analytics it can store and manage large volumes of both structured and unstructured data for subsequent analytic processing. The advent of Hadoop made it feasible and more affordable to store much larger volumes of information, and organizations began collecting and storing the raw detail from various systems throughout the organization. Hadoop has also become a repository for unstructured information such as social media and semistructured data such as log files. In fact, our benchmark research shows that social media data is the second-most important source of external information used in big data analytics.

More Read

Influencing Shoppers Beyond the Store
If You Think Data Security is IT’s Responsibility, Think Again
More Ways to get a Scoring Model wrong
How to Create and Deploy Effective Metrics
Big Brother… or do I mean Big Data?

In addition to handling larger volumes and more varieties of information, data lakes enable faster access to information as it is generated. Since data is gathered in its raw form, no preprocessing is needed. Therefore, information can be added to the data lake as soon as it is generated and collected. This approach has caused some controversy with many industry analysts and even vendors to raise concerns about data lakes turning into data swamps. In general, the concerns about data lakes becoming data swamps center around the lack of governance of the data in a data lake, an appropriate topic here. These collections of data should be governed like any other set of information assets within an organization. The challenge was that most of the governance tools and technologies had been developed for relational databases and EDWs. In essence, the big data technologies used for data lakes had gotten ahead of themselves, without incorporating all the features needed to support enterprise deployments.

Another, perhaps, minor controversy centers around terminology. I raise this issue so that, regardless of the terminology a vendor chooses, you can recognize data lakes and be aware of the challenges. Cloudera uses the term Enterprise Data Hub to represent essentially the same concept as a data lake. Hortonworks embraces the data lake terminology as evidenced in this post. IBM acknowledges the value of data lakes as well as its challenges in this post, but Jim Kobielus, IBM’s Big Data Evangelist, questioned the terminology in a more recent post on LinkedIn, and the term “data lake” is not featured prominently on IBM’s website.

Despite the controversy and challenges, data lakes are continuing to grow in popularity. They provide important capabilities for data science. First, they contain the detailed data necessary to perform predictive analytics. Second, they allow efficient access to unstructured data such as social media or other text from customer interactions. For business, this information can create a more complete profile of customers and their behavior. Data lakes also make data available sooner than it might be available in a conventional EDW architecture. Our data and analytics in the cloud benchmark research show that one in five (21%) organizations are integrating their data in real time. The research also shows that those who integrate their data more often are more satisfied and more confident in their results. Granted, a data lake contains raw information, and it may require more analysis or manipulation since the data is not yet cleansed, but time is money and faster access can often lead to new revenue opportunities. Half the participants in our predictive analytics benchmark research said they have created new revenue opportunities with their analytics.

Cognizant of the lack of governance and management tools some organizations hesitated to adopt data lakes, while others went ahead. Vendors in this space have advanced their capabilities in the meantime. Some, such as Informatica, are bringing data governance capabilities from the EDW world to data lakes. I wrote about the most recent release of Informatica’s big data capabilities, which it calls Intelligent Data Lake. Other vendors are bringing their EDW capabilities to data lakes as well. Information Builders and Teradata both made data lake announcements this spring. In addition, a new category of vendors is emerging focused specifically on data lakes. Podium Data says it provides an “enterprise data lake management platform,” Zaloni calls itself “the data lake company,” and Waterline Data draws its name “from the metaphor of a data lake where the data is hidden below the waterline.”

Is it safe to jump in? Well, just like you shouldn’t jump into a lake without knowing how to swim, you shouldn’t jump into a data lake without plans for managing and governing the information in it. Data lakes can provide unique opportunities to take advantage of big data and create new revenue opportunities. With the right tools and training, it might be worth testing the water.

Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

data analytics and truck accident claims
How Data Analytics Reduces Truck Accidents and Speeds Up Claims
Analytics Big Data Exclusive
predictive analytics for interior designers
Interior Designers Boost Profits with Predictive Analytics
Analytics Exclusive Predictive Analytics
big data and cybercrime
Stopping Lateral Movement in a Data-Heavy, Edge-First World
Big Data Exclusive
AI and data mining
What the Rise of AI Web Scrapers Means for Data Teams
Artificial Intelligence Big Data Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

Social Media Analytics: How our approach blends the best analytics technology

5 Min Read
big data in healthcare technology
Big DataExclusive

The Powerful Role of Big Data In The Healthcare Industry

8 Min Read

MIT Information Quality Symposium

2 Min Read

Adding Business to Analytics

5 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

AI chatbots
AI Chatbots Can Help Retailers Convert Live Broadcast Viewers into Sales!
Chatbots
AI and chatbots
Chatbots and SEO: How Can Chatbots Improve Your SEO Ranking?
Artificial Intelligence Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?