Safe Big Data


ImageBig Data is one of the most important business and government developments of the twenty-first century, but there are crucial security and privacy issues that prevent it from developing as quickly as it could.

ImageBig Data is one of the most important business and government developments of the twenty-first century, but there are crucial security and privacy issues that prevent it from developing as quickly as it could. Large-scale cloud infrastructures, data source format and medium, high volume inter-cloud migrations and dynamic streaming data acquisition have created increase privacy and security concerns. These issues are compounded by the big four V’s of Big Data — volume, velocity, variety and veracity. 

Big Data Volume

It’s estimated that 43 trillion gigabytes of new data will be created by the year 2020. This is an increase of 300 percent from 2005 figures. With the expanding volume of data being made readily available, it’s become increasingly important to be able to sort between sensitive and public information. The average company in the United States has 100,000 gigabytes of data stored locally and in the cloud. This poses a great security risk when those data stores become the targets of thieves and criminals.

Big Data Velocity

In a typical session, the New York Stock Exchange captures 1,000 gigabytes of data. One of the great challenges of Big Data involves striking a balance between quickly analyzing data in real-time, while still maintaining a highly secure and private environment. Secure systems require encryption and other protections that necessarily slow down the processing of data. When data is flying through servers at lightning speeds, the task of managing data becomes increasingly difficult. 

Big Data Variety

Healthcare is one of the most privacy- and security-centric types of data available. In 2011, healthcare data was estimated to exceed 161 billion gigabytes, and that’s just one type of data. 30 billion pieces of content are shared monthly on Facebook. Video, images, text-based communications, audio and microdata all require different types of processing to sort, manage and categorize crucial information. The technology required to manage this data is still in the developing stages, and data scientists are employed full-time in efforts to manage, interpret and disseminate this data. 

Big Data Veracity

The accuracy of data is another concern for Big Data. With so much data to sort through, it’s important to create filters that can automatically decipher the difference between high-quality, timely data and outdated poor quality data. Getting accurate data is big business, and it costs the U.S. economy $3.1 trillion dollars per year. When data is inaccurate businesses, governments and organizations can’t make accurate data. As the move by the U.S. government to track terrorists in the Deep Web, it’s getting to a point where an inability to sift through Big Data can cost lives.

Top Challenges for Big Data

The Cloud Security Alliance (CSA) conducted a review of the challenges facing Big Data and compiled a list of the top ten current challenges. One of the chief concerns is the ability to conduct massive computations in highly secure distributed programming frameworks. Security and speed typically aren’t complementary, so creating highly efficient servers that can process massive amounts of data is a primary concern.

Big Data consists of large amounts of non-relational data. Put simply, relational data uses strict requirements to find data that is highly structured and organized. Most Big Data is non-relational and as a result, is generally less organized. Non-relational data doesn’t require software programmers to re-architect an entire system if the data stored needs to be used for an entirely different purpose later on down the road. This is crucial for Big Data to be able to be used effectively. As strange as it may seem, overly organized data is a detriment to effectively working with Big Data. The challenge for working with Big Data is creating best practices for securing non-relational data stores.

The third issue is finding ways to secure data storage and transaction logs that may be created in real-time. Since data and transaction logs are often stored in multi-tiered media, it’s important for companies and organizations to be able to protect this data from unauthorized access while continuing to make this data available. Using policy-based encryption can help to ensure access to the data by verified, authenticated users and applications.

The Challenges of the Big Data Ecosystem

Cloud security requires different types of training for each of the main aspects of Big Data Management. The four main aspects are divided into infrastructure security, data privacy, data management, as well as integrity and reactive security. Each one of these aspects needs a different approach.

Infrastructure security must rely upon breaking down data into pieces and analyzing it on a micro-level. Then, capabilities need to be introduced that prevent leaking sensitive data by using data sanitation and de-identification. 

Outside of the actual data, employees and all hands that touch the ecosystem must be highly trained, trustworthy and qualified. This adds to the already challenging upfront costs and hard to nail down ROI that companies need to justify costs. Forcing businesses to have to either head hunt top talent or provide different types of high quality training. Both which are expensive.

Data privacy has historically concentrated on preserving the systems managing data instead of the actual data. Since these systems have proven to be vulnerable, a new approach that encapsulates data in cloud-based environments is necessary. New algorithms must also be created to provide better key management and secure key exchanges.

Data management concerns itself with secure data storage, secure transaction logs, granular audits and data provenance. This aspect must be concerned with validating and determining the trustworthiness of data. Fine-grained access controls along with end-to-end data protection can be used to verify data and make data management more secure. 

Integrity and reactive security is concerned with end-point validation and filtering, as well as real-time security monitoring. Each endpoint must be thoroughly vetted to ensure that malicious data isn’t submitted. Using Trusted Platform Module chips, and host- and mobile-based security controls can reduce some of the risk associated with untrusted endpoints.


The cloud environment has made working with Big Data easier and more productive, it has also lowered the barrier of entry for business getting their feet wet. Data integration and cloud services allow for tools like BI & Hadoop to play together nicely, but it’s important to realize that there are still significant security risks involved. The CSA working group’s research and findings can be utilized by companies, organizations and governments to more readily analyze the current state of their infrastructure and applications, and they can create more secure environments to better protect against unauthorized access.