Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    New Data Analytics Breakthroughs Give eCommerce Startups a Fighting Chance
    New Data Analytics Breakthroughs Give eCommerce Startups a Fighting Chance
    6 Min Read
    How Data Analytics Is Reshaping Patient Financing Decisions
    How Data Analytics Is Reshaping Patient Financing Decisions
    13 Min Read
    business using business intelligence
    How to Use a Competitive Intelligence Dashboard to Turn Market Data Into Smarter Marketing Decisions 
    9 Min Read
    unusual trading activity
    Signal Or Noise? A Decision Tree For Evaluating Unusual Trading Activity
    3 Min Read
    software developer using ai
    How Data Analytics Helps Developers Deliver Better Tech Services
    8 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: When Is ‘Big Data’ Too Big for Analytics?
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Analytics > When Is ‘Big Data’ Too Big for Analytics?
Analytics

When Is ‘Big Data’ Too Big for Analytics?

TimManns
TimManns
9 Min Read
SHARE

Apologies for the lack of recent posts.  I’ve been *very* busy on many Data Mining Analytics projects in my role as a Data Mining Consultant for SAS.  The content of my work is usually sensititive and therefore discussing it in any level of detail in public blog posts is difficult.

Apologies for the lack of recent posts.  I’ve been *very* busy on many Data Mining Analytics projects in my role as a Data Mining Consultant for SAS.  The content of my work is usually sensititive and therefore discussing it in any level of detail in public blog posts is difficult.

This specific post is to help promote the launch of the new IAPA website and increase focus on Analytics in Australia (and Sydney, where I am normally based).  The topic of this post is something that has been at the forefornt of my mind and seems to be a central theme of many of the projects I have been working on recently.  It is certaininly a current problem for many Marketing/Customer Analytics departments.  So here are a few thoughts and comments on ‘big data’. Apologies for typos, it is mostly written piecemeal on my iPhone during short 5 mins breaks…

How big is too big (for Analytics)?
I frequently read Analytics blogs and e-magazines that talk about the ‘new’ explosion of big data. Although I am unconvinced it is new, or will improve anytime soon, I do agree that despite technology advances in analytics the growth of data generation and storage seems to be outpacing most Analyst’s ability to transform data into information and utilize it to greater benefit (both operationally and analytically). The term ‘Analysis Paralysis’ has never been so relevant!

More Read

Hadoop in Healthcare
4 Ways Hadoop Is Improving Our Healthcare System
Forex Trading with R : Part 2
How Could New Big Data Technology Benefit Wealth Management Industry?
Automating Your Text Analytics Process
Can Advancements In Data Science Address The Challenges To Cybersecurity?
But from a practical perspective what conditions cause data to become unwieldy? For example, take a typical customer services based organisation such as a bank, telcom, or public dept: how can the data (de)-evolve to a state that makes it ‘un-analysable’ (what a horrible thought..). Even given mild (by today’s standards) numbers of variables and records, certain practices and conditions can lead to bottlenecks, widespread performance problems, and delays that make any delivery of Analytics very challenging.


So, below is a series of my most recent observations from Analytics projects I have been involved with that involved resolving, or encountered ‘big data’ problems:

– Scaleable Infrastructure.
Data will grow. Fast. In fact it will probably more than double in the next few years. CPU capacity of data warehousing and analytics servers need to improve to match.

As an example, I was working on a telcom Social Network Analysis project recently where we were processing weekly summaries of mobile telephone calls for approx 18million individuals. My role was to analysis the social interactions between all customers and build dozens of propensity scores, using the social influence of others to predict behaviour. In total I was probably processing hundreds of millions of records of data (by a dozen or so variables). This was more than the client typically analysed.
After a week  of design and preliminary work I began to conasider ways to optimise the performance of my queries and computations, and I asked about the server specifications. I assumed some big server with dozens of processors, but unfortunately what I was connecting to was a dual core 4GB desktop PC under an Analyst’s desk…

– Variable Transformations

A common mistake by inexperienced data miners is to ignore or short-cut comprehensive data preparation steps. All data that involves analysis of people is certain to include unusual characteristics. One person’s outlier is another’s screw-up 🙂
So, what is the best way to account for outliers, skewed distributions, poor data sparsity, or highly likely erreonous data features? Well an approach (that i am not keen on) taken by some is to apply several variable transformations indiscriminatly to all ‘raw’ variables and subsequentially let a variable selection process pick the best input variables for propensity modeling etc. When combined with data which represents transposed time series (so a variable represents a value in ‘month1’ the next variable the same value dimension in ‘month2’ etc) then this can easily generate in excess of 20,000 variables (by say 10 million customers…). It is true there are variable selection methods that handle 20,000 quite well, but the metadata and processing to create those datasets is often significant and the whole process often incurs excessive costs in terms of time to delivery of results.
Additional problems that may arise when you start working with many thousands of variables is that variable naming needs to be easily understood and interpretable. The last thing a data miner wants to do is spend hours working out what those transformed and selected important variables in the propensity model actually mean and represent in the raw data.
Which leads me to my next point..

– Variable / Data Understanding
One of the core skills of a good data miner is the understanding and translate complex data in order to solve business problems.
As organisations obtain more data it is not just about more records, often the data reveals new subtle operational details and customer behaviors not previously known, or completely new sources of data (FaceBook, social chat, location based services etc). This in turn often requires extended knowledge of the business and operational systems to enable the correct data warehouse values or variable manipulations and selections to be made.
An analyst is expected to understand most parts of an organization’s data at a level of detail most individuals in the organisation are not concerned with, and this is often a momental task.
As an example of ‘big data’ bad practice, I’ve encountered verbose variables names which immediately require truncation (due to IT / variable name limit reasons), others which make understand the value or meaning of the variable difficult, or naming conventions which are undocumented. For example: “number_of_broken_promises” is one of the funniest long max variable names I’ve seen, whilst others such as “ccxs_ytdspd_m1_pct” can be guessed when you have the business context but definitely require detailed documentation or a key.

– Diverse Skillsets

‘big data’ often requires big warehouse and analytics systems (see point 1) and so an analyst must have understanding of how these systems work properly.
Through personal experience I’m always aware of table indexes on a Teradata system for example. By default the first column in a warehouse table will be the index, so if you incorrectly use a poorly managed or repetitive variable such as ‘gender’ or ‘end_date’ then the technology of a big data system works against you. I’ve seen this type of user error on temp tables or analytics output tables far too many times.  Big Data often involves bringing information from a greater number of sources, so understanding the source systems and data warehouse involved is an important challenge.
I hope this helps.  I strongly recommend getting involved with the IAPA and Sydney Data Miner’s Meetup if you are based in  Australia or Sydney.
 – Tim
TAGGED:big data
Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

Why Every Small Business Should Care About an AI Image Generator
Why Every Small Business Should Care About an AI Image Generator
Artificial Intelligence Exclusive
ai for instagram reel marketing
How AI Is Changing Instagram Reel Marketing
Artificial Intelligence Exclusive Marketing
protecting data in public
The Importance Of Protecting Sensitive Data In Public Services
Big Data Data Management Exclusive
New Data Analytics Breakthroughs Give eCommerce Startups a Fighting Chance
New Data Analytics Breakthroughs Give eCommerce Startups a Fighting Chance
Analytics Big Data Exclusive

Stay Connected

1.2KFollowersLike
33.7KFollowersFollow
222FollowersPin

You Might also Like

cybersecurity
Big DataData ManagementExclusivePrivacyRisk Management

Improving Big Data Analytics To Address Cybersecurity Challenges

5 Min Read
big data and games matching
Big DataExclusive

How Big Data Can Improve Multiplayer Game Matching

6 Min Read
virtual reality apps
Artificial IntelligenceHardware

Here’s How Big Data Is Transforming Augmented Reality

8 Min Read
use AI and IoT in business
Artificial Intelligence

How AI and IoT Solutions Can Improve Your Business

5 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

data-driven web design
5 Great Tips for Using Data Analytics for Website UX
Big Data
AI and chatbots
Chatbots and SEO: How Can Chatbots Improve Your SEO Ranking?
Artificial Intelligence Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?