By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    customer experience analytics
    Using Data Analysis to Improve and Verify the Customer Experience and Bad Reviews
    6 Min Read
    data analytics and CRO
    Data Analytics is Crucial for Website CRO
    9 Min Read
    analytics in digital marketing
    The Importance of Analytics in Digital Marketing
    8 Min Read
    benefits of investing in employee data
    6 Ways to Use Data to Improve Employee Productivity
    8 Min Read
    Jira and zendesk usage
    Jira Service Management vs Zendesk: What Are the Differences?
    6 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: To Sample Or Not To Sample… Does It Even Matter?
Share
Notification Show More
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Analytics > To Sample Or Not To Sample… Does It Even Matter?
AnalyticsCommentary

To Sample Or Not To Sample… Does It Even Matter?

BillFranks
Last updated: 2012/04/08 at 6:24 AM
BillFranks
6 Min Read
SHARE

So the question is…when do you sample and when do you not?  And does it even matter anymore in the world of big data?  As I’ll lay out here, in most cases today there is no point in wasting energy worrying about it.  As long as a few basic criteria are met, do whatever you prefer.

So the question is…when do you sample and when do you not?  And does it even matter anymore in the world of big data?  As I’ll lay out here, in most cases today there is no point in wasting energy worrying about it.  As long as a few basic criteria are met, do whatever you prefer.

First, let’s take care of the cases where sampling just won’t work.  If you need to find the top 100 spending customers, you can’t do that with a sample.  You’ll have to look at every single customer to accurately identify the top 100.  However, such scenarios, while common, aren’t the most prevalent type of analytic requirement.  They do represent an easy victory for the “no sampling” crowd, however.  Similarly, even a model built on a sample will need to be applied to the universe to use it appropriately.  So, when it comes time to deploy, sampling isn’t an option.

Second, let’s remember that many analytic processes are going to deal with or remove outliers and extreme values in some way.  As opposed to the “top 100” question above, many of the top or bottom observations may be removed or adjusted so as not to have too much influence.  Even if such observations are available in a dataset, they won’t be used.

The point above is important.  When building a customer propensity model, for example, you want it to apply broadly to the “typical” customer.  Perhaps there really is a customer that spends 1,000 times the next highest customer.  Even if true, that customer is so extreme and atypical that you shouldn’t include them in your model.  The model is meant to differentiate the masses and a few extreme customers can compromise the power of the model for the purpose it was intended.  Any customer who is legitimately that extreme is worthy of special handling from an organization to begin with.  You don’t need a model to tell you that.

Last, let’s come back to a typical scenario.  You need an average.  Or you want to get parameter estimates from some sort of predictive model.  Statistically speaking, a sample of sufficient size that is correctly drawn to mimic the population is going to get you the same answer as if you used all of the data.  There is no difference between the results from a sample or the universe for most types of metrics and models.

There are those who will vehemently argue that if you don’t need to sample, then don’t.  I can see that view.  One hole in this view, however, is that a correct modeling process will involve some combination of development and validation data sets…and these are effectively samples anyway!  Others will argue that you should only use the amount of data needed and that using more than the minimal sample required is a waste of time and resources.  I can also see this view.  One hole in this view is that if the resources available can easily handle all the data in a timely manner, then not much is wasted.

Where I net out is that I really don’t care.  If someone doing a project for me wants to sample, I’m ok with that as long as the sample is sufficiently large and drawn correctly.  If someone wants to use the universe, I’m ok with that too as long as the extra resources required compared to a sample aren’t pragmatically meaningful.  I am confident I’ll get the same results, so I’ll stay out of the argument over sampling.

I realize that this position of indifference may concern virtually everyone since most people land on one side of the fence or the other.  I guess my point is simply that there are plenty of other, more “meaty” topics to spend time debating when developing an analytic process.  I don’t see the use in losing much sleep over whether or not to sample in today’s world.  If the systems and tools in use can handle it either way, then I’ll let you have it your way!

One last unrelated note…if you think that you or someone you know might be an analytic superhero, be sure to check out the Analytic Superheroes site! 

TAGGED: data sampling
BillFranks April 8, 2012 April 8, 2012
Share This Article
Facebook Twitter Pinterest LinkedIn
Share
By BillFranks
Follow:
Bill Franks is Chief Analytics Officer for The International Institute For Analytics (IIA). Franks is also the author of Taming The Big Data Tidal Wave and The Analytics Revolution. His work has spanned clients in a variety of industries for companies ranging in size from Fortune 100 companies to small non-profit organizations. You can learn more at http://www.bill-franks.com.

Follow us on Facebook

Latest News

Cloud-Based Marketing
Smart Video Bloggers Are Leveraging Cloud-Based Marketing Tools
Cloud Computing IT Marketing
technology and security
Technology in Physical Security: A Guide to Business Safety
Exclusive IT Security
ai for stopping credit card theft
AI Can Manage Credit Card Cybersecurity Risks
IT Security
ai can help with nurse burnout
Breakthroughs in AI Are Helping to Prevent Nurse Burnout
Artificial Intelligence Exclusive

Stay Connected

1.2k Followers Like
33.7k Followers Follow
222 Followers Pin

You Might also Like

Resampling Data in Hadoop with RHadoop

1 Min Read

Big Data Blasphemy: Why Sample?

8 Min Read

Data Sampling for Association Rule Mining

2 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

AI and chatbots
Chatbots and SEO: How Can Chatbots Improve Your SEO Ranking?
Artificial Intelligence Chatbots Exclusive
giveaway chatbots
How To Get An Award Winning Giveaway Bot
Big Data Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?