Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    image fx (67)
    Improving LinkedIn Ad Strategies with Data Analytics
    9 Min Read
    big data and remote work
    Data Helps Speech-Language Pathologists Deliver Better Results
    6 Min Read
    data driven insights
    How Data-Driven Insights Are Addressing Gaps in Patient Communication and Equity
    8 Min Read
    pexels pavel danilyuk 8112119
    Data Analytics Is Revolutionizing Medical Credentialing
    8 Min Read
    data and seo
    Maximize SEO Success with Powerful Data Analytics Insights
    8 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: To Sample Or Not To Sample… Does It Even Matter?
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Analytics > To Sample Or Not To Sample… Does It Even Matter?
AnalyticsCommentary

To Sample Or Not To Sample… Does It Even Matter?

BillFranks
BillFranks
6 Min Read
SHARE

So the question is…when do you sample and when do you not?  And does it even matter anymore in the world of big data?  As I’ll lay out here, in most cases today there is no point in wasting energy worrying about it.  As long as a few basic criteria are met, do whatever you prefer.

So the question is…when do you sample and when do you not?  And does it even matter anymore in the world of big data?  As I’ll lay out here, in most cases today there is no point in wasting energy worrying about it.  As long as a few basic criteria are met, do whatever you prefer.

More Read

The Evolution of Social Media Measurement
Aligning Big Data
How Data Analytics Helps Sports Teams Win
Using Data Science on TripAdvisor Reviews (Part 1)
Infor Demonstrates Steady Stream of Advances to Customers

First, let’s take care of the cases where sampling just won’t work.  If you need to find the top 100 spending customers, you can’t do that with a sample.  You’ll have to look at every single customer to accurately identify the top 100.  However, such scenarios, while common, aren’t the most prevalent type of analytic requirement.  They do represent an easy victory for the “no sampling” crowd, however.  Similarly, even a model built on a sample will need to be applied to the universe to use it appropriately.  So, when it comes time to deploy, sampling isn’t an option.

Second, let’s remember that many analytic processes are going to deal with or remove outliers and extreme values in some way.  As opposed to the “top 100” question above, many of the top or bottom observations may be removed or adjusted so as not to have too much influence.  Even if such observations are available in a dataset, they won’t be used.

The point above is important.  When building a customer propensity model, for example, you want it to apply broadly to the “typical” customer.  Perhaps there really is a customer that spends 1,000 times the next highest customer.  Even if true, that customer is so extreme and atypical that you shouldn’t include them in your model.  The model is meant to differentiate the masses and a few extreme customers can compromise the power of the model for the purpose it was intended.  Any customer who is legitimately that extreme is worthy of special handling from an organization to begin with.  You don’t need a model to tell you that.

Last, let’s come back to a typical scenario.  You need an average.  Or you want to get parameter estimates from some sort of predictive model.  Statistically speaking, a sample of sufficient size that is correctly drawn to mimic the population is going to get you the same answer as if you used all of the data.  There is no difference between the results from a sample or the universe for most types of metrics and models.

There are those who will vehemently argue that if you don’t need to sample, then don’t.  I can see that view.  One hole in this view, however, is that a correct modeling process will involve some combination of development and validation data sets…and these are effectively samples anyway!  Others will argue that you should only use the amount of data needed and that using more than the minimal sample required is a waste of time and resources.  I can also see this view.  One hole in this view is that if the resources available can easily handle all the data in a timely manner, then not much is wasted.

Where I net out is that I really don’t care.  If someone doing a project for me wants to sample, I’m ok with that as long as the sample is sufficiently large and drawn correctly.  If someone wants to use the universe, I’m ok with that too as long as the extra resources required compared to a sample aren’t pragmatically meaningful.  I am confident I’ll get the same results, so I’ll stay out of the argument over sampling.

I realize that this position of indifference may concern virtually everyone since most people land on one side of the fence or the other.  I guess my point is simply that there are plenty of other, more “meaty” topics to spend time debating when developing an analytic process.  I don’t see the use in losing much sleep over whether or not to sample in today’s world.  If the systems and tools in use can handle it either way, then I’ll let you have it your way!

One last unrelated note…if you think that you or someone you know might be an analytic superhero, be sure to check out the Analytic Superheroes site! 

TAGGED:data sampling
Share This Article
Facebook Pinterest LinkedIn
Share
ByBillFranks
Follow:
Bill Franks is Chief Analytics Officer for The International Institute For Analytics (IIA). Franks is also the author of Taming The Big Data Tidal Wave and The Analytics Revolution. His work has spanned clients in a variety of industries for companies ranging in size from Fortune 100 companies to small non-profit organizations. You can learn more at http://www.bill-franks.com.

Follow us on Facebook

Latest News

image fx (2)
Monitoring Data Without Turning into Big Brother
Big Data Exclusive
image fx (71)
The Power of AI for Personalization in Email
Artificial Intelligence Exclusive Marketing
image fx (67)
Improving LinkedIn Ad Strategies with Data Analytics
Analytics Big Data Exclusive Software
big data and remote work
Data Helps Speech-Language Pathologists Deliver Better Results
Analytics Big Data Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

Big Data Blasphemy: Why Sample?

8 Min Read

Data Sampling for Association Rule Mining

2 Min Read

Resampling Data in Hadoop with RHadoop

1 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai is improving the safety of cars
From Bolts to Bots: How AI Is Fortifying the Automotive Industry
Artificial Intelligence
AI chatbots
AI Chatbots Can Help Retailers Convert Live Broadcast Viewers into Sales!
Chatbots

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?