Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    data driven insights
    How Data-Driven Insights Are Addressing Gaps in Patient Communication and Equity
    8 Min Read
    pexels pavel danilyuk 8112119
    Data Analytics Is Revolutionizing Medical Credentialing
    8 Min Read
    data and seo
    Maximize SEO Success with Powerful Data Analytics Insights
    8 Min Read
    data analytics for trademark registration
    Optimizing Trademark Registration with Data Analytics
    6 Min Read
    data analytics for finding zip codes
    Unlocking Zip Code Insights with Data Analytics
    6 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: Counting Observations
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Data Mining > Counting Observations
Data MiningData Visualization

Counting Observations

DeanAbbott
DeanAbbott
6 Min Read
SHARE

Data is fodder for the data mining process. One fundamental aspect of the data we analyze is its size, which is most often characterized by the number of observations and the number of variables in the given set of data- typically measured as counts of “rows and columns”, respectively. It is worth taking a closer look at this, though, as questions such as “Do we have enough data?” depend on an apt measure of how much data we have.

Outcome Distributions

In many predictive modeling situations, cases are spread fairly evenly among the possible outcomes, but this is not always true. Many fraud detection problems, for instance, involve extreme class imbalance: target class cases (known frauds) may represent a small fraction of 1% of the available records. Despite having many total observations of customer behavior…


Data is fodder for the data mining process. One fundamental aspect of the data we analyze is its size, which is most often characterized by the number of observations and the number of variables in the given set of data- typically measured as counts of “rows and columns”, respectively. It is worth taking a closer look at this, though, as questions such as “Do we have enough data?” depend on an apt measure of how much data we have.

More Read

Because It’s the Weekend: Cube-Solving Lego Robot
Why PC’s still suck
5 Common Use Cases for Hadoop in Retail
Here’s how to build on Business Analytics
Conducting A/B Tests: Subject Lines

Outcome Distributions

In many predictive modeling situations, cases are spread fairly evenly among the possible outcomes, but this is not always true. Many fraud detection problems, for instance, involve extreme class imbalance: target class cases (known frauds) may represent a small fraction of 1% of the available records. Despite having many total observations of customer behavior, observations of fraudulent behavior may be rather sparse. Data miners who work in the fraud detection field are acutely aware of this issue and characterize their data sets not just by ‘total number of observations’, but also by ‘observations of the behavior of interest’. When assessing an existing data set, or specifying a new one, such an analyst generally employ both counts.

Numeric outcome variables may also suffer from this problem. Most numeric variables are not uniformly distributed, and areas in which outcome data is sparse- for instance, long tails of high personal income- are areas which may be poorly represented in models derived from that data.

With both class and numeric outcomes, it might be argued that outcome values which are infrequent are, by definition, less important. This may or may not be so, depending on the modeling process and our priorities. If the model is expected to perform well on the top personal income decile, then data should be evaluated by how many cases fall in that range, not just on the total observation count.

Predictor Distributions

Issues of coverage occur on the input variable side, as well. Keeping in mind that generalization is the goal of discovered models, the total record count by itself seems inadequate when, for example, data are drawn from a process which has (or may have) a seasonal component. Having 250,000 records in a single data set sounds like many, but if they are only drawn from October, November and December, then one might reasonably take the perspective that only 3 “observations” of monthly behavior are represented, out of 12 possibilities. In fact, (assuming some level of stability from year to year) one could argue that not only should all 12 calendar months be included, but that they should be drawn from multiple historical years, so that there are multiple observations for each calendar month.

Other groupings of cases in the input space may also be important. For instance, of hundreds of observations of retail sales may be observed, but if only from 25 salespeople out of a sales force of 300, then the simple record count as “observation count” may be deceiving.

Validation Issues

Observations as aggregates of single records should be considered during the construction of train/test data, as well. When pixel-level data are drawn from images for the construction of a pixel level classifier, for instance, it makes sense to avoid having pixels from a given image serve as training observations, and other pixels from that same image serve as validation observations. Entire images should be labeled as “train” or “test”, and pixels drawn as observations according, to avoid “cheating” during model construction, based on the inherent redundancy in image data.

Conclusion

This posting has only briefly touched on some of the issues which arise when attempting to measure the volume of data in one’s possession, and has not explored yet more subtle concepts such as sampling techniques, observation weighting or model performance measures. Hopefully though, it gives the reader some things to think about when assessing data sets in terms of their size and quality.

TAGGED:data mining
Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

crypto marketing
How a Crypto Marketing Agency Can Use AI to Create Powerful Native Advertising Strategies
Blockchain Exclusive Marketing
data driven insights
How Data-Driven Insights Are Addressing Gaps in Patient Communication and Equity
Analytics Big Data Exclusive
image fx (37)
Boosting SMS Marketing Efficiency with AI Automation
Exclusive
pexels pavel danilyuk 8112119
Data Analytics Is Revolutionizing Medical Credentialing
Analytics Big Data Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

It’s time to industrialize analytics

8 Min Read

Analytics: Not About Saving Time

7 Min Read

PAW Analyzing and predicting user satisfaction with sponsored search

5 Min Read

The three legged stool – business, analytics, IT

6 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

giveaway chatbots
How To Get An Award Winning Giveaway Bot
Big Data Chatbots Exclusive
ai chatbot
The Art of Conversation: Enhancing Chatbots with Advanced AI Prompts
Chatbots

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?