Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    data analytics and truck accident claims
    How Data Analytics Reduces Truck Accidents and Speeds Up Claims
    7 Min Read
    predictive analytics for interior designers
    Interior Designers Boost Profits with Predictive Analytics
    8 Min Read
    image fx (67)
    Improving LinkedIn Ad Strategies with Data Analytics
    9 Min Read
    big data and remote work
    Data Helps Speech-Language Pathologists Deliver Better Results
    6 Min Read
    data driven insights
    How Data-Driven Insights Are Addressing Gaps in Patient Communication and Equity
    8 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: Counting Observations
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Data Mining > Counting Observations
Data MiningData Visualization

Counting Observations

DeanAbbott
DeanAbbott
6 Min Read
SHARE

Data is fodder for the data mining process. One fundamental aspect of the data we analyze is its size, which is most often characterized by the number of observations and the number of variables in the given set of data- typically measured as counts of “rows and columns”, respectively. It is worth taking a closer look at this, though, as questions such as “Do we have enough data?” depend on an apt measure of how much data we have.

Outcome Distributions

In many predictive modeling situations, cases are spread fairly evenly among the possible outcomes, but this is not always true. Many fraud detection problems, for instance, involve extreme class imbalance: target class cases (known frauds) may represent a small fraction of 1% of the available records. Despite having many total observations of customer behavior…


Data is fodder for the data mining process. One fundamental aspect of the data we analyze is its size, which is most often characterized by the number of observations and the number of variables in the given set of data- typically measured as counts of “rows and columns”, respectively. It is worth taking a closer look at this, though, as questions such as “Do we have enough data?” depend on an apt measure of how much data we have.

More Read

Analysis in R indicates “Moderately Strong Support” for fraud in Iranian election
Here’s How The UK Government Is Using Big Data For Tax Collection
Visualizing Lexical Novelty in Literature
Data Driven: 5 Ways Automakers Use Big Data to Improve Their Products
First Look – IBM/ILOG BRMS 7.0

Outcome Distributions

In many predictive modeling situations, cases are spread fairly evenly among the possible outcomes, but this is not always true. Many fraud detection problems, for instance, involve extreme class imbalance: target class cases (known frauds) may represent a small fraction of 1% of the available records. Despite having many total observations of customer behavior, observations of fraudulent behavior may be rather sparse. Data miners who work in the fraud detection field are acutely aware of this issue and characterize their data sets not just by ‘total number of observations’, but also by ‘observations of the behavior of interest’. When assessing an existing data set, or specifying a new one, such an analyst generally employ both counts.

Numeric outcome variables may also suffer from this problem. Most numeric variables are not uniformly distributed, and areas in which outcome data is sparse- for instance, long tails of high personal income- are areas which may be poorly represented in models derived from that data.

With both class and numeric outcomes, it might be argued that outcome values which are infrequent are, by definition, less important. This may or may not be so, depending on the modeling process and our priorities. If the model is expected to perform well on the top personal income decile, then data should be evaluated by how many cases fall in that range, not just on the total observation count.

Predictor Distributions

Issues of coverage occur on the input variable side, as well. Keeping in mind that generalization is the goal of discovered models, the total record count by itself seems inadequate when, for example, data are drawn from a process which has (or may have) a seasonal component. Having 250,000 records in a single data set sounds like many, but if they are only drawn from October, November and December, then one might reasonably take the perspective that only 3 “observations” of monthly behavior are represented, out of 12 possibilities. In fact, (assuming some level of stability from year to year) one could argue that not only should all 12 calendar months be included, but that they should be drawn from multiple historical years, so that there are multiple observations for each calendar month.

Other groupings of cases in the input space may also be important. For instance, of hundreds of observations of retail sales may be observed, but if only from 25 salespeople out of a sales force of 300, then the simple record count as “observation count” may be deceiving.

Validation Issues

Observations as aggregates of single records should be considered during the construction of train/test data, as well. When pixel-level data are drawn from images for the construction of a pixel level classifier, for instance, it makes sense to avoid having pixels from a given image serve as training observations, and other pixels from that same image serve as validation observations. Entire images should be labeled as “train” or “test”, and pixels drawn as observations according, to avoid “cheating” during model construction, based on the inherent redundancy in image data.

Conclusion

This posting has only briefly touched on some of the issues which arise when attempting to measure the volume of data in one’s possession, and has not explored yet more subtle concepts such as sampling techniques, observation weighting or model performance measures. Hopefully though, it gives the reader some things to think about when assessing data sets in terms of their size and quality.

TAGGED:data mining
Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

data analytics and truck accident claims
How Data Analytics Reduces Truck Accidents and Speeds Up Claims
Analytics Big Data Exclusive
predictive analytics for interior designers
Interior Designers Boost Profits with Predictive Analytics
Analytics Exclusive Predictive Analytics
big data and cybercrime
Stopping Lateral Movement in a Data-Heavy, Edge-First World
Big Data Exclusive
AI and data mining
What the Rise of AI Web Scrapers Means for Data Teams
Artificial Intelligence Big Data Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

big data fintech and lending
Data CollectionData ManagementPredictive AnalyticsRisk Management

Here’s How Big Data Influences Banking And Online Lenders

8 Min Read

Data Miners: Participate in 3rd Annual Survey

1 Min Read

Is there anything new in Predictive Analytics?

5 Min Read

PMML support is growing rapidly. From down under and into the stars!

5 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai chatbot
The Art of Conversation: Enhancing Chatbots with Advanced AI Prompts
Chatbots
giveaway chatbots
How To Get An Award Winning Giveaway Bot
Big Data Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?