By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    data Analytics instagram stories
    Data Analytics Helps Marketers Make the Most of Instagram Stories
    15 Min Read
    analyst,women,looking,at,kpi,data,on,computer,screen
    What to Know Before Recruiting an Analyst to Handle Company Data
    6 Min Read
    AI analytics
    AI-Based Analytics Are Changing the Future of Credit Cards
    6 Min Read
    data overload showing data analytics
    How Does Next-Gen SIEM Prevent Data Overload For Security Analysts?
    8 Min Read
    hire a marketing agency with a background in data analytics
    5 Reasons to Hire a Marketing Agency that Knows Data Analytics
    7 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: Counting Observations
Share
Notification Show More
Aa
SmartData CollectiveSmartData Collective
Aa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Data Mining > Counting Observations
Data MiningData Visualization

Counting Observations

DeanAbbott
Last updated: 2010/01/23 at 3:53 PM
DeanAbbott
6 Min Read
SHARE

Data is fodder for the data mining process. One fundamental aspect of the data we analyze is its size, which is most often characterized by the number of observations and the number of variables in the given set of data- typically measured as counts of “rows and columns”, respectively. It is worth taking a closer look at this, though, as questions such as “Do we have enough data?” depend on an apt measure of how much data we have.

Outcome Distributions

In many predictive modeling situations, cases are spread fairly evenly among the possible outcomes, but this is not always true. Many fraud detection problems, for instance, involve extreme class imbalance: target class cases (known frauds) may represent a small fraction of 1% of the available records. Despite having many total observations of customer behavior…


Data is fodder for the data mining process. One fundamental aspect of the data we analyze is its size, which is most often characterized by the number of observations and the number of variables in the given set of data- typically measured as counts of “rows and columns”, respectively. It is worth taking a closer look at this, though, as questions such as “Do we have enough data?” depend on an apt measure of how much data we have.

More Read

data mining

Data Mining Technology Helps Online Brands Optimize Their Branding

What Role Does Big Data Have on the Deep Web?
5 Data Mining Tips to Leverage the Benefits of Surveys
Perform Data Mining With Web Scrapers to Track Prices
Data Mining Vital Statistics Yields Fascinating Societal Insights

Outcome Distributions

In many predictive modeling situations, cases are spread fairly evenly among the possible outcomes, but this is not always true. Many fraud detection problems, for instance, involve extreme class imbalance: target class cases (known frauds) may represent a small fraction of 1% of the available records. Despite having many total observations of customer behavior, observations of fraudulent behavior may be rather sparse. Data miners who work in the fraud detection field are acutely aware of this issue and characterize their data sets not just by ‘total number of observations’, but also by ‘observations of the behavior of interest’. When assessing an existing data set, or specifying a new one, such an analyst generally employ both counts.

Numeric outcome variables may also suffer from this problem. Most numeric variables are not uniformly distributed, and areas in which outcome data is sparse- for instance, long tails of high personal income- are areas which may be poorly represented in models derived from that data.

With both class and numeric outcomes, it might be argued that outcome values which are infrequent are, by definition, less important. This may or may not be so, depending on the modeling process and our priorities. If the model is expected to perform well on the top personal income decile, then data should be evaluated by how many cases fall in that range, not just on the total observation count.

Predictor Distributions

Issues of coverage occur on the input variable side, as well. Keeping in mind that generalization is the goal of discovered models, the total record count by itself seems inadequate when, for example, data are drawn from a process which has (or may have) a seasonal component. Having 250,000 records in a single data set sounds like many, but if they are only drawn from October, November and December, then one might reasonably take the perspective that only 3 “observations” of monthly behavior are represented, out of 12 possibilities. In fact, (assuming some level of stability from year to year) one could argue that not only should all 12 calendar months be included, but that they should be drawn from multiple historical years, so that there are multiple observations for each calendar month.

Other groupings of cases in the input space may also be important. For instance, of hundreds of observations of retail sales may be observed, but if only from 25 salespeople out of a sales force of 300, then the simple record count as “observation count” may be deceiving.

Validation Issues

Observations as aggregates of single records should be considered during the construction of train/test data, as well. When pixel-level data are drawn from images for the construction of a pixel level classifier, for instance, it makes sense to avoid having pixels from a given image serve as training observations, and other pixels from that same image serve as validation observations. Entire images should be labeled as “train” or “test”, and pixels drawn as observations according, to avoid “cheating” during model construction, based on the inherent redundancy in image data.

Conclusion

This posting has only briefly touched on some of the issues which arise when attempting to measure the volume of data in one’s possession, and has not explored yet more subtle concepts such as sampling techniques, observation weighting or model performance measures. Hopefully though, it gives the reader some things to think about when assessing data sets in terms of their size and quality.

TAGGED: data mining
DeanAbbott January 23, 2010
Share This Article
Facebook Twitter Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

ai low code frameworks
AI Can Help Accelerate Development with Low-Code Frameworks
Artificial Intelligence
data Analytics instagram stories
Data Analytics Helps Marketers Make the Most of Instagram Stories
Analytics
data breaches
How Hospital Security Breaches Devastate Local Communities
Policy and Governance
analyst,women,looking,at,kpi,data,on,computer,screen
What to Know Before Recruiting an Analyst to Handle Company Data
Analytics

Stay Connected

1.2k Followers Like
33.7k Followers Follow
222 Followers Pin

You Might also Like

data mining
Data Mining

Data Mining Technology Helps Online Brands Optimize Their Branding

7 Min Read
big data technology has helped improve the state of both the deep web and dark web
Big Data

What Role Does Big Data Have on the Deep Web?

8 Min Read
surveys data
Data Mining

5 Data Mining Tips to Leverage the Benefits of Surveys

11 Min Read
data mining is game changer for small businesses
Data Mining

Perform Data Mining With Web Scrapers to Track Prices

7 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

data-driven web design
5 Great Tips for Using Data Analytics for Website UX
Big Data
AI and chatbots
Chatbots and SEO: How Can Chatbots Improve Your SEO Ranking?
Artificial Intelligence Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?