Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    data analytics and truck accident claims
    How Data Analytics Reduces Truck Accidents and Speeds Up Claims
    7 Min Read
    predictive analytics for interior designers
    Interior Designers Boost Profits with Predictive Analytics
    8 Min Read
    image fx (67)
    Improving LinkedIn Ad Strategies with Data Analytics
    9 Min Read
    big data and remote work
    Data Helps Speech-Language Pathologists Deliver Better Results
    6 Min Read
    data driven insights
    How Data-Driven Insights Are Addressing Gaps in Patient Communication and Equity
    8 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: Prinicpal Components for Modeling
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Analytics > Predictive Analytics > Prinicpal Components for Modeling
Predictive Analytics

Prinicpal Components for Modeling

Editor SDC
Editor SDC
6 Min Read
SHARE

Problem Statement

Analysts constructing predictive models frequently encounter the need to reduce the size of the available data, both in terms of variables and observations. One reason is that data sets are now available which are far too large to be modeled directly in their entirety using contemporary hardware and software. Another reason is that some data elements (variables) have an associated cost. For instance, medical tests bring an economic and sometimes human cost, so it would be ideal to minimize their use if possible. Another problem is overfitting: Many modeling algorithms will eagerly consume however much data they are fed, but increasing the size of this data will eventually produce models of increased complexity without a corresponding increase in quality. Model deployment and maintenance, too, may be encumbered by extra model inputs, in terms of both execution time and required data preparation and storage.

Naturally, the goal in data reduction is to decrease the size of needed data…


Problem Statement

More Read

2009 Marketing Trends and Thoughts on Web 2.0
Using Decision Management to improve Requirements
A conversation with Jay Kreps about Project Voldemort
DQ Problems? Start a Data Quality Recognition Program!
Show and Tell (via IBMSocialMedia)

Analysts constructing predictive models frequently encounter the need to reduce the size of the available data, both in terms of variables and observations. One reason is that data sets are now available which are far too large to be modeled directly in their entirety using contemporary hardware and software. Another reason is that some data elements (variables) have an associated cost. For instance, medical tests bring an economic and sometimes human cost, so it would be ideal to minimize their use if possible. Another problem is overfitting: Many modeling algorithms will eagerly consume however much data they are fed, but increasing the size of this data will eventually produce models of increased complexity without a corresponding increase in quality. Model deployment and maintenance, too, may be encumbered by extra model inputs, in terms of both execution time and required data preparation and storage.

Naturally, the goal in data reduction is to decrease the size of needed data, while maintaining (as much as is possible) model performance, this process must be performed carefully.

A Solution: Principal Components

Selection of candidate predictor variables to retain (or to eliminate) is the most obvious way to reduce the size of the data. If model performance is not to suffer, though, then some effective measure of each variable’s usefulness in the final model must be employed- which is complicated by the correlations among predictors. Several important procedures have been developed along these lines, such as forward selection, backward selection and stepwise selection.

Another possibility is principal components analysis (“PCA” to his friends), which is a procedure from multivariate statistics which yields a new set of variables (the same number as before), called the principal components. Conveniently, all of the principal components are simply linear functions of the original variables. As a side benefit, all of the principal components are completely uncorrelated. The technical details will not be presented here (see the reference, below), but suffice it to say that if 100 variables enter PCA, then 100 new variables (called the principal components come out. You are now wondering, perhaps, where the “data reduction” is? Simple: PCA constructs the new variables so that the first principal component exhibits the largest variance, the second principal component exhibits the second largest variance, and so on.

How well this works in practice depends completely on the data. In some cases, though, a large fraction of the total variance in the data can be compressed into a very small number of principal components. The data reduction comes when the analyst decides to retain only the first n principal components.

Note that PCA does not eliminate the need for the original variables: they are all still used in the calculation of the principal components, no matter how few of the principal components are retained. Also, statistical variance (which is what is concentrated by PCA) may not correspond perfectly to “predictive information”, although it is often a reasonable approximation.

Last Words

Many statistical and data mining software packages will perform PCA, and it is not difficult to write one’s own code. If you haven’t tried this technique before, I recommend it: It is truly impressive to see PCA squeeze 90% of the variance in a large data set into a handful of variables.

Note: Related terms from the engineering world: eigenanalysis, eigenvector and eigenfunction.

Reference

For the down-and-dirty technical details of PCA (with enough information to allow you to program PCA), see:

Multivariate Statistical Methods: A Primer, by Manly (ISBN: 0-412-28620-3)

Note: The first edition is adequate for coding PCA, and is at present much cheaper than the second or third editions.

TAGGED:data qualitydata reductionpredictive modeling
Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

AI supply chain
AI Tools Are Strengthening Global Supply Chains
Artificial Intelligence Exclusive
data analytics and truck accident claims
How Data Analytics Reduces Truck Accidents and Speeds Up Claims
Analytics Big Data Exclusive
predictive analytics for interior designers
Interior Designers Boost Profits with Predictive Analytics
Analytics Exclusive Predictive Analytics
big data and cybercrime
Stopping Lateral Movement in a Data-Heavy, Edge-First World
Big Data Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

Don’t Sweat the Small Stuff, Except in Data Quality

5 Min Read

Why Google needed a Superbowl ad

3 Min Read

Poor Quality Data Sucks

9 Min Read

Is your data complete and accurate, but useless to your business?

8 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence
data-driven web design
5 Great Tips for Using Data Analytics for Website UX
Big Data

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?