Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    image fx (67)
    Improving LinkedIn Ad Strategies with Data Analytics
    9 Min Read
    big data and remote work
    Data Helps Speech-Language Pathologists Deliver Better Results
    6 Min Read
    data driven insights
    How Data-Driven Insights Are Addressing Gaps in Patient Communication and Equity
    8 Min Read
    pexels pavel danilyuk 8112119
    Data Analytics Is Revolutionizing Medical Credentialing
    8 Min Read
    data and seo
    Maximize SEO Success with Powerful Data Analytics Insights
    8 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: Decision Tree Bagging System (R code)
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Uncategorized > Decision Tree Bagging System (R code)
Uncategorized

Decision Tree Bagging System (R code)

Editor SDC
Editor SDC
9 Min Read
SHARE

I previously posted a note on decision trees, then explained how they could be improved by model averaging using ensembles of trees trained on bootstrap samples. Then I implemented it in Matlab, and now finally I’m sharing it here coded in R, with an example to walk through. This should be the simplest way to learn how a trading system like this works and it’s open source.

The code is concise at about 100 lines. Here’s the main system, sample data used in the example below, and a small harness to load the data and configure the workspace. 
As I’ve mentioned multiple times, machine learning systems can take in basically any data and then automatically harvest as much alpha as possible from it. The differences between an advanced (tree bagging, SVM, etc) and a primitive algorithm (linear regression, nearest neighbors, LDA, etc) usually translate [in trading] to finding more complex nonlinear patterns, controlling overfitting, and, of course, slower runtimes. 
In this example we’re going to try to squeeze some alpha out of GLD, an actively traded gold ETF. After a little bit of thought, we decide on some inputs that might have some predictive value because of what …


I previously posted a note on decision trees, then explained how they could be improved by model averaging using ensembles of trees trained on bootstrap samples. Then I implemented it in Matlab, and now finally I’m sharing it here coded in R, with an example to walk through. This should be the simplest way to learn how a trading system like this works and it’s open source.

The code is concise at about 100 lines. Here’s the main system, sample data used in the example below, and a small harness to load the data and configure the workspace. 
As I’ve mentioned multiple times, machine learning systems can take in basically any data and then automatically harvest as much alpha as possible from it. The differences between an advanced (tree bagging, SVM, etc) and a primitive algorithm (linear regression, nearest neighbors, LDA, etc) usually translate [in trading] to finding more complex nonlinear patterns, controlling overfitting, and, of course, slower runtimes. 
In this example we’re going to try to squeeze some alpha out of GLD, an actively traded gold ETF. After a little bit of thought, we decide on some inputs that might have some predictive value because of what we know about macroeconomics and the market. We decide to feed our system data on the movements of two big gold miners, Freeport-McMoRan (FCX) and Rio Tinto’s ADR (RTP), bonds (DHY) the performance of the financial sector (XLF) and the S&P500 (SPY). I recommend using factors such as bond prices, the overall market, and the price of relevant commodities in any machine learning system, because I’ve found they often
 improve performance. If you look at the sample data which you should have downloaded above, I’ve compiled all this data for you. We will use weekly periods and backtest as far back as 2001.
To run the system from the code I’ve provided, open R (on Windows it’s RGui, download the installer here) and first enter the command > setwd(‘C:\\[whatever folder you downloaded the files into]’). This sets the default search path directory. Next copy and paste or type > source(‘rungoldtreesys.r’). This will load the data and the tree bagging system code. Now you can backtest the system using whatever parameters you’d like. In the continuation of the example below, I ran it with this command and parameters, > factormodel.tree(data, targets, returns, btsamples=130, horizon=1, trainperiods=8, leverage = ‘kelly’, keepNFeatures = 10,
 treesInBag = 40, endPd = 150). 
Now I’ll explain how to interpret the results. While the system is backtesting it outputs the predictions at each period so you can see how fast it’s running. The final text results give you a summary of all the predictions and confidence values, and the overall accuracy as the fraction correct. There are also three plots, showing the estimated importance of each variable and decrease in out-of-bootstrap-sample error rate as trees are added to the ensemble (only on the first backtest period, just to give an idea – you don’t want hundreds of charts). Here’s one I got estimating variable importance: 
We find that the previous period’s return is the most useful followed by the previous 4 weeks’ return of FCX and bond yield levels. FCXHighLow is the difference between FCX’s weekly high and low and FCXVolNorm is the volume of shares traded in the week for FCX. Both were found to be useless, as we might expect. Read more about tree bagging to learn how exactly importance is measured. Next we look at the error rate of the ensemble as more decision trees were “grown”:
During feature selection the error rate falls and then rises since the ensemble gets “confused” by the useless variables, which we found above. Then in actual model building the accuracy finishes at about 57.5%. This is just the model build to predict one period into the future by the backtester. The real power of ensemble/bagging learners is that as more components are added the error gets lower, to a point. 
Finally, let’s look at the equity curve using the parameters above. There is another parameter, the random seed, which controls how bootstrap samples are chosen so results vary. 
Over 120 weeks (about 2 years), the system made about 90%. Two things to keep in mind are that this ignores trading costs (which should be negligible in this case because it’s weekly trading of just a single security), and more importantly that this is based on full Kelly betting, which is probably too volatile for a human to tolerate – above we see a 40% drawdown. However, when searching for alpha it’s good to have sensitive tools.
If you give this system data and buy/short targets, it will pull as much alpha from the data as is possible for the underlying algorithm.
Finally, I’ll explain the system’s parameters so you can experiment with and modify the code yourself.
data : All the data you’re giving to the system by columns
targets : Either 1 or -1 for a long or short position. Aligned in time with corresponding data
returns : Similar to targets but used for making equity curve
backtest : Don’t mess with this one, possible future functionality
verbose : “
btsamples : The number of periods to evaluate the system on
skip : Ignore every (skip-1) data point. Used for testing over long period faster
horizon : Number of periods out to predict (equivalently, number of lags for data) 
dataperiods : Don’t mess with this one, possible future functionality
speedUpFactor : Whether or not to train a model every backtest period, not tested, likely broken
trainperiods : Number of periods of data to train on. More = focus on the past, Less = shorter memory
leverage : Either ‘kelly’ or a positive decimal number. E.g 2 means 2X returns
keepNFeatures : Number of features to retain after feature selection
treesInBag : Number of trees to grow. Smooths confidence values but takes longer
startPd & endPd : Used to test over a specific interval. No date functionality – kept simple for now.
Please leave a message if you have any suggestions, questions, or ideas. 

More Read

Rules-based innovation
Cloud Software and Business Modernization: Part One
Why Culture Is the Biggest Barrier to SecDevOps
Data quality webcast next week
Implementing Enterprise 2.0 at Intuit, Part Four: Tecon and Encouraging Use
TAGGED:decision treesr code
Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

image fx (2)
Monitoring Data Without Turning into Big Brother
Big Data Exclusive
image fx (71)
The Power of AI for Personalization in Email
Artificial Intelligence Exclusive Marketing
image fx (67)
Improving LinkedIn Ad Strategies with Data Analytics
Analytics Big Data Exclusive Software
big data and remote work
Data Helps Speech-Language Pathologists Deliver Better Results
Analytics Big Data Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

Map-reduce in R with Amazon EMR

2 Min Read

Mapping the Massachusetts election upset with R, ctd

2 Min Read

Do Predictive Modelers Need to Know Math?

6 Min Read

Catching Up With Hunch

5 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

AI and chatbots
Chatbots and SEO: How Can Chatbots Improve Your SEO Ranking?
Artificial Intelligence Chatbots Exclusive
AI chatbots
AI Chatbots Can Help Retailers Convert Live Broadcast Viewers into Sales!
Chatbots

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?