Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    image fx (67)
    Improving LinkedIn Ad Strategies with Data Analytics
    9 Min Read
    big data and remote work
    Data Helps Speech-Language Pathologists Deliver Better Results
    6 Min Read
    data driven insights
    How Data-Driven Insights Are Addressing Gaps in Patient Communication and Equity
    8 Min Read
    pexels pavel danilyuk 8112119
    Data Analytics Is Revolutionizing Medical Credentialing
    8 Min Read
    data and seo
    Maximize SEO Success with Powerful Data Analytics Insights
    8 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-25 SmartData Collective. All Rights Reserved.
Reading: Benchmarking bigglm
Share
Notification
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > R Programming Language > Benchmarking bigglm
R Programming Language

Benchmarking bigglm

DavidMSmith
DavidMSmith
8 Min Read
SHARE

By Joseph Rickert

By Joseph Rickert

In a recent blog post, David Smith reported on a talk that Steve Yun and I gave at STRATA in NYC about building and benchmarking Poisson GLM models on various platforms. The results presented showed that the rxGlm function from Revolution Analytics’ RevoScaleR package running on a five node cluster outperformed a Map Reduce/ Hadoop implementation as well as an implementation of legacy software running on a large server. An alert R user posted the following comment on the blog:

As a poisson regression was used, it would be nice to also see as a benchmark the computational speed when using the biglm package in open source R? Just import your csv in sqlite and run biglm to obtain your poisson regression. Biglm also loads in data in R in chunks in order to update the model so that looks more similar to the RevoScaleR setup then just running plain glm in R.

More Read

How to Program MapReduce Jobs in Hadoop with R
ggplot2 for Big Data
How NOAA uses R to forecast river flooding.
Data Says R Amongst Most Popular Languages
10 Amazing Data Analytics Platforms Everyone Should Know About

This seemed like a reasonable, simple enough experiment. So we tried it. The benchmark results presented at STRATA were done on a 145 million record file, but as a first step, I thought that I would try it on a 14 million record subset that I already had loaded on my PC, a quad core Dell, with i7 processors and 8GB of RAM.  It took almost an hour to build the SQLite data base:

# make a SQLite database out of the csv file library(sqldf) sqldf("attach AdataT2SQL as new") file <- file.path(getwd(),"AdataT2.csv") read.csv.sql(file, sql = "create table main.AT2_10Pct as select * from file",dbname = "AdataT2SQL",header = TRUE)

. . . and then just a couple of lines of code to set up the connections and run the model.

# BIGGLM library(biglm) #-------------------- # Set up the data base cinnections tablename <- "main.AT2_10Pct" DB <- file.path(getwd(),"AdataT2SQL") conn <- dbConnect(dbDriver("SQLite"),dbname=DB) modelvars <- all.vars(formula) query <- paste("select ", paste(modelvars, collapse = ", ")," from ", tablename) #-------------------- # Run bigglm gc() system.time(model <- bigglm (formula = formula, data = dbGetQuery(conn,query),family = poisson(),chunksize=10000,maxit=10))

Unfortunately, the model didn’t run to completion.  The error messages returned were of the form:

#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : #contrasts can be applied only to factors with 2 or more levels #In addition: There were 50 or more warnings (use warnings() to see the first 50) #Timing stopped at: 186.39 80.81 470.33 warnings() #1: In model.matrix.default(tt, mf) : variable 'V1' converted to a factor #2: In model.matrix.default(tt, mf) : variable 'V2' converted to a factor

This error suggests that while chunking through the data bigglm came across a variable that should be converted into a factor. But, since there was only value for the variable in the chunk that was in memory bigglm threw an error.

In general, factors present a significant challenge for external memory algorithms.  Not only might an algorithm fail to create factor variables, even when the algorithm runs there may be unanticipated consequences that cause big trouble downstream. For example, variations in text can cause attempts at automatic factor conversion to make several versions of the same variable.  This, in turn, may make it impossible to merge files, or cause an attempt to predict results on a hold out data set to fail because the factor levels are different. Even more insidiously, when hundreds of variables are involved in a model, an analyst might not notice a few bogus factor levels.

bigglm does not provide a mechanism for setting factor levels on the fly. In my opinion, far from being a fault, this was an intelligent design choice. rxGlm, RevoScaleR’s function for building GLM models, does provide some capability to work with factors on the fly. But, this is not recommended practice — too many things can go wrong.  The recommended way to do things is to use RevoScaleR’s rxFactors function on data stored in RevoScaleR native .XDF file.  rxFactors provides the user with very fine control of factor variables. Factor levels can be set, sorted, created and merged.

The analogous course of action with bigglm would be to set up the factor variables properly in the data base. Whenever, I have database problems, my go to guy is my colleague Steve Weller. Steve loaded the data into a MySQL database installed on a quad-core PC with 8 GB of RAM running Windows 2008 Server R2 Standard. He manually added new indicator variables to the database corresponding to the factor levels in the original model, and built a model that was almost statistically equivalent to the original model (we never quite got the contrasts right) but good enough to benchmark.  It took bigglm about 27 minutes to run working off the MySQL database. By comparison, rxGlm completed in less than a minute on Steve’s test machine. We have not yet tried to run bigglm on the entire 145 million record dataset.  It would be nice to know if bigglm scales linearly with the number of records.  If it does, that would bring bigglm in at about 4.5 hours to process the entire data set, considerably longer than the 54 minutes it took to process the large data set with RevoScaleR on my PC.

It would be nice to hear from R users who have built bigglm models on large data sets. Unfortunately, I cannot make the proprietary data set used in the benchmark available. However, it should not be too difficult to find a suitable publicly-available substitute. Our benchmark data set had 145,814,000 rows and 139 variables. These included integer, numeric, character and factor variables. There were 40 independent variables in the original model. If you try, be prepared to spend a little time on the project. It is not likely to be as easy as the beguiling phrase in the comment to the blog post (“Just import your csv in sqlite and run biglm…”) would indicate.

Share This Article
Facebook Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

image fx (2)
Monitoring Data Without Turning into Big Brother
Big Data Exclusive
image fx (71)
The Power of AI for Personalization in Email
Artificial Intelligence Exclusive Marketing
image fx (67)
Improving LinkedIn Ad Strategies with Data Analytics
Analytics Big Data Exclusive Software
big data and remote work
Data Helps Speech-Language Pathologists Deliver Better Results
Analytics Big Data Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

big data trends
AnalyticsBig DataBusiness IntelligenceData ManagementData MiningR Programming LanguageSoftwareUnstructured Data

7 Big Data Trends That Will Impact Your Business

8 Min Read

Upcoming R training Classes, Live from the Experts

2 Min Read

How the New York Times uses R for Data Visualization

2 Min Read

Two Talks on Data Science, Big Data and R

5 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence
giveaway chatbots
How To Get An Award Winning Giveaway Bot
Big Data Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-25 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?