By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData Collective
  • Analytics
    AnalyticsShow More
    predictive analytics in dropshipping
    Predictive Analytics Helps New Dropshipping Businesses Thrive
    12 Min Read
    data-driven approach in healthcare
    The Importance of Data-Driven Approaches to Improving Healthcare in Rural Areas
    6 Min Read
    analytics for tax compliance
    Analytics Changes the Calculus of Business Tax Compliance
    8 Min Read
    big data analytics in gaming
    The Role of Big Data Analytics in Gaming
    10 Min Read
    analyst,women,looking,at,kpi,data,on,computer,screen
    Promising Benefits of Predictive Analytics in Asset Management
    11 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: Simple Tools for Building a Recommendation Engine
Share
Notification Show More
Latest News
ai digital marketing tools
Top Five AI-Driven Digital Marketing Tools in 2023
Artificial Intelligence
ai-generated content
Is AI-Generated Content a Net Positive for Businesses?
Artificial Intelligence
predictive analytics in dropshipping
Predictive Analytics Helps New Dropshipping Businesses Thrive
Predictive Analytics
cloud data security in 2023
Top Tools for Your Cloud Data Security Stack in 2023
Cloud Computing
become a data scientist
Boosting Your Chances for Landing a Job as a Data Scientist
Jobs
Aa
SmartData Collective
Aa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Data Management > Best Practices > Simple Tools for Building a Recommendation Engine
Best PracticesData MiningR Programming Language

Simple Tools for Building a Recommendation Engine

DavidMSmith
Last updated: 2012/04/19 at 3:35 PM
DavidMSmith
9 Min Read
SHARE

Revolution’s resident economist, Saar Golde, is very fond of saying that “90% of what you might from a recommendation engine can be achieved with simple techniques”. To illustrate this point (without doing a lot of work), we downloaded the million row movie dataset from www.grouplens.org with the idea of just taking the first obvious exploratory step: finding the good movies. Three zipped up .dat files comprise this data set.

Revolution’s resident economist, Saar Golde, is very fond of saying that “90% of what you might from a recommendation engine can be achieved with simple techniques”. To illustrate this point (without doing a lot of work), we downloaded the million row movie dataset from www.grouplens.org with the idea of just taking the first obvious exploratory step: finding the good movies. Three zipped up .dat files comprise this data set. The first file, ratings.dat, contains 1,000,209 records of UserID, MovieID, Rating, and Timestamp for 6,040 users rating 3,952 movies. Ratings are whole numbers on a 1 to 5 scale. The second file, users.dat, contains the UserID, Gender, Age, Occupation and Zip-code for each user. The third file, movies.dat, contains the MovieID, Title and Genre associated with each movie.

Although the movies dataset is not a large file, and can easily be read into my laptop’s memory (a Dell with 4, 1.86GHz cores and 8GB or RAM) it turns out that it doesn’t provide R with enough resources to run the simple regression model we had in mind. To get around this, we used the rxImport function to import the files into .XDF format used by the RevoScaleR package, transformed all of the variables except the times stamp into factors and merged them into one big file called “RUM”. The script “XDF import.R” shows how all of this was done. Everything in this script is straightforward except maybe the few lines of code beginning at line 101 where a little work needs to be done to make reconcile the levels of the MovieID factor variable between ratings and movies files. Working with factors can be a little in R, so it is nice to have the function rxFactors for setting factor levels in XDF files.

Once we got the RUM we could take the first naïve step of finding the best movie according to the user ratings. This can be done with the RevoScaleR function rxCube which produces a cross tabulation in long form. The following code which tabulates rating by titles produces the results in Table 1.

More Read

data mining

Data Mining Technology Helps Online Brands Optimize Their Branding

Four Strategies For Effective Database Compliance
Choosing the Right Programming Language for A Corporate Database
How The Explosive Growth Of Data Access Affects Your Engineer’s Team Efficiency
What Are the Most Serious Privacy Concerns Regarding Big Data?

cube.3

head(cube.3)

Table 1

  

 

fTitle  

Rating

Counts

1

Toy Story (1995)

4.147

2077

2

GoldenEye (1995)

3.541

888

3

City Hall (1996)

3.063

128

4

Curdled (1996)

3.050

20

5

Ed’s Next Move (1996)

4.250

8

6

Extreme Measures (1996)

2.942

121 

Rating contains the average rating and Counts gives the number of ratings that went into computing the average. After sorting cube.3 and including only movies with more than 50 ratings we come up with our top-six list in Table 2.

Table 2

  

 

Title

Rating

Counts

1

Sanjuro (1962)

4.609

69

2

Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)

4.561

628

3

Shawshank Redemption, The (1994)

4.555

2227

4

Godfather, The (1972)

4.525

2223

5

Close Shave, A (1995)

4.521

657

6

Usual Suspects, The (1995)

4.517

1783

 (The code for this and the rest of our analysis is in the script “Best movie.R”.) This list certainly looked good to me: two films by Akira Kurosawa on top and the Godfather not far behind. These people know a good film when they see one. Well, maybe so, but maybe taking the ratings at face value from this particular movie buffs is not the best we can do. Better than just averaging ratings, with a simple regression we can look for the best movie while controlling for the users. The RevoScaleR functions to do this are:

form

mod.1

The regression looks inocuous enough. However, since UserID and fTitle are both factors we are computing a regression with 9,746 fixed effects! lm ran out of resource on my laptop trying to do this regression but rxLinMod completed the job in about 53 seconds. Sorting the model coefficients from high to low produces a new best movie list. Table 3 includes enough movies to show where the movies in our original list landed.

Table 3

   

 

Title

Coef

Coef Std

1

Song of Freedom (1936)

3.298

0.9127

2

Mamma Roma (1962)

2.173

0.6445

3

Schlafes Bruder (Brother of Sleep) (1995)

1.717

0.9159

4

Hour of the Pig, The (1993)

1.637

0.6410

5

Smashing Time (1967)

1.636

0.6481

6

Ulysses (Ulisse) (1954)

1.583

0.9133

7

Gate of Heavenly Peace, The (1995)

1.577

0.5244

8

Apple, The (Sib) (1998)

1.554

0.3058

9

I Am Cuba (Soy Cuba/Ya Kuba) (1964)

1.532

0.4076

10

Lamerica (1994)

1.490

0.3243

11

One Little Indian (1973)

1.404

0.9066

12

Follow the Bitch (1998)

1.401

0.9178

13

Sanjuro (1962)

1.382

0.1213

14

Bells, The (1926)

1.323

0.6404

15

Lured (1947)

1.315

0.9090

16

Leather Jacket Love Story (1997)

1.311

0.6410

17

Chain of Fools (2000)

1.298

0.9127

18

Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)

1.281

0.0645

19

Shawshank Redemption, The (1994)

1.254

0.0568

20

Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)

1.253

0.0678

21

For All Mankind (1989)

1.245

0.1822

22

Jar, The (Khomreh) (1992)

1.243

0.9045

23

Paths of Glory (1957)

1.232

0.0801

24

Close Shave, A (1995)

1.230

0.0641

25

Usual Suspects, The (1995)

1.229

0.0576

26

Godfather, The (1972)

1.229

0.0568

Pretty interesting, not only did the original top six move down, they also moved out of their original order. Moreover, having done the regression, we also have the standard deviations of the coefficients available. Now, we not only have a measure of how good a movie is, but we also have an estimate of the uncertainity of its quality, and we may want to use this to adjust our recommendations. For example, the 17th movie on the list, Chain of Fools, has a coefficient of 1.298 with a standard deviation of 0.9127. This is a fairly large spread. Near the bottom of its range of uncertainty the coefficient of Chain of Fools would be smaller than the coefficients of the rest of the movies on the top 26 list. But, just below Chain of Fools in the 18th slot, the coefficient of the The Seven Samurai, 1.281, has a much smaller standard deviation: 0.0645. It appears that this would be a better movie to recommend.

Although we have not shown it, our model also gives user fixed effects. So, we can see who the tougher raters are and, combined with other information about users, we might be able to figure out how to recognize a tough movie rater.

I think Saar made his point: good progress can be made with simple tools. Moreover, by thinking like a statistician (or an economist) and not merely letting the machine learning algorithms drive the process, it is possible to gain useful insight into how to make better recommendations.

 

By Joseph Rickert

DavidMSmith April 19, 2012
Share this Article
Facebook Twitter Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

ai digital marketing tools
Top Five AI-Driven Digital Marketing Tools in 2023
Artificial Intelligence
ai-generated content
Is AI-Generated Content a Net Positive for Businesses?
Artificial Intelligence
predictive analytics in dropshipping
Predictive Analytics Helps New Dropshipping Businesses Thrive
Predictive Analytics
cloud data security in 2023
Top Tools for Your Cloud Data Security Stack in 2023
Cloud Computing

Stay Connected

1.2k Followers Like
33.7k Followers Follow
222 Followers Pin

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

[mc4wp_form id=”1616″]

You Might also Like

data mining
Data Mining

Data Mining Technology Helps Online Brands Optimize Their Branding

7 Min Read
database compliance guide
Data Management

Four Strategies For Effective Database Compliance

8 Min Read
programming languages for corporate database
ExclusiveProgrammingR Programming Language

Choosing the Right Programming Language for A Corporate Database

6 Min Read
data access for engineers
Big Data

How The Explosive Growth Of Data Access Affects Your Engineer’s Team Efficiency

11 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

data-driven web design
5 Great Tips for Using Data Analytics for Website UX
Big Data
ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence

Quick Link

  • About
  • Contact
  • Privacy
Follow US

© 2008-23 SmartData Collective. All Rights Reserved.

Removed from reading list

Undo
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?