By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData Collective
  • Analytics
    AnalyticsShow More
    construction analytics
    5 Benefits of Analytics to Manage Commercial Construction
    5 Min Read
    benefits of data analytics for financial industry
    Fascinating Changes Data Analytics Brings to Finance
    7 Min Read
    analyzing big data for its quality and value
    Use this Strategic Approach to Maximize Your Data’s Value
    6 Min Read
    data-driven seo for product pages
    6 Tips for Using Data Analytics for Product Page SEO
    11 Min Read
    big data analytics in business
    5 Ways to Utilize Data Analytics to Grow Your Business
    6 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: Wikipedia Page Traffic Statistics Dataset
Share
Notification Show More
Latest News
cloud-centric companies using network relocation
Cloud-Centric Companies Discover Benefits & Pitfalls of Network Relocation
Cloud Computing
construction analytics
5 Benefits of Analytics to Manage Commercial Construction
Analytics
database compliance guide
Four Strategies For Effective Database Compliance
Data Management
Digital Security From Weaponized AI
Fortifying Enterprise Digital Security Against Hackers Weaponizing AI
Security
DevOps on cloud
Optimizing Cost with DevOps on the Cloud
Development
Aa
SmartData Collective
Aa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Data Mining > Wikipedia Page Traffic Statistics Dataset
Data Mining

Wikipedia Page Traffic Statistics Dataset

Editor SDC
Last updated: 2009/06/12 at 12:08 AM
Editor SDC
7 Min Read
SHARE
- Advertisement -

Contents
WikistatsWikilinks (1.1G)Wikidump (29G)

I’ve published a Wikipedia Page Traffic Data Set containing a 320 GB sample of the data used to power trendingtopics.org (I’ll talk about Trending Topics more in a upcoming post). The EBS snapshot includes 7 months of hourly page traffic statistics for over 8 Million Wikipedia articles (~ 1 TB uncompressed) along with the associated Wikipedia content, linkgraph, & metadata. The english Wikipedia subset contains ~2.5 Million articles.

- Advertisement -

It only takes a couple of minutes to sign up for an Amazon EC2 account and set up access to the data as an EBS volume from the Amazon Management Console.

If you want to work entirely from the command line, you will need to complete the steps in the Getting Started Guide. When you are set up to use EC2, launch a small EC2 Ubuntu instance from your local machine:

More Read

data mining helps with offsite SEO

Can Data Mining Aid with Off-Page SEO Strategies?

3 Data Mining Tips for Companies Trying to Understand their Customers
5 Data Mining Tips to Leverage the Benefits of Surveys
Perform Data Mining With Web Scrapers to Track Prices
Data Mining Vital Statistics Yields Fascinating Societal Insights
    $ ec2-run-instances ami-5394733a -k gsg-keypair -z us-east-1a

Once it is running and you have the instance id, create and attach an EBS Volume using the public snapshot snap-753dfc1c (make sure the volume is created in the same availability zone as the ec2 instance)

    $ ec2-create-volume --snapshot snap-753dfc1c -z us-east-1a    $ ec2-attach-volume vol-ec06ea85 -i i-df396cb6 -d /dev/sdf

…

- Advertisement -

I’ve published a Wikipedia Page Traffic Data Set containing a 320 GB sample of the data used to power trendingtopics.org (I’ll talk about Trending Topics more in a upcoming post). The EBS snapshot includes 7 months of hourly page traffic statistics for over 8 Million Wikipedia articles (~ 1 TB uncompressed) along with the associated Wikipedia content, linkgraph, & metadata. The english Wikipedia subset contains ~2.5 Million articles.

It only takes a couple of minutes to sign up for an Amazon EC2 account and set up access to the data as an EBS volume from the Amazon Management Console.

If you want to work entirely from the command line, you will need to complete the steps in the Getting Started Guide. When you are set up to use EC2, launch a small EC2 Ubuntu instance from your local machine:

    $ ec2-run-instances ami-5394733a -k gsg-keypair -z us-east-1a

Once it is running and you have the instance id, create and attach an EBS Volume using the public snapshot snap-753dfc1c (make sure the volume is created in the same availability zone as the ec2 instance)

- Advertisement -
    $ ec2-create-volume --snapshot snap-753dfc1c -z us-east-1a    $ ec2-attach-volume vol-ec06ea85 -i i-df396cb6 -d /dev/sdf

Next, ssh into the instance and mount the volume

    $ ssh root@ec2-12-xx-xx-xx.z-1.compute-1.amazonaws.com    root@domU-12-xx-xx-xx-75-81:/mnt# mkdir /mnt/wikidata    root@domU-12-xx-xx-xx-75-81:/mnt# mount /dev/sdf /mnt/wikidata

See the README files in each subdirectory for more details on these datasets…

Wikistats

The good stuff is sitting in 5000 files in /mnt/wikidata/wikistats/pagecounts/

    /mnt/wikidata/wikistats/pagecounts# ls -l | wc -l    5068    /mnt/wikidata/wikistats/pagecounts# ls -lh |head    total 260G    -rw-r--r-- 1 root root  49M 2009-02-26 13:34 pagecounts-20081001-000000.gz    -rw-r--r-- 1 root root  46M 2009-02-26 13:34 pagecounts-20081001-010000.gz    -rw-r--r-- 1 root root  47M 2009-02-26 13:34 pagecounts-20081001-020000.gz    -rw-r--r-- 1 root root  44M 2009-02-26 13:34 pagecounts-20081001-030000.gz    -rw-r--r-- 1 root root  45M 2009-02-26 13:34 pagecounts-20081001-040000.gz    -rw-r--r-- 1 root root  47M 2009-02-26 13:35 pagecounts-20081001-050001.gz    -rw-r--r-- 1 root root  45M 2009-02-26 13:35 pagecounts-20081001-060000.gz    -rw-r--r-- 1 root root  50M 2009-02-26 13:35 pagecounts-20081001-070000.gz    -rw-r--r-- 1 root root  51M 2009-02-26 13:35 pagecounts-20081001-080000.gz

This directory contains hourly Wikipedia article traffic logs covering the 7 month period from October 01 2008 to April 30 2009, this data is regularly logged from the wikipedia squid proxy by Domas Mituzas.

Each log file is named with the date and time of collection: pagecounts-20090430-230000.gz

- Advertisement -

Each line has 4 fields:

projectcode, pagename, pageviews, bytes
    en Barack_Obama 997 123091092    en Barack_Obama%27s_first_100_days 8 850127    en Barack_Obama,_Jr 1 144103    en Barack_Obama,_Sr. 37 938821    en Barack_Obama_%22HOPE%22_poster 4 81005    en Barack_Obama_%22Hope%22_poster 5 102081

Wikilinks (1.1G)

Contains a wikipedia linkgraph dataset provided by Henry Haselgrove.

These files contain all links between proper english language Wikipedia pages, that is pages in “namespace 0″. This includes disambiguation pages and redirect pages.

In links-simple-sorted.txt, there is one line for each page that has links from it. The format of the lines is ready for processing by Hadoop:

    from1: to11 to12 to13 ...    from2: to21 to22 to23 ...    ...

where from1 is an integer labelling a page that has links from it, and to11 to12 to13 … are integers labelling all the pages that the page links to. To find the page title that corresponds to integer n, just look up the n-th line in the file titles-sorted.txt.

- Advertisement -

Wikidump (29G)

Contains the raw Wikipedia dumps from March along with some processed versions of the data. One of the useful files I created provides a direct lookup table for wikipedia article redirects in page_lookup_redirects.txt, which can be useful for name standardization and search:

Here is a sample query run when the file is loaded into MySQL:

   mysql> select redirect_title, true_title from page_lookups               where page_id = 534366;   +------------------------------------------------+--------------+   | redirect_title                                 | true_title   |   +------------------------------------------------+--------------+   | Barack_Obama                                   | Barack Obama |   | Barak_Obama                                    | Barack Obama |   | 44th_President_of_the_United_States            | Barack Obama |   | Barach_Obama                                   | Barack Obama |   | Senator_Barack_Obama                           | Barack Obama |                           .....                           .....            | Rocco_Bama                                     | Barack Obama |   | Barack_Obama's                                 | Barack Obama |    | B._Obama                                       | Barack Obama |   +------------------------------------------------+--------------+   110 rows in set (11.15 sec)    

Editor SDC June 12, 2009
Share this Article
Facebook Twitter Pinterest LinkedIn
Share
- Advertisement -

Follow us on Facebook

Latest News

cloud-centric companies using network relocation
Cloud-Centric Companies Discover Benefits & Pitfalls of Network Relocation
Cloud Computing
construction analytics
5 Benefits of Analytics to Manage Commercial Construction
Analytics
database compliance guide
Four Strategies For Effective Database Compliance
Data Management
Digital Security From Weaponized AI
Fortifying Enterprise Digital Security Against Hackers Weaponizing AI
Security

Stay Connected

1.2k Followers Like
33.7k Followers Follow
222 Followers Pin

You Might also Like

data mining helps with offsite SEO
Data Mining

Can Data Mining Aid with Off-Page SEO Strategies?

10 Min Read
using data mining to learn more about customers
Big Data

3 Data Mining Tips for Companies Trying to Understand their Customers

6 Min Read
surveys data
Data Mining

5 Data Mining Tips to Leverage the Benefits of Surveys

11 Min Read
data mining is game changer for small businesses
Data Mining

Perform Data Mining With Web Scrapers to Track Prices

7 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

giveaway chatbots
How To Get An Award Winning Giveaway Bot
Big Data Chatbots Exclusive
ai is improving the safety of cars
From Bolts to Bots: How AI Is Fortifying the Automotive Industry
Artificial Intelligence

Quick Link

  • About
  • Contact
  • Privacy
Follow US

© 2008-23 SmartData Collective. All Rights Reserved.

Removed from reading list

Undo
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?