[DEPRECATED] Wikistats pageview files

Maintained by WMF Analytics

NOTE: This dataset has had some problems and we are no longer generating new data, since September 2020. We are phasing it out in favor of Pageviews Complete. This new dataset is a work in progress, we still have some formatting issues to fix. When it's finished we will announce it widely and explain how to migrate.

Hourly page views per article for around 30 million article titles (Sept 2013) in around 800+ Wikimedia wikis. Repackaged (with extreme shrinkage, without losing granularity), corrected, reformatted. Daily files and two monthly files (see notes below).

Hourly page views per wiki, corrected for site outages and underreporting. Also repackaged, as one tar file per year.

Raw data for reports at stats.wikimedia.org.


Notes for hourly page views

Both sets of hourly files are derived from the best data available at the time:

The huge hourly files for page views per article per wiki have been massively compressed by merging 720 files per month, thus removing massive redundancy (80% of record space is article title, and a title can occur in all 720 files). All of this shrinkage without losing hourly granularity.

Line format:

In the wiki code field, the subproject is the language code (fr, el, ja, etc) or meta, commons etc.

The project is one of b (wikibooks), k (wiktionary), n (wikinews), o (wikivoyage), q (wikiquote), s (wikisource), v (wikiversity), z (wikipedia), m (wikimedia subprojects: commons, meta, species, etc). An .m project suffix combined with a language subproject, i.e. en.m, means the page counts come from the mobile site.

Hourly counts can be deciphered as follows:

Hour:
from 0 to 23, written as 0 = A, 1 = B ... 22 = W, 23 = X
Day:
from 1 to 31, written as 1 = A, 2 = B ... 25 = Y, 26 = Z, 27 = [, 28 = \, 29 = ], 30 = ^, 31 = _
Example: 33 views on day 2, hour 4, and 155 views on day 3, hour 7 are coded as 'BE33,CH155'

Source for this information: https://lists.wikimedia.org/pipermail/wikitech-l/2011-August/054591.html.


All Analytics datasets are available under the Creative Commons CC0 dedication.