Analytics Datasets: Commons Impact Metrics
Overview
This collection of datasets details how commons media is edited, used, and accessed across Wikimedia projects. Currently, we are publishing data about categories on an allow-list, curated jointly with the GLAM community.
The files available for download are all in TSV (tab-separated-value) format, with lists are separated by the "|" character. Schemas for each dataset are detailed below.
Category metrics snapshot
Field |
Description |
category |
The name of the category this row refers to. Coincides with the page title of the category page in Commons. URL version (with underscores). |
primary_categories |
The top ancestor category names of this row’s category. They should be in the Commons institution category allow-list. Ideally, there should be only one primary category, but since we can not control that from MediaWiki, we accept multiple primary categories. The list is separated by the bar “|” character. |
media_file_count |
The number of media files contained in this (shallow) category. |
media_file_count_deep |
The number of media files contained in this (deep) category tree. |
used_media_file_count |
The number of media files from this (shallow) category featured in at least one wiki page. |
used_media_file_count_deep |
The number of media files from this (deep) category tree featured in at least one wiki page. |
leveraging_wiki_count |
The number of wikis featuring at least one of this (shallow) category’s media files. |
leveraging_wiki_count_deep |
The number of wikis featuring at least one of this (deep) category tree’s media files. |
leveraging_page_count |
The number of (namespace=0) pages featuring at least one of this (shallow) category’s media files. |
leveraging_page_count_deep |
The number of (namespace=0) pages featuring at least one of this (deep) category tree’s media files. |
month |
The month after the end of which we calculate the data (YYYY-MM). For example, if we are calculating the data after March 2024 (even if it’s i.e. April 4th) the value should be “2024-03”. This is so, to be consistent with the sibling incremental datasets (Pageviews by category, Pageviews by media file, and Edits). |
Media file metrics snapshot
Field |
Description |
media_file |
The name of the media file this row refers to. Coincides with the page title of the media file page in Commons. URL version (with underscores). |
media_type |
The media type of the media file, coming from the Image table (img_media_type): BITMAP, VIDEO, etc. |
categories |
The category names that the media file is directly associated with. The list is separated by the bar “|” character. |
primary_categories |
The top ancestor category names of the media file. They should be in the Commons institution category allow-list. Ideally, there should be only one primary category, but since we can not control that from MediaWiki, we accept multiple primary categories. The list is separated by the bar “|” character. |
leveraging_wiki_count |
The number of wikis featuring this media file at least in one (namespace=0) page. |
leveraging_page_count |
The number of (namespace=0) pages featuring this media file across all wikis. |
month |
The month after the end of which we calculate the data (YYYY-MM). For example, if we are calculating the data after March 2024 (even if it’s i.e. April 4th) the value should be “2024-03”. This is so, to be consistent with the sibling incremental datasets (Pageviews by category, Pageviews by media file, and Edits). |
Pageviews by category
Field |
Description |
category |
The name of the category this row refers to. Coincides with the page title of the category page in Commons. URL version (with underscores). |
category_scope |
Either “shallow” (meaning only media files directly associated with the category were used to aggregate pageviews) or “deep” (meaning all media files within the category and all its recursive subcategories were used to aggregate pageviews). |
primary_categories |
The top ancestor category names of this row’s category. They should be in the Commons institution category allow-list. Ideally, there should be only one primary category, but since we can not control that from MediaWiki, we accept multiple primary categories. The list is separated by the bar “|” character. |
wiki |
The canonical name of the visualized wiki, i.e.: “en.wikipedia” or “fr.wiktionary”. Only wikis that feature at least one media file of the corresponding category will appear here. |
page_title |
The title of the visualized (namespace=0) page. URL version (with underscores). Only (namespace=0) pages featuring at least one media file of the corresponding category will appear here. |
pageview_count |
Aggregated pageview count for (namespace=0) pages featuring at least one media file from the category/scope. Rows with pageview_count=0 should be omitted! |
month |
The month for which we aggregate the data (YYYY-MM). |
Pageviews by media file
Field |
Description |
media_file |
The name of the media file this row refers to. Coincides with the page title of the media file page in Commons. URL version (with underscores). |
categories |
The category names that the media file is directly associated with. The list is separated by the bar “|” character. |
primary_categories |
The top ancestor category names of the media file. They should be in the Commons institution category allow-list. Ideally, there should be only one primary category, but since we can not control that from MediaWiki, we accept multiple primary categories. The list is separated by the bar “|” character. |
wiki |
The canonical name of the visualized wiki, i.e.: “en.wikipedia” or “fr.wiktionary”. Only wikis that feature the media file at least once will appear here. |
page_title |
The title of the visualized (namespace=0) page. URL version (with underscores). Only (namespace=0) pages featuring the media file will appear here. |
pageview_count |
Aggregated pageview count for (namespace=0) pages featuring the media file. Rows with pageview_count=0 should be omitted! |
month |
The month for which we aggregate the data (YYYY-MM). |
Edits
Field |
Description |
user_name |
The user name of the user who performed the edit. This is resolved from the actor table’s actor_name. If no actor is found, it is set to ‘anonymous’. If it has been suppressed, it is set to ‘redacted’. |
edit_type |
Either “create” (for the first revision of a media file page), or “update” (for all other revisions of the media file page). |
media_file |
The name of the edited media file. Coincides with the page title of the media file page in Commons. URL version (with underscores). |
categories |
The category names that the media file is directly associated with. The list is separated by the bar “|” character. |
primary_categories |
The top ancestor category names of the media file. They should be in the Commons institution category allow-list. Ideally, there should be only one primary category, but since we can not control that from MediaWiki, we accept multiple primary categories. The list is separated by the bar “|” character. |
dt |
The timestamp of the edit. |
All Analytics datasets are available under the Creative Commons CC0 dedication.