Introduction: Retrieval of Storting data in R

A wide variety of parliamentary data have been made available to the public in several countries over the last decade. Be it through frontend websites or back-end APIs, researchers on parliaments have never had easier access to large amounts of data than they do now. However, both frontend and API scraped data often come in formats (.html, .xml, .json, etc) that require substantial structuring and pre-processing before they are ready for subsequent analyses.

In this vignette, I present the stortingscrape package for R. stortingscrape makes retrieving data from the Norwegian parliament (Stortinget) through their easily accessible back-end API. The data requested using the package require little to no further structuring. The scope of the package, discussed further below, ranges from general data on the parliament itself (rules, session info, committees, etc) to data on the parties, bibliographies of the MPs, questions, hearings, debates, votes, and more.

Although this is the first attempt to make data on Stortinget more easily accessible, stortingscrape does not live in a vacuum. A variety of parliamentary data for different countries are available for researchers to use freely. For parliamentary debates, Thomas, Pang, and Lee (2006) were one of the first to gather and make available data. Their data cover the proceedings of the 2005 House debates. Eggers and Spirling (2014) structured the UK Hansard speech data, which spans from 1802 to 2010. Beelen et al. (2017) provided continuously updated data for the Canadian parliament, Rauh and Schwalbach (2020) made available a collection of speech data from 9 countries, and Turner-Zwinkels et al. (2021) developed a day-by-day dataset of MPs in Germany, Switzerland, and the Netherlands, in the period between 1947 and 2017. These examples are, however, different from stortingscrape in that they are finished datasets ready for download and have limited scope.

The main goal of stortingscrape is to allow researchers to access any data from the Norwegian parliament easily, but also still be able to structure the data according to ones need. Most importantly, the package is facilitated for weaving together different parts of the data.stortinget.no API.

I will start this vignette by briefly discussing the openly accessible data.stortinget.no API. Next, I will describe the philosophy, scope and general usage of the stortingscrape package. Finally, I will present some minimal examples of possible workflows for working with the package, before I summarize.

Stortinget’s API

The Norwegian parliament was comparatively early in granting open access to their data through an API when they launched data.stortinget.no in 2012. The general purpose of the API is to priovide transparency in the form om raw data, mirroring the frontend web-page information from stortinget.no. The format of the API has been fairly consistent over the time of its existance, but there have been some small style changes over different versions.¹ stortingscrape was built under version $1.6$ of the API.

Except for content that is blocked for the public (e.g. debates behind closed doors), the API contains all recorded data produced in Stortinget. These data include data on individual MPs, transcripts from debates, voting results, hearing input, and much more. For a exhaustive list of all data sources in the API.² The data available in the API can be accessed through XML of JSON format³, both of which are flexible formats for compressing data in nested lists.

As an exmple, the raw data input for general information about a single MP⁴ looks like this:

#> <person>
#>   <respons_dato_tid>2021-08-13T14:59:48.2114895+02:00</respons_dato_tid>
#>   <versjon>1.6</versjon>
#>   <doedsdato>0001-01-01T00:00:00</doedsdato>
#>   <etternavn>Aasen</etternavn>
#>   <foedselsdato>1967-02-21T00:00:00</foedselsdato>
#>   <fornavn>Marianne</fornavn>
#>   <id>MAAA</id>
#>   <kjoenn>kvinne</kjoenn>
#> </person>

This is the typical XML structure in the API, although other parts of the data are more complex in that the XML tree can be nested multiple times. This will be discussed further in the next section.

Package philosophy, scope, and usage

stortingscrape aims to make Norwegian parliamentary data easily accessible, while also being flexible enough for tailoring the different underlying data sources to ones needs. Indeed, contrary to most open source parliamentary speech data, stortingscrape aims at giving the user as much agency as possible in tailoring data for specific needs. In addition to user agency, the package is built with a core philosophy of simplifying data structures, make seamless workflows between different parts of the Storting API, and limit data duplication between functions.

Because a lot of analysis tools in R requires 2 dimensional data formats, the stortingscrape package prioritize converting the nested XML format to data frames, when possible. However, some sources of data from the Storting API are nested in a way which makes retaining all data in a 2 dimensional space either impossible or too verbose. For example, the get_mp_bio() function, which extract a specific MP’s biography by id, has data on MP personalia, parliamentary periods the MP had a seat, vocations, literature authored by the MP, and more. In order to make all these data workable, the resulting format from the function call is a list of data frames for each part of the data. The different list elements are, however, easily combined for different applications of the data.

One of the core thoughts behind the workflow of the package is to make it easy to combine different parts of the API and to extract the data you actually need. To facilitate this, most functions within stortingscrape are built to work seemlessly with the apply() family or control flow constructs in R. Because we do not want to call the API repeatedly, functions that are expected to often be ran repeatedly have a good_manners argument. This will make R sleep for the set amount of seconds after calling the API. It is recommended to set this argument to 2 seconds or higher on multiple calls to the API. Generally, the package is built by the recommendations given by the httr2 package (Wickham 2023)⁵.

Most of the data from Stortinget’s API and frontend web page are interconnected through ids for the various sources (session id, MP id, case id, question id, vote id, etc.). stortingscrape core extraction methods are based around these. One of the major benefits of this is that whether you want to extract, for instance, a single question found on the frontend web page, or all questions for a parliamentary session, the package is flexible enough to suit both needs (see the workflow section). It will also enable users to quickly retreive data from the frontend web-page.⁶

Because of the interconnectedness of the API’s data, there are some overlapping sources of data. For instance, both retreival of MP general information (get_mp()), biography (get_mp_bio()), and all MPs for a session (get_parlperiod_mps()) have the name of the MP in the API, but only get_mp() will return MP names in stortingscrape, because these two data sources are easily merged by the MP’s id (see the workflow section).

The scope of stortingscrape is almost the entire API of Stortinget, with some notable shortcomings. First, there are no functions for dynamically updated data sources, such as current speaker lists (https://data.stortinget.no/dokumentasjon-og-hjelp/talerliste/). Second, as mentioned above, duplicated data i avoided whenever possible. Third, certain unstandardized image sources – such as publication attachment figures – are not supported in the package. And finally, publications from the get_publication() function can be retrieved, but are returned in a parsed XML data format from the rvest package because these data are not standardized across different publications.

There are three overarching sources of data in stortingscrape: 1) Parliamentary structure data, 2) MP data, and 3) Parliamentary activity data. These are, in some/most cases, linked by various forms of ID tags. For example, retrieving all MPs for a given session (get_parlperiod_mps()) will give access to MP IDs (mp_id) for that session, which can be used to extract biographies, pictures, speech activity, and more for those MPs. Next, I will showcase some examples of how a typical workflow for using stortingscrape could look like.

Workflow

In the following section, I will discuss some examples of data extraction with stortingscrape. I start by showing basic extraction of voting data based on vote IDs from the frontend web-page – stortinget.no. Next, I exemplify the large set of period and session specific data by retrieving all MPs for a specific parliamentary period and all interpellations for a specified parliamentary session. Finally, I show how the different functions of the stortingscrape package works together – merging data on cases with their belonging vote results. Note that the vignette is built using the examples in the data folder of the package.⁷

data_files <- data(package = "stortingscrape")$results[,"Item"]
data(list = data_files)

Basic extraction

The basic extraction of specific data from Stortinget’s API revolves around various forms of ID tags. For example, all MPs have a unique ID, all cases have unique IDs, all votes have unique IDs, and so on. For the following example, I will highlight going from a case on economic measures for the Covid pandemic to party distribution on a specific vote in this case. First, the case was relatively rapidly proposed and treated in the Storting during the early days of June 2021. The case in its entirety can be found at here. You will see the procedure steps from a government proposal, through work in the finance committee, to debate and decision. Lets say a particular proposal under the case caught our eye – for instance, vote number 61 from the Labor Party asking the government to propose a plan for implementing the International Labor Organization’s core conventions to the Human Rights Act (menneskerettighetsloven).

As can be seen from the link to the case itself, we have an ID within the URL: “85196”. This is the case ID. We can use the get_case() function from stortingscrape to extract all votes on this case:

## covid_relief <- get_vote("85196")

We now have a data frame with 71 votes over 22 variables. The data structure for some selected variables, looks like this:

head(covid_relief[, c("case_id", "vote_id", "n_for", "n_against", "adopted")])
#>   case_id vote_id n_for n_against adopted
#> 1   85196   17631     1        87   false
#> 2   85196   17632     6        81   false
#> 3   85196   17633    14        74   false
#> 4   85196   17634    42        46   false
#> 5   85196   17635    40        48   false
#> 6   85196   17636    15        73   false

As we are interested in the result of proposal 217 from the Labor Party, we can extract the ID of this particular vote from our data:


covid_relief$vote_id[which(grepl("217", covid_relief$vote_topic))]

To get the personal MP vote results for this particular vote, we can use the get_result_vote() function:⁸

## covid_relief_result <- get_result_vote("17689")

head(covid_relief_result[, c("vote_id", "mp_id", "party_id", "vote")])
#>   vote_id mp_id party_id          vote
#> 1   17689   SSA        H           mot
#> 2   17689   EAG        H ikke_tilstede
#> 3   17689   PTA      FrP           mot
#> 4   17689   DTA        A ikke_tilstede
#> 5   17689  KAAN       SV           for
#> 6   17689  KAND       Sp           for

From looking only at the first six rows of the data, the readers who know the Norwegian political system will suspect that this vote was an opposition versus government vote, but we can also easily get the distribution of votes by party:


table(covid_relief_result$party_id, 
      covid_relief_result$vote) |>
  addmargins()
#>      
#>       for ikke_tilstede mot Sum
#>   A    27            21   0  48
#>   FrP   0            12  14  26
#>   H     0            22  23  45
#>   KrF   0             5   3   8
#>   MDG   1             0   0   1
#>   R     1             0   0   1
#>   SV    5             6   0  11
#>   Sp    8            12   0  20
#>   Uav   0             0   1   1
#>   V     0             5   3   8
#>   Sum  42            83  44 169

As suspected, the vote was divided between the opposition (A, MDG, R, SP, and SV) and government parties (H, KrF, V, and FrP), and was not adopted by a thin margin of 2 votes. Of course, this is a minimal example, but I will highlight more methods for extracting multiple votes below.

Sequences of data extraction

Below, I show two examples of sequentially extracting data of interest.

Example 1: From periods to interpellations

Most of the mentioned IDs for Stortinget’s data are not only extractable from the frontend web-page, but also from the back-end API. These data can be retrieved by various forms of parliamentary period or session specific functions in stortingscrape. In this section, I will show how to get all MPs for a specific parliamentary period and all interpellations for a parliamentary session.

First, however, I note that IDs for periods and sessions are accessed through two core functions in the package:

## parl_periods <- get_parlperiods()
## parl_sessions <- get_parlsessions()

tail(parl_periods[,c("id", "years")])
#>         id     years
#> 15 1965-69 1965-1969
#> 16 1961-65 1961-1965
#> 17 1958-61 1958-1961
#> 18 1954-57 1954-1958
#> 19 1950-53 1950-1954
#> 20 1945-49 1945-1950
tail(parl_sessions[,c("id", "years")])
#>         id     years
#> 34 1991-92 1991-1992
#> 35 1990-91 1990-1991
#> 36 1989-90 1989-1990
#> 37 1988-89 1988-1989
#> 38 1987-88 1987-1988
#> 39 1986-87 1986-1987

The parliamentary period IDs is mainly used for MP data; Norwegian MPs are elected for 4 year terms, with no constitutional arrangement for snap elections. The MP data also stretch way further back in time than most of the other data in the API:


parl_periods$id[nrow(parl_periods)]
#> [1] "1945-49"

## mps4549 <- get_parlperiod_mps("1945-49")
head(mps4549[, c("mp_id", "county_id", "party_id", "period_id")])
#>   mp_id county_id party_id period_id
#> 1  AAKU        VA        A   1945-49
#> 2  AARY        AA        A   1945-49
#> 3  ALKJ        He        H   1945-49
#> 4  ALVÅ        Fi      NKP   1945-49
#> 5  AMSK        ST        A   1945-49
#> 6  ANBØ        SF        V   1945-49

From these data, the way is short to extracting more rich data on individual MPs, as will be demonstrated below.

Content data, however, use parliamentary session IDs rather than period IDs. These functions are standardized to function names as get_session_*(). For example, we can access all interpellations from the 2002-2003 session with the get_session_questions() function:

## interp0203 <- get_session_questions("2002-2003", q_type = "interpellasjoner")
dim(interp0203)
#> [1] 22 26

Here, we have 22 interpellations over 26 different variables. Unfortunately, the API only gives the question and not the answer for the different types of question requests. Retrieval of question answers is a daunting task, because it is only accessible through the unstandardized get_publication() function.

Example 2: From cases to MP vote results

Next, I showcase how to get go from cases in a section, through extracting a case of interest and vote results, to vote matrices for that case.

First, I extract all cases in the 2019-2020 session:

## cases <- get_session_cases("2019-2020")

The cases object will here contain all cases treated in the 2019-2020 parliamentary session. Do note that cases is a list of 4 elements ($root, $topics, $proposers, and $spokespersons). In the following, I use the case ID in $root to access vote information for a case – in this example the 48th row in the data:⁹


# The case titles are, unfortunately, not translated
cases$root$title_short[48]
#> [1] "Representantforslag om å reversere avgiftsøkninger på alkoholfrie drikkevarer og bevare arbeidsplasser i norsk næringsmiddelindustri og varehandel"

## vote <- get_vote(cases$root$id[48])

vote[, c("case_id", "vote_id", 
         "alternative_vote", 
         "n_for", "n_absent", "n_against")]
#>   case_id vote_id alternative_vote n_for n_absent n_against
#> 1   78686   15404               -1     1       82        86
#> 2   78686   15405            15406    46       82        41
#> 3   78686   15406            15405    41       82        46

The output gives us a data frame of three votes over 22 variables, whereof one is the vote ID for each of the votes. We can use this variable to retrieve rollcall data, using the get_result_vote function:

## vote_result <- lapply(vote$vote_id, get_result_vote, good_manners = 5)
names(vote_result) <- vote$vote_id

vote_result <- do.call(rbind, vote_result)
head(vote_result[, 3:ncol(vote_result)])
#>         vote_id mp_id party_id          vote permanent_sub_for sub_for
#> 15404.1   15404   SSA        H           mot              <NA>    <NA>
#> 15404.2   15404   EAG        H           mot              <NA>    <NA>
#> 15404.3   15404   PTA      FrP           mot              <NA>    <NA>
#> 15404.4   15404   DTA        A           mot              <NA>    <NA>
#> 15404.5   15404  KAAN       SV           mot              <NA>    <NA>
#> 15404.6   15404   MAA       Sp ikke_tilstede              <NA>    <NA>

And make an overall proportion table over party distribution for the three votes:

table(vote_result$vote, vote_result$party_id,
      dnn = c("Vote result", "Vote ID")) |>
  prop.table(margin = 2) |>
  round(digits = 2)

See stortingscrape::get_publication() for instance↩︎
See https://martigso.github.io/stortingscrape/functions.html ↩︎
stortingscrape exclusively works with XML.↩︎
stortingscrape::get_mp("MAAA")↩︎
Especially, see https://httr2.r-lib.org/articles/wrapping-apis.html ↩︎
stortinget.no as the ids are embedded in the urls.↩︎
This is done in order to not call the API each time the vignette is built.↩︎
I have not decided if data values should be translated or not. In this case, “for” is “for”, “mot” is “against”, and “ikke_tilstede” is “absent”.}↩︎
I will note that it is possible to extract vote information on all cases by either using the apply() family or control flow constructs available in R. However, in this case, calling the API 616 (nrow(cases[["root"]])) times, will require to pause between calls (with the {good_manners argument). This will increase running time substantially.↩︎

stortingscrape: An R package for accessing data from the Norwegian parliament