In rscopus
we try to use the Scopus API to present queries about authors and affiliations. Here we will use an example from Clarke Iakovakis.
First, let’s load in the packages we’ll need.
Next, we need to see if we have an API key available. See the API key vignette for more information and how to set the keys up. We will use the have_api_key()
functionality.
Here we will create a query of a specific affiliation, subject area, publication year, and type of access (OA = open access). Let’s look at the different types of subject areas:
rscopus::subject_areas()
#> [1] "AGRI" "ARTS" "BIOC" "BUSI" "CENG" "CHEM" "COMP" "DECI" "DENT" "EART"
#> [11] "ECON" "ENER" "ENGI" "ENVI" "HEAL" "IMMU" "MATE" "MATH" "MEDI" "NEUR"
#> [21] "NURS" "PHAR" "PHYS" "PSYC" "SOCI" "VETE" "MULT"
These categories are helpful because to search all the documents it’d be too big of a call. We may also get rate limited. We can search each separately, store the information, save them, merge them, and then run our results.
The author of this example was analyzing data from OSU (Oklahoma State University), and uses the affiliation ID from that institution (60006514
). If you know the institution name, but not the ID, you can use process_affiliation_name
to retrieve it. Here we make the queries for each subject area:
Let’s pull the first subject area information. Note, the count may depend on your API key limits. We also are asking for a complete view, rather than the standard view. The max_count
is set to \(20000\), so this may not be enough for your query and you need to adjust.
if (have_api_key()) {
make_query = function(subj_area) {
paste0("AF-ID(60006514) AND SUBJAREA(",
subj_area,
") AND PUBYEAR = 2018 AND ACCESSTYPE(OA)")
}
i = 3
subj_area = subject_areas()[i]
print(subj_area)
completeArticle <- scopus_search(
query = make_query(subj_area),
view = "COMPLETE",
count = 200)
print(names(completeArticle))
total_results = completeArticle$total_results
total_results = as.numeric(total_results)
} else {
total_results = 0
}
#> [1] "BIOC"
#> Warning in scopus_search(query = make_query(subj_area), view =
#> "COMPLETE", : STANDARD view can have a max count of 200 and COMPLETE 25
#> The query list is:
#> list(query = "AF-ID(60006514) AND SUBJAREA(BIOC) AND PUBYEAR = 2018 AND ACCESSTYPE(OA)",
#> count = 25, start = 0, view = "COMPLETE")
#> $query
#> [1] "AF-ID(60006514) AND SUBJAREA(BIOC) AND PUBYEAR = 2018 AND ACCESSTYPE(OA)"
#>
#> $count
#> [1] 25
#>
#> $start
#> [1] 0
#>
#> $view
#> [1] "COMPLETE"
#>
#> Response [https://api.elsevier.com/content/search/scopus?query=AF-ID%2860006514%29%20AND%20SUBJAREA%28BIOC%29%20AND%20PUBYEAR%20%3D%202018%20AND%20ACCESSTYPE%28OA%29&count=25&start=0&view=COMPLETE]
#> Date: 2019-09-17 18:52
#> Status: 200
#> Content-Type: application/json;charset=UTF-8
#> Size: 171 kB
#> Total Entries are 82
#> 4 runs need to be sent with current count
#>
|
| | 0%
|
|================================ | 50%
|
|=================================================================| 100%
#> Number of Output Entries are 82
#> [1] "entries" "total_results" "get_statements"
Here we see the total results of the query. This can be useful if the total_results = 0
or they are greater than the max count specified (not all records in Scopus are returned).
The gen_entries_to_df
function is an attempt at turning the parsed JSON to something more manageable from the API output. You may want to go over the list elements get_statements
in the output of completeArticle
. The original content can be extracted using httr::content()
and the "type"
can be specified, such as "text"
and then jsonlite::toJSON
can be used explicitly on the JSON output. Alternatively, any arguments to jsonlite::toJSON
can be passed directly into httr::content()
, such as flatten
or simplifyDataFrame
.
These are all alternative options, but we will use rscopous::gen_entries_to_df
. The output is a list of data.frame
s after we pass in the entries
elements from the list.
if (have_api_key()) {
# areas = subject_areas()[12:13]
areas = c("ENER", "ENGI")
names(areas) = areas
results = purrr::map(
areas,
function(subj_area) {
print(subj_area)
completeArticle <- scopus_search(
query = make_query(subj_area),
view = "COMPLETE",
count = 200,
verbose = FALSE)
return(completeArticle)
})
entries = purrr::map(results, function(x) {
x$entries
})
total_results = purrr::map_dbl(results, function(x) {
as.numeric(x$total_results)
})
total_results = sum(total_results, na.rm = TRUE)
df = purrr::map(entries, gen_entries_to_df)
MainEntry = purrr::map_df(df, function(x) {
x$df
}, .id = "subj_area")
ddf = MainEntry %>%
filter(as.numeric(`author-count.$`) > 99)
if ("message" %in% colnames(ddf)) {
ddf = ddf %>%
select(message, `author-count.$`)
print(head(ddf))
}
MainEntry = MainEntry %>%
mutate(
scopus_id = sub("SCOPUS_ID:", "", `dc:identifier`),
entry_number = as.numeric(entry_number),
doi = `prism:doi`)
#################################
# remove duplicated entries
#################################
MainEntry = MainEntry %>%
filter(!duplicated(scopus_id))
Authors = purrr::map_df(df, function(x) {
x$author
}, .id = "subj_area")
Authors$`afid.@_fa` = NULL
Affiliation = purrr::map_df(df, function(x) {
x$affiliation
}, .id = "subj_area")
Affiliation$`@_fa` = NULL
# keep only these non-duplicated records
MainEntry_id = MainEntry %>%
select(entry_number, subj_area)
Authors = Authors %>%
mutate(entry_number = as.numeric(entry_number))
Affiliation = Affiliation %>%
mutate(entry_number = as.numeric(entry_number))
Authors = left_join(MainEntry_id, Authors)
Affiliation = left_join(MainEntry_id, Affiliation)
# first filter to get only OSU authors
osuauth <- Authors %>%
filter(`afid.$` == "60006514")
}
#> [1] "ENER"
#> Warning in scopus_search(query = make_query(subj_area), view =
#> "COMPLETE", : STANDARD view can have a max count of 200 and COMPLETE 25
#> [1] "ENGI"
#> Warning in scopus_search(query = make_query(subj_area), view =
#> "COMPLETE", : STANDARD view can have a max count of 200 and COMPLETE 25
#> Joining, by = c("entry_number", "subj_area")
#> Joining, by = c("entry_number", "subj_area")
At the end of the day, we have the author-level information for each paper. The entry_number
will join these data.frame
s if necessary. The df
element has the paper-level information in this example, the author
data.frame
has author information, including affiliations. There can be multiple affiliations, even within institution, such as multiple department affiliations within an institution affiliation. The affiliation
information relates to the affiliations and can be merged with the author information.
Here we look at the funding agencies listed on all the papers. This can show us if there is a pattern in the funding sponsor and the open-access publications. Overall, though, we would like to see the funding of all the papers if a specific funder requires open access. This checking allows libraries and researchers ensure they are following the guidelines of the funding agency.
if (total_results > 0) {
cn = colnames(MainEntry)
cn[grep("fund", tolower(cn))]
tail(sort(table(MainEntry$`fund-sponsor`)))
funderPoland <- filter(
MainEntry,
`fund-sponsor` == "Ministerstwo Nauki i Szkolnictwa Wyższego" )
dim(funderPoland)
osuFunders <- MainEntry %>%
group_by(`fund-sponsor`) %>%
tally() %>%
arrange(desc(n))
osuFunders
}
#> # A tibble: 25 x 2
#> `fund-sponsor` n
#> <chr> <int>
#> 1 Ministerstwo Nauki i Szkolnictwa Wyższego 22
#> 2 <NA> 14
#> 3 National Science Foundation 4
#> 4 Australian Research Council 2
#> 5 Natural Science Foundation of Fujian Province 2
#> 6 Oklahoma State University 2
#> 7 University of Oklahoma Health Sciences Center 2
#> 8 Advanced Scientific Computing Research 1
#> 9 Austrian Science Fund 1
#> 10 China Scholarship Council 1
#> # … with 15 more rows
The Scopus API has limits for different searches and calls. Using a combination of APIs, we can gather all the information on authors that we would like. This gives us a full picture of the authors and co-authorship at a specific institution in specific scenarios, such as the open access publications from 2018.