This package provides access to data frames of values from the COVIDcast
endpoint of the Epidata API. Using the
covidcast_signal()
function, you can fetch any data you may
be interested in analyzing, then use
plot.covidcast_signal()
to make plots and maps. Since the
data is provided as a simple data frame, you can also wrangle it into
whatever form you need to conduct your desired analyses using other
packages and functions.
This package is available on CRAN, so the easiest way to install it is simply
To obtain smoothed estimates of COVID-like illness from our COVID-19 Trends and
Impact Survey for every county in the United States between
2020-05-01 and 2020-05-07, we can use
covidcast_signal()
:
library(covidcast)
library(dplyr)
cli <- covidcast_signal(data_source = "fb-survey", signal = "smoothed_wcli",
start_day = "2020-05-01", end_day = "2020-05-07",
geo_type = "county")
knitr::kable(head(cli))
data_source | signal | geo_value | time_value | source | geo_type | time_type | issue | lag | missing_value | missing_stderr | missing_sample_size | value | stderr | sample_size |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
fb-survey | smoothed_wcli | 01000 | 2020-05-01 | fb-survey | county | day | 2020-09-03 | 125 | 0 | 0 | 0 | 0.8260625 | 0.1341381 | 1676.2773 |
fb-survey | smoothed_wcli | 01001 | 2020-05-01 | fb-survey | county | day | 2020-09-03 | 125 | 0 | 0 | 0 | 1.0707790 | 0.8213119 | 109.0866 |
fb-survey | smoothed_wcli | 01003 | 2020-05-01 | fb-survey | county | day | 2020-09-03 | 125 | 0 | 0 | 0 | 0.5081644 | 0.2800777 | 572.3194 |
fb-survey | smoothed_wcli | 01015 | 2020-05-01 | fb-survey | county | day | 2020-09-03 | 125 | 0 | 0 | 0 | 0.5277609 | 0.5192431 | 118.8275 |
fb-survey | smoothed_wcli | 01031 | 2020-05-01 | fb-survey | county | day | 2020-09-03 | 125 | 0 | 0 | 0 | 0.3733811 | 0.3367309 | 112.2687 |
fb-survey | smoothed_wcli | 01045 | 2020-05-01 | fb-survey | county | day | 2020-09-03 | 125 | 0 | 0 | 0 | 1.2369542 | 0.6464530 | 108.5803 |
covidcast_signal()
returns a data frame. (Here we’re
using knitr::kable()
to make it more readable.) Each row
represents one observation in one county on one day. The county FIPS
code is given in the geo_value
column, the date in the
time_value
column. Here value
is the requested
signal—in this case, the smoothed estimate of the percentage of people
with COVID-like illness, based on the symptom surveys, and
stderr
is its standard error. See the
covidcast_signal()
documentation for details on the
returned data frame.
To get a basic summary of the returned data frame:
## A `covidcast_signal` dataframe with 7030 rows and 15 columns.
##
## data_source : fb-survey
## signal : smoothed_wcli
## geo_type : county
##
## first date : 2020-05-01
## last date : 2020-05-07
## median number of geo_values per day : 1015
The COVIDcast API makes estimates available at several different
geographic levels, and covidcast_signal()
defaults to
requesting county-level data. To request estimates for states instead of
counties, we use the geo_type
argument:
cli <- covidcast_signal(data_source = "fb-survey", signal = "smoothed_wcli",
start_day = "2020-05-01", end_day = "2020-05-07",
geo_type = "state")
knitr::kable(head(cli))
data_source | signal | geo_value | time_value | source | geo_type | time_type | issue | lag | missing_value | missing_stderr | missing_sample_size | value | stderr | sample_size |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
fb-survey | smoothed_wcli | ak | 2020-05-01 | fb-survey | state | day | 2020-09-03 | 125 | 0 | 0 | 0 | 0.3661909 | 0.1469918 | 1560.000 |
fb-survey | smoothed_wcli | al | 2020-05-01 | fb-survey | state | day | 2020-09-03 | 125 | 0 | 0 | 0 | 0.7764020 | 0.1010989 | 7360.237 |
fb-survey | smoothed_wcli | ar | 2020-05-01 | fb-survey | state | day | 2020-09-03 | 125 | 0 | 0 | 0 | 0.7065584 | 0.1051584 | 4781.483 |
fb-survey | smoothed_wcli | az | 2020-05-01 | fb-survey | state | day | 2020-09-03 | 125 | 0 | 0 | 0 | 0.6025853 | 0.0724214 | 10973.073 |
fb-survey | smoothed_wcli | ca | 2020-05-01 | fb-survey | state | day | 2020-09-03 | 125 | 0 | 0 | 0 | 0.4139045 | 0.0306336 | 50482.138 |
fb-survey | smoothed_wcli | co | 2020-05-01 | fb-survey | state | day | 2020-09-03 | 125 | 0 | 0 | 0 | 0.5984794 | 0.0717395 | 9888.894 |
One can also select a specific geographic region by its ID. For example, this is the FIPS code for Allegheny County, Pennsylvania:
cli <- covidcast_signal(data_source = "fb-survey", signal = "smoothed_wcli",
start_day = "2020-05-01", end_day = "2020-05-07",
geo_type = "county", geo_value = "42003")
knitr::kable(head(cli))
data_source | signal | geo_value | time_value | source | geo_type | time_type | issue | lag | missing_value | missing_stderr | missing_sample_size | value | stderr | sample_size |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
fb-survey | smoothed_wcli | 42003 | 2020-05-01 | fb-survey | county | day | 2020-09-03 | 125 | 0 | 0 | 0 | 0.6270520 | 0.2511377 | 2554.564 |
fb-survey | smoothed_wcli | 42003 | 2020-05-02 | fb-survey | county | day | 2020-09-03 | 124 | 0 | 0 | 0 | 0.6453498 | 0.2599037 | 2509.176 |
fb-survey | smoothed_wcli | 42003 | 2020-05-03 | fb-survey | county | day | 2020-09-03 | 123 | 0 | 0 | 0 | 0.5523067 | 0.2497662 | 2473.456 |
fb-survey | smoothed_wcli | 42003 | 2020-05-04 | fb-survey | county | day | 2020-09-03 | 122 | 0 | 0 | 0 | 0.1430772 | 0.0804642 | 2493.730 |
fb-survey | smoothed_wcli | 42003 | 2020-05-05 | fb-survey | county | day | 2020-09-03 | 121 | 0 | 0 | 0 | 0.1861889 | 0.0960907 | 2415.204 |
fb-survey | smoothed_wcli | 42003 | 2020-05-06 | fb-survey | county | day | 2020-09-03 | 120 | 0 | 0 | 0 | 0.3124150 | 0.1218194 | 2465.422 |
By default, this package submits queries to the API anonymously. All
the examples in the package documentation are compatible with anonymous
use of the API, but there
are some limits on anonymous queries, including rate limits on the
number of queries that can be submitted per hour. To lift these limits,
see the “API keys” section of the covidcast_signal()
documentation for information on how to register for and use an API
key.
This package provides convenient functions for plotting and mapping these signals. For example, simple line charts are easy to construct:
For more details and examples, including choropleth and bubble maps,
see vignette("plotting-signals")
.
Above we used data from Delphi’s symptom surveys, but the COVIDcast API includes numerous data streams: medical claims data, cases and deaths, mobility, and many others; new signals are added regularly. This can make it a challenge to find the data stream that you are most interested in.
The COVIDcast
Data Sources and Signals documentation lists all data sources and
signals available through COVIDcast. When you find a signal of interest,
get the data source name (such as jhu-csse
or
fb-survey
) and the signal name (such as
confirmed_incidence_num
or smoothed_wcli
).
These are provided as arguments to covidcast_signal()
to
request the data you want.
The COVIDcast API identifies counties by their 5-digit FIPS code and metropolitan areas by their CBSA ID number. (See the geographic coding documentation for details.) This means that to query a specific county or metropolitan area, we must have some way to quickly find its identifier.
This package includes several utilities intended to make the process
easier. For example, if we look at ?county_census
, we find
that the package provides census data (such as population) on every
county in the United States, including its FIPS code. Similarly, by
looking at ?msa_census
we can find data about metropolitan
statistical areas, their corresponding CBSA IDs, and recent census
data.
(Note: the msa_census
data includes types of area beyond
metropolitan statistical areas, including micropolitan statistical
areas. The LSAD
column identifies the type of each area.
The COVIDcast API only provides estimates for metropolitan statistical
areas, not for their divisions or for micropolitan areas.)
Building on these datasets, the convenience functions
name_to_fips()
and name_to_cbsa()
conduct
grep()
-based searching of county or metropolitan area names
to find FIPS or CBSA codes, respectively:
## Allegheny County
## "42003"
## Pittsburgh, PA
## "38300"
Since these functions return vectors of IDs, we can use them to
construct the geo_values
argument to
covidcast_signal()
to select specific regions to query.
You may also want to convert FIPS codes or CBSA IDs back to
well-known names, for instance to report in tables or graphics. The
package provides inverse mappings county_fips_to_name()
and
cbsa_to_name()
that work in the analogous way:
## 42003
## "Allegheny County"
## 38300
## "Pittsburgh, PA"
See their documentation for more details (for example, the options for handling matches when counties have the same name).
If we are interested in exploring the available signals and their
metadata, we can use covidcast_meta()
to fetch a data frame
of the available signals:
data_source | signal | time_type | geo_type | min_time | max_time | num_locations | min_value | max_value | mean_value | stdev_value | last_update | max_issue | min_lag | max_lag |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
chng | smoothed_adj_outpatient_cli | day | county | 2020-02-01 | 2023-02-14 | 3118 | 0.0009331 | 99.92012 | 2.227852 | 3.843579 | 1683566979 | 2023-02-19 | 3 | 674 |
chng | smoothed_adj_outpatient_cli | day | hhs | 2020-02-01 | 2023-06-14 | 10 | 0.0061953 | 20.77577 | 2.530157 | 2.531306 | 1687231582 | 2023-06-19 | 5 | 674 |
chng | smoothed_adj_outpatient_cli | day | hrr | 2020-02-01 | 2023-06-14 | 306 | 0.0010292 | 50.81590 | 2.350355 | 2.763442 | 1687231582 | 2023-06-19 | 5 | 674 |
chng | smoothed_adj_outpatient_cli | day | msa | 2020-02-01 | 2023-06-14 | 392 | 0.0007662 | 99.99898 | 2.153204 | 3.000248 | 1687231583 | 2023-06-19 | 5 | 674 |
chng | smoothed_adj_outpatient_cli | day | nation | 2020-02-01 | 2023-06-14 | 1 | 0.0154639 | 12.08697 | 2.778260 | 2.344107 | 1687231583 | 2023-06-19 | 5 | 674 |
chng | smoothed_adj_outpatient_cli | day | state | 2020-02-01 | 2023-06-14 | 55 | 0.0013343 | 33.23859 | 2.264207 | 2.563880 | 1687231583 | 2023-06-19 | 5 | 674 |
The covidcast_meta()
documentation describes the columns
and their meanings. The metadata data frame can be filtered and sliced
as desired to obtain information about signals of interest. To get a
basic summary of the metadata:
(We silenced the evaluation because the output of
summary()
here is still quite long.)
The COVIDcast API records not just each signal’s estimate for a given location on a given day, but also when that estimate was made, and all updates to that estimate.
For example, consider using our doctor
visits signal, which estimates the percentage of outpatient doctor
visits that are COVID-related, and consider a result row with
time_value
2020-05-01 for geo_values = "pa"
.
This is an estimate for the percentage in Pennsylvania on May 1, 2020.
That estimate was issued on May 5, 2020, the delay being due to
the aggregation of data by our source and the time taken by the
COVIDcast API to ingest the data provided. Later, the estimate for May
1st could be updated, perhaps because additional visit data from May 1st
arrived at our source and was reported to us. This constitutes a new
issue of the data.
By default, covidcast_signal()
fetches the most recent
issue available. This is the best option for users who simply want to
graph the latest data or construct dashboards. But if we are interested
in knowing when data was reported, we can request specific data
versions using the as_of
, issues
, or
lag
arguments. (Note these are mutually exclusive; only one
can be specified at a time.)
First, we can request the data that was available as of a
specific date, using the as_of
argument:
covidcast_signal(data_source = "doctor-visits", signal = "smoothed_adj_cli",
start_day = "2020-05-01", end_day = "2020-05-01",
geo_type = "state", geo_values = "pa", as_of = "2020-05-07")
## A `covidcast_signal` dataframe with 1 rows and 15 columns.
##
## data_source : doctor-visits
## signal : smoothed_adj_cli
## geo_type : state
##
## data_source signal geo_value time_value source geo_type
## 1 doctor-visits smoothed_adj_cli pa 2020-05-01 doctor-visits state
## time_type issue lag missing_value missing_stderr missing_sample_size
## 1 day 2020-05-07 6 0 5 5
## value stderr sample_size
## 1 2.581509 NA NA
This shows that an estimate of about 2.3% was issued on May 7. If we
don’t specify as_of
, we get the most recent estimate
available:
covidcast_signal(data_source = "doctor-visits", signal = "smoothed_adj_cli",
start_day = "2020-05-01", end_day = "2020-05-01",
geo_type = "state", geo_values = "pa")
## A `covidcast_signal` dataframe with 1 rows and 15 columns.
##
## data_source : doctor-visits
## signal : smoothed_adj_cli
## geo_type : state
##
## data_source signal geo_value time_value source geo_type
## 1 doctor-visits smoothed_adj_cli pa 2020-05-01 doctor-visits state
## time_type issue lag missing_value missing_stderr missing_sample_size
## 1 day 2020-07-04 64 0 5 5
## value stderr sample_size
## 1 5.973572 NA NA
Note the substantial change in the estimate, to over 5%, reflecting new data that became available after May 7 about visits occurring on May 1. This illustrates the importance of issue date tracking, particularly for forecasting tasks. To backtest a forecasting model on past data, it is important to use the data that would have been available at the time, not data that arrived much later.
By using the issues
argument, we can request all issues
in a certain time period:
covidcast_signal(data_source = "doctor-visits", signal = "smoothed_adj_cli",
start_day = "2020-05-01", end_day = "2020-05-01",
geo_type = "state", geo_values = "pa",
issues = c("2020-05-01", "2020-05-15")) %>%
knitr::kable()
data_source | signal | geo_value | time_value | source | geo_type | time_type | issue | lag | missing_value | missing_stderr | missing_sample_size | value | stderr | sample_size |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
doctor-visits | smoothed_adj_cli | pa | 2020-05-01 | doctor-visits | state | day | 2020-05-07 | 6 | 0 | 5 | 5 | 2.581509 | NA | NA |
doctor-visits | smoothed_adj_cli | pa | 2020-05-01 | doctor-visits | state | day | 2020-05-08 | 7 | 0 | 5 | 5 | 3.278896 | NA | NA |
doctor-visits | smoothed_adj_cli | pa | 2020-05-01 | doctor-visits | state | day | 2020-05-09 | 8 | 0 | 5 | 5 | 3.321781 | NA | NA |
doctor-visits | smoothed_adj_cli | pa | 2020-05-01 | doctor-visits | state | day | 2020-05-12 | 11 | 0 | 5 | 5 | 3.588683 | NA | NA |
doctor-visits | smoothed_adj_cli | pa | 2020-05-01 | doctor-visits | state | day | 2020-05-13 | 12 | 0 | 5 | 5 | 3.631978 | NA | NA |
doctor-visits | smoothed_adj_cli | pa | 2020-05-01 | doctor-visits | state | day | 2020-05-14 | 13 | 0 | 5 | 5 | 3.658009 | NA | NA |
doctor-visits | smoothed_adj_cli | pa | 2020-05-01 | doctor-visits | state | day | 2020-05-15 | 14 | 0 | 5 | 5 | 3.662286 | NA | NA |
This estimate was clearly updated many times as new data for May 1st arrived. Note that these results include only data issued or updated between 2020-05-01 and 2020-05-15. If a value was first reported on 2020-04-15, and never updated, a query for issues between 2020-05-01 and 2020-05-15 will not include that value among its results.
After fetching multiple issues of data, we can use the
latest_issue()
or earliest_issue()
functions
to subset the data frame to view only the latest or earliest issue of
each observation.
Finally, we can use the lag
argument to request only
data reported with a certain lag. For example, requesting a lag of 7
days means to request only issues 7 days after the corresponding
time_value
:
covidcast_signal(data_source = "doctor-visits", signal = "smoothed_adj_cli",
start_day = "2020-05-01", end_day = "2020-05-07",
geo_type = "state", geo_values = "pa", lag = 7) %>%
knitr::kable()
## Warning: Data not fetched for the following days: 2020-05-01
data_source | signal | geo_value | time_value | source | geo_type | time_type | issue | lag | missing_value | missing_stderr | missing_sample_size | value | stderr | sample_size |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
doctor-visits | smoothed_adj_cli | pa | 2020-05-01 | doctor-visits | state | day | 2020-05-08 | 7 | 0 | 5 | 5 | 3.278896 | NA | NA |
doctor-visits | smoothed_adj_cli | pa | 2020-05-02 | doctor-visits | state | day | 2020-05-09 | 7 | 0 | 5 | 5 | 3.225292 | NA | NA |
doctor-visits | smoothed_adj_cli | pa | 2020-05-05 | doctor-visits | state | day | 2020-05-12 | 7 | 0 | 5 | 5 | 2.779908 | NA | NA |
doctor-visits | smoothed_adj_cli | pa | 2020-05-06 | doctor-visits | state | day | 2020-05-13 | 7 | 0 | 5 | 5 | 2.557698 | NA | NA |
doctor-visits | smoothed_adj_cli | pa | 2020-05-07 | doctor-visits | state | day | 2020-05-14 | 7 | 0 | 5 | 5 | 2.191677 | NA | NA |
Note that though this query requested all values between 2020-05-01 and 2020-05-07, May 3rd and May 4th were not included in the results set. This is because the query will only include a result for May 3rd if a value were issued on May 10th (a 7-day lag), but in fact the value was not updated on that day:
covidcast_signal(data_source = "doctor-visits", signal = "smoothed_adj_cli",
start_day = "2020-05-03", end_day = "2020-05-03",
geo_type = "state", geo_values = "pa",
issues = c("2020-05-09", "2020-05-15")) %>%
knitr::kable()
data_source | signal | geo_value | time_value | source | geo_type | time_type | issue | lag | missing_value | missing_stderr | missing_sample_size | value | stderr | sample_size |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
doctor-visits | smoothed_adj_cli | pa | 2020-05-03 | doctor-visits | state | day | 2020-05-09 | 6 | 0 | 5 | 5 | 2.788618 | NA | NA |
doctor-visits | smoothed_adj_cli | pa | 2020-05-03 | doctor-visits | state | day | 2020-05-12 | 9 | 0 | 5 | 5 | 3.015368 | NA | NA |
doctor-visits | smoothed_adj_cli | pa | 2020-05-03 | doctor-visits | state | day | 2020-05-13 | 10 | 0 | 5 | 5 | 3.039310 | NA | NA |
doctor-visits | smoothed_adj_cli | pa | 2020-05-03 | doctor-visits | state | day | 2020-05-14 | 11 | 0 | 5 | 5 | 3.021245 | NA | NA |
doctor-visits | smoothed_adj_cli | pa | 2020-05-03 | doctor-visits | state | day | 2020-05-15 | 12 | 0 | 5 | 5 | 3.048725 | NA | NA |