library(zctaCrosswalk)
library(dplyr)
While creating this package I was acutely aware that ZCTAs change frequently. For example, back in 2016 I created a similar package called choroplethrZip. That package is now out of date, because the underlying data it stores became out of date. I expect that something similar will eventually happen with this package.
This vignette is written as a “note to my future self” in case I wind up needing to write a similar package again in the future. It is also intended to increase the number of people who understand how to create packages like this.
The core data structure in this package is
?zcta_crosswalk
:
data(zcta_crosswalk)
print(zcta_crosswalk, n = 5)
#> # A tibble: 46,960 × 9
#> zcta zcta_numeric state_name state_usps state_fips state_fips_numeric
#> <chr> <int> <chr> <chr> <chr> <int>
#> 1 00601 601 puerto rico PR 72 72
#> 2 00601 601 puerto rico PR 72 72
#> 3 00602 602 puerto rico PR 72 72
#> 4 00602 602 puerto rico PR 72 72
#> 5 00603 603 puerto rico PR 72 72
#> # ℹ 46,955 more rows
#> # ℹ 3 more variables: county_name <chr>, county_fips <chr>,
#> # county_fips_numeric <int>
Most of the effort in creating this package was spent creating this data structure. Let’s see how it was created.
Start by looking at the contents of the function
?get_zcta_crosswalk
:
get_zcta_crosswalk = function() {
url = "https://www2.census.gov/geo/docs/maps-data/data/rel2020/zcta520/tab20_zcta520_county20_natl.txt"
zcta_crosswalk = read_delim(file = url, delim = "|")
# Select and rename columns
zcta_crosswalk = zcta_crosswalk |>
rename(zcta = .data$GEOID_ZCTA5_20,
county_fips = .data$GEOID_COUNTY_20,
county_name = .data$NAMELSAD_COUNTY_20) |>
select(.data$zcta, .data$county_fips, .data$county_name)
# 1. The county FIPS is always 5 characters. And the first 2 characters always
# indicate the state. See https://en.wikipedia.org/wiki/FIPS_county_code.
# Breaking out the state allows for easier state selection later.
# 2. This file has all counties, some of which do not have a ZCTA. Remove
# those counties.
zcta_crosswalk |>
mutate(state_fips = str_sub(.data$county_fips, 1, 2)) |>
filter(!is.na(.data$zcta))
}
The function reads and transforms the contents of a URL. At the time of this writing that is the URL for the Census Bureau’s “2010 ZCTA to County Relationship File”, a file which I mentioned earlier.
This means that if Census publishes an updated dataset in the same format tomorrow you could just change the URL, rerun the code and get the updated data in R. (Note that I do not know when Census plans to update this dataset or whether they plan to publish it in the same format.)
If you open the URL referenced in ?get_zcta_crosswalk
in
a browser you will see rows like this:
221704258470394|90210|ZCTA5 90210|27823432|153478|G6350|B5|S|275901063468976|06037|Los Angeles County|10513491099|1787501506|G4020|H1|A|27823432|153478
This tells us that ZCTA 90210 is in Los Angeles County. It also tells us that Los Angles County has FIPS Code 06037.
Unfortunately, the file does not directly contain any state information. And since I wanted to run queries like “Get all ZCTAs in a given state”, I needed to add that in.
I started by splitting out the first two characters of each County
FIPS Code into a new column called state_fips
. This allows
a user to search for ZCTAs in a state if they know the state’s FIPS
code.
However, this does not help us if users want to select a state by
it’s name or Postal Code Abbreviation. To address this limitation I
created a new dataframe called state_names
, and used it to
join against the results of ?get_zcta_crosswalk
:
data(state_names)
print(state_names, n = 5)
#> # A tibble: 56 × 4
#> full usps fips_numeric fips_character
#> <chr> <chr> <int> <chr>
#> 1 alaska AK 2 02
#> 2 alabama AL 1 01
#> 3 arkansas AR 5 05
#> 4 arizona AZ 4 04
#> 5 california CA 6 06
#> # ℹ 51 more rows
One thing to keep in mind is that while there are technically only 50 states, “state” in this dataset really means “any top level administrative region”. This dataset contains 56 states (the extra ones are: the District of Columbia, Puerto Rico, US Virigin Islands, American Samoa, Guam and the Northern Mariana Islands).
I believe that it would be useful for R to have a standalone package
that contains a data frame like this for all FIPS codes. I did not break
state_names
out into a separate package because even though
it has 56 state-level entities, the full list is much larger.
The code I used to generate state_names
is in
inst/gen_state_states.R
.
Note that while R has two built-in vectors that deal with state names
(state.abb
and state.name
), they cannot help
us here because: (1) they do not contain FIPS codes and (2) they only
contain 50 states.
If you would like to learn more about ZCTAs (including how they differ from ZIP Codes), I recommend two references:
One of my recollections from that meeting is Jon explaining that ZIP Codes are designed to follow roads. This means that different sides of a single block can have different ZIP codes. Census geography, however, treats blocks as atomic. This means that all homes on a single block must have the same ZCTA. This difference in construction means that ZIPs and ZCTAs are unlikely to ever truly be identical.
My primary concern with this dataset is that people will assume that it is a crosswalk for present-day ZIP Codes. As stated above, ZCTAs rarely (if ever) line up perfectly with ZIP Codes. Additionally, this dataset was published in 2020, and it is not clear how many changes have occurred to ZIP Codes in the interim.
I would like to thank my employer, MarketBridge, for supporting the development of this package. This package would not have been developed without their support.