In September 2018 I created an account with the Integrated Public Use Microdata Series (IPUMS) data library at the University of Minnesota..[@IPUMS:2020] My objective then was to understand the evolution of accounting and auditing as a percent of the US workforce over time. In redoing this in March 2020, it seems sensible to start by summarizing the large IPUMS data extract into a smaller table of occupation codes by year and save this as OCC1950.rda
. Then can then analyze that much smaller dataset in a separate vignette.
After logging into the IPUMS website, the two primary options are “Select Data” and “My Data”. The latter shows previous selections with the following options:
In 2018 from “Select Data”, I chose “Select harmonized variables” > Person > Work.
From here, I first selected “OCC1950
= Occupation 1950 basis” and got something useful doing that.
I tried other similar extracts as outlined in the Appendix below. I decided the best extract for my purposes seemed to be the one I started with using OCC1950
. The rest of this vignette describes summarizing that into a matrix estimating the percent of the US population by year in each OCC1950
code in March 2020.
# The code in this snippet is a slight modification
# of the R code from usa_00006.R, 2020-03-18.
if (!require("ipumsr")){
msg <- paste("Reading IPUMS data into R",
"requires the ipumsr package. It can be",
"installed using:\ninstall.packages('ipumsr')")
stop(msg)
}
## Loading required package: ipumsr
# NOTE: base::dir works differently
# within an R Markdown file than
# from an ordinary command prompt,
# at least under macOS 10.15.3 on 2020-03-18.
# With getwd() = the parent directory of
# "~/fda/vignettes",
# When I highlighted "dir()" and executed it
# using <command + enter>, I got the contents
# of "~/fda/vignettes".
# When I executed "dir()" outside the *.Rmd file,
# I got the parent directory.
dir()
## [1] "IPUMS-references.bib" "UpdatingUSGDPpresidents.R"
## [3] "UpdatingUSGDPpresidents.Rmd" "UpdatingUSGDPpresidents.html"
## [5] "nuclearArmageddon.R" "nuclearArmageddon.Rmd"
## [7] "nuclearArmageddon.html" "nuclearProliferation.svg"
## [9] "time2Armgeddon.svg" "updateOCC1950.R"
## [11] "updateOCC1950.Rmd" "update_nuclearWeaponStates.R"
## [13] "update_nuclearWeaponStates.Rmd" "update_nuclearWeaponStates.html"
dir(getwd())
## [1] "IPUMS-references.bib" "UpdatingUSGDPpresidents.R"
## [3] "UpdatingUSGDPpresidents.Rmd" "UpdatingUSGDPpresidents.html"
## [5] "nuclearArmageddon.R" "nuclearArmageddon.Rmd"
## [7] "nuclearArmageddon.html" "nuclearProliferation.svg"
## [9] "time2Armgeddon.svg" "updateOCC1950.R"
## [11] "updateOCC1950.Rmd" "update_nuclearWeaponStates.R"
## [13] "update_nuclearWeaponStates.Rmd" "update_nuclearWeaponStates.html"
# copy and paste the following
# from "Accountants-IPUMS.Rmd" in R Studio
# into the Console below:
IPUMSdir <- 'IPUMS'
(ddiXml <- dir(IPUMSdir, pattern="usa_00006.xml",
full.names = TRUE))
## character(0)
# OR execute the following inside
# "Accountants-IPUMS.Rmd" in R Studio:
if(length(ddiXml)!=1){
# print(ddiXml <- file.path('..', '..', 'IPUMS'))
print(ddiXml <- dir(pattern="usa_00006.xml",
full.names = TRUE))
# NOTE: This worked under macOS 10.15.3
# with R 3.6.3 and RStudio 1.2.5033.
# It has not been tested on other platforms.
}
## character(0)
Some of the computations below are fairly long, because this dataset is so large. This vignette is structured so nearly all the R code in this vignette will be skipped in routine package testing and would only be executed if a user arranged so a file containing the desired IPUMS data extract can be found. We do this here by creating a variable readAndCompute
that is FALSE
by default and is set to TRUE
only when the the desired data are available for manual processing.
readAndCompute <- FALSE
if((length(ddiXml)==1) && (!fda::CRAN())){
readAndCompute <- TRUE
ddiDat <- read_ipums_ddi(ddiXml)
(readDatTime <- system.time(
IPUMSdata <- read_ipums_micro(ddiDat)
))
}
On 2020-03-19 this took roughly 40 seconds on a MacBook Pro with a 2.8 GHz quad-core Intel core i7 with 16 GB RAM.
IPUMSdata
is an object with a huge number of rows and 8 columns:
if(readAndCompute){
str(IPUMSdata)
nrow(IPUMSdata)/1e6
}
IPUMSdata
is an object of classes tbl_df
, tbl
and data.frame
with over 114 million rows for dat00001.xml
.
That's too few rows to have one row for each person in the most recent census and certainly too few to have one row for each household in all the census since 1850:
if(readAndCompute){
print(etYr <- system.time(
tbl_year <- table(IPUMSdata$YEAR)
))
plot(tbl_year)
tbl_year
}
The key point from the the print of tbl_year
and this plot is that this dataset includes data from every census except 1890 plus for each year between 2000 and 2016.
Before proceeding, let's check IPUMSdata
for missing values:
if(readAndCompute){
print(etNA <- system.time(
nNA <- sapply(IPUMSdata, function(x)sum(is.na(x)))
))
print(nNA)
}
No missing values!
Let's look at var_desc
for HHWT
:
if(readAndCompute){
attributes(IPUMSdata$HHWT)
}
Let's look at the distribution of HHWT
:
if(readAndCompute){
print(etQ <- system.time({
rngHHWT <- range(IPUMSdata$HHWT)
qtleHHWT <- quantile(IPUMSdata$HHWT)
}))
print(rngHHWT)
qtleHHWT
}
Let's also examine the the attributes of OCC1950
:
if(readAndCompute){
print(etCodes <- system.time(
OCC50codes <- attributes(IPUMSdata$OCC1950)
))
str(OCC50codes)
}
We're especially interested in “labels”:
if(readAndCompute){
# OCCcodes$labels
print(OCC50codes$var_desc)
print(head(OCC50codes$labels))
tail(OCC50codes$labels)
}
The “labels” attribute from OCC1950
provided a translate table giving English-language names to the numeric codes.
Are all these codes used?
if(readAndCompute){
print(etOcc1 <- system.time(
Occ1 <- table(IPUMSdata$OCC1950)
))
str(Occ1)
}
Two codes are not used. What are they?
if(readAndCompute){
OCC50codes$labels[!(
OCC50codes$labels %in% names(Occ1))]
}
Let's sum HHWT
within YEAR
and OCC1950
:
if(readAndCompute){
print(etOccYr <- system.time(
OccYr <- tapply(IPUMSdata$HHWT,
IPUMSdata[c("OCC1950", "YEAR")], sum)
))
str(OccYr)
}
This is an array of OCC1950
by YEAR
. We will convert this into a matrix of the proportion of the workforce in each OCC1950
code with two attributes:
codes
= OCC50codes
workforce
= colSums(YrOcc)
We need the codes
to allow us to make any use of these data, and workforce
gives us an estimate of the size of the workforce by year.
To confirm the latter, let's compute it:
if(readAndCompute){
(totWts <- colSums(OccYr))
}
All NA
s. One explanation for this is no year has seen the use of all occupation codes. For example, we should not expect to see many “"Airplane pilots and navigators” in the nineteenth century![[[w:Airline# The first airlines|The world's first airline company]] was German using airships, founded in 1909.]
To check this, we will table(OCC1950", YEAR)
:
if(readAndCompute){
print(etOY <- system.time(
OY <- with(IPUMSdata, table(OCC1950, YEAR))
))
print(str(OY))
sum(is.na(OccYr) - (OY==0))
}
Wonderful: This says that all NA
s in OccYr
should be 0:
if(readAndCompute){
OccYr[is.na(OccYr)] <- 0
}
Now let's repeat the sums by year, comparing with USGDPpresidents$population.K
:
if(readAndCompute){
(totWts <- colSums(OccYr))
library(Ecdat)
selGDP <- (USGDPpresidents$Year %in% names(totWts))
USpops <- USGDPpresidents[selGDP, ]
ylim <- range(totWts/1e6, USpops$population.K/1e3)
# png('IPUMS HHWT and US Population.png')
plot(names(totWts), totWts/1e6, xlab='',
ylab="millions",
main='sum(HHWT) vs. US Population', las=1)
with(USpops, lines(Year, population.K/1e3))
# dev.off()
}
This plot suggests that HHWT
attempts to weight the observations so the total matches the US population but has problems for 1970 and every year since 2000. A simple fix is to rescale all the numbers so they match. Let's first check by plotting the ratio:
if(readAndCompute){
tot_pop <- (totWts / (USpops$population.K*1000))
plot(USpops$Year, tot_pop, type='b', las=1)
abline(h=1, lty='dotted', col='red')
}
We can make these numbers match by dividing OccYr
by tot_pop
:
if(readAndCompute){
nOcc <- nrow(OccYr)
Occ1950 <- (OccYr / rep(tot_pop, e=nOcc))
(revTots <- colSums(Occ1950))
plot(USpops$Year, revTots/1e6, type='b', las=1)
}
Great. Now let's create OCC1950
= proportion of population:
if(readAndCompute){
OCC1950 <- (Occ1950 / rep(revTots, e=nOcc))
quantile(chkTots <- colSums(OCC1950))
}
Let's make rownames
= occupation names rather than the codes:
if(readAndCompute){
rownames(OCC1950) <- names(Occ1)
}
Let's save this:
if(readAndCompute){
save(OCC1950, file='OCC1950.rda')
dir(full=TRUE)
}
After my initial data extraction, I tried adding OCC
= “Occupation”. Then “Data cart: Your data extract” said, “2 variable, 32 samples”. Then clicking “View Cart” listed 9 variables, being YEAR
, DATANUM
, SERIAL
, HHWT
, GQ
, PERNUM
, PERWT
, OCC
, and OCC1950
. Then I clicked, “Create data extract”. This allowed me to further “select data quality flags” for GQ
, OCC
, and OCC1950
.
The first time I did this, I ignored the data quality flags. The second time, I requested them.
However, with both the additional OCC
and the data quality flags, the data set was bigger than I could read into my computer. So I split that extract into two for seq(1850, 2000, 10) and for 2001:2016. When I read the 2001:2016 extract, I found that I could not understand OCC
nor the data quality flags. I decided that the answer I already had was probably good enough, and it wasn't clear if I could learn enough to justify the work of further study of OCC
and the data quality flags.
In any event, after each data selection, I was given an “estimated size” for the extract (5340 MB for one trial). I entered something to “Describe your extract”. Then I clicked, “Submit extract”. The IPUMS web site responded saying, “Your extract request has been submitted. You will be notified by email at (the email address I had given them) when it has been created.” Under “Data”, it said, “Processing…”.
After a while (47 minutes for one extract on 2018-09-03) I got an email saying my extract was ready for download. I returned to the web site and clicked something under “Data”.
To help with questions about the format, etc., I studied help(pac=ipumsr)
. I found that the ipumsr
package included seven vignettes. One of those is titled “Introduction to ipumsr
- IPUMS Data in R”. From that, I learned that I needed to right-click (ctrl-click
on a Mac) on DDI
under Codebook
and then select “Save link as…”. Moreover, I should NOT do this in Safari. Google Chrome worked for me for this on 2018-09-01 and Firefox worked when I repeated it with a slightly different extract on 2018-09-03.
When I repeated this 2020-03-18, I downloaded "usa_00003.dat"
and "use_00005.dat"
consuming 2.3 and 3.8 GB. Codebooks are also available.
Then I followed the instructions in the “Command File” for R. With one extract, I got, “Error: Error in read_tokens_(data, tokenizer, col_specs, col_names, locale_
, : Evaluation error: vector memory exhausted (limit reached?).” After that, I redid the extract to select less data to file(s) I could read.