The heuristica
R package implements heuristic decision
models, such as Take The
Best (TTB) and a unit-weighted
linear model. The models are designed for two-alternative choice
tasks, such as which of two schools has a higher drop-out rate. The
package also wraps more well-known models like regression and logistic
regression into the two-alternative choice framework so all these models
can be assessed side-by-side. It provides functions to measure accuracy,
such as an overall percentCorrect
and, for advanced users,
some confusion
matrix functions. These measures can be applied in-sample or
out-of-sample.
The goal is to make it easy to explore the range of conditions in which simple heuristics are better than more complex models. Optimizing is not always better!
This package is focused on two-alternative choice tasks, e.g. given two schools, which has a higher drop-out rate. The output is categorical, not quantitative.
Here is a subset of data on Chicago public high school drop-out rates. The criterion to predict is the Dropout_Rate, which is in column 2.
<- data.frame(Name=c("Bowen", "Collins", "Fenger", "Juarez", "Young"), Dropout_Rate=c(25.5, 11.8, 28.7, 21.6, 4.5), Low_Income_Students=c(82.5, 88.8, 63.2, 84.5, 30.3), Limited_English_Students=c(11.4, 0.1, 0, 28.3, 0.1))
schools
schools#> Name Dropout_Rate Low_Income_Students Limited_English_Students
#> 1 Bowen 25.5 82.5 11.4
#> 2 Collins 11.8 88.8 0.1
#> 3 Fenger 28.7 63.2 0.0
#> 4 Juarez 21.6 84.5 28.3
#> 5 Young 4.5 30.3 0.1
To fit a model, we give it the data set and the columns to use. In
this case, the 2nd column, Dropout_Rate
, is the
criterion to be predicted. The cues
are the following columns, percent of Low_Income_Students
and percent of Limited_English_Students
. They are at
indexes 3 and 4.
Let’s fit two models: * ttbModel, Take The Best, which uses the highest-validity cue that discriminates (more details below). * regModel, a version of R’s “lm” function for linear regression wrapped to fit into heurstica’s interface.
library(heuristica)
#> Error in library(heuristica): there is no package called 'heuristica'
<- 2
criterion_col <- ttbModel(schools, criterion_col, c(3:4))
ttb #> Error in ttbModel(schools, criterion_col, c(3:4)): could not find function "ttbModel"
<- regModel(schools, criterion_col, c(3:4))
reg #> Error in regModel(schools, criterion_col, c(3:4)): could not find function "regModel"
What do the fits look like? We can examine Take The Best’s cue validities and the regression coefficients.
$cue_validities
ttb#> Error in eval(expr, envir, enclos): object 'ttb' not found
coef(reg)
#> Error in coef(reg): object 'reg' not found
Both Take The Best and regression give a higher weight to
Low_Income_Students
than
Limited_English_Students
, although of course how they use
the weights differs. Take The Best will use a lexicographic order,
making its prediction based solely on Low_Income_Students
as long as the schools have differing values– which they do for all 5
schools in this data set. That means it will ignore
Limited_English_Students
when predicting on this data set.
In contrast, regression will use a weighted sum of both cues, but with
the most important cues weighted more.
To see a model’s predictions, we use the predictPair
function. It takes two rows of data– which together comprise a “row
pair”– and the fitted model. predictPair
outputs three
possible values:
In Bowen vs. Collins, it outputs 1, meaning it predicts Bowen has a higher dropout rate. In Bowen vs. Fenger, it outputs -1, meaning it predicts Fenger has a higher dropout rate.
predictPair(subset(schools, Name=="Bowen"), subset(schools, Name=="Collins"), ttb)
#> Error in predictPair(subset(schools, Name == "Bowen"), subset(schools, : could not find function "predictPair"
predictPair(subset(schools, Name=="Bowen"), subset(schools, Name=="Fenger"), ttb)
#> Error in predictPair(subset(schools, Name == "Bowen"), subset(schools, : could not find function "predictPair"
Note that the output depends on the order of the rows. In the reversed pair of Collins vs. Bowen, the output is -1. This is consistent because it still picks Bowen, regardless of order.
predictPair(subset(schools, Name=="Collins"), subset(schools, Name=="Bowen"), ttb)
#> Error in predictPair(subset(schools, Name == "Collins"), subset(schools, : could not find function "predictPair"
It is tedious to predict one row pair at a time, so let’s use
heurstica’s predictPairSummary
function instead. We simply
pass it the data and the heuristics whose predictions we are interested
in. It produces a matrix with all row pairs, which in this case is 10 (5
* 4 / 2).
<- predictPairSummary(schools, ttb, reg)
out #> Error in predictPairSummary(schools, ttb, reg): could not find function "predictPairSummary"
# See the first row: It has row indexes.
1,]
out[#> Error in eval(expr, envir, enclos): object 'out' not found
# Convert indexes to school names for easier interpretation
<- data.frame(out)
out_df #> Error in data.frame(out): object 'out' not found
$Row1 <- schools$Name[out_df$Row1]
out_df#> Error in eval(expr, envir, enclos): object 'out_df' not found
$Row2 <- schools$Name[out_df$Row2]
out_df#> Error in eval(expr, envir, enclos): object 'out_df' not found
out_df#> Error in eval(expr, envir, enclos): object 'out_df' not found
The first row shows the Bowen vs. Collins example we considered above. Because CorrectGreater is 1, that means TTB predicted it correctly– Bowen really does have a higher drop-out rate. But regression predicted -1 for this row pair, which is incorrect.
predictPairSummary is for beginners. heuristica offers full
flexibility in output with the rowPairApply
function. After
passing it the data, you can pass it any number of generators to make
the columns you want. Some examples are below, where we print only the
first row.
# Same as predictPairSummary.
<- rowPairApply(schools, rowIndexes(), correctGreater(criterion_col), heuristics(ttb, reg))
out_same #> Error in rowPairApply(schools, rowIndexes(), correctGreater(criterion_col), : could not find function "rowPairApply"
1,]
out_same[#> Error in eval(expr, envir, enclos): object 'out_same' not found
# Show first the heuristic predictions, then CorrectGreater. No row indexes.
<- rowPairApply(schools, heuristics(ttb, reg), correctGreater(criterion_col))
out_simple #> Error in rowPairApply(schools, heuristics(ttb, reg), correctGreater(criterion_col)): could not find function "rowPairApply"
1,]
out_simple[#> Error in eval(expr, envir, enclos): object 'out_simple' not found
For an overall measure of performance, we can measure the percent of
correct inferences for all pairs of schools in the data with
percentCorrect
, namely the number of correct predictions
divided by the total number of predictions. We give the function the
data to be predicted (in this case the same as what was fit) and the
fitted models to assess.
percentCorrect(schools, ttb, reg)
#> Error in percentCorrect(schools, ttb, reg): could not find function "percentCorrect"
Take The Best got 60% correct and regression got 50% correct, which is the same as chance.
Regression is the best linear unbiased model for the data. But this data had a very small sample size of just 5 schools, and good estimates require more data.
This is an unusual case where TTB actually beat regression in a fitting task. Usually ttb only wins in out-of-sample performance, e.g. fitting 5 schools and then predicting on other schools not used in the fit.
For a more realistic example, see the vignette with cross-validated out-of-sample performance on a complete data set.
Uncomment and execute the line below to get the CRAN version:
# install.packages("heuristica")
Uncomment and execute the line below to get the development version.
# Uncomment and execute the line below if you do not have devtools.
# install.packages("devtools")
# devtools::install_github("jeanimal/heuristica")
# library("heuristica")
The package comes with the following models that you can call with predictPair.
logRegModel
: A logistic regression model, a wrapper
around R’s glm. This estimates the probability that one school’s
drop-out rate is greater than the other and chooses the school with
probability greater than 50%.minModel
: It searches cues in a random order, making a
decision based on the first cue that discriminates (has differing values
on the two items / schools).regModel
: A regression model, a wrapper around R’s lm
to make it easier to compare with heuristics. It fits a regression based
on the column indices. For predictPair, it predicts the criterion for
each item in the pair– e.g. estimates the drop-out rate of each school–
and then predicts the item with the higher estimate– higher drop-out
rate. (A variant that fits with an intercept,
regInterceptModel
, is provided in order to confirm prior
research results, but it is not recommended for future research.)singleCueModel
: In the fitting stage, this selects the
cue with the higest cue validity. It only uses that cue, and if the cue
does not discriminate, it guesses.ttbModel
: An implementation of Take The
Best. In the fitting stage, it sorts cues in order of cue validity.
When predicting between two items, it finds the highest-validity that
discriminates (has differing values on the two items) and bases its
prediction on that cue, ignoring other cues. The cue used can vary based
on the cue values of the two items.ttbGreedyModel
: Take the Best using conditional cue
validity (rather than cue validity).unitWeightModel
: A unit-weighted
linear model that uses weights of +1 or -1 only. An exception is
that a cue with no variance– every value is the same– gets a weight of
0. Inspired by psychologist Robyn Dawes– see citation below.validityWeightModel
: A cue-validity-weighted linear
model. (In some publications, this was called franklinModel after Ben
Franklin.)You can add your own models by also implementing a function related
to predictPair
, as described in a vignette.
The package comes with two data sets used by many heuristic researchers.
city_population
: The 83 German cities with more than
100,000 inhabitants in 1993. All cues are binary. (There is another
version called city_population_original
that has some
transciption errors from the almanac source but exactly matches the data
set in Simple Heuristics That Make Us Smart.)highschool_dropout
: Drop-out rates for all 63 Chicago
public high schools plus associated variables like average students per
teacher and percent low income students. The data is from 1995. All cues
are real-valued but some have N/A values. (This data set does not
exactly match that used in Simple Heuristics That Make Us Smart.)Take The Best was first described in: Gigerenzer, G. & Goldstein, D. G. (1996). “Reasoning the fast and frugal way: Models of bounded rationality”. Psychological Review, 103, 650-669.
All of these heuristics were run on many data sets and analyzed in: Gigerenzer, G., Todd, P. M., & the ABC Group (1999). Simple heuristics that make us smart. New York: Oxford University Press.
The research was also inspired by: Dawes, Robyn M. (1979). “The robust beauty of improper linear models in decision making”. American Psychologist, volume 34, pages 571-582. archived pdf
Thanks for coding advice and beta testing go to Marcus Buckmann, Daniel G. Goldstein, and Özgür Simsek.