One of the principal strengths of linked plots is the ease with which one can form complex logical queries on the data.
Begin with a classic data set in R
–
mtcars
.
For the sake of illustration, some enrichment of the variables and their
values will be made:
data(mtcars, package = "datasets")
mtcars$country <- c("Japan", "Japan", "Japan", "USA", "USA", "USA", "USA",
"Germany", "Germany", "Germany", "Germany", "Germany",
"Germany", "Germany", "USA", "USA", "USA", "Italy",
"Japan", "Japan", "Japan", "USA", "USA", "USA", "USA",
"Italy", "Germany", "UK", "USA", "Italy", "italy", "Sweden")
mtcars$continent <- c("Asia", "Asia", "Asia", "North America", "North America",
"North America", "North America", "Europe", "Europe",
"Europe", "Europe", "Europe", "Europe", "Europe",
"North America", "North America", "North America",
"Europe", "Asia", "Asia", "Asia", "North America",
"North America", "North America", "North America",
"Europe", "Europe", "Europe", "North America",
"Europe", "Europe", "Europe" )
mtcars$company <- c("Mazda", "Mazda", "Nissan", "AMC", "AMC", "Chrysler",
"Chrysler", "Mercedes", "Mercedes", "Mercedes", "Mercedes",
"Mercedes", "Mercedes", "Mercedes", "GM", "Ford",
"Chrysler", "Fiat", "Honda", "Toyota", "Toyota", "Chrysler",
"AMC", "GM", "GM", "Fiat", "Porsche", "Lotus", "Ford",
"Ferrari", "Maserati", "Volvo")
mtcars$Engine <- factor(c("V-shaped", "Straight")[mtcars$vs +1],
levels = c("V-shaped", "Straight"))
mtcars$Transmission <- factor(c("automatic", "manual")[mtcars$am +1],
levels = c("automatic", "manual"))
mtcars$vs <- NULL # These are redundant now
mtcars$am <- NULL #
For this illustration, it will be convenient to separate categorical from continuous data.
varTypes <- split(names(mtcars),
sapply(mtcars,
FUN = function(x){
if(is.factor(x)|is.character(x)){
"categorical"
} else {"numeric"} } ))
varTypes
is a list with two named components:
categorical
and numeric
.
To explore the data, several interactive plots will likely have been constructed. Typically, these will have been constructed one at a time and assigned to the same linking group (perhaps via the inspector).
Below, histograms/barplots are constructed for each categorical
variable and assigned to that variable name now prefixed by
h_
for “histogram”.
for (varName in varTypes$categorical) {
assign(paste0("h_", varName),
l_hist(mtcars[ , varName], showFactors = TRUE,
xlabel = varName, title = varName,
linkingGroup = "Motor Trend"))
}
These are not evaluated in this vignette. Note that all are in the
same linkingGroup
.
Other linked plots might exist as well – for example, a scatterplot
of gear
(the number of forward gears) versus
disp
(the engine displacement in cubic inches).
p <- with(mtcars, l_plot(disp, cyl,
xlabel = "engine displacement", ylabel = "number of cylinders",
title = "1974 Motor Trend cars",
linkingGroup = "Motor Trend",
size = 10, showScales = TRUE,
itemLabel = rownames(mtcars), showItemLabels = TRUE
))
Note that - each car’s name appears as the itemLabel
for
that point in the plot (to be revealed as a “tooltip” style pop up), and
that - the plot p
is in the same linking group as the
histograms.
Through a combination of selection, inversion, deactivation, and reactivation, logical queries may be made interactively on the data.
For simplicity, the basic logical operators are illustrated below
using only the histograms. More generally, these apply to any
interactive loon
graphic.
Five logical conditions/operations illustrated here are the basic ones:
A
is TRUE
A
) is TRUE
A
OR B
) is
TRUE
(one or the other or both),A
AND B
) are both
TRUE
A
XOR B
) meaning
(A
is TRUE
) or (B
is
TRUE
) but (A
AND B
) is FALSEEach of these corresponds to a sequence of actions on the plots and/or inspector. Whatever is highlighted in the end corresponds to the result.
Again, for simplicity all operations are illustrated by interacting with values of categorical variates in the various histograms. Any of the logical elements could also have been that satisfying numerical constraints by undertaking the corresponding actions on a scatterplot (or histogram of continuous values).
Each logical operator is illustrated in turn:
A
(\(= A\))
on the plot select A
,
e.g., click on "manual"
bar from the
Transmission
histogram
highlighted \(\iff\)
Transmission == "manual"
is TRUE
NOT A
(\(= \overline{A}~~\) or \(~~\neg A\))
on a plot select A
,
from the inspector click invert
e.g., click on "North America"
bar from the
continent
histogram,
then invert
highlighted \(\iff\)
continent == "North America"
is FALSE
all that is highlighted is not
"North America"
, namely "Asia"
or
"Europe"
A
OR B
(\(= A \cup B~~\) or \(~~A \lor B\)),
on a plot select A
,
on the same (or a different but linked) plot
<SHIFT>
- select B
e.g., click on "manual"
bar from
Transmission
histogram,
then while holding down the <SHIFT>
key,
click on the Mercedes
bar in the company
histogram
highlighted \(\iff\)
Transmission == "manual"
is TRUE
OR company = "Mercedes"
is
TRUE
(or both)
A
AND B
(\(= A \cap B\) or \(A \land B\))
lots of solutions, here is one that always works
on a plot select A
,
from the inspector, invert
then
deactivate
(only A
remains),
from a plot of the remaining select B
,
from the inspector reactivate
all
elements are highlighted \(\iff A \cap B\)
e.g. try highlighting all European cars with manual transmissions.
A
XOR B
(\(= (A \cup B) \cap (\overline{A \cap B})\)
or \((A \lor B) \land \neg({A \land
B})\))
following steps in 4, select A
AND B
,
from the inspector invert
then
deactivate
(only \(\neg({A \land
B})\) remains)
following steps in 3, select A
OR B
,
from the inspector reactivate
(only
A
XOR B
is
highlighted)
Other logical conditions (including numerical ones such as
disp > 300
on the scatterplot p
) are
constructed as a combination of the above (as in exclusive or).
These can be quite complex and it may help, after some number of steps, to mark intermediary results by colour (or also glyph in scatterplots).
Note that because of possibly missing data, not all linked plots may share the same set of observations.
The mtcars
data is an example of a complete data set.
Had there been missing values, then these would not appear in loon plots
that require them.
For example, suppose data
has four variables
A
, B
, C
, and D
,
and
data <- data.frame(A = sample(c(rnorm(10), NA), 10, replace = FALSE),
B = sample(c(rnorm(10), NA), 10, replace = FALSE),
C = sample(c("firebrick", "steelblue", NA), 10, replace = TRUE),
D = sample(c(1:10, NA), 10, replace = FALSE))
p_test <- l_plot(x = data$A, y = data$B, color = data$C, linkingGroup = "test missing")
h_test <- l_hist(x = data$D, color = data$C, linkingGroup = "test missing")
Then
wherever an NA
appears in any of A
,
B
, or C
, that point will be missing from
p_test
Note that it is generally not a good idea to use
C
for any simple display characteristic like
color
if indeed C
has missing values since
this will remove non-missing x
and y
values
from the plot. Not all values of x
and y
would
then be accessible from the plot for logical queries,
wherever an NA
appears in either of C
or D
, that point will be missing from
h_test
Using logical operations on the original data
to change
plot properties (e.g. select values) can be challenging when data values
are missing in the plot (since what is missing depends on what was
missing at the time of its construction).
For example,
may not work!
The logical operation on the data (data$A > 0
)
will typically be longer than the corresponding x value
p_test["x"]
in the plot and so will not work.
Even if the logical vector is of the right length (and contains
no NA
s itself), the values may not correctly match the data
points.
There are two general approaches to logical queries
when data
contains NA
s.
Using complete data
If, like mtcars
, the data being used contains no
NA
s then conducting logical queries on the plot will be
identical to conducting them on the data.
If the data is not complete (contains one or more NA
),
it can be made complete by removing all observations (rows) that contain
an NA
. E.g. replacing data
by
c_data <- na.omit(data)
.
any logic on c_data
will match that on plots made
from c_data
.
depending on the amount and pattern of missing data, this could critically reduce the amount of data in the analysis.
Using the information in the loon plots. Of course, this is the recommended approach when data is missing.
Logical queries can then be made
directly on the plots, either
p_test["x"] > 0
in place of
data$A > 0
.or
directly on the data and applied to the plots
To help manage this, the linkingKey
of each
plot can be used.
the default value for each plot is a character vector with entries
from "0"
to "n-1"
where
n =
nrow(data)`.
These are easily turned into the row numbers for the original data.
E.g. in p_test
the row numbers of data
that
correspond to the points is
1 + as.numeric(p_test["linkingKey"])
Logical values for the rows of data
can then select
points in p
as follows
Similarly for h_test
. E.g., compare
p_test["linkingKey"]
and
h_test["linkingKey]"
.
Note: the user can always provide their own
character vector linkingKey
for their plots.
linkingKey = rownames(data)
If so, then more care may be needed to use these to identify rows in a logical vector.
Loon’s linking model has the following three parts
linkingGroup
which identifies which plots are
linkedlinkingKey
, a character vector where each element is
a key uniquely identifying a single observation in the plot (no two
observations in the same plot can have the same value in the linking
key), andl_getlinkedStates()
).Observations in different plots (in the same linking group) are linked (in that their linked states change together) if and only if they have the same linking key.
Points appearing in different plots (in the same
linkingGroup
) which matched on the value of their
linkingKey
will share the same value for their linked
states.