rPref allows an efficient computation of Pareto frontiers (also known as Skylines in the context of databases) and slight generalizations (database preferences). This vignette will explain how to compose Skyline queries and preferences, which are finally evaluated on a data set, i.e., the optimal objects from the data set are selected.
A classical Skyline query optimizes two dimensions of a data set
simultaneously. Usually these dimensions tend to anticorrelate. Consider
the mtcars
data set and the dimensions mpg
(miles per gallon, i.e., inverse fuel consumption) and hp
(horsepower). To get those cars with a low fuel consumption (i.e., high
mpg
value) and high power we create the preference and
evaluate it on mtcars
. Using select
from the
dplyr package we restrict our attention to the relevant columns:
<- high(mpg) * high(hp)
p <- psel(mtcars, p)
res ::kable(select(res, mpg, hp)) knitr
mpg | hp | |
---|---|---|
Merc 450SL | 17.3 | 180 |
Fiat 128 | 32.4 | 66 |
Toyota Corolla | 33.9 | 65 |
Lotus Europa | 30.4 | 113 |
Ford Pantera L | 15.8 | 264 |
Ferrari Dino | 19.7 | 175 |
Maserati Bora | 15.0 | 335 |
The *
operator is the Pareto composition. The result
contains all cars from mtcars
which are not
Pareto-dominated according to this preference. This means, we are not
interested in those cars, which are strictly worse in at least one
dimension and worse/equal in the other dimension (i.e., they are
dominated).
We can add a third optimization goal like minimizing the 1/4 mile
time of a car. Additionally to the preference selection via
psel
, preference objects can be associated with data sets
and then processed via peval
(preference evaluation). For
example
<- high(mpg, df = mtcars) * high(hp) * low(qsec)
p
p## [Preference] high(mpg) * high(hp) * low(qsec)
## * associated data source: data.frame "mtcars" [32 x 11]
creates a 3-dimensional Pareto preference which is associated with
mtcars
. We can evaluate this preference using
peval(p)
which returns the Pareto optima:
<- peval(p)
res ::kable(select(res, mpg, hp, qsec)) knitr
mpg | hp | qsec | |
---|---|---|---|
Mazda RX4 | 21.0 | 110 | 16.46 |
Merc 450SE | 16.4 | 180 | 17.40 |
Merc 450SL | 17.3 | 180 | 17.60 |
Fiat 128 | 32.4 | 66 | 19.47 |
Toyota Corolla | 33.9 | 65 | 19.90 |
Porsche 914-2 | 26.0 | 91 | 16.70 |
Lotus Europa | 30.4 | 113 | 16.90 |
Ford Pantera L | 15.8 | 264 | 14.50 |
Ferrari Dino | 19.7 | 175 | 15.50 |
Maserati Bora | 15.0 | 335 | 14.60 |
Using psel
instead of peval
we can evaluate
the preference on another data set (which does not change the
association of p
). Using the filter
function
from dplyr we can first pick all cars with automatic transmission
(am == 0
) and then get the Pareto optima:
<- mtcars %>% filter(am == 0) %>% psel(p)
res ::kable(select(res, am, mpg, hp, qsec)) knitr
am | mpg | hp | qsec | |
---|---|---|---|---|
Hornet 4 Drive | 0 | 21.4 | 110 | 19.44 |
Hornet Sportabout | 0 | 18.7 | 175 | 17.02 |
Duster 360 | 0 | 14.3 | 245 | 15.84 |
Merc 240D | 0 | 24.4 | 62 | 20.00 |
Merc 230 | 0 | 22.8 | 95 | 22.90 |
Merc 450SE | 0 | 16.4 | 180 | 17.40 |
Merc 450SL | 0 | 17.3 | 180 | 17.60 |
Chrysler Imperial | 0 | 14.7 | 230 | 17.42 |
Toyota Corona | 0 | 21.5 | 97 | 20.01 |
Dodge Challenger | 0 | 15.5 | 150 | 16.87 |
Camaro Z28 | 0 | 13.3 | 245 | 15.41 |
Pontiac Firebird | 0 | 19.2 | 175 | 17.05 |
Database preferences allow some generalizations of Skyline queries
like combining the Pareto order with the lexicographical order. Assume
we prefer cars with manual transmission (am == 0
). If two
cars are equivalent according to this criterion, then the higher number
of gears should be the decisive criterion. This is known as the
lexicographical order and can be realized with
<- true(am == 1) & high(gear) p
where true
is a Boolean preference, where those tuples
are preferred fulfilling the logical condition. The &
is a non-commutative operator creating a lexicographical order, also
called Prioritization in the context of database
preferences.
We evaluate this preference on the mtcars
data set:
<- psel(mtcars, p)
res ::kable(select(res, am, gear, hp, cyl)) knitr
am | gear | hp | cyl | |
---|---|---|---|---|
Porsche 914-2 | 1 | 5 | 91 | 4 |
Lotus Europa | 1 | 5 | 113 | 4 |
Ford Pantera L | 1 | 5 | 264 | 8 |
Ferrari Dino | 1 | 5 | 175 | 6 |
Maserati Bora | 1 | 5 | 335 | 8 |
The constructs high
, low
and
true
are the three base preferences. They also accept
arbitrary arithmetic expressions (and accordingly logical, for
true
). For example, we can Pareto-combine the
lexicographical order from above with a wish for an high power per
cylinder ratio:
<- (true(am == 1) & high(gear)) * high(hp/cyl)
p <- psel(mtcars, p)
res ::kable(select(res, am, gear, hp, cyl)) knitr
am | gear | hp | cyl | |
---|---|---|---|---|
Maserati Bora | 1 | 5 | 335 | 8 |
According to this preference there is only one Pareto-optimal car.
In the above preference selection we just have one Pareto-optimal
tuple for the data set mtcars
. Probably we are also
interested in the tuples slightly worse than the optimum. rPref offers a
top-k preference selection, iterating the preference selection on the
remainder on the data set until k tuples are returned. To get the 3 best
tuples we use:
<- psel(mtcars, p, top = 3)
res ::kable(select(res, am, gear, hp, cyl, .level)) knitr
am | gear | hp | cyl | .level | |
---|---|---|---|---|---|
Maserati Bora | 1 | 5 | 335 | 8 | 1 |
Ford Pantera L | 1 | 5 | 264 | 8 | 2 |
Duster 360 | 0 | 3 | 245 | 8 | 3 |
The column .level
is additionally added to the result
when psel
is called with the top
parameter. It
counts the number of iterations needed to get this tuple. The k-th level
of a Skyline is also called the k-th stratum. We see that the
first three tuples have levels {1, 2, 3}. The top-k parameter produces a
nondeterministic cut, i.e., there could be more tuples in the third
level. To avoid the cut, we use the at_least
parameter,
returning all tuples from the last level:
<- psel(mtcars, p, at_least = 3)
res ::kable(select(res, am, gear, hp, cyl, .level)) knitr
am | gear | hp | cyl | .level | |
---|---|---|---|---|---|
Maserati Bora | 1 | 5 | 335 | 8 | 1 |
Ford Pantera L | 1 | 5 | 264 | 8 | 2 |
Duster 360 | 0 | 3 | 245 | 8 | 3 |
Camaro Z28 | 0 | 3 | 245 | 8 | 3 |
Ferrari Dino | 1 | 5 | 175 | 6 | 3 |
Additionally there is a top_level
parameter which allows
to explicitly state the number of iterations. The preference selection
with top_level = 3
is identical to the statement above in
this case, because just one tuple resides in each of the levels 1 and
2.
Using the grouping functionality from the dplyr package, we can
perform a preference selection on each group separately. For example, we
search for the cars maximizing mpg
and hp
in
each group of cars with the same number of cylinders. This is done
by:
<- group_by(mtcars, cyl)
grouped_df <- psel(grouped_df, high(hp) * high(mpg))
res ::kable(select(res, cyl, hp, mpg)) knitr
cyl | hp | mpg |
---|---|---|
4 | 66 | 32.4 |
4 | 65 | 33.9 |
4 | 113 | 30.4 |
6 | 110 | 21.4 |
6 | 175 | 19.7 |
8 | 180 | 17.3 |
8 | 175 | 19.2 |
8 | 264 | 15.8 |
8 | 335 | 15.0 |
The first line is the grouping operation from dplyr and the second line is the preference selection from rPref, which respects the grouping. The result is again a grouped data frame, containing the Pareto optima for each group of cylinders.