Introduction to rPref

rPref allows an efficient computation of Pareto frontiers (also known as Skylines in the context of databases) and slight generalizations (database preferences). This vignette will explain how to compose Skyline queries and preferences, which are finally evaluated on a data set, i.e., the optimal objects from the data set are selected.

A first Skyline example

A classical Skyline query optimizes two dimensions of a data set simultaneously. Usually these dimensions tend to anticorrelate. Consider the mtcars data set and the dimensions mpg (miles per gallon, i.e., inverse fuel consumption) and hp (horsepower). To get those cars with a low fuel consumption (i.e., high mpg value) and high power we create the preference and evaluate it on mtcars. Using select from the dplyr package we restrict our attention to the relevant columns:

p <- high(mpg) * high(hp)
res <- psel(mtcars, p)
knitr::kable(select(res, mpg, hp))

	mpg	hp
Merc 450SL	17.3	180
Fiat 128	32.4	66
Toyota Corolla	33.9	65
Lotus Europa	30.4	113
Ford Pantera L	15.8	264
Ferrari Dino	19.7	175
Maserati Bora	15.0	335

The * operator is the Pareto composition. The result contains all cars from mtcars which are not Pareto-dominated according to this preference. This means, we are not interested in those cars, which are strictly worse in at least one dimension and worse/equal in the other dimension (i.e., they are dominated).

We can add a third optimization goal like minimizing the 1/4 mile time of a car. Additionally to the preference selection via psel, preference objects can be associated with data sets and then processed via peval (preference evaluation). For example

p <- high(mpg, df = mtcars) * high(hp) * low(qsec)
p
## [Preference] high(mpg) * high(hp) * low(qsec)
##   * associated data source: data.frame "mtcars" [32 x 11]

creates a 3-dimensional Pareto preference which is associated with mtcars. We can evaluate this preference using peval(p) which returns the Pareto optima:

res <- peval(p)
knitr::kable(select(res, mpg, hp, qsec))

	mpg	hp	qsec
Mazda RX4	21.0	110	16.46
Merc 450SE	16.4	180	17.40
Merc 450SL	17.3	180	17.60
Fiat 128	32.4	66	19.47
Toyota Corolla	33.9	65	19.90
Porsche 914-2	26.0	91	16.70
Lotus Europa	30.4	113	16.90
Ford Pantera L	15.8	264	14.50
Ferrari Dino	19.7	175	15.50
Maserati Bora	15.0	335	14.60

Using psel instead of peval we can evaluate the preference on another data set (which does not change the association of p). Using the filter function from dplyr we can first pick all cars with automatic transmission (am == 0) and then get the Pareto optima:

res <- mtcars %>% filter(am == 0) %>% psel(p)
knitr::kable(select(res, am, mpg, hp, qsec))

	mpg	hp	qsec
Hornet 4 Drive	21.4	110	19.44
Hornet Sportabout	18.7	175	17.02
Duster 360	14.3	245	15.84
Merc 240D	24.4	62	20.00
Merc 230	22.8	95	22.90
Merc 450SE	16.4	180	17.40
Merc 450SL	17.3	180	17.60
Chrysler Imperial	14.7	230	17.42
Toyota Corona	21.5	97	20.01
Dodge Challenger	15.5	150	16.87
Camaro Z28	13.3	245	15.41
Pontiac Firebird	19.2	175	17.05

Lexicographical order

Database preferences allow some generalizations of Skyline queries like combining the Pareto order with the lexicographical order. Assume we prefer cars with manual transmission (am == 0). If two cars are equivalent according to this criterion, then the higher number of gears should be the decisive criterion. This is known as the lexicographical order and can be realized with

p <- true(am == 1) & high(gear)

where true is a Boolean preference, where those tuples are preferred fulfilling the logical condition. The & is a non-commutative operator creating a lexicographical order, also called Prioritization in the context of database preferences.

We evaluate this preference on the mtcars data set:

res <- psel(mtcars, p)
knitr::kable(select(res, am, gear, hp, cyl))

	am	gear	hp	cyl
Porsche 914-2	1	5	91	4
Lotus Europa	1	5	113	4
Ford Pantera L	1	5	264	8
Ferrari Dino	1	5	175	6
Maserati Bora	1	5	335	8

The constructs high, low and true are the three base preferences. They also accept arbitrary arithmetic expressions (and accordingly logical, for true). For example, we can Pareto-combine the lexicographical order from above with a wish for an high power per cylinder ratio:

p <- (true(am == 1) & high(gear)) * high(hp/cyl)
res <- psel(mtcars, p)
knitr::kable(select(res, am, gear, hp, cyl))

	am	gear	hp	cyl
Maserati Bora	1	5	335	8

According to this preference there is only one Pareto-optimal car.

Top-k selections

In the above preference selection we just have one Pareto-optimal tuple for the data set mtcars. Probably we are also interested in the tuples slightly worse than the optimum. rPref offers a top-k preference selection, iterating the preference selection on the remainder on the data set until k tuples are returned. To get the 3 best tuples we use:

res <- psel(mtcars, p, top = 3)
knitr::kable(select(res, am, gear, hp, cyl, .level))

	am	gear	hp	cyl	.level
Maserati Bora	1	5	335	8	1
Ford Pantera L	1	5	264	8	2
Duster 360	0	3	245	8	3

The column .level is additionally added to the result when psel is called with the top parameter. It counts the number of iterations needed to get this tuple. The k-th level of a Skyline is also called the k-th stratum. We see that the first three tuples have levels {1, 2, 3}. The top-k parameter produces a nondeterministic cut, i.e., there could be more tuples in the third level. To avoid the cut, we use the at_least parameter, returning all tuples from the last level:

res <- psel(mtcars, p, at_least = 3)
knitr::kable(select(res, am, gear, hp, cyl, .level))

	am	gear	hp	cyl	.level
Maserati Bora	1	5	335	8	1
Ford Pantera L	1	5	264	8	2
Duster 360	0	3	245	8	3
Camaro Z28	0	3	245	8	3
Ferrari Dino	1	5	175	6	3

Additionally there is a top_level parameter which allows to explicitly state the number of iterations. The preference selection with top_level = 3 is identical to the statement above in this case, because just one tuple resides in each of the levels 1 and 2.

Grouped preference selection

Using the grouping functionality from the dplyr package, we can perform a preference selection on each group separately. For example, we search for the cars maximizing mpg and hp in each group of cars with the same number of cylinders. This is done by:

grouped_df <- group_by(mtcars, cyl)
res <- psel(grouped_df, high(hp) * high(mpg))
knitr::kable(select(res, cyl, hp, mpg))

cyl	hp	mpg
4	66	32.4
4	65	33.9
4	113	30.4
6	110	21.4
6	175	19.7
8	180	17.3
8	175	19.2
8	264	15.8
8	335	15.0

The first line is the grouping operation from dplyr and the second line is the preference selection from rPref, which respects the grouping. The result is again a grouped data frame, containing the Pareto optima for each group of cylinders.

Introduction to rPref

Patrick Roocks

2023-01-28

A first Skyline example

Lexicographical order

Top-k selections

Grouped preference selection