library(ivs)
library(clock)
library(dplyr, warn.conflicts = FALSE)
library(tidyr, warn.conflicts = FALSE)
ivs (said, “eye-vees”) is a package dedicated to working with intervals in a generic way. It introduces a new type, the interval vector, which is generally referred to as an iv. An iv is generally created from two parallel vectors representing the starts and ends of the intervals, like this:
# Interval vector of integers
iv(1:5, 7:11)
#> <iv<integer>[5]>
#> [1] [1, 7) [2, 8) [3, 9) [4, 10) [5, 11)
# Interval vector of dates
<- as.Date("2019-01-01") + 0:2
starts <- starts + c(2, 5, 10)
ends
iv(starts, ends)
#> <iv<date>[3]>
#> [1] [2019-01-01, 2019-01-03) [2019-01-02, 2019-01-07) [2019-01-03, 2019-01-13)
The neat thing about interval vectors is that they are generic, so you can create them from any comparable type that is supported by vctrs. For example, the integer64 type from the bit64 package:
<- bit64::as.integer64("900000000000")
start <- start + 1234
end
iv(start, end)
#> <iv<integer64>[1]>
#> [1] [900000000000, 900000001234)
Or the year-month type from clock:
<- year_month_day(c(2019, 2020), c(1, 3))
start <- year_month_day(c(2020, 2020), c(2, 6))
end
iv(start, end)
#> <iv<year_month_day<month>>[2]>
#> [1] [2019-01, 2020-02) [2020-03, 2020-06)
The rest of this vignette explores some of the useful things that you can do with ivs.
As mentioned above, ivs are created from two parallel vectors representing the starts and ends of the intervals.
<- iv(1:3, 4:6)
x
x#> <iv<integer>[3]>
#> [1] [1, 4) [2, 5) [3, 6)
You can access the starts with iv_start()
and the ends
with iv_end()
:
iv_start(x)
#> [1] 1 2 3
iv_end(x)
#> [1] 4 5 6
You can use an iv as a column in a data frame or tibble and it’ll work just fine!
tibble(x = x)
#> # A tibble: 3 × 1
#> x
#> <iv<int>>
#> 1 [1, 4)
#> 2 [2, 5)
#> 3 [3, 6)
The only interval type that is supported by ivs is a
right-open interval, i.e. [a, b)
. While this might
seem restrictive, it rarely ends up being a problem in practice, and
often it aligns with the easiest way to express a particular interval.
For example, consider an interval that spans the entire day of
2019-01-02
. If you wanted to represent this interval with
second precision with a right-open interval, you’d do
[2019-01-02 00:00:00, 2019-01-03 00:00:00)
. This nicely
captures the exclusive “end” of the interval as the start of the next
day. This also means it exactly aligns with the start of the
next day’s interval,
[2019-01-03 00:00:00, 2019-01-04 00:00:00)
.
If you wanted to represent this with a closed interval, you might do
[2019-01-02 00:00:00, 2019-01-02 23:59:59]
. Not only is
this a bit awkward, it can also cause issues if the precision changes!
Say you wanted to up the precision on this interval from second level
precision to millisecond precision. The right-open interval wouldn’t
have to change at all since the end of that interval is set to
2019-01-03 00:00:00
and anything before that is fair game.
But the closed interval can’t be naively changed from
2019-01-02 23:59:59
to
2019-01-02 23:59:59.000
, as you’d lose the 999 milliseconds
in that last second. Extra care would have to be taken to set the
milliseconds to 2019-01-02 23:59:59.999
.
If you still aren’t convinced, I’d encourage you to take a look at these resources that also advocate for right-open intervals:
In ivs, it is required that start < end
when
generating an interval vector. This means that intervals like
[5, 2)
are invalid, but it also means that an “empty”
interval of [5, 5)
is also invalid. Practically, I’ve found
that attempting to allow these ends up resulting in more implementation
headaches than anything else, and they don’t end up having very many
uses.
One of the most compelling reasons to use this package is that it
tries to make finding overlapping intervals as easy as possible.
iv_locate_overlaps()
takes two ivs and returns a data frame
containing information about where they overlap. It works somewhat like
base::match()
in that for each element of
needles
, it looks for a match in all of
haystack
. Unlike match()
, it actually returns
all of the overlaps rather than just the first.
# iv_pairs() is a useful way to create small ivs from individual intervals
<- iv_pairs(c(1, 5), c(3, 7), c(10, 12))
needles
needles#> <iv<double>[3]>
#> [1] [1, 5) [3, 7) [10, 12)
<- iv_pairs(c(0, 6), c(13, 15), c(0, 2), c(7, 8), c(4, 5))
haystack
haystack#> <iv<double>[5]>
#> [1] [0, 6) [13, 15) [0, 2) [7, 8) [4, 5)
<- iv_locate_overlaps(needles, haystack)
locations
locations#> needles haystack
#> 1 1 1
#> 2 1 3
#> 3 1 5
#> 4 2 1
#> 5 2 5
#> 6 3 NA
The $needles
column of the result is an integer vector
showing where to slice needles
to generate the intervals
that overlap the intervals in haystack
described by the
$haystack
column. When a needle doesn’t overlap with any
intervals in the haystack, an NA
location is returned. An
easy way to align both needles
and haystack
using this information is to pass everything to iv_align()
,
which will automatically perform the slicing and store the results in
another data frame:
iv_align(needles, haystack, locations = locations)
#> needles haystack
#> 1 [1, 5) [0, 6)
#> 2 [1, 5) [0, 2)
#> 3 [1, 5) [4, 5)
#> 4 [3, 7) [0, 6)
#> 5 [3, 7) [4, 5)
#> 6 [10, 12) [NA, NA)
If you just wanted to know if an interval in needles
overlapped any interval in haystack
, then you can
use iv_overlaps()
, which returns a logical vector.
iv_overlaps(needles, haystack)
#> [1] TRUE TRUE FALSE
By default, iv_locate_overlaps()
will detect if there is
any kind of overlap between the two inputs, but there are various other
type
s of overlaps that you can detect. For example, you can
check if needles
“contains” haystack
:
<- iv_locate_overlaps(
locations
needles,
haystack, type = "contains",
no_match = "drop"
)
iv_align(needles, haystack, locations = locations)
#> needles haystack
#> 1 [1, 5) [4, 5)
#> 2 [3, 7) [4, 5)
I’ve also used no_match = "drop"
to drop all of the
needles
that don’t have any matching overlaps.
You can also check for the reverse, i.e. if needles
is
“within” the haystack
:
<- iv_locate_overlaps(
locations
needles,
haystack, type = "within",
no_match = "drop"
)
iv_align(needles, haystack, locations = locations)
#> needles haystack
#> 1 [1, 5) [0, 6)
Two other functions that are related to
iv_locate_overlaps()
are iv_locate_precedes()
and iv_locate_follows()
.
# Where does `needles` precede `haystack`?
<- iv_locate_precedes(needles, haystack)
locations
locations#> needles haystack
#> 1 1 2
#> 2 1 4
#> 3 2 2
#> 4 2 4
#> 5 3 2
This returns a data frame of the same structure as
iv_locate_overlaps()
, so you can use it with
iv_align()
.
iv_align(needles, haystack, locations = locations)
#> needles haystack
#> 1 [1, 5) [13, 15)
#> 2 [1, 5) [7, 8)
#> 3 [3, 7) [13, 15)
#> 4 [3, 7) [7, 8)
#> 5 [10, 12) [13, 15)
# Where does `needles` follow `haystack`?
<- iv_locate_follows(needles, haystack)
locations
iv_align(needles, haystack, locations = locations)
#> needles haystack
#> 1 [1, 5) [NA, NA)
#> 2 [3, 7) [0, 2)
#> 3 [10, 12) [0, 6)
#> 4 [10, 12) [0, 2)
#> 5 [10, 12) [7, 8)
#> 6 [10, 12) [4, 5)
If you are only interested in the closest interval in
haystack
that the needle precedes or follows, set
closest = TRUE
.
<- iv_locate_follows(
locations needles = needles,
haystack = haystack,
closest = TRUE,
no_match = "drop"
)
iv_align(needles, haystack, locations = locations)
#> needles haystack
#> 1 [3, 7) [0, 2)
#> 2 [10, 12) [7, 8)
Maintaining
Knowledge about Temporal Intervals is a great paper by James Allen
that outlines an interval algebra that completely describes how any two
intervals are related to each other (i.e. if one interval precedes,
overlaps, or is met-by another interval). The paper describes 13
relations that make up this algebra, which are faithfully implemented in
iv_locate_relates()
and iv_relates()
. These
relations are extremely useful because they are distinct
(i.e. two intervals can only be related by exactly 1 of the 13
relations), but they are a bit too restrictive to be practically useful.
iv_locate_overlaps()
, iv_locate_precedes()
,
and iv_locate_follows()
combine multiple of the individual
relations into three broad ideas that I find most useful. If you want to
learn more about this, I’d encourage you to read the help documentation
for iv_locate_relates()
.
Often you just want to know if a vector of values falls between the bounds of an interval. This is particularly common with dates, where you might want to know if a sale you made corresponded to an interval range when any commercial was being run.
<- as.Date(c("2019-01-01", "2020-05-10", "2020-06-10"))
sales
<- as.Date(c(
commercial_starts "2019-10-12", "2020-04-01", "2020-06-01", "2021-05-10"
))<- commercial_starts + 90
commercial_ends
<- iv(commercial_starts, commercial_ends)
commercials
sales#> [1] "2019-01-01" "2020-05-10" "2020-06-10"
commercials#> <iv<date>[4]>
#> [1] [2019-10-12, 2020-01-10) [2020-04-01, 2020-06-30) [2020-06-01, 2020-08-30)
#> [4] [2021-05-10, 2021-08-08)
You can check if a sale was made while any commercial was being run
with iv_between()
, which works like %in%
and
is similar to iv_overlaps()
:
tibble(sales = sales) %>%
mutate(commercial_running = iv_between(sales, commercials))
#> # A tibble: 3 × 2
#> sales commercial_running
#> <date> <lgl>
#> 1 2019-01-01 FALSE
#> 2 2020-05-10 TRUE
#> 3 2020-06-10 TRUE
You can find the commercials that were airing when the sale was made
with iv_locate_between()
and iv_align()
:
iv_align(sales, commercials, locations = iv_locate_between(sales, commercials))
#> needles haystack
#> 1 2019-01-01 [NA, NA)
#> 2 2020-05-10 [2020-04-01, 2020-06-30)
#> 3 2020-06-10 [2020-04-01, 2020-06-30)
#> 4 2020-06-10 [2020-06-01, 2020-08-30)
If you aren’t looking for the %in%
-like behavior of
iv_between()
, and instead want to pairwise detect whether
one value falls between an interval or not, you can use
iv_pairwise_between()
:
<- c(1, 5, 10, 12)
x
x#> [1] 1 5 10 12
<- iv_pairs(c(0, 6), c(7, 9), c(10, 12), c(10, 12))
y
y#> <iv<double>[4]>
#> [1] [0, 6) [7, 9) [10, 12) [10, 12)
iv_pairwise_between(x, y)
#> [1] TRUE FALSE TRUE FALSE
Keep in mind that the intervals are half-open, so 12
doesn’t fall between the interval of [10, 12)
! This is
different from dplyr::between()
.
Sometimes you just need the counts of the number of overlaps rather than the actual locations of them. For example, say your business has a subscription service and you’d like to compute a rolling monthly count of the total number of subscriptions that are active (i.e. in January 2019, how many subscriptions were active?). Customers are only allowed to have one subscription active at once, but they may cancel it and reactivate it at any time. If a customer was active at any point during the month, then they are counted in that month.
<- tribble(
enrollments ~name, ~start, ~end,
"Amy", "1, Jan, 2017", "30, Jul, 2018",
"Franklin", "1, Jan, 2017", "19, Feb, 2017",
"Franklin", "5, Jun, 2017", "4, Feb, 2018",
"Franklin", "21, Oct, 2018", "9, Mar, 2019",
"Samir", "1, Jan, 2017", "4, Feb, 2017",
"Samir", "5, Apr, 2017", "12, Jun, 2018"
)
# Parse these into "day" precision year-month-day objects
<- enrollments %>%
enrollments mutate(
start = year_month_day_parse(start, format = "%d, %b, %Y"),
end = year_month_day_parse(end, format = "%d, %b, %Y"),
)
enrollments#> # A tibble: 6 × 3
#> name start end
#> <chr> <ymd<day>> <ymd<day>>
#> 1 Amy 2017-01-01 2018-07-30
#> 2 Franklin 2017-01-01 2017-02-19
#> 3 Franklin 2017-06-05 2018-02-04
#> 4 Franklin 2018-10-21 2019-03-09
#> 5 Samir 2017-01-01 2017-02-04
#> 6 Samir 2017-04-05 2018-06-12
Even though we have day precision information, we only actually need
month precision intervals to answer this question. We’ll use
calendar_narrow()
from clock to convert our
"day"
precision dates to "month"
precision
ones. We’ll also add 1 month to the end
intervals to
reflect the fact that the end month is open (remember, ivs are
half-open).
<- enrollments %>%
enrollments mutate(
start = calendar_narrow(start, "month"),
end = calendar_narrow(end, "month") + 1L
)
enrollments#> # A tibble: 6 × 3
#> name start end
#> <chr> <ymd<month>> <ymd<month>>
#> 1 Amy 2017-01 2018-08
#> 2 Franklin 2017-01 2017-03
#> 3 Franklin 2017-06 2018-03
#> 4 Franklin 2018-10 2019-04
#> 5 Samir 2017-01 2017-03
#> 6 Samir 2017-04 2018-07
<- enrollments %>%
enrollments mutate(active = iv(start, end), .keep = "unused")
enrollments#> # A tibble: 6 × 2
#> name active
#> <chr> <iv<ymd<month>>>
#> 1 Amy [2017-01, 2018-08)
#> 2 Franklin [2017-01, 2017-03)
#> 3 Franklin [2017-06, 2018-03)
#> 4 Franklin [2018-10, 2019-04)
#> 5 Samir [2017-01, 2017-03)
#> 6 Samir [2017-04, 2018-07)
To answer this question, we are going to need to create a sequential
vector of months that span the entire range of intervals. This starts at
the smallest start
and goes to the largest
end
. Because the end
is half-open, there won’t
be any hits for that month, so we won’t include it.
<- range(enrollments$active)
bounds <- iv_start(bounds[[1]])
lower <- iv_end(bounds[[2]]) - 1L
upper
<- tibble(month = seq(lower, upper, by = 1))
months
months#> # A tibble: 27 × 1
#> month
#> <ymd<month>>
#> 1 2017-01
#> 2 2017-02
#> 3 2017-03
#> 4 2017-04
#> 5 2017-05
#> 6 2017-06
#> 7 2017-07
#> 8 2017-08
#> 9 2017-09
#> 10 2017-10
#> # … with 17 more rows
Now we need to add a column to months
to represent the
number of subscriptions that were active in that month. To do this we
can use iv_count_between()
. It works like
iv_between()
and iv_locate_between()
but
returns an integer vector corresponding to the number of times the
i
-th “needle” value fell between any of the values in the
“haystack”.
%>%
months mutate(count = iv_count_between(month, enrollments$active)) %>%
print(n = Inf)
#> # A tibble: 27 × 2
#> month count
#> <ymd<month>> <int>
#> 1 2017-01 3
#> 2 2017-02 3
#> 3 2017-03 1
#> 4 2017-04 2
#> 5 2017-05 2
#> 6 2017-06 3
#> 7 2017-07 3
#> 8 2017-08 3
#> 9 2017-09 3
#> 10 2017-10 3
#> 11 2017-11 3
#> 12 2017-12 3
#> 13 2018-01 3
#> 14 2018-02 3
#> 15 2018-03 2
#> 16 2018-04 2
#> 17 2018-05 2
#> 18 2018-06 2
#> 19 2018-07 1
#> 20 2018-08 0
#> 21 2018-09 0
#> 22 2018-10 1
#> 23 2018-11 1
#> 24 2018-12 1
#> 25 2019-01 1
#> 26 2019-02 1
#> 27 2019-03 1
There are also iv_count_overlaps()
,
iv_count_precedes()
, and iv_count_follows()
for working with two ivs at once.
One common operation when working with interval vectors is merging
all the overlapping intervals within a single interval vector. This
removes all the redundant information, while still maintaining the full
range covered by the iv. For this, you can use iv_groups()
which computes the minimal set of interval “groups” that contain all of
the intervals in x
.
<- iv_pairs(c(1, 5), c(5, 7), c(9, 11), c(10, 13), c(12, 13))
x
x#> <iv<double>[5]>
#> [1] [1, 5) [5, 7) [9, 11) [10, 13) [12, 13)
iv_groups(x)
#> <iv<double>[2]>
#> [1] [1, 7) [9, 13)
By default, this grouped abutting intervals that aren’t
considered to overlap but also don’t have any values between them. If
you don’t want this, use the abutting
argument.
iv_groups(x, abutting = FALSE)
#> <iv<double>[3]>
#> [1] [1, 5) [5, 7) [9, 13)
.by
Grouping overlapping intervals is often a useful way to create a new
variable to group on with dplyr’s .by
argument. For
example, consider the following problem where you have multiple users
racking up costs across multiple systems. The date ranges represent the
range when the corresponding cost was accrued over, and the ranges don’t
overlap for a given (user, system)
pair.
<- tribble(
costs ~user, ~system, ~from, ~to, ~cost,
"a", "2019-01-01", "2019-01-05", 200.5,
1L, "a", "2019-01-12", "2019-01-13", 15.6,
1L, "b", "2019-01-03", "2019-01-10", 500.3,
1L, "a", "2019-01-02", "2019-01-03", 25.6,
2L, "c", "2019-01-03", "2019-01-04", 30,
2L, "c", "2019-01-05", "2019-01-07", 66.2
2L,
)
<- costs %>%
costs mutate(
from = as.Date(from),
to = as.Date(to)
%>%
) mutate(range = iv(from, to), .keep = "unused")
costs#> # A tibble: 6 × 4
#> user system cost range
#> <int> <chr> <dbl> <iv<date>>
#> 1 1 a 200. [2019-01-01, 2019-01-05)
#> 2 1 a 15.6 [2019-01-12, 2019-01-13)
#> 3 1 b 500. [2019-01-03, 2019-01-10)
#> 4 2 a 25.6 [2019-01-02, 2019-01-03)
#> 5 2 c 30 [2019-01-03, 2019-01-04)
#> 6 2 c 66.2 [2019-01-05, 2019-01-07)
Now let’s say you don’t care about the system
anymore,
and instead want to sum up the costs for any overlapping date ranges for
a particular user
. iv_groups()
can give us an
idea of what the non-overlapping ranges would be for each user:
%>%
costs reframe(range = iv_groups(range), .by = user)
#> # A tibble: 4 × 2
#> user range
#> <int> <iv<date>>
#> 1 1 [2019-01-01, 2019-01-10)
#> 2 1 [2019-01-12, 2019-01-13)
#> 3 2 [2019-01-02, 2019-01-04)
#> 4 2 [2019-01-05, 2019-01-07)
But how can we sum up the costs? For this, we need to turn to
iv_identify_group()
which allows us to identify the group
that each range
falls in. This will give us something to
group on so we can sum up the costs.
<- costs %>%
costs2 mutate(range = iv_identify_group(range), .by = user)
# `range` has been updated with the corresponding group
costs2#> # A tibble: 6 × 4
#> user system cost range
#> <int> <chr> <dbl> <iv<date>>
#> 1 1 a 200. [2019-01-01, 2019-01-10)
#> 2 1 a 15.6 [2019-01-12, 2019-01-13)
#> 3 1 b 500. [2019-01-01, 2019-01-10)
#> 4 2 a 25.6 [2019-01-02, 2019-01-04)
#> 5 2 c 30 [2019-01-02, 2019-01-04)
#> 6 2 c 66.2 [2019-01-05, 2019-01-07)
# So now we can group on that to summarise the cost
%>%
costs2 summarise(cost = sum(cost), .by = c(user, range))
#> # A tibble: 4 × 3
#> user range cost
#> <int> <iv<date>> <dbl>
#> 1 1 [2019-01-01, 2019-01-10) 701.
#> 2 1 [2019-01-12, 2019-01-13) 15.6
#> 3 2 [2019-01-02, 2019-01-04) 55.6
#> 4 2 [2019-01-05, 2019-01-07) 66.2
iv_groups()
is a critical function in this package
because its defaults also produce what is known as a minimal
iv. A minimal interval vector:
Has no overlapping intervals
Has no abutting intervals
Is ordered on both iv_start(x)
and
iv_end(x)
Minimal interval vectors are nice because they cover the range of an interval vector in the most compact form possible. They are also nice to know about because the set operations described in the set operations section below all return minimal interval vectors.
While iv_groups()
generates less intervals than
you began with, it is sometimes useful to go the other way and generate
more intervals by splitting on all the overlapping endpoints.
This is what iv_splits()
does. Both operations end up
generating a result that contains completely disjoint intervals, but
they go about it in very different ways.
Let’s look back at our first iv_groups()
example:
<- iv_pairs(c(1, 5), c(5, 7), c(9, 11), c(10, 13), c(12, 13))
x
x#> <iv<double>[5]>
#> [1] [1, 5) [5, 7) [9, 11) [10, 13) [12, 13)
Notice that [9, 11)
overlaps [10, 13)
which
in turn overlaps [12, 13)
. If we looked at the sorted
unique values of the endpoints (i.e. c(9, 10, 11, 12, 13)
)
and then paired these up like
[9, 10), [10, 11), [11, 12], [12, 13)
, then we will have
nicely split on the endpoints, generating a disjoint set of intervals
that we refer to as “splits”. iv_splits()
returns these
intervals.
iv_splits(x)
#> <iv<double>[6]>
#> [1] [1, 5) [5, 7) [9, 10) [10, 11) [11, 12) [12, 13)
.by
Splitting an iv into its disjoint pieces is another operation that
works nicely with .by
. Consider this data set containing
details about a number of guests that arrived to your party. You’ve been
meticulous, so you’ve got their arrival and departure times logged
(don’t ask me why, maybe it’s for COVID-19 Contact Tracing
purposes).
<- tibble(
guests arrive = as.POSIXct(
c("2008-05-20 19:30:00", "2008-05-20 20:10:00", "2008-05-20 22:15:00"),
tz = "UTC"
),depart = as.POSIXct(
c("2008-05-20 23:00:00", "2008-05-21 00:00:00", "2008-05-21 00:30:00"),
tz = "UTC"
),name = list(
c("Mary", "Harry"),
c("Diana", "Susan"),
"Peter"
)
)
<- unnest(guests, name) %>%
guests mutate(iv = iv(arrive, depart), .keep = "unused")
guests#> # A tibble: 5 × 2
#> name iv
#> <chr> <iv<dttm>>
#> 1 Mary [2008-05-20 19:30:00, 2008-05-20 23:00:00)
#> 2 Harry [2008-05-20 19:30:00, 2008-05-20 23:00:00)
#> 3 Diana [2008-05-20 20:10:00, 2008-05-21 00:00:00)
#> 4 Susan [2008-05-20 20:10:00, 2008-05-21 00:00:00)
#> 5 Peter [2008-05-20 22:15:00, 2008-05-21 00:30:00)
Let’s figure out who was at your party at any given point throughout
the night. To do this, we’ll need to break iv
up into all
possible disjoint intervals that mark either an arrival or departure.
Like with iv_groups()
, iv_splits()
can show us
those disjoint intervals, but this doesn’t help us map them back to each
guest.
iv_splits(guests$iv)
#> <iv<datetime<UTC>>[5]>
#> [1] [2008-05-20 19:30:00, 2008-05-20 20:10:00)
#> [2] [2008-05-20 20:10:00, 2008-05-20 22:15:00)
#> [3] [2008-05-20 22:15:00, 2008-05-20 23:00:00)
#> [4] [2008-05-20 23:00:00, 2008-05-21 00:00:00)
#> [5] [2008-05-21 00:00:00, 2008-05-21 00:30:00)
Instead, we’ll need iv_identify_splits()
, which
identifies which of the splits overlap with each of the original
intervals and returns a list of the results which works nicely as a
list-column. This is a little easier to understand if we first look at a
single guest:
# Mary's arrival/departure times
$iv[[1]]
guests#> <iv<datetime<UTC>>[1]>
#> [1] [2008-05-20 19:30:00, 2008-05-20 23:00:00)
# The first start and last end correspond to Mary's original times,
# but we've also broken her stay up by the departures/arrivals of
# everyone else
iv_identify_splits(guests$iv)[[1]]
#> <iv<datetime<UTC>>[3]>
#> [1] [2008-05-20 19:30:00, 2008-05-20 20:10:00)
#> [2] [2008-05-20 20:10:00, 2008-05-20 22:15:00)
#> [3] [2008-05-20 22:15:00, 2008-05-20 23:00:00)
Since this generates a list-column, we’ll also immediately use
tidyr::unnest()
to expand it out.
<- guests %>%
guests2 mutate(iv = iv_identify_splits(iv)) %>%
unnest(iv) %>%
arrange(iv)
guests2#> # A tibble: 15 × 2
#> name iv
#> <chr> <iv<dttm>>
#> 1 Mary [2008-05-20 19:30:00, 2008-05-20 20:10:00)
#> 2 Harry [2008-05-20 19:30:00, 2008-05-20 20:10:00)
#> 3 Mary [2008-05-20 20:10:00, 2008-05-20 22:15:00)
#> 4 Harry [2008-05-20 20:10:00, 2008-05-20 22:15:00)
#> 5 Diana [2008-05-20 20:10:00, 2008-05-20 22:15:00)
#> 6 Susan [2008-05-20 20:10:00, 2008-05-20 22:15:00)
#> 7 Mary [2008-05-20 22:15:00, 2008-05-20 23:00:00)
#> 8 Harry [2008-05-20 22:15:00, 2008-05-20 23:00:00)
#> 9 Diana [2008-05-20 22:15:00, 2008-05-20 23:00:00)
#> 10 Susan [2008-05-20 22:15:00, 2008-05-20 23:00:00)
#> 11 Peter [2008-05-20 22:15:00, 2008-05-20 23:00:00)
#> 12 Diana [2008-05-20 23:00:00, 2008-05-21 00:00:00)
#> 13 Susan [2008-05-20 23:00:00, 2008-05-21 00:00:00)
#> 14 Peter [2008-05-20 23:00:00, 2008-05-21 00:00:00)
#> 15 Peter [2008-05-21 00:00:00, 2008-05-21 00:30:00)
Now that we have the splits for each guest, we can group by
iv
and summarize to figure out who was at the party at any
point throughout the night.
%>%
guests2 summarise(n = n(), who = list(name), .by = iv)
#> # A tibble: 5 × 3
#> iv n who
#> <iv<dttm>> <int> <list>
#> 1 [2008-05-20 19:30:00, 2008-05-20 20:10:00) 2 <chr [2]>
#> 2 [2008-05-20 20:10:00, 2008-05-20 22:15:00) 4 <chr [4]>
#> 3 [2008-05-20 22:15:00, 2008-05-20 23:00:00) 5 <chr [5]>
#> 4 [2008-05-20 23:00:00, 2008-05-21 00:00:00) 3 <chr [3]>
#> 5 [2008-05-21 00:00:00, 2008-05-21 00:30:00) 1 <chr [1]>
There are a number of set theoretical operations that you can use on ivs. These are:
iv_set_complement()
iv_set_union()
iv_set_intersect()
iv_set_difference()
iv_set_symmetric_difference()
iv_set_complement()
works on a single iv, while all the
others work on two intervals at a time. All of these functions return a
minimal interval vector. The easiest way to think about these
functions is to imagine iv_groups()
being called on each of
the inputs first (to reduce them down to their minimal form) before
applying the operation.
iv_set_complement()
computes the set complement of the
intervals in a single iv.
<- iv_pairs(c(1, 3), c(2, 5), c(10, 12), c(13, 15))
x
x#> <iv<double>[4]>
#> [1] [1, 3) [2, 5) [10, 12) [13, 15)
iv_set_complement(x)
#> <iv<double>[2]>
#> [1] [5, 10) [12, 13)
By default, iv_set_complement()
uses the
smallest/largest values of its input as the bounds to compute the
complement over, but you can supply bounds explicitly with
lower
and upper
:
iv_set_complement(x, lower = 0, upper = Inf)
#> <iv<double>[4]>
#> [1] [0, 1) [5, 10) [12, 13) [15, Inf)
iv_set_union()
takes the union of two ivs. It is
essentially a call to vctrs::vec_c()
followed by
iv_groups()
. It answers the question, “Which intervals are
in x
or y
?”
<- iv_pairs(c(-5, 0), c(1, 4), c(8, 10), c(15, 16))
y
x#> <iv<double>[4]>
#> [1] [1, 3) [2, 5) [10, 12) [13, 15)
y#> <iv<double>[4]>
#> [1] [-5, 0) [1, 4) [8, 10) [15, 16)
iv_set_union(x, y)
#> <iv<double>[4]>
#> [1] [-5, 0) [1, 5) [8, 12) [13, 16)
iv_set_intersect()
takes the intersection of two ivs. It
answers the question, “Which intervals are in x
and
y
?”
iv_set_intersect(x, y)
#> <iv<double>[1]>
#> [1] [1, 4)
iv_set_difference()
takes the asymmetrical difference of
two ivs. It answers the question, “Which intervals are in x
but not y
?”
iv_set_difference(x, y)
#> <iv<double>[3]>
#> [1] [4, 5) [10, 12) [13, 15)
The set operations described above all treat x
and
y
as two complete “sets” of intervals and operate on the
intervals as a group. Occasionally it is useful to have pairwise
equivalents of these operations that, say, take the intersection of the
i-th interval of x
and the i-th interval of
y
.
One case in particular comes from combining
iv_locate_overlaps()
with
iv_pairwise_set_intersect()
. Here you might want to know
not only where two ivs overlaps, but also what that
intersection was for each value in x
.
<- as.Date(c("2019-01-05", "2019-01-20", "2019-01-25", "2019-02-01"))
starts <- starts + c(5, 10, 3, 5)
ends <- iv(starts, ends)
x
<- as.Date(c("2019-01-02", "2019-01-23"))
starts <- starts + c(5, 6)
ends <- iv(starts, ends)
y
x#> <iv<date>[4]>
#> [1] [2019-01-05, 2019-01-10) [2019-01-20, 2019-01-30) [2019-01-25, 2019-01-28)
#> [4] [2019-02-01, 2019-02-06)
y#> <iv<date>[2]>
#> [1] [2019-01-02, 2019-01-07) [2019-01-23, 2019-01-29)
iv_set_intersect()
isn’t very useful to answer this
particular question, because it first merges all overlapping intervals
in each input.
iv_set_intersect(x, y)
#> <iv<date>[2]>
#> [1] [2019-01-05, 2019-01-07) [2019-01-23, 2019-01-29)
Instead, we can find the overlaps and align them, and then pairwise intersect the results:
<- iv_locate_overlaps(x, y, no_match = "drop")
locations <- iv_align(x, y, locations = locations)
overlaps
%>%
overlaps mutate(intersect = iv_pairwise_set_intersect(needles, haystack))
#> needles haystack intersect
#> 1 [2019-01-05, 2019-01-10) [2019-01-02, 2019-01-07) [2019-01-05, 2019-01-07)
#> 2 [2019-01-20, 2019-01-30) [2019-01-23, 2019-01-29) [2019-01-23, 2019-01-29)
#> 3 [2019-01-25, 2019-01-28) [2019-01-23, 2019-01-29) [2019-01-25, 2019-01-28)
Note that the pairwise set operations come with a number of
restrictions that limit their usage in many cases. For example,
iv_pairwise_set_intersect()
requires that x[i]
and y[i]
overlap, otherwise they would result in an empty
interval, which isn’t allowed.
iv_pairwise_set_intersect(iv(1, 5), iv(6, 9))
#> Error in `iv_pairwise_set_intersect()`:
#> ! Can't take the intersection of non-overlapping intervals.
#> ℹ This would result in an empty interval.
#> ℹ Location 1 contains non-overlapping intervals.
See the documentation page of
iv_pairwise_set_intersect()
for a complete list of
restrictions for all of the pairwise set operations.
Missing intervals are allowed in ivs, you can generate them by
supplying vectors to iv()
or iv_pairs()
that
contain missing values in either input.
<- iv_pairs(c(1, 5), c(3, NA), c(NA, 3))
x
x#> <iv<double>[3]>
#> [1] [1, 5) [NA, NA) [NA, NA)
The defaults of all functions in ivs treat missing intervals in one of two ways:
Match-like operations treat missing intervals as
overlapping other missing intervals, but they won’t overlap any
other interval. These include iv_locate_overlaps()
,
iv_set_intersect()
, and iv_splits()
.
Pairwise operations treat missing intervals as
infectious, meaning that if the i-th interval of x
is missing and the i-th interval of y
is not missing (or
vice versa), then the result is forced to be a missing interval. These
include any operations prefixed with
iv_pairwise_*()
.
<- iv_pairs(c(NA, NA), c(0, 2))
y
y#> <iv<double>[2]>
#> [1] [NA, NA) [0, 2)
# Match-like operations treat missing intervals as overlapping
iv_locate_overlaps(x, y)
#> needles haystack
#> 1 1 2
#> 2 2 1
#> 3 3 1
iv_set_intersect(x, y)
#> <iv<double>[2]>
#> [1] [1, 2) [NA, NA)
# Pairwise operations treat missing intervals as infectious
<- iv_pairs(c(1, 2), c(1, 4))
z
iv_pairwise_set_intersect(y, z)
#> <iv<double>[2]>
#> [1] [NA, NA) [1, 2)