Alternative Thresholds

Robert W Schlegel

2021-10-26

Overview

The heatwaveR package was designed to include many different methods for the creation of climatologies and thresholds for detecting extremes events (e.g. heatwaves and cold-spells) in time series data. To this end we made a very large change in the event detection pipeline when we moved from the RmarineHeatWaves package to heatwaveR. This change may primarily be seen with the inclusion of the ts2clm() function and the removal of the climatology performed in RmarineHeatWaves::detect() in favour of detect_event(), which does not calculate climatologies. In this way we have allowed for the introduction of a multitude of more complex climatology calculation and event detection/filtering methods. It is our overarching goal to provide one package that allows climate scientists to calculate these events in both the atmosphere and oceans. But rather than talking about it, let’s walk through some case studies on how this package can be used for diverse applications.

Double thresholds

Brought to our attention by Mr. Haouari from the Hydrometeorological Institute of Training and Research (IHFR) in Algeria was the concept of using a flat (e.g. 19\(^\circ\)C) tMin bottom boundary when calculating events from tMax with the standard 90th percentile upper threshold. As the authors of the heatwaveR package are admittedly marine oriented, we tend to work with daily time series that have only one mean value per day (e.g. tMean, temp, sst). This is why there are not arguments in the heatwaveR suite of functions that call on tMin and tMax explicitly, but that does not mean that one cannot do so. Below we will work through the steps one would take to calculate (atmospheric) heatwaves, as per their definition in Perkins and Alexander (2013) (but excluding the calculation of EHF), and with the additional step proposed by Mr. Haouari.

The following sub-sections will show the step-by-step process one may use to calculate atmospheric heatwaves using a 90th percentile threshold created from the tMax time series for a location, but will only use the days when the corresponding tMin also exceeds a pre-determined flat bottom boundary on the same days to quantify the heatwave metrics. We will finish by visualising the results with the built-in heatwaveR graphing functions of event_line() and lolli_plot as well as a bubble plot. The data we will use for these examples will be a 45 year time series of daily atmospheric tMin and tMax temperatures over Algiers, Algeria contributed by Mr. Haouari.

Data prep

The first step with any analysis in R should be the loading of the packages to be used.

library(dplyr)
library(ggpubr)
library(heatwaveR)

With our libraries loaded, we will now go about explicitly calling the Algiers data into our environment. These data are automatically loaded for us when we load the heatwaveR library, but we perform this step here just for clarity. Anyone following along should feel free to use whatever data they would like. As long as the data have a date (t), tMin, and tMax column this code will function as designed.

Algiers <- Algiers

Calculating thresholds

With our libraries and data loaded, we will now calculate the two thresholds we need to correctly detect the heatwaves and accurately quantify their metrics. The first is the 90th percentile threshold based on the tMax time series. The second is the flat exceedance of 19\(^\circ\)C based on the tMin data. We use 19\(^\circ\)C as the bottom threshold as this is roughly the mean tMin for summer temperatures.

# The tMax threshold
# The current WMO standard climatology period is 1981-01-01 to 2010-12-31 and should be used when possible
# We rather use 1961-01-01 to 1990-01-01 as this is the oldest 30 year period available in the data
tMax_clim <- ts2clm(data = Algiers, y = tMax, climatologyPeriod = c("1961-01-01", "1990-12-31"), pctile = 90)

# The tMin exceedance
# Note the use here of 'minDuration = 3' and 'maxGap = 1' as the default atmospheric arguments
# The default marine arguments are 'minDuration = 5' and 'maxGap = 2'
tMin_exc <- exceedance(data = Algiers, y = tMin, threshold = 19, minDuration = 3, maxGap = 1)$threshold

Calculating events

Now that the two thresholds have been calculated we use the detect_event() function as usual, but provide the second threshold to the argument threshClim2 that would normally lay dormant.

# Note that because we calculated our 90th percentile threshold on a column named 'tMax' 
# and not the default column name 'temp', we must specify this below with 'y = tMax'
events <- detect_event(data = tMax_clim, y = tMax, # The 90th percentile threshold
                       threshClim2 = tMin_exc$exceedance) # The flat exceedance threshold

Please note that even though the use of the second threshold does allow for the resultant event metrics to differ, the values themselves are still being calculated against the seasonal climatology and daily temperatures for the time series given to the 90th percentile threshold calculation (in this case tMax) and so using a second threshold (in this case tMin) won’t generally have much of an effect on the event metrics. Rather it mostly screens out smaller or larger events depending on how one chooses to set the threshold. In the case when an exceedance threshold is chosen for a temperature that would typically only occur in summer (e.g. 19\(^\circ\)C, as used here), one is also effectively screening events by season. There are many use cases where this would be desirable. For example, if one is only interested in events that would occur during a season in which night time heat stress becomes an issue for the young and elderly.

Creating visuals

Even though we have used two thresholds to calculate our events, the results are output the same as though only one threshold (default) were used. This means that we may use the visualisation functions that come with heatwaveR without any extra fuss.

# The code to create a bubble plot for the heatwave results
bubble_plot <- ggplot(data = events$event, aes(x = date_peak, y = intensity_max)) +
  geom_point(aes(size = intensity_cumulative), shape = 21, fill = "salmon", alpha = 0.8) +
  labs(x = NULL, y = "Maximum Intensity [°C] ", size = "Cumulative Intensity [°C x days]") +
  scale_size_continuous(range = c(1, 10), 
                        guide = guide_legend(title.position = "top", direction = "horizontal")) +
  theme_bw() +
  theme(legend.position = c(0.3, 0.12),
        legend.box.background = element_rect(colour = "black"))

# Don't forget to set 'event_line(y = tMax)'
ggarrange(event_line(events, y = tMax, metric = "intensity_max"),
          event_line(events, y = tMax, metric = "intensity_max", category = T),
          lolli_plot(events),
          bubble_plot,
          ncol = 2, nrow = 2, align = "hv")

Alternative second thresholds

Using a percentile based second threshold is not much different than using a static second threshold. Rather than using exceedance() to get our second threshold we can use ts2clm nested within detect_event(). It must also be pointed out that in addition to using multiple thresholds, we can adjust the minimum duration (minDuration) and maximum gap (maxGap) arguments for our multiple thresholds, too. This allows us to provide different ‘flavours’ of criteria for our events. For example, let’s say we are interested in night-time events (tMin) when the temperatures remain above the 80th percentile threshold (pctile = 80) for 10 or more days (minDuration = 10) without dipping below that threshold for more than 2 consecutive days (maxGap = 2). But on top of that, we are also only interested in those parts of the event when the daytime temperatures exceed the 90th percentile threshold (pctile = 90) for 3 or more days (minDuration = 3) non-stop (maxGap = 0).

Below we will look at how to detect/calculate events that meet these rather specific criteria. We will also calculate events with just the first threshold and compare the results visually. It must be noted here that whichever criteria is the most strict, in this case minDuration = 3 and maxGap = 0, will be the predominant filter through which the event metrics are quantified.

# Note that because we are not using the standard column name 'temp' we must
# specify the chosen column name twice, once for ts2clm() and again for detect_event()

# First threshold based on tMin
thresh_tMin <- ts2clm(data = Algiers, y = tMin, pctile = 80, 
                      climatologyPeriod = c("1961-01-01", "1990-12-31"))

# Second threshold based on tMax
# Be careful here that you put the arguments within the correct brackets
thresh_tMax <- detect_event(ts2clm(data = Algiers, y = tMax, pctile = 90, 
                                   climatologyPeriod = c("1961-01-01", "1990-12-31")),
                            # These arguments are passed to detect_event(), not ts2clm()
                            minDuration = 3, maxGap = 0, y = tMax, protoEvents = T)

# Detect/calculate events using the two precalculated thresholds
# Because detect_event() is not able to deduce which arguments we used above,
# we must again tell it explicitly here
events_two_thresh <- detect_event(data = thresh_tMin, y = tMin, minDuration = 10, maxGap = 2,
                                  threshClim2 = thresh_tMax$event, minDuration2 = 3, maxGap2 = 0)

# Or to simply use one threshold
events_one_thresh <- detect_event(data = thresh_tMin, y = tMin, minDuration = 10, maxGap = 2)

Here are the differences in lolliplot format:

ggarrange(lolli_plot(events_one_thresh), lolli_plot(events_two_thresh), labels = c("One threshold", "Two thresholds"))

Here is a brief display of the top few events from each method:

head(events_one_thresh$event)
## # A tibble: 5 × 22
##   event_no index_start index_peak index_end duration date_start date_peak 
##      <int>       <int>      <int>     <int>    <dbl> <date>     <date>    
## 1        1        6135       6137      6144       10 1977-10-18 1977-10-20
## 2        2       10844      10845     10856       13 1990-09-09 1990-09-10
## 3        3       11191      11200     11201       11 1991-08-22 1991-08-31
## 4        4       13292      13300     13301       10 1997-05-23 1997-05-31
## 5        5       15534      15542     15546       13 2003-07-13 2003-07-21
## # … with 15 more variables: date_end <date>, intensity_mean <dbl>,
## #   intensity_max <dbl>, intensity_var <dbl>, intensity_cumulative <dbl>,
## #   intensity_mean_relThresh <dbl>, intensity_max_relThresh <dbl>,
## #   intensity_var_relThresh <dbl>, intensity_cumulative_relThresh <dbl>,
## #   intensity_mean_abs <dbl>, intensity_max_abs <dbl>, intensity_var_abs <dbl>,
## #   intensity_cumulative_abs <dbl>, rate_onset <dbl>, rate_decline <dbl>
head(events_two_thresh$event)
## # A tibble: 5 × 22
##   event_no index_start index_peak index_end duration date_start date_peak 
##      <int>       <int>      <int>     <int>    <dbl> <date>     <date>    
## 1        1        6135       6137      6137        3 1977-10-18 1977-10-20
## 2        2       10851      10851     10856        6 1990-09-16 1990-09-16
## 3        3       11191      11200     11201       11 1991-08-22 1991-08-31
## 4        4       13292      13294     13294        3 1997-05-23 1997-05-25
## 5        5       15540      15542     15545        6 2003-07-19 2003-07-21
## # … with 15 more variables: date_end <date>, intensity_mean <dbl>,
## #   intensity_max <dbl>, intensity_var <dbl>, intensity_cumulative <dbl>,
## #   intensity_mean_relThresh <dbl>, intensity_max_relThresh <dbl>,
## #   intensity_var_relThresh <dbl>, intensity_cumulative_relThresh <dbl>,
## #   intensity_mean_abs <dbl>, intensity_max_abs <dbl>, intensity_var_abs <dbl>,
## #   intensity_cumulative_abs <dbl>, rate_onset <dbl>, rate_decline <dbl>

If we look at these results we see that the use of two thresholds did not detected fewer events than the use of one threshold. This is because even though the second threshold was much more ‘difficult’ for the time series to surpass than the first, the heatwaves in the time series are so pronounced that they emerge regardless. This method allows for a lot of flexibility, but users should also be cautious that they understand what exactly they are asking their machines to do. In the case above, it may be that we would actually prefer to calculate our event metrics based entirely on the first threshold, but filter out the events that didn’t meet our second threshold criteria. We will see how to do this in the following section.

Filtering with a second threshold

The methodology outlined below for the detection and filtering of events with two thresholds is somewhat cumbersome. A potential issue with this technique is that the multiple filters do not affect the calculation of the event metrics (e.g. intensity_cumulative), as only the primary threshold given to detect_event() is used when calculating event metrics. This may however be the desired case if one is still interested in knowing the cumulative intensity above the given percentile threshold, but only wants to filter the full event based on some other threshold criteria. I can imagine real-world use cases for both scenarios, which is why this seemingly less sophisticated approach is detailed below.

Filtering events

Because we have already calculated our single threshold events (events_one_thresh) and our second threshold (thresh_tMax) we may directly begin filtering the results. Before we do so, let’s pull out the list components of our results into dataframes for easier use down the line.

# Pull out each data.frame as their own object for easier use
events_one_event <- events_one_thresh$event
events_one_climatology <- events_one_thresh$climatology

This is where things may get tricky for some users, and where the default use of the functions in the heatwaveR package ends. We are now going ‘off-road’ so to speak. But do not despair! The tidyverse suite of packages makes data wrangling like this much more user friendly than it was in the dark days of Base R coding.

In order to make the filtering of events easier, we will combine the two different dataframes that we are using as threshold/filtering guides to chose the events that meet all of our selection criteria.

# Join the two threshold dataframes
two_thresh <- left_join(events_one_climatology, thresh_tMax, by = c("t"))

# Remove all days that did not qualify as events in both thresholds
two_thresh_filtered <- two_thresh %>%
  filter(event.x == TRUE,
         event.y == TRUE)

With our filtering guide created, we may now apply it to events_one_thresh to get our filtered results.

# Copy data with a new name
events_one_thresh_filtered <- events_one_thresh

# Then filter
events_one_thresh_filtered$event <- events_one_thresh_filtered$event %>% 
  filter(event_no %in% two_thresh_filtered$event_no.x)

# Compare results
head(events_one_thresh_filtered$event)
## # A tibble: 5 × 22
##   event_no index_start index_peak index_end duration date_start date_peak 
##      <int>       <int>      <int>     <int>    <dbl> <date>     <date>    
## 1        1        6135       6137      6144       10 1977-10-18 1977-10-20
## 2        2       10844      10845     10856       13 1990-09-09 1990-09-10
## 3        3       11191      11200     11201       11 1991-08-22 1991-08-31
## 4        4       13292      13300     13301       10 1997-05-23 1997-05-31
## 5        5       15534      15542     15546       13 2003-07-13 2003-07-21
## # … with 15 more variables: date_end <date>, intensity_mean <dbl>,
## #   intensity_max <dbl>, intensity_var <dbl>, intensity_cumulative <dbl>,
## #   intensity_mean_relThresh <dbl>, intensity_max_relThresh <dbl>,
## #   intensity_var_relThresh <dbl>, intensity_cumulative_relThresh <dbl>,
## #   intensity_mean_abs <dbl>, intensity_max_abs <dbl>, intensity_var_abs <dbl>,
## #   intensity_cumulative_abs <dbl>, rate_onset <dbl>, rate_decline <dbl>
head(events_two_thresh$event)
## # A tibble: 5 × 22
##   event_no index_start index_peak index_end duration date_start date_peak 
##      <int>       <int>      <int>     <int>    <dbl> <date>     <date>    
## 1        1        6135       6137      6137        3 1977-10-18 1977-10-20
## 2        2       10851      10851     10856        6 1990-09-16 1990-09-16
## 3        3       11191      11200     11201       11 1991-08-22 1991-08-31
## 4        4       13292      13294     13294        3 1997-05-23 1997-05-25
## 5        5       15540      15542     15545        6 2003-07-19 2003-07-21
## # … with 15 more variables: date_end <date>, intensity_mean <dbl>,
## #   intensity_max <dbl>, intensity_var <dbl>, intensity_cumulative <dbl>,
## #   intensity_mean_relThresh <dbl>, intensity_max_relThresh <dbl>,
## #   intensity_var_relThresh <dbl>, intensity_cumulative_relThresh <dbl>,
## #   intensity_mean_abs <dbl>, intensity_max_abs <dbl>, intensity_var_abs <dbl>,
## #   intensity_cumulative_abs <dbl>, rate_onset <dbl>, rate_decline <dbl>

The event numbers found in events_one_thresh_filtered are the same as the event numbers found in events_two_thresh with the important difference that the event metrics in events_two_thresh were calculated only on the days that exceeded both thresholds, while the events in events_one_thresh_filtered have had their metrics calculated from all of the days that exceeded only the first threshold.

Visualising filtered events

To better understand how different the results from these two different techniques may be we will use lolliplots to visualise them.

ggarrange(lolli_plot(events_two_thresh, metric = "duration"), 
          lolli_plot(events_one_thresh_filtered, metric = "duration"), 
          labels = c("Double threshold", "Filter threshold"))
Difference in duration (days) of events given different applications of thresholds. Note the difference in the y-axes.

Difference in duration (days) of events given different applications of thresholds. Note the difference in the y-axes.

ggarrange(lolli_plot(events_two_thresh, metric = "intensity_cumulative"), 
          lolli_plot(events_one_thresh_filtered, metric = "intensity_cumulative"), 
          labels = c("Double threshold", "Filter threshold"))
Difference in cumulative intensity (°C x days) of events given different applications of thresholds. Note the difference in the y-axes.

Difference in cumulative intensity (°C x days) of events given different applications of thresholds. Note the difference in the y-axes.

One may of course visualise the outputs from the events calculated here with geom_flame() and geom_lolli() as well, but this will not differ from the default method of using these functions as outlined in their help files so we will not go into that here.

Summary

This vignette serves as a guideline for how to implement multiple methodologies for using two thresholds (tMin and tMax) with atmospheric data. We also showed in this vignette a more straight forward approach to using a second threshold through the built-in arguments in detect_event(). The use of a second threshold in this way, whether it be based on a static threshold or one derived from a percentile, is useful for the consideration of events that may be more specifically relevant to a given season or organism.

I hope the techniques shown in this vignette will be useful both technically and theoretically. The authors of heatwaveR are very happy to receive any further input on the development of the package as well as other potential methods for calculating heatwaves and cold-spells in air or sea.

References

Perkins, Sarah E., and Lisa V. Alexander. 2013. On the measurement of heat waves.” Journal of Climate 26 (13): 4500–4517.