theft
enables the standardised calculation of
time-series features from multiple existing feature sets, and any
user-supplied features.
To explore package functionality, we are going to use a dataset that
comes standard with theft
called simData
. This
dataset contains a collection of randomly generated time series for six
different types of processes. The dataset can be accessed via:
The data follows the following structure:
head(simData)
#> values timepoint id process
#> Gaussian Noise.1 -0.6264538 1 Gaussian Noise_1 Gaussian Noise
#> Gaussian Noise.2 0.1836433 2 Gaussian Noise_1 Gaussian Noise
#> Gaussian Noise.3 -0.8356286 3 Gaussian Noise_1 Gaussian Noise
#> Gaussian Noise.4 1.5952808 4 Gaussian Noise_1 Gaussian Noise
#> Gaussian Noise.5 0.3295078 5 Gaussian Noise_1 Gaussian Noise
#> Gaussian Noise.6 -0.8204684 6 Gaussian Noise_1 Gaussian Noise
The core function that automates the calculation of the feature
statistics at once is calculate_features
. You can choose
which subset of features to calculate with the feature_set
argument. The choices are currently "catch22"
,
"feasts"
, "Kats"
, "tsfeatures"
,
"tsfresh"
, and/or "TSFEL"
.
Note that Kats
, tsfresh
and
TSFEL
are Python packages. The R package
reticulate
is used to call Python code that uses these
packages and applies it within the broader tidy data philosophy
embodied by theft
. At present, depending on the input
time-series, theft
provides access to \(>1200\) features.
However, as discussed in the functionality demonstrations below, you can also supply your own list of features too! But more on that later…
Prior to using theft
(only if you want to use the
Kats
, tsfresh
or TSFEL
feature
sets; the R-based sets will run fine) you should have a working Python
3.9 installation and run the function
install_python_pkgs(venv)
after first installing
theft
, where the venv
argument is the name of
the virtual environment you want to create.
For example, if you wanted to install the Python libraries to the
default virtual environment folder used by reticulate
, you
would run the following after first having installed theft
(here I am just creating a new virtual environment called
"theft-package"
—you can call it whatever you like!):
You can then run the following to activate the virtual environment:
You are now ready to commit theft using all six potential factory feature sets!
However, you do not necessarily have to use these convenience
functions. If you have another method for pointing R to the correct
Python (such as reticulate
or findpython
), you
can use those in your workflow instead and make sure you install
Kats
, tsfresh
or TSFEL
as
required
NOTE 1: You only need to call
init_theft
or your other solution once per
session.
NOTE 2: If you have issues installing
Kats
with
install_python_pkgs
, try
install_python_pkgs("theft-package", standard_kats = FALSE)
.
You are then ready to use the rest of the package’s functionality,
beginning with the extraction of time-series features. Here is an
example with the catch22
set:
feature_matrix <- calculate_features(data = simData,
id_var = "id",
time_var = "timepoint",
values_var = "values",
group_var = "process",
feature_set = "catch22",
seed = 123)
head(feature_matrix)
#> id group names values
#> 1 Gaussian Noise_1 Gaussian Noise DN_HistogramMode_5 -0.01408452
#> 2 Gaussian Noise_1 Gaussian Noise DN_HistogramMode_10 -0.27031413
#> 3 Gaussian Noise_1 Gaussian Noise CO_f1ecac 1.60843425
#> 4 Gaussian Noise_1 Gaussian Noise CO_FirstMin_ac 3.00000000
#> 5 Gaussian Noise_1 Gaussian Noise CO_HistogramAMI_even_2_5 0.09661403
#> 6 Gaussian Noise_1 Gaussian Noise CO_trev_1_num 1.51953865
#> feature_set
#> 1 catch22
#> 2 catch22
#> 3 catch22
#> 4 catch22
#> 5 catch22
#> 6 catch22
Note that for the catch22
set you can set the additional
catch24
argument to calculate the mean and standard
deviation in addition to the standard 22 features:
feature_matrix <- calculate_features(data = simData,
id_var = "id",
time_var = "timepoint",
values_var = "values",
group_var = "process",
feature_set = "catch22",
catch24 = TRUE,
seed = 123)
NOTE: If using the tsfresh
feature set, you might want
to consider the tsfresh_cleanup
argument to
calculate_features
. This argument defaults to
FALSE
and specifies whether to use the in-built
tsfresh
relevant feature filter or not.
You can also supply your own named list of functions to compute as
time-series features. Below is an example with mean and standard
deviation. Note that the list must be named as
theft
uses the list element names to label the time-series
features internally. Note that if you don’t want to use any of the
existing feature sets in theft
and only calculate the
features you supply to features
, just set
feature_set = NULL
.
feature_matrix2 <- calculate_features(data = simData,
group_var = "process",
feature_set = NULL,
features = list("mean" = mean, "sd" = sd))
head(feature_matrix2)
#> id group names values feature_set
#> 1 Gaussian Noise_1 Gaussian Noise mean 0.106146509 User
#> 2 Gaussian Noise_1 Gaussian Noise sd 0.900816596 User
#> 3 Gaussian Noise_2 Gaussian Noise mean 0.029534984 User
#> 4 Gaussian Noise_2 Gaussian Noise sd 1.129814821 User
#> 5 Gaussian Noise_3 Gaussian Noise mean -0.009571088 User
#> 6 Gaussian Noise_3 Gaussian Noise sd 0.872840204 User
For a detailed comparison of the six feature sets, see this paper for a detailed review1.
As theft
is based on the foundations laid by hctsa
, there
is also functionality for reading in hctsa
-formatted Matlab
files and automatically processing them into tidy dataframes ready for
feature extraction in theft
. The
process_hctsa_file
function takes a string filepath to the
Matlab file and does all the work for you, returning a dataframe with
naming conventions consistent with other theft
functionality. As per hctsa
specifications for Input
File Format 1, this file should have 3 variables with the following
exact names: timeSeriesData
, labels
, and
keywords
. The filepath can be a local drive path or a
URL.
Please see the companion package theftdlc
(‘theft
downloadable content’) for a large suite of
functions.
T. Henderson and B. D. Fulcher, “An Empirical Evaluation of Time-Series Feature Sets,” 2021 International Conference on Data Mining Workshops (ICDMW), 2021, pp. 1032-1038, doi: 10.1109/ICDMW53433.2021.00134.↩︎