After briefly describing the problem that fplyr
tries to
solve, this vignette will go through all the functions in the package,
explaining their usage. In order to make the most of this package, a
certain degree of familiarity with the data.table
package
is suggested. Often, if one has trouble understanding an option, it will
be possible to find detailed help in the manual of
data.table
’s fread() function. Furthermore, basic
acquaintance with the *ply family of functions in R, especially
lapply(), will also be helpful. You are encouraged to run the code of
this vignette on your own and explore the output of the commands.
A very common operation when analyzing data is that of splitting the
observations into groups and applying a function to each group,
separately. So common is this operation, that in R there are at least
two functions that implement it: by() and aggregate(). However, using
these functions requires that the data be loaded into the RAM, and often
the files are too big to fit in the memory. fplyr
was born
to solve this problem: it allows to perform split-apply-combine
operations to very big files; by reading the files chunk by chunk, only
a limited number of rows is stored in memory at any given time.
fplyr
combines the strengths of two other packages:
iotools
and data.table
. While
iotools
has some functions, such as chunk.apply(), to apply
a function to chunks of files, the chunks may not reflect the actual
groups in which the data are partitioned. In particular, a ‘chunk’ may
contain observations pertaining to several different groups, and the
task of further splitting them is left to the user. In
fplyr
, on the other hand, the further splitting is done
automatically (thanks also to the data.table
package), so
the user needs not worry about it.
Before using fplyr
you need to ensure that the input
file is in the correct format. First and foremost, the data must be
amenable to the split-apply-combine paradigm, so the observations must
be grouped according to the value of a certain field. We refer to the
values of the ‘groupby’ field as the subjects. Thus, for
instance, in the famous iris
data set, each species would
be a different subject. All the observations pertaining to the same
subject constitute a block.
In fplyr
the input file must be formatted in such a way
that the first field contains all the subject IDs. If the IDs are not in
the first field, it won’t work. Moreover, all the observations referring
to the same subject must be consecutive; in other words, the file must
be sorted on the first field, the reason being that the file is read
block by block. Indeed, the subject ID of one line is compared with that
of the previous line, and the reading goes on until the IDs are the
same.1
Note that fplyr
always ensures that all the rows with the
same subject ID are read together in the same batch, but only if the
rows are consecutive. To make sure that a file complies with these
specifications, it is possible to use *nix command-line tools such as
awk
and sort
.
As an example file, in this vignette we will use a modified version
of the iris
dataset where the species has been relocated to
the first column. This file is very small and would probably be
accommodated even in the RAM of old hardware, so fplyr
would not be necessary. Nevertheless, this file is attached to the
package, meaning that it will be immediately available to all users, and
despite its having only three blocks, it will still illustrate the most
important features of fplyr
. We begin by storing the path
to this file into the variable f
:
f <- system.file("extdata", "dt_iris.csv", package = "fplyr")
# Let's have a look at the first four lines of the file
fread(f, nrows = 4)
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1: setosa 5.1 3.5 1.4 0.2
#> 2: setosa 4.9 3.0 1.4 0.2
#> 3: setosa 4.7 3.2 1.3 0.2
#> 4: setosa 4.6 3.1 1.5 0.2
Use flply() when you want to obtain a list where each element corresponds to a subject and contains the result of the processing of the corresponding block. In our examples, the output of flply() will contain three elements, one for each Iris species. The elements of the list will be conveniently named after the subject IDs.
fplyr
allows you to apply a function to each block of
the file. For the sake of distinguishing the user-specified function to
be applied to each block from other functions, we shall refer to it as
FUN
. In the first example we will obtain the summary() of
each species. In general, all the functions in the package support two
fundamental arguments: the path to the input file, and
FUN
.
species_summ <- flply(input = f, FUN = summary)
# Now `species_summ` is a list of three elements; let's show the 'versicolor' element
species_summ$versicolor
#> Species Sepal.Length Sepal.Width Petal.Length
#> Length:50 Min. :4.900 Min. :2.000 Min. :3.00
#> Class :character 1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00
#> Mode :character Median :5.900 Median :2.800 Median :4.35
#> Mean :5.936 Mean :2.770 Mean :4.26
#> 3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60
#> Max. :7.000 Max. :3.400 Max. :5.10
#> Petal.Width
#> Min. :1.000
#> 1st Qu.:1.200
#> Median :1.300
#> Mean :1.326
#> 3rd Qu.:1.500
#> Max. :1.800
For flply(), FUN
can be any function that takes as input
a “data.frame”; summary() was just an example, but other appropriate
functions are str(), as.matrix(), and so on. Of course, if you cannot
find a function that does what you want, you can write your own
FUN
, as we shall see in the next example, where we’ll
perform hierarchical clustering within each species.2 Note that this is also
how functions like lapply() work.
clusters <- flply(f, FUN = function(d) {
dm <- dist(d[, -1]) # Compute the distance matrix, excluding the first field
hclust(dm) # Perform the clustering and return the object
})
# The `cluster` variable contains one "hclust" object for each species.
# Let's plot the 'setosa' dendrogram
plot(clusters$setosa)
If FUN
takes more than one argument, it is possible to
pass any additional argument directly to flply(): they will be passed,
in turn, to FUN
. For instance, suppose that we want to use
kmeans() instead of hclust(), and we want to specify the number of
centroids as an additional parameter. In the next example we will also
define FUN
as a separate function before using it, rather
than writing an anonymous function like in the previous example. The
output will be a “kmeans” object for each species.
kmeans_FUN <- function(d, my_centers) {
kmeans(d[, -1], centers = my_centers)
}
my_centers <- 2
# We pass `my_centers` to flply(), and flply() passes it to kmeans_FUN
clusters <- flply(f, FUN = kmeans_FUN, my_centers)
# Let's display the centers of the 'setosa' clusters
clusters$setosa$centers
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 4.818182 3.236364 1.433333 0.2303030
#> 2 5.370588 3.800000 1.517647 0.2764706
# Now let's do the same thing, but with three centers for each species
my_centers <- 3
clusters <- flply(f, FUN = kmeans_FUN, my_centers)
clusters$setosa$centers
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 5.512500 4.000000 1.475000 0.275000
#> 2 5.100000 3.513043 1.526087 0.273913
#> 3 4.678947 3.084211 1.378947 0.200000
The last example of this section may be a bit surprising. Since, in
R, [[
is a function, nothing prevents us from using it as
FUN
to select only, say, the second column of each block.
Admittedly, however, in this case it would be better to use the
select
option (see ?flply
and
?fread
, or wait for the Other
options subsection).
sepal_length <- flply(f, `[[`, 2)
# Now `sepal_length` contains all the sepal lengths, divided by species
sepal_length
#> $setosa
#> [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 5.7
#> [20] 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 5.5 4.9
#> [39] 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0
#>
#> $versicolor
#> [1] 7.0 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2
#> [20] 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3
#> [39] 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7
#>
#> $virginica
#> [1] 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
#> [20] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4
#> [39] 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9
We followed the same convention of the plyr
package. The
name of each function consists of two letters followed by ‘ply’: the
first letter represent the type of input, whereas the second letter
characterizes the type of output, and the final ‘ply’ clinches the
relation with the existing ‘apply’ family of functions. The first letter
is usually ‘f’, because the input is the path to a file. The second
letter is ‘l’ if the output is a list, as in flply(), it is ‘t’ if the
output is a “data.table”, ‘f’ if the output is another file, and ‘m’ if
the output can be multiple things.
Use ftply() to return a “data.table”; the rows corresponding to the
different subjects will be rbind
ed together. Needless to
say, in this case FUN
must return a “data.frame” or a
“data.table”, while in flply() there was no such restriction. (When
fplyr
is loaded, the data.table
package is
loaded as well.) Moreover, in this case FUN
has to take at
least two arguments: the first one being a “data.table” corresponding to
the current block being processed, and the second one being a character
vector containing the subject ID. This is best explained with an
example:
selected_flowers <- ftply(f, function(d, by) {
if (by == "setosa")
return(NULL)
else
return(d)
})
#> Warning in ftply(f, function(d, by) {: Block setosa returned an empty
#> data.table.
# Let's have a look at the first few lines of the output; note that it start directly with 'versicolor', because all the 'setosa' flowers have been omitted
head(selected_flowers, 4)
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1: versicolor 7.0 3.2 4.7 1.4
#> 2: versicolor 6.4 3.2 4.5 1.5
#> 3: versicolor 6.9 3.1 4.9 1.5
#> 4: versicolor 5.5 2.3 4.0 1.3
Here, we are skipping the ‘setosa’ species. The result will be equal
to the input, except that the rows corresponding to the setosa flowers
will be omitted. Notice also that fplyr
warns us that one
block didn’t return any output. In general, the behavior of ftply() is
equivalent to flply() followed by rbind
on the resulting
list.
Importantly, the d
argument to FUN
contains
a “data.table” of the current block being processed, but without the
first field. This is just for efficiency concerns; the first field
will be added back to the output of FUN
. In fact, the
following example will show that inside FUN
the
d
data set has only four columns, whereas normally it would
have five.
count_cols <- function(d, by) {
ncol(d)
}
ftply(f, count_cols)
#> Species V1
#> 1: setosa 4
#> 2: versicolor 4
#> 3: virginica 4
nblocks
optionftply() can also be used to quickly glance at the data, much like one
would use the head() function. Indeed, we can specify the
nblocks
option to select only the first block; thus, we can
see what the data look like without loading the whole file into memory.
By default, in ftply() FUN
returns the data without
modifying them, so in this case we can avoid specifying
FUN
. Incidentally, all the other functions support the
nblocks
option as well; it is intended to be the analogous
of nrows
in read.table() and fread().
flowers_head <- ftply(f, nblocks = 1)
# Now `flowers_head` has 50 observations, while the original data set had 150. Let's have a look at the first ones.
head(flowers_head, 4)
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1: setosa 5.1 3.5 1.4 0.2
#> 2: setosa 4.9 3.0 1.4 0.2
#> 3: setosa 4.7 3.2 1.3 0.2
#> 4: setosa 4.6 3.1 1.5 0.2
Another useful option is parallel
, with which it is
possible to specify the number of threads that fplyr
can
use. Like nblocks
, also the parallel
option is
supported by all the functions. It is not necessary to initialize any
cluster, but this option has effect only on Unix-like systems, not on
Windows. In the following example we will select, for each block, a
random sample of 10 observations.
This package was born to deal with files that are too big to fit into
the available RAM. With fplyr
, it is easy to process such
files, but what if even the output of the processing is too big for the
memory? One solution could be to write the output to a file as soon as
it is generated, without ever returning it. This solution is implemented
in the ffply() function, but it works only if FUN
returns a
“data.table” or “data.frame”. It is equivalent to calling ftply()
followed by write.table() or fread(). This function supports one
additional argument with respect to the previously described functions:
the path to the output file. In the example, we will replace the
original observations with their principal components, block by
block.
out <- tempfile() # Create temporary output file
ffply(f, out, function(d, by) {
# Here, `d` does not contain the subject IDs; they will be automatically added back later
x <- prcomp(d)$x
as.data.table(x)
})
# Let's check the result. Note in particular that the subject IDs are present
fread(out, nrows = 4)
#> Species PC1 PC2 PC3 PC4
#> 1: setosa -0.1068424 -0.02489398 0.08216974 -0.034541755
#> 2: setosa 0.3940472 0.16586593 0.13148092 -0.017551195
#> 3: setosa 0.3906877 -0.12685112 0.07181182 0.009744303
#> 4: setosa 0.5117016 -0.02656106 -0.11121361 -0.032673214
For ffply(), FUN
must take two arguments, like in
ftply(). The return value of ffply() is the number of processed
blocks.
Besides the options we have already discussed, such as
nblocks
and parallel
, all the functions in the
package support a set of core options that modify how the file is read.
These options are as follows.
key.sep
The character that delimits the first field
from the rest [default: “\t”].sep
The field delimiter (often equal to
key.sep
) [default: “\t”].skip
Number of lines to skip at the beginning of the
file [default: 0].header
Whether the file has a header [default:
TRUE].nblocks
The number of blocks to read [default:
Inf].stringsAsFactors
Whether to convert strings into
factors [default: FALSE].colClasses
Vector or list specifying the class of each
field [default: NULL].select
The columns (names or numbers) to be read
[default: NULL].drop
The columns (names or numbers) not to be read
[default: NULL].col.names
Names of the columns [default: NULL].With the exception of key.sep
, all these options are
comprehensively documented in the help page of data.table
‘s
fread() function (?fread
). For key.sep
, see
the help page of iotools
’ read.chunk()
(?read.chunk
).
For the last function, suppose that the analysis of each block
produces several output files; for instance, we may want to compute the
principal components as well as a nonlinear transformation of the
variables, for each block, and save them to two separate output files.
In this case, we can use fmply(). Like ftply(), it too supports the
output
option, but this time it can be a vector of many
paths. Accordingly, FUN
should now return a list of
“data.table”s, one for each of the output files.
out <- c(pca = tempfile(), transf = tempfile())
# Note that the vector needs not be named, we use these names just for convenience
analyze_block <- function(d) {
# Here, `d` does contain the subject IDs, so we have to remove them...
x <- prcomp(d[, -1])$x
# ...and add them back manually
x <- cbind(d[, 1], x)
# Transform each number 'z' into e^(-z)
y <- cbind(d[, 1], exp(-d[, -1]))
# Return a list of two "data.table"s
list(x, y)
}
fmply(f, out, analyze_block)
Notice that, contrary to ffply(), FUN
takes only one
argument, and it is the full block, including the first field.
Therefore, we had to remove this field when we computed the principal
components, and add it back at the end. (In ffply() and ftply() this is
done automatically.) Moreover, FUN
should now return two
values, the first of which is printed to the first output file, and the
second of which is printed to the second output file. There is no limit
to the number of output files, but the order of the output files and of
the values returned by FUN
must match (named vectors and
lists are not taken into account at the moment).
Sometimes it is also necessary to return objects that are not
printable as “data.table”s. For instance, suppose that, besides printing
the principal components to the output file, we also wanted to return
the "prcomp"
object. In these cases, fmply() is still
helpful, because it allows FUN
to return one more element,
which in turn will be returned by fmply(). For example, consider the
following modification of analyze_block():
analyze_block2 <- function(d) {
pca <- prcomp(d[, -1])
x <- cbind(d[, 1], pca$x)
y <- cbind(d[, 1], exp(-d[, -1]))
# 'x' and 'y' are the same as before, but now we add the 'pca' object
list(x, y, pca)
}
iris_pca <- fmply(f, out, analyze_block2)
# Let's have a look at the screeplot of the 'versicolor' PCA
screeplot(iris_pca$versicolor)
Here, FUN
returns three arguments, but there are only
two output files. The third value returned by FUN
, then, is
returned at the end by fmply(). In particular, the variable
iris_pca
will be a list of three "prcomp"
objects, one for each species.
Actually, it is a bit more complicated than that: the
iotools
package takes care of the reading, so the file is
read chunk by chunk, not block by block, and then the chunk is split
into its constituent blocks; you can read more about how
iotools
reads files in the help page of the chunk.reader()
function.↩︎
Yes, I know this clustering is pointless, but the example is just meant to illustrate the kind of things that one can do, provided that he or she has access to more appropriate data sets.↩︎