{disk.frame} has been soft-deprecated in favor of {arrow}. With the {arrow} 6.0.0 release, it’s now capable of doing larger-than-RAM data analysis quite well see release note. Hence, there is no strong reason to prefer {disk.frame} unless you have very specific feature needs.
For the above reason, I’ve decided to soft-deprecate {disk.frame} which means I will no longer actively develop new features for it but it will remain on CRAN in maintenance mode.
To help with the transition I’ve created a function,
disk.frame::disk.frame_to_parquet(df, outdir)
to help you
convert existing {disk.frame}s to the parquet format so you can use
{arrow} with it.
I am working on an reincarnation of {disk.frame} in Julia, so the {disk.frame} will live on!
Thank your for support {disk.frame}. I’ve learnt alot along the way, but time has come to move on!
How do I manipulate tabular data that doesn’t fit into Random Access Memory (RAM)?
Use {disk.frame}
!
In a nutshell, {disk.frame}
makes use of two simple
ideas
{disk.frame}
performs a similar role to distributed
systems such as Apache Spark, Python’s Dask, and Julia’s JuliaDB.jl for
medium data which are datasets that are too large for RAM but
not quite large enough to qualify as big data.
You can install the released version of {disk.frame}
from CRAN with:
install.packages("disk.frame")
And the development version from GitHub with:
# install.packages("devtools")
::install_github("DiskFrame/disk.frame") devtools
On some platforms, such as SageMaker, you may need to explicitly specify a repo like this
install.packages("disk.frame", repo="https://cran.rstudio.com")
Please see these vignettes and articles about
{disk.frame}
{disk.frame}
which replicates the
sparklyr
vignette for manipulating the
nycflights13
flights data.{disk.frame}
which lists some commons way of
creating disk.frames{disk.frame}
can be more epic! shows some ways of loading large CSVs and the
importance of srckeep
dfglm
function for fitting generalized linear models{disk.frame}
and why create it?{disk.frame}
is an R package that provides a framework
for manipulating larger-than-RAM structured tabular data on disk
efficiently. The reason one would want to manipulate data on disk is
that it allows arbitrarily large datasets to be processed by R. In other
words, we go from “R can only deal with data that fits in RAM” to “R can
deal with any data that fits on disk”. See the next section.
data.frame
and data.table
?A data.frame
in R is an in-memory data structure, which
means that R must load the data in its entirety into RAM. A corollary of
this is that only data that can fit into RAM can be processed using
data.frame
s. This places significant restrictions on what R
can process with minimal hassle.
In contrast, {disk.frame}
provides a framework to store
and manipulate data on the hard drive. It does this by loading only a
small part of the data, called a chunk, into RAM; process the chunk,
write out the results and repeat with the next chunk. This chunking
strategy is widely applied in other packages to enable processing large
amounts of data in R, for example, see chunkded
arkdb
, and iotools
.
Furthermore, there is a row-limit of 2^31 for
data.frame
s in R; hence an alternate approach is needed to
apply R to these large datasets. The chunking mechanism in
{disk.frame}
provides such an avenue to enable data
manipulation beyond the 2^31 row limit.
{disk.frame}
different to previous “big” data
solutions for R?R has many packages that can deal with larger-than-RAM datasets,
including ff
and bigmemory
. However,
ff
and bigmemory
restrict the user to
primitive data types such as double, which means they do not support
character (string) and factor types. In contrast,
{disk.frame}
makes use of
data.table::data.table
and data.frame
directly, so all data types are supported. Also,
{disk.frame}
strives to provide an API that is as similar
to data.frame
’s where possible. {disk.frame}
supports many dplyr
verbs for manipulating
disk.frame
s.
Additionally, {disk.frame}
supports parallel data
operations using infrastructures provided by the excellent future
package to take advantage of multi-core CPUs. Further,
{disk.frame}
uses state-of-the-art data storage techniques
such as fast data compression, and random access to rows and columns
provided by the fst
package to provide superior data manipulation speeds.
{disk.frame}
work?{disk.frame}
works by breaking large datasets into
smaller individual chunks and storing the chunks in fst
files inside a folder. Each chunk is a fst
file containing
a data.frame/data.table
. One can construct the original
large dataset by loading all the chunks into RAM and row-bind all the
chunks into one large data.frame
. Of course, in practice
this isn’t always possible; hence why we store them as smaller
individual chunks.
{disk.frame}
makes it easy to manipulate the underlying
chunks by implementing dplyr
functions/verbs and other
convenient functions (e.g. the
cmap(a.disk.frame, fn, lazy = F)
function which applies the
function fn
to each chunk of a.disk.frame
in
parallel). So that {disk.frame}
can be manipulated in a
similar fashion to in-memory data.frame
s.
{disk.frame}
different from Spark, Dask, and
JuliaDB.jl?Spark is primarily a distributed system that also works on a single
machine. Dask is a Python package that is most similar to
{disk.frame}
, and JuliaDB.jl is a Julia package. All three
can distribute work over a cluster of computers. However,
{disk.frame}
currently cannot distribute data processes
over many computers, and is, therefore, single machine focused.
In R, one can access Spark via sparklyr
, but that
requires a Spark cluster to be set up. On the other hand
{disk.frame}
requires zero-setup apart from running
install.packages("disk.frame")
or
devtools::install_github("xiaodaigh/disk.frame")
.
Finally, Spark can only apply functions that are implemented for
Spark, whereas {disk.frame}
can use any function in R
including user-defined functions.
{disk.frame}
{disk.frame}
works best if it can process multiple data
chunks in parallel. The best way to set-up {disk.frame}
so
that each CPU core runs a background worker is by using
setup_disk.frame()
# this allows large datasets to be transferred between sessions
options(future.globals.maxSize = Inf)
The setup_disk.frame()
sets up background workers equal
to the number of CPU cores; please note that, by default, hyper-threaded
cores are counted as one not two.
Alternatively, one may specify the number of workers using
setup_disk.frame(workers = n)
.
suppressPackageStartupMessages(library(disk.frame))
library(nycflights13)
# this will setup disk.frame's parallel backend with number of workers equal to the number of CPU cores (hyper-threaded cores are counted as one not two)
setup_disk.frame()
#> The number of workers available for disk.frame is 6
# this allows large datasets to be transferred between sessions
options(future.globals.maxSize = Inf)
# convert the flights data.frame to a disk.frame
# optionally, you may specify an outdir, otherwise, the
<- as.disk.frame(nycflights13::flights) flights.df
{disk.frame} aims to support as many dplyr verbs as possible. For example
%>%
flights.df filter(year == 2013) %>%
mutate(origin_dest = paste0(origin, dest)) %>%
head(2)
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
#> 1: 2013 1 1 517 515 2 830 819 11
#> 2: 2013 1 1 533 529 4 850 830 20
#> carrier flight tailnum origin dest air_time distance hour minute time_hour
#> 1: UA 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 05:00:00
#> 2: UA 1714 N24211 LGA IAH 227 1416 5 29 2013-01-01 05:00:00
#> origin_dest
#> 1: EWRIAH
#> 2: LGAIAH
Starting from {disk.frame}
v0.3.0, there is
group_by
support for a limited set of functions. For
example:
= iris %>%
result_from_disk.frame %>%
as.disk.frame group_by(Species) %>%
summarize(
mean(Petal.Length),
sumx = sum(Petal.Length/Sepal.Width),
sd(Sepal.Width/ Petal.Length),
var(Sepal.Width/ Sepal.Width),
l = length(Sepal.Width/ Sepal.Width + 2),
max(Sepal.Width),
min(Sepal.Width),
median(Sepal.Width)
%>%
) collect
The results should be exactly the same as if applying the same group-by operations on a data.frame. If not, please report a bug.
If a function you like is missing, please make a feature request here. It is a limitation that function that depend on the order a column can only be obtained using estimated methods.
Function | Exact/Estimate | Notes |
---|---|---|
min |
Exact | |
max |
Exact | |
mean |
Exact | |
sum |
Exact | |
length |
Exact | |
n |
Exact | |
n_distinct |
Exact | |
sd |
Exact | |
var |
Exact | var(x) only cor, cov support
planned |
any |
Exact | |
all |
Exact | |
median |
Estimate | |
quantile |
Estimate | One quantile only |
IQR |
Estimate |
To find out where the disk.frame is stored on disk:
# where is the disk.frame stored
attr(flights.df, "path")
#> [1] "C:\\Users\\RTX2080\\AppData\\Local\\Temp\\Rtmp2txRzU\\file236422fb5a83.df"
A number of data.frame functions are implemented for disk.frame
# get first few rows
head(flights.df, 1)
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
#> 1: 2013 1 1 517 515 2 830 819 11
#> carrier flight tailnum origin dest air_time distance hour minute time_hour
#> 1: UA 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 05:00:00
# get last few rows
tail(flights.df, 1)
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
#> 1: 2013 9 30 NA 840 NA NA 1020 NA
#> carrier flight tailnum origin dest air_time distance hour minute time_hour
#> 1: MQ 3531 N839MQ LGA RDU NA 431 8 40 2013-09-30 08:00:00
# number of rows
nrow(flights.df)
#> [1] 336776
# number of columns
ncol(flights.df)
#> [1] 19
This project exists thanks to all the people who contribute.
The work priorities at this stage are
Title | Language | Author | Date | Description |
---|---|---|---|---|
https://www.researchgate.net/post/What-is-the-Maximum-size-of-data-that-is-supported-by-R-datamining | English | Knut Jägersberg | 2019-11-11 | Great answer on using disk.frame |
{disk.frame}
is epic |
English | Bruno Rodriguez | 2019-09-03 | It’s about loading a 30G file into {disk.frame} |
My top 10 R packages for data analytics | English | Jacky Poon | 2019-09-03 | {disk.frame} was number 3 |
useR! 2019 presentation video | English | Dai ZJ | 2019-08-03 | |
useR! 2019 presentation slides | English | Dai ZJ | 2019-08-03 | |
Split-apply-combine for Maximum Likelihood Estimation of a linear model | English | Bruno Rodriguez | 2019-10-06 | {disk.frame} used in helping to create a maximum
likelihood estimation program for linear models |
Emma goes to useR! 2019 | English | Emma Vestesson | 2019-07-16 | The first mention of {disk.frame} in a blog post |
深入对比数据科学工具箱:Python3 和 R 之争(2020版) | Chinese | Harry Zhu | 2020-02-16 | Mentions disk.frame |
{disk.frame}
in a structured course?Please register your interest at:
https://leanpub.com/c/taminglarger-than-ramwithdiskframe
If you like {disk.frame}
and want to speed up its
development or perhaps you have a feature request? Please consider
sponsoring {disk.frame}
on Open Collective
Thank you to all our backers!
{disk.frame}
Support {disk.frame}
development by becoming a sponsor.
Your logo will show up here with a link to your website.
Do you need help with machine learning and data science in R, Python, or Julia? I am available for Machine Learning/Data Science/R/Python/Julia consulting! Email me
Do you wish to give back the open-source community in non-financial ways? Here are some ways you can contribute
{disk.frame}
usage or
experience. I would love to learn more about how
{disk.frame}
has helped you{disk.frame}
to help promote it{disk.frame}
Github repo{disk.frame}
depends on e.g. {fst}
and {future}
https://github.com/DiskFrame/disk.frame-fannie-mae-example https://github.com/DiskFrame/disk.frame-vs https://github.com/DiskFrame/disk.frame.ml