santoku is a versatile cutting tool for R. It provides
chop()
, a replacement for base::cut()
.
Install from r-universe:
install.packages("santoku", repos = c("https://hughjonesd.r-universe.dev",
"https://cloud.r-project.org"))
Or from CRAN:
install.packages("santoku")
Or get the development version from github:
# install.packages("remotes")
::install_github("hughjonesd/santoku") remotes
Here are some advantages of santoku:
By default, chop()
always covers the whole range of
the data, so you won’t get unexpected NA
values.
chop()
can handle single values as well as
intervals. For example, chop(x, breaks = c(1, 2, 2, 3))
will create a separate factor level for values exactly equal to
2.
chop()
can handle many kinds of data, including
numbers, dates and times, and units.
chop_*
functions create intervals in many ways,
using quantiles of the data, standard deviations, fixed-width intervals,
equal-sized groups, or pretty intervals for use in graphs.
It’s easy to label intervals: use names for your breaks vector,
or use a lbl_*
function to create interval notation like
[1, 2)
, dash notation like 1-2
, or arbitrary
styles using glue::glue()
.
tab_*
functions quickly chop data, then tabulate
it.
These advantages make santoku especially useful for exploratory analysis, where you may not know the range of your data in advance.
library(santoku)
chop
returns a factor:
chop(1:5, c(2, 4))
#> [1] [1, 2) [2, 4) [2, 4) [4, 5] [4, 5]
#> Levels: [1, 2) [2, 4) [4, 5]
Include a number twice to match it exactly:
chop(1:5, c(2, 2, 4))
#> [1] [1, 2) {2} (2, 4) [4, 5] [4, 5]
#> Levels: [1, 2) {2} (2, 4) [4, 5]
Use names in breaks for labels:
chop(1:5, c(Low = 1, Mid = 2, High = 4))
#> [1] Low Mid Mid High High
#> Levels: Low Mid High
Or use lbl_*
functions:
chop(1:5, c(2, 4), labels = lbl_dash())
#> [1] 1—2 2—4 2—4 4—5 4—5
#> Levels: 1—2 2—4 4—5
Chop into fixed-width intervals:
chop_width(runif(10), 0.1)
#> [1] [0.1068, 0.2068) [0.6068, 0.7068) [0.9068, 1.007] [0.006763, 0.1068)
#> [5] [0.9068, 1.007] [0.3068, 0.4068) [0.6068, 0.7068) [0.1068, 0.2068)
#> [9] [0.4068, 0.5068) [0.5068, 0.6068)
#> 7 Levels: [0.006763, 0.1068) [0.1068, 0.2068) ... [0.9068, 1.007]
Or into fixed-size groups:
chop_n(1:10, 5)
#> [1] [1, 6) [1, 6) [1, 6) [1, 6) [1, 6) [6, 10] [6, 10] [6, 10] [6, 10]
#> [10] [6, 10]
#> Levels: [1, 6) [6, 10]
Chop dates by calendar month, then tabulate:
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
<- as.Date("2021-12-31") + 1:90
dates
tab_width(dates, months(1), labels = lbl_discrete(fmt = "%d %b"))
#> 01 Jan—31 Jan 01 Feb—28 Feb 01 Mar—31 Mar
#> 31 28 31
For more information, see the vignette.