Perturbation parameters for count variables
The next task is to define parameters that are used to perturb count
variables which can be achieved with ck_params_cnts
. This
function requires as input the result of either
pt_create_pParams
, pt_create_pTable
or
create_cnt_ptable
from the ptable
package. Please refer also to the documentation of this package for
information on the required parameters. In this example we are going to
use - amongst others - exemplary ptables that can are provided by the
ptable
-pkg for demonstration purposes:
# two different perturbation parameter sets from the ptable-pkg
# an example ptable provided directly
ptab1 <- ptable::pt_ex_cnts()
# creating a ptable by specifying parameters
para2 <- ptable::create_cnt_ptable(
D = 8, V = 3, js = 2, pstay = 0.5,
optim = 1, mono = TRUE)
We then need to create the required inputs for the cellKey
package.
p_cnts1 <- ck_params_cnts(ptab = ptab1)
p_cnts2 <- ck_params_cnts(ptab = para2)
ck_params_cnts()
returns objects that can be used as
inputs in method params_cnts_set()
. In argument
v
one may specify count variables for which the supplied
perturbation parameters should be used. If v
is not
specified, the perturbation parameters are used for all count
variables.
# use `p_cnts1` for variable "total" (which always exists)
tab$params_cnts_set(val = p_cnts1, v = "total")
## --> setting perturbation parameters for variable 'total'
# use `p_cnts2` for "cnt_highincome"
tab$params_cnts_set(val = p_cnts2, v = "cnt_highincome")
## --> setting perturbation parameters for variable 'cnt_highincome'
It is therefore entirely possible to use different parameter sets for
different variables. Modifying perturbation parameters for some
variables is easy, too. It is only required to apply the
params_cnts_set()
-method again which will replace any
previously defined parameters.
Perturbation parameters for continuous variables
Setting and defining perturbation parameters for continuous variables
works similarily. The required functions are
ck_params_num()
to create input objects that can be set
with the params_nums_set
method. Please note that it is
possibly by specifying the path
argument in both
ck_params_nums()
and ck_params_cnts()
to save
the parameters additionally as yaml-file. Using
ck_read_yaml()
, these files can later be imported again.
This feature is useful for re-using parameter settings.
The underlying framework on how to perturb continuous tables differs
from the proposed method from ABS
. One possible approach is
based on a “flex function”. This approach (which is described
in deliverable D4.2
in the project perturbative
confidentiality methods allows to apply different magnitude of noise
to larger and smaller cells. Users can define the required parameters
for the flex-approach with function ck_flexparams()
. The
required inputs are:
fp
: the flexpoint defining at which point should the
underlying noise coefficient function reach its desired maximum (which
is defined by the first element of p
)
p
: numeric vector of length 2
with
p[1] > p[2]
where both elements specify a percentage.
The first value refers to the desired maximum perturbation percentage
for small cells (depending on fp) while the second element refers to the
desired maximum perturbation percentage for large cells.
epsilon
: a numeric vector in descending order with all
values in [0; 1]
and with the first element forced to equal
1
. The length of this parameter must correspond with the
number of top_k
specified in ck_params_nums()
(which will be discussed later).
# parameters for the flex-function
p_flex <- ck_flexparams(
fp = 1000,
p = c(0.3, 0.03),
epsilon = c(1, 0.5, 0.2))
In the cellKey
package it is possible to select the underlying data that form the base
for the perturbation differently. In ck_params_nums()
the
specific approach can be selected in argument type
. The
valid choices for this argument are:
"top_contr"
: the k
largest contributions
to each cell are used in the perturbation procedure with the number
k
required to be specified in argument
top_k
"mean"
: weighted cellmeans are used as starting
points
"range"
: the difference between largest and smallest
unweighted contributions for each cell are used as base for the
perturbation procedure
"sum"
: weighted cellvalues are used as starting points
for the perturbation
Another, more basic approach, is to use a constant perturbation
magnitude for all cells, independent on their (weighted) values. The
required parameters can be defined with ck_simpleparams()
as shown below:
# parameters for the simple approach
p_simple <- ck_simpleparams(
p = 0.05,
epsilon = 1)
In this appraoch it is only required to specify a single percentage
value p
and - as in the case for the flex function - a
vector of epsilons that are used in the case when
top_k > 1
.
Further important parameters for ck_params_nums()
are:
mu_c
: an extra amount of perturbation applied to
sensitive cells (restricted to the first of top_k
noise
components). In the following example we demonstrate how to identify
sensitive cells for numeric variables.
same_key
: a logical value specifying if the original
cell key (TRUE
) should be used for the lookup of the
largest contributor of a cell or if a perturbation of the cellkey itself
(FALSE
) should take place.
use_zero_rkeys
: a logical value defining if record keys
of units not contributing to a specific numeric variables should be used
(TRUE
) or ignored (FALSE
) when cell keys are
computed.
A very important parameter is ptab
which actually holds
the perturbation tables in which perturbation values are looked up. This
input can be specified differently in the case when numeric variables
should be perturbed. It can be either an object derived from
ptable::pt_create_pTable(..., table = "nums")
in the most
simple case. More advanced is to supply a named list, where the allowed
names are shown below and each element must be the output of
ptable::pt_create_pTable(..., table = "nums")
.
"all"
: this ptable will be used for all cells; if
specified, list-elements named "even"
or "odd"
are ignored
"even"
: this perturbation table will be used to look up
perturbation values for cells with an even number of contributors
"odd"
: will be used to look up perturbation values for
cells with an odd number of contributors
"small_cells"
: if specified, this ptable will be used
to extract perturbation values for very small cells
Please note, that if the goal is to have different perturbation
tables for cells with an even/odd number of contributors, both
"even"
or "odd"
must be available in the input
list. In the chunk below we create four different perturbation tables.
For details on the parameters, please look at the documentation of the
ptable
package, especially in ptable::create_num_ptable()
.
# same ptable for all cells except for very small ones
ex_ptab1 <- ptable::pt_ex_nums(parity = TRUE, separation = TRUE)
We can now use these tables to finally create objects containing all
the required information to create perturbed magnitude tables using
ck_params_nums
. In the first case we want the same
perturbation table (ptab_all
) for cells with an even/odd
number of contributors but want to use ptab_sc
for very
small cells.
p_nums1 <- ck_params_nums(
type = "top_contr",
top_k = 3,
ptab = ex_ptab1,
mult_params = p_flex,
mu_c = 2,
same_key = FALSE,
use_zero_rkeys = TRUE)
The second input we generate should use different ptables for cells
with an even/odd number of contributing units (ptab_even
and ptab_odd
) but should not use a specific perturbation
table for very small cells.
ex_ptab2 <- ptable::pt_ex_nums(parity = FALSE, separation = FALSE)
As above, we need to use ck_params_nums()
to compute
suitable inputs.
p_nums2 <- ck_params_nums(
type = "mean",
ptab = ex_ptab2,
mult_params = p_simple,
mu_c = 1.5,
same_key = FALSE,
use_zero_rkeys = TRUE)
The package internally computes the separation point that is used for
very small cells in case this is required. Details on this can also be
found in deliverable D4.2
.
Now we can attach the results from ck_params_nums()
to
numeric variables using the params_nums_set()
-method as
shown below:
tab$params_nums_set(v = "income", val = p_nums1)
## --> setting perturbation parameters for variable 'income'
tab$params_nums_set(v = "savings", val = p_nums1)
## --> setting perturbation parameters for variable 'savings'
In order to make use of parameter mu_c
that allows ab
add extra amount of protection to sensitive cells, one may identify
sensitive cells according to some rules. The following methods to
identify sensitive cells are implemented:
supp_p()
: identify sensitive cells based on
p%-rule
supp_pp()
: identify sensitive cells based on
pq%-rule
supp_nk()
: identify sensitive cells based on
nk-dominance rule
supp_freq()
: identify sensitive cells based on minimal
frequencies for (weighted) number of contributors
supp_val()
: identify sensitive cells based on
(weighted) cell values
supp_cells()
: identify sensitive cells based on their
“names”
We now want to set all cells for variable income
as
sensitive to which less than 15
units contribute.
tab$supp_freq(v = "income", n = 15, weighted = FALSE)
## freq-rule: 3 new sensitive cells (incl. duplicates) found (total: 3)
To set specific cells independent on values but their names, one may
use the $supp_cells()
-method. This cell requires a
data.frame
as input that contains a column for each
dimensional variable specified. Each row of this input is considered as
a cell where NAs
are used as placeholders that match any
characteristic of the relevant variable. Using the
data.frame
inp
show below, the programm would
suppress the following cells:
female
x age_group1
male
x age_group3
male
x any age group available in the data
inp <- data.frame(
"sex" = c("female", "male", "male"),
"age" = c("age_group1", "age_group3", NA)
)