Path to Automation

The initial release of {sdtm.oak} provides a framework for modular programming of SDTM in R and sets the stage for potential automation of SDTM creation following the standardized SDTM specification. In the future, the automation workflow could involve preparing specifications and then making automated function calls to generate SDTM domains.

The future workflow for automation could look like:

  1. Prepare SDTM specification: Users can define the raw data source, target SDTM domain, target SDTM variables, and algorithms used for automation. A template is still under development; details are also provided in this article.
  2. Prepare SDTM-controlled Terminology: Users can define the SDTM-controlled terms applicable to the study. A template is still under development.
  3. An automated process to read the specification and make {sdtm.oak} function calls can create the code required to generate SDTM datasets or the datasets themselves.

This article provides an overview of metadata and a draft version of the standard SDTM specification. We plan to demonstrate the creation of standard SDTM specs from the CDISC library in collaboration with CDISC COSA. Sponsors may need to establish the necessary tools to generate this SDTM specification from their MDR to utilize the automation features of {sdtm.oak}. It’s worth mentioning that this concept draws inspiration from Roche’s existing implementation of the SDTM automation process using OAK. I would like to inform you that further development is required for this concept.

Throughout this article, the term “metadata” is used several times. In this context, “metadata” refers to the specific metadata used by {sdtm.oak}. This article aims to provide users with a more detailed understanding of the {sdtm.oak} metadata.

In general, metadata can be defined as “data about data.” It does not include any patient-level data. Instead, the metadata provides a blueprint of the data that needs to be collected during a study.

Standards Metadata

The standards metadata used in {sdtm.oak} is sourced from the CDISC Library or sponsor MDR or any other form of documentation where standards are maintained. This metadata provides information on the following:

In the upcoming releases of {sdtm.oak}, we will effectively utilize the standards metadata and customize it to meet the study requirements.

Study Definition Metadata

Study Definition Metadata is also referred to as Study Metadata. Study Definition Metadata provides information about the eCRF and eDT data collected in the study.

eCRF Metadata The eCRF Design Metadata is fetched from the EDC system. This Metadata includes

eDT Metadata

eDT Metadata is the blueprint metadata that describes the data collected as part of that external data transfer (from clinical sites to the sponsor). This includes

Study SDTM Mappings Metadata (specifications)

Study SDTM mappings metadata is the study SDTM specification. To develop the SDTM domains, {sdtm.oak} requires the user to prepare the Study SDTM mappings metadata. Unlike the conventional SDTM specification, which includes one tab per domain defining the target (SDTM domain, Variables) to source (raw dataset, raw variables) and SDTM mappings, the SDTM spec for {sdtm.oak} defines the source-to-target relationship. For each source, the SDTM mapping, algorithms, and associated metadata are defined. The table below presents the columns in the SDTM mapping specification and its explanation.

Variable_Name Description_of_the_variable Example_Values Association_with_mapping_Algorithms
study_number Study Number test_study Generic Use
raw_source_model Data Collection model e-CRF or eDT Generic Use
raw_dataset Name of the raw or source dataset VTLS1, DEM Required for all mapping algorithms
raw_dataset_ordinal Ordinal of the raw dataset as defined in EDC or eDT specification 1, 2, 3, etc Generic Use
raw_dataset_label Label of the raw or source dataset Vital Signs,
Demographics
Generic Use
raw_variable Name of the raw variable SEX_001,
BRTHDD
Generic Use
raw_variable_label Label of the raw variable Systolic Blood Pressure,
Birth Day
Generic Use
raw_variable_ordinal Ordinal of the variable as defined in the eCRF or eDT specification 1, 2, 3, etc Generic Use
raw_variable_type Type of the Raw Variable Text Box,
Date control
Required for all mapping algorithms
raw_data_format Data format of the raw variable $200,
dd MON YYYY
Required for all mapping algorithms
study_specific TRUE indicates that the source is study specific. FALSE indicates that the raw variable is part of data standards TRUE, FALSE Generic Use
annotation_ordinal Ordinal of the SDTM mappings for the particular raw source 1, 2, 3, etc Required for all mapping algorithms
mapping_is_dataset Indicates if the SDTM mapping is at the dataset level. TRUE indicates that it is dataset level mapping. TRUE, FALSE Required for all mapping algorithms
annotation_text SDTM mapping text or annotation text VS.VSORRES when VSTESTCD = ‘SYSBP’ Generic Use
target_domain Name of the target domain. VS, MH Required for all mapping algorithms
target_sdtm_variable Name of the target SDTM variable VSORRES, MHSTDTC Required for all mapping algorithms
target_sdtm_variable_role CDISC Role for the SDTM target variable defined in the annotation. Topic Variable,
Grouping Qualifier,
Identifier Variable
Required for all mapping algorithms
target_sdtm_variable_codelist_code NCI or sponsor code of the codelist assigned to the SDTM target variable defined in the annotation. C66742
C66790
Required for all mapping algorithms
target_sdtm_variable_ controlled_terms_or_format Controlled terms or format for the target variable defined in the annotation (as defined per CDISC). target_sdtm_variable_controlled_terms_or_format is required for SDTM Define.xml (AGEU)
ISO 8601
(SEX)
Generic Use
target_sdtm_variable_ordinal Ordinal of the target SDTM variable 1, 2, 3 Required for all mapping algorithms
origin Origin of metadata source, values are subject to controlled terminology Derived,
Assigned,
Collected,
Predecessor
Used for define.xml
mapping_algorithm Mapping Algorithm condition_add
assign_ct
ae_aerel
hardcode_ct
Required for all mapping algorithms
sub_algorithm The sub-algorithm (scenario) of the source-to-target mapping assign_no_ct
hardcode_ct
Only when Mapping Algorithm is
condition_add
dataset_level
target_hardcoded_value Text (Hardcoded value) that applies to the target. ALZHEIMER’S DISEASE HISTORY assign_no_ct
hardcode_no_ct
target_term_value CDISC Submission value or sponsor value which represents a hardcoded text Y,
beats/min,
INFORMED CONSENT OBTAINED
harcode_ct
condition_add_raw_dat Condition that has to be applied at a raw dataset before applying a mapping. Can be a valid R filter statement. Map qualifier CMSTRTPT Annotation text is If MDPRIOR == 1 then CM.CMSTRTPT = ‘BEFORE’ raw_dat parameter as condition_add(cm_raw, MDPRIOR == 1) condition_add
condition_add_tgt_dat Condition that has to be applied at a target dataset before applying a mapping. Can be a valid R filter statement. Map qualifier CMDOSFRQ Annotation text is If CMTRT is not null then map the collected value in raw dataset cm_raw and raw variable MDFRQ to CMDOSFRQ tgt_dat parameter as condition_add(., !is.na(CMTRT)) condition_add
merge_type Specifies the type of join left_join
right_join
full_join
visit_join
subject_join
MERGE
merge_left Specifies the left component of the merge VTLS1 MERGE
merge_right Specifies the right component of the merge VACREC MERGE
merge_condition Specify the condition of the join (e.g. a specific variable that should match in the components of the merge) VTLS1.SUBJECT = VACREC.SUBJECT,
MD1.MDNUM = VACREC.MDNUM
MERGE
unduplicate_keys Raw variables that should be used to determine whether an observation in the source data is a duplicate record and subject to being removed VTLS1.SUBJECT,
VTLS1.DATAPAGEID
REMOVE_DUP
groupby_keys Raw Variables or aggregation functions (i.e. earliest, latest) to group source data records before mapping to SDTM TXINF1.DATAPGID,
Earliest
GROUP_BY