YAML Configuration

This section is continually being expanded as the configuration feature set is modified. For the time being, see existing dataset configuration files in this directory for examples.

Top-level YAML Sections

Section

Description

tag

Dataset tag name (e.g. “HW”); used minimally but not quite
deprecated to continue to support comparison between two
datasets

globals

Contains global settings that are applied across all variables

min_age_for_inclusion

Required Under global; subjects will be excluded from all histograms etc. in
the report and from the cleaned output if their age falls below
this threshold

max_invalid_datatypes_per_subject

Required Under global; subjects will be excluded from the cleaned output if they have
more than this number of variables that can not be converted to
the expected datatypes

consent_inclusion_file

Under global; name of file containing subject IDs with confirmed consent approval.
Format is: plaintext file, no header, one ID per line; can be null (~)

consent_exclusion_file

Under global; name of file containing subject IDs without valid consent. Format
is: plaintext file, no header, one ID per line; can be null (~)

variables

This section contains one block for each variable in the dataset,
with a variety of other configuration settings described in the next section

derived

This section defines variables to be derived from existing variables

Variables YAML Section

Each variable in the dataset is assigned a normalized encoded value (e.g. HW00001, HW00002, etc.). Under each variable block, there are a variety of other possible configuration settings:

Section

Description

name

This is the header of the variable in the input dataset

type

Either this or shared_model are required Expected variable type; one of:
string, numeric, ordinal, categorical, `blood_pressure

shared_model

Either this or type are required Expected variable type as defined
in yaml-configuration/shared-models.yaml

canonical_name

If desired, a string with a more descriptive variable name than what’s present in name

bounds

Accepts numeric values for tags min, max, and/or sd (standard deviation) to apply bounds for a numeric variable

suppress_reporting

A boolean to turn off printing a table of unique values and counts in the
html report; useful for variables with PII or with many expected unique
values like phone numbers

suppress_output

A boolean to override cleaned output for a variable: all values will be set
to NA. Should be used to remove the most problematic variables from results

linked_date

For age variables, this optionally points to a corresponding date variable for
cross-comparison; should also include flags indicating whether the variable is
the reported_year (standardized name of variable containing corresponding
year variable) or the reference_year (which year the age was collected)

subject_age

Required once per dataset Boolean flag to mark which variable is the accepted age of the subjects

subject_id

Required once per dataset Boolean flag to mark which variable is the accepted unique subject ID

na-values

Any non-canonical values to be treated as NA (e.g. nil, not specified, etc.)

multimodal

Used to define another variable for plotting overlayed histograms,
e.g. overlayed plots of BMI by sex

allow_undelimited_bp

Only for variables of type bp (blood pressure): enable
recognition of systolic and diastolic blood pressure specified
exactly as: ^\d{4}\d?\d?$, where systolic will use three digits preferentially
if 5 or 6 digits are specified. this behavior is imperfect given the
lack of delimiter, and is not recommended in most circumstances.

dependencies

Test for expected relationships between variables; can also include
contingency tables to compare two variables and instructions for setting values
to NA if certain dependency tests fail

levels

For categorical, ordinal, and binary type variables, you will need to define levels under the levels tag. See the vignettes for more details and examples.

Derived YAML Section

Derived variables are calculated from existing data, e.g. calculating BMI from reported waist and height measurements. This section allows the user to define arbitrary new variables to derive.

  • Most sections here have been previously described, but code is where the logic is injected to create the derived variable, written in R syntax with access to the normalized variable names

YAML Validation

Prior to running this tool, you should validate the YAML configurations you’ve set up as follows:

dataset.schema <- system.file("validator", 
                              "schema.datasets.yaml", 
                              package = "process.phenotypes")
shared.models.schema <- system.file("validator", 
                                    "schema.shared-models.yaml", 
                                    package = "process.phenotypes")
process.phenotypes::config.validation("/path/to/your.dataset.yaml",
                                      "/path/to/your.shared-models.yaml",
                                      dataset.schema,
                                      shared.models.schema)

This command will compare your configuration files to the set of guidelines and restrictions we’ve specified for the package. If your configuration settings are valid, you’ll get a confirmation message to that effect; otherwise, the function will emit a summary of the restriction that wasn’t met.