YAML Configuration

This section is continually being expanded as the configuration feature set is modified. For the time being, see existing dataset configuration files in this directory for examples.

Top-level YAML Sections

Section	Description
`tag`	Dataset tag name (e.g. “HW”); used minimally but not quite deprecated to continue to support comparison between two datasets
`globals`	Contains global settings that are applied across all variables
`min_age_for_inclusion`	Required Under `global`; subjects will be excluded from all histograms etc. in the report and from the cleaned output if their age falls below this threshold
`max_invalid_datatypes_per_subject`	Required Under `global`; subjects will be excluded from the cleaned output if they have more than this number of variables that can not be converted to the expected datatypes
`consent_inclusion_file`	Under `global`; name of file containing subject IDs with confirmed consent approval. Format is: plaintext file, no header, one ID per line; can be null (`~`)
`consent_exclusion_file`	Under `global`; name of file containing subject IDs without valid consent. Format is: plaintext file, no header, one ID per line; can be null (`~`)
`variables`	This section contains one block for each variable in the dataset, with a variety of other configuration settings described in the next section
`derived`	This section defines variables to be derived from existing variables

Variables YAML Section

Each variable in the dataset is assigned a normalized encoded value (e.g. HW00001, HW00002, etc.). Under each variable block, there are a variety of other possible configuration settings:

Section	Description
`name`	This is the header of the variable in the input dataset
`type`	Either this or shared_model are required Expected variable type; one of: `string`, `numeric`, `ordinal`, `categorical`, `blood_pressure
`shared_model`	Either this or type are required Expected variable type as defined in `yaml-configuration/shared-models.yaml`
`canonical_name`	If desired, a string with a more descriptive variable name than what’s present in `name`
`bounds`	Accepts numeric values for tags `min`, `max`, and/or `sd` (standard deviation) to apply bounds for a numeric variable
`suppress_reporting`	A boolean to turn off printing a table of unique values and counts in the html report; useful for variables with PII or with many expected unique values like phone numbers
`suppress_output`	A boolean to override cleaned output for a variable: all values will be set to `NA`. Should be used to remove the most problematic variables from results
`linked_date`	For `age` variables, this optionally points to a corresponding date variable for cross-comparison; should also include flags indicating whether the variable is the `reported_year` (standardized name of variable containing corresponding year variable) or the `reference_year` (which year the age was collected)
`subject_age`	Required once per dataset Boolean flag to mark which variable is the accepted age of the subjects
`subject_id`	Required once per dataset Boolean flag to mark which variable is the accepted unique subject ID
`na-values`	Any non-canonical values to be treated as NA (e.g. nil, not specified, etc.)
`multimodal`	Used to define another variable for plotting overlayed histograms, e.g. overlayed plots of BMI by sex
`allow_undelimited_bp`	Only for variables of type `bp` (blood pressure): enable recognition of systolic and diastolic blood pressure specified exactly as: `^\d{4}\d?\d?$`, where systolic will use three digits preferentially if 5 or 6 digits are specified. this behavior is imperfect given the lack of delimiter, and is not recommended in most circumstances.
`dependencies`	Test for expected relationships between variables; can also include contingency tables to compare two variables and instructions for setting values to NA if certain dependency tests fail
`levels`	For `categorical`, `ordinal`, and `binary` type variables, you will need to define levels under the `levels` tag. See the vignettes for more details and examples.

Derived YAML Section

Derived variables are calculated from existing data, e.g. calculating BMI from reported waist and height measurements. This section allows the user to define arbitrary new variables to derive.

Most sections here have been previously described, but code is where the logic is injected to create the derived variable, written in R syntax with access to the normalized variable names

YAML Validation

Prior to running this tool, you should validate the YAML configurations you’ve set up as follows:

dataset.schema <- system.file("validator", 
                              "schema.datasets.yaml", 
                              package = "process.phenotypes")
shared.models.schema <- system.file("validator", 
                                    "schema.shared-models.yaml", 
                                    package = "process.phenotypes")
process.phenotypes::config.validation("/path/to/your.dataset.yaml",
                                      "/path/to/your.shared-models.yaml",
                                      dataset.schema,
                                      shared.models.schema)

This command will compare your configuration files to the set of guidelines and restrictions we’ve specified for the package. If your configuration settings are valid, you’ll get a confirmation message to that effect; otherwise, the function will emit a summary of the restriction that wasn’t met.