| Title: | Testing framework for hubverse hub validations |
|---|---|
| Description: | This package aims at providing a simple interface to run validations on data and metadata submitted to a hubverse modeling hub. Validation tests can be run at different levels (single file, single folder, whole repository) and locally as well as part of a continuous integration workflow. |
| Authors: | Anna Krystalli [aut, cre] (ORCID: <https://orcid.org/0000-0002-2378-4915>), Evan Ray [aut], Hugo Gruson [aut] (ORCID: <https://orcid.org/0000-0002-4094-1476>), Zhian N. Kamvar [ctb] (ORCID: <https://orcid.org/0000-0003-1458-7108>), Consortium of Infectious Disease Modeling Hubs [cph] |
| Maintainer: | Anna Krystalli <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 2.1.0 |
| Built: | 2026-05-15 14:15:38 UTC |
| Source: | https://github.com/hubverse-org/hubValidations |
Capture a condition of the result of validation check.
capture_check_cnd( check, file_path, msg_subject, msg_attribute, msg_verbs = c("is", "must be"), error = FALSE, details = NULL, ... )capture_check_cnd( check, file_path, msg_subject, msg_attribute, msg_verbs = c("is", "must be"), error = FALSE, details = NULL, ... )
check |
logical, the result of a validation check. If |
file_path |
character string. Path to the file being validated. Must be
the relative path to the hub's |
msg_subject |
character string. The subject of the validation. |
msg_attribute |
character string. The attribute of subject being validated. |
msg_verbs |
character vector of length 2. The verbs describing the state of the attribute in relation to the validation subject. The first element describes the state when validation succeeds, the second element, when validation fails. |
error |
logical. In the case of validation failure, whether the function
should return an object of class |
details |
further details to be appended to the output message. |
... |
<dynamic> Named data fields stored inside the condition object. |
Arguments msg_subject, msg_attribute, msg_verbs and details
accept text that can interpreted and formatted by cli::format_inline().
Depending on whether validation has succeeded and the value
of the error argument, one of:
<message/check_success> condition class object.
<error/check_failure> condition class object.
<error/check_error> condition class object.
Returned object also inherits from subclass <hub_check>.
capture_check_cnd( check = TRUE, file_path = "test/file.csv", msg_subject = "{.var round_id}", msg_attribute = "valid.", error = FALSE ) capture_check_cnd( check = FALSE, file_path = "test/file.csv", msg_subject = "{.var round_id}", msg_attribute = "valid.", error = FALSE, details = "Must be one of 'A' or 'B', not 'C'" ) capture_check_cnd( check = FALSE, file_path = "test/file.csv", msg_subject = "{.var round_id}", msg_attribute = "valid.", error = TRUE, details = "Must be one of {.val {c('A', 'B')}}, not {.val C}" )capture_check_cnd( check = TRUE, file_path = "test/file.csv", msg_subject = "{.var round_id}", msg_attribute = "valid.", error = FALSE ) capture_check_cnd( check = FALSE, file_path = "test/file.csv", msg_subject = "{.var round_id}", msg_attribute = "valid.", error = FALSE, details = "Must be one of 'A' or 'B', not 'C'" ) capture_check_cnd( check = FALSE, file_path = "test/file.csv", msg_subject = "{.var round_id}", msg_attribute = "valid.", error = TRUE, details = "Must be one of {.val {c('A', 'B')}}, not {.val C}" )
Capture a simple info message condition. Useful for communicating when a check is ignored or skipped.
capture_check_info(file_path, msg, call = rlang::caller_call())capture_check_info(file_path, msg, call = rlang::caller_call())
file_path |
character string. Path to the file being validated. Must be
the relative path to the hub's |
msg |
Character string. Accepts text that can interpreted and
formatted by |
call |
The defused call of the function that generated the message. Use to override default which uses the caller call. See rlang::stack for more details. |
A <message/check_info> condition class object. Returned object also
inherits from subclass <hub_check>.
Capture an execution error condition. Useful for communicating when a check
execution has failed. Usually used in conjunction with try.
capture_exec_error(file_path, msg, call = NULL)capture_exec_error(file_path, msg, call = NULL)
file_path |
character string. Path to the file being validated. Must be
the relative path to the hub's |
msg |
Character string. |
call |
Character string. Name of the parent call that failed to execute.
If |
A <error/check_exec_error> condition class object. Returned object also
inherits from subclass <hub_check>.
Capture an execution warning condition. Useful for communicating when a check
execution has failed. Usually used in conjunction with try.
capture_exec_warning(file_path, msg, call = NULL)capture_exec_warning(file_path, msg, call = NULL)
file_path |
character string. Path to the file being validated. Must be
the relative path to the hub's |
msg |
Character string. |
call |
Character string. Name of the parent call that failed to execute.
If |
A <warning/check_exec_warn> condition class object. Returned object also
inherits from subclass <hub_check>.
Capture a warning about the validation process. Unlike check results
(success/failure/error), validation warnings are informational messages
about the validation process itself rather than validation outcomes.
They do not cause validation to fail — check_for_errors() only
aborts on check_failure, check_error, check_exec_error or
check_exec_warn objects.
capture_validation_warning(msg, where = NULL, call = rlang::caller_call(), ...)capture_validation_warning(msg, where = NULL, call = rlang::caller_call(), ...)
msg |
Character string. The warning message. Accepts text that can be
interpreted and formatted by |
where |
Optional. Character string indicating the location or context
of the warning (e.g., file path, |
call |
The defused call of the function that generated the warning. Use to override default which uses the caller call. See rlang::stack for more details. |
... |
Additional named fields to include in the warning condition object.
Useful for attaching structured data (e.g., |
Validation warnings can be attached at two levels:
Validation-level: Stored as an attribute on hub_validations objects,
printed prominently by default.
Check-level: Stored in a warnings field on individual check results,
printed only when verbose = TRUE.
A <warning/validation_warning> condition class object.
# Simple warning capture_validation_warning( msg = "Configuration files were modified" ) # Warning with location and additional structured data config_files <- c("tasks.json", "admin.json") capture_validation_warning( msg = "Config files modified: {.path {config_files}}", where = "hub-config", config_files = config_files )# Simple warning capture_validation_warning( msg = "Configuration files were modified" ) # Warning with location and additional structured data config_files <- c("tasks.json", "admin.json") capture_validation_warning( msg = "Config files modified: {.path {config_files}}", where = "hub-config", config_files = config_files )
Checks that admin and tasks configuration files in directory hub-config
are valid.
check_config_hub_valid(hub_path)check_config_hub_valid(hub_path)
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_error> condition class object.
Returned object also inherits from subclass <hub_check>.
Check file exists at the file path specified
check_file_exists( file_path, hub_path = ".", subdir = c("model-output", "model-metadata", "hub-config", "target-data") )check_file_exists( file_path, hub_path = ".", subdir = c("model-output", "model-metadata", "hub-config", "target-data") )
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
subdir |
subdirectory within the hub |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_error> condition class object.
Returned object also inherits from subclass <hub_check>.
Check file format is accepted by hub.
check_file_format(file_path, hub_path, round_id)check_file_format(file_path, hub_path, round_id)
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
round_id |
character string. The round identifier. |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_error> condition class object.
Returned object also inherits from subclass <hub_check>.
Checks that the model_id metadata in the file name matches the directory name
the file is being submitted to.
check_file_location(file_path)check_file_location(file_path)
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_failure> condition class object.
Returned object also inherits from subclass <hub_check>.
Check number of files submitted per round does not exceed the allowed number of submissions per team.
check_file_n(file_path, hub_path, allowed_n = 1L)check_file_n(file_path, hub_path, allowed_n = 1L)
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
allowed_n |
integer(1). The maximum number of files allowed per round. |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_failure> condition class object.
Returned object also inherits from subclass <hub_check>.
Check a model output file name can be correctly parsed.
check_file_name(file_path)check_file_name(file_path)
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_error> condition class object.
Returned object also inherits from subclass <hub_check>.
Check file can be read successfully
check_file_read(file_path, hub_path = ".")check_file_read(file_path, hub_path = ".")
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_error> condition class object.
Returned object also inherits from subclass <hub_check>.
Checks validation objects for errors and raises conditions if any are found.
Works with hub_validations and hub_validations_collection objects, as
well as their subclasses (target_validations and
target_validations_collection). Can be used in CI workflows to signal
validation failures, or locally to summarise validation results.
check_for_errors(x, verbose = FALSE, show_warnings = FALSE)check_for_errors(x, verbose = FALSE, show_warnings = FALSE)
x |
A |
verbose |
Logical. If |
show_warnings |
Logical. If |
For more details on these classes, see
article on <hub_validations> S3 class objects.
An error if one of the elements of x is of class check_failure,
check_error, check_exec_error or check_exec_warning.
TRUE invisibly otherwise.
validate_submission(), validate_pr(),
validate_target_submission(), validate_target_pr()
Check whether a metadata schema file exists
check_metadata_file_exists(hub_path = ".", file_path)check_metadata_file_exists(hub_path = ".", file_path)
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
file_path |
character string. Path to the file being validated relative to the hub's model-metadata directory. |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_error> condition class object.
Returned object also inherits from subclass <hub_check>.
Checks that the model_id metadata in the file name matches the directory name
the file is being submitted to.
check_metadata_file_ext(file_path)check_metadata_file_ext(file_path)
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_failure> condition class object.
Returned object also inherits from subclass <hub_check>.
Check that the metadata file is being submitted to the correct folder
check_metadata_file_location(file_path)check_metadata_file_location(file_path)
file_path |
character string. Path to the file being validated relative to the hub's model-metadata directory. |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_failure> condition class object.
Returned object also inherits from subclass <hub_check>.
Check whether the file name of a metadata file matches the model_id or combination of team_abbr and model_abbr specified within the metadata file
check_metadata_file_name(file_path, hub_path = ".")check_metadata_file_name(file_path, hub_path = ".")
file_path |
character string. Path to the file being validated relative to the hub's model-metadata directory. |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_failure> condition class object.
Returned object also inherits from subclass <hub_check>.
Check whether a metadata file matches the schema provided by the hub
check_metadata_matches_schema(file_path, hub_path = ".")check_metadata_matches_schema(file_path, hub_path = ".")
file_path |
character string. Path to the file being validated relative to the hub's model-metadata directory. |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_failure> condition class object.
Returned object also inherits from subclass <hub_check>.
Check whether a metadata schema file exists
check_metadata_schema_exists(hub_path = ".", file_path)check_metadata_schema_exists(hub_path = ".", file_path)
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
file_path |
Path to the model metadata file being validated. Used as
the |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_failure> condition class object.
Returned object also inherits from subclass <hub_check>.
Check whether a metadata file for the given model exists
check_submission_metadata_file_exists(file_path, hub_path = ".")check_submission_metadata_file_exists(file_path, hub_path = ".")
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_error> condition class object.
Returned object also inherits from subclass <hub_check>.
Checks submission is within the valid submission window for a given round.
check_submission_time( hub_path, file_path, ref_date_from = c("file", "file_path") )check_submission_time( hub_path, file_path, ref_date_from = c("file", "file_path") )
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
ref_date_from |
whether to get the reference date around
which relative submission windows will be determined from the file's
|
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_failure> condition class object.
Returned object also inherits from subclass <hub_check>.
Check target dataset can be detected for a given target type
check_target_dataset_exists( hub_path, target_type = c("time-series", "oracle-output") )check_target_dataset_exists( hub_path, target_type = c("time-series", "oracle-output") )
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
target_type |
Type of target data to retrieve matching files. One of "time-series" or "oracle-output". Defaults to "time-series". |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_error> condition class object.
Returned object also inherits from subclass <hub_check>.
Check that all files of a given target type share a single unique file format
check_target_dataset_file_ext_unique( hub_path, target_type = c("time-series", "oracle-output") )check_target_dataset_file_ext_unique( hub_path, target_type = c("time-series", "oracle-output") )
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
target_type |
Type of target data to retrieve matching files. One of "time-series" or "oracle-output". Defaults to "time-series". |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_failure> condition class object.
Returned object also inherits from subclass <hub_check>.
Check that there are no duplicate rows in a target dataset. Function designed to be used as part of overall target data integrity check.
check_target_dataset_rows_unique( target_type = c("time-series", "oracle-output"), na = c("NA", ""), date_col = NULL, output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date"), hub_path )check_target_dataset_rows_unique( target_type = c("time-series", "oracle-output"), na = c("NA", ""), date_col = NULL, output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date"), hub_path )
target_type |
Type of target data to retrieve matching files. One of "time-series" or "oracle-output". Defaults to "time-series". |
na |
A character vector of strings to interpret as missing values. Only
applies to CSV files. The default is |
date_col |
Optional column name to be interpreted as date for dataset
connection. Useful when the date column does not correspond to a valid task ID
(e.g., calculated from other task IDs like |
output_type_id_datatype |
character string. One of |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
If datasets are versioned, multiple observations are allowed in time-series
target data, so long as they have different as_of values. The as_of column
is therefore included when determining duplicates.
In oracle-output data, there should be only a single observation,
regardless of the as_of value so the column it is not be included when
determining duplicates.
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_failure> condition class object.
Returned object also inherits from subclass <hub_check>.
Check that a single unique target dataset exists for a given target type.
check_target_dataset_unique( hub_path, target_type = c("time-series", "oracle-output") )check_target_dataset_unique( hub_path, target_type = c("time-series", "oracle-output") )
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
target_type |
Type of target data to retrieve matching files. One of "time-series" or "oracle-output". Defaults to "time-series". |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_error> condition class object.
Returned object also inherits from subclass <hub_check>.
Note that files which are part of a hive partitioned dataset must have parquet file extension only.
check_target_file_ext_valid(file_path)check_target_file_ext_valid(file_path)
file_path |
A character string representing the path to the target data file
relative to the |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_error> condition class object.
Returned object also inherits from subclass <hub_check>.
Check that a hive-partitioned target data file name can be correctly parsed.
check_target_file_name(file_path)check_target_file_name(file_path)
file_path |
A character string representing the path to the target data file
relative to the |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_error> condition class object.
Returned object also inherits from subclass <hub_check>.
Check target file can be read successfully
check_target_file_read(file_path, hub_path = ".")check_target_file_read(file_path, hub_path = ".")
file_path |
A character string representing the path to the target data file
relative to the |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_error> condition class object.
Returned object also inherits from subclass <hub_check>.
Check that a target data file has the correct column names according to target type
check_target_tbl_colnames( target_tbl, target_type = c("time-series", "oracle-output"), file_path, hub_path, config_target_data = NULL, date_col = NULL )check_target_tbl_colnames( target_tbl, target_type = c("time-series", "oracle-output"), file_path, hub_path, config_target_data = NULL, date_col = NULL )
target_tbl |
A tibble/data.frame of the contents of the target data file being validated. |
target_type |
Type of target data to retrieve matching files. One of "time-series" or "oracle-output". Defaults to "time-series". |
file_path |
A character string representing the path to the target data file
relative to the |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
config_target_data |
Optional. A |
date_col |
Optional. Name of the date column in target data (e.g.,
|
Column name validation depends on whether a target-data.json configuration
file is provided:
With target-data.json config:
Expected columns are determined directly from the configuration. The target
table must contain exactly the columns defined in the config.
Without target-data.json config (inference mode):
Expected columns are inferred from the task ID configuration in tasks.json,
allowed columns according to the target type, and expectations based on the
detected output types in the target data. Additional optional columns
(e.g., as_of) are allowed for time-series data.
Note on date columns: Target data always contains a date column (e.g.,
target_end_date) representing when observations occurred. However, in
horizon-based forecast hubs, task IDs may only define origin_date
and horizon (with target dates calculated from these). In such cases,
provide date_col to enable deterministic validation of the date column
when it is not a valid task ID. Validation of date column existence and
type is performed by check_target_tbl_coltypes().
Inference mode validation for time-series data is limited. For robust
validation, create a target-data.json config file. See
target-data.json schema # nolint: line_length_linter.
for more information on the json schema scpecifics.
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_error> condition class object.
Returned object also inherits from subclass <hub_check>.
Check that a target data file has the correct column types according to target type
check_target_tbl_coltypes( target_tbl, target_type = c("time-series", "oracle-output"), date_col = NULL, na = c("NA", ""), output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date"), file_path, hub_path )check_target_tbl_coltypes( target_tbl, target_type = c("time-series", "oracle-output"), date_col = NULL, na = c("NA", ""), output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date"), file_path, hub_path )
target_tbl |
A tibble/data.frame of the contents of the target data file being validated. |
target_type |
Type of target data to retrieve matching files. One of "time-series" or "oracle-output". Defaults to "time-series". |
date_col |
Optional column name to be interpreted as date for schema
creation. Useful when the date column does not correspond to a valid task ID
(e.g., calculated from other task IDs like |
na |
A character vector of strings to interpret as missing values. Only
applies to CSV files. The default is |
output_type_id_datatype |
character string. One of |
file_path |
A character string representing the path to the target data file
relative to the |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
Column type validation depends on whether a target-data.json configuration
file is provided:
With target-data.json config:
Expected column types are determined directly from the schema defined in
the configuration. Validation is performed against the schema specifications
in target-data.json.
Without target-data.json config:
Expected column types are determined from the dataset itself and validated
for internal consistency across files, which mainly applies to partitioned
datasets.
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_error> condition class object.
Returned object also inherits from subclass <hub_check>.
This check is only performed when the target data file contains an
output_type_id column and cdf or pmf output types.
It verifies that distributional output type (cdf and pmf) oracle
values meet the following criteria:
oracle_value values are either 0 or 1.
pmf oracle values sum to 1 for each observation unit.
cdf oracle values are non-decreasing for each observation unit when
sorted by the output_type_id set defined in the hub config.
check_target_tbl_oracle_value( target_tbl, target_type = c("oracle-output", "time-series"), file_path, hub_path, config_target_data = NULL )check_target_tbl_oracle_value( target_tbl, target_type = c("oracle-output", "time-series"), file_path, hub_path, config_target_data = NULL )
target_tbl |
A tibble/data.frame of the contents of the target data file being validated. |
target_type |
Type of target data to validate. One of |
file_path |
A character string representing the path to the target data file
relative to the |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
config_target_data |
Target data configuration object from
|
When validating oracle values, data is grouped by observation unit to check PMF sums and CDF monotonicity within each unit.
With target-data.json config:
Observable unit is determined from the config's observable_unit specification.
Without target-data.json config:
Observable unit is inferred from task ID columns present in the data.
The as_of column is NOT included in the grouping. Oracle data is designed to
contain a single version per observable unit with a one-to-one mapping to model
output data.
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_failure> condition class object.
Returned object also inherits from subclass <hub_check>.
This check is only performed when the target data file contains an
output_type_id column. It verifies that non-distributional
output types have all NA output type IDs, and that distributional output types
(cdf, pmf) include the complete output_type_id set defined in the hub config.
check_target_tbl_output_type_ids( target_tbl_chr, target_type = c("oracle-output", "time-series"), file_path, hub_path, config_target_data = NULL )check_target_tbl_output_type_ids( target_tbl_chr, target_type = c("oracle-output", "time-series"), file_path, hub_path, config_target_data = NULL )
target_tbl_chr |
A tibble/data.frame of the contents of the target data file being validated. All columns should be coerced to character. |
target_type |
Type of target data to validate. One of |
file_path |
A character string representing the path to the target data file
relative to the |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
config_target_data |
Target data configuration object from
|
When checking for completeness of distributional output types, data is grouped by observation unit to verify each unit has the complete set of output_type_id values.
With target-data.json config:
Observable unit is determined from the config's observable_unit specification.
Without target-data.json config:
Observable unit is inferred from task ID columns present in the data.
The as_of column is NOT included in the grouping. Oracle data is designed to
contain a single version per observable unit with a one-to-one mapping to model
output data.
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_error> condition class object.
Returned object also inherits from subclass <hub_check>.
Check that there are no duplicate rows in target data files being validated.
check_target_tbl_rows_unique( target_tbl, target_type = c("time-series", "oracle-output"), file_path, hub_path, config_target_data = NULL )check_target_tbl_rows_unique( target_tbl, target_type = c("time-series", "oracle-output"), file_path, hub_path, config_target_data = NULL )
target_tbl |
A tibble/data.frame of the contents of the target data file being validated. |
target_type |
Type of target data to retrieve matching files. One of "time-series" or "oracle-output". Defaults to "time-series". |
file_path |
A character string representing the path to the target data file
relative to the |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
config_target_data |
Optional. A |
Row uniqueness is determined by checking for duplicate combinations of key columns (excluding value columns).
With target-data.json config:
Columns to check are determined from the config's observable_unit
specification. For oracle-output data with output type IDs, the
output_type and output_type_id columns are also included in the
uniqueness check.
Without target-data.json config:
For time-series data, if versioned, multiple observations are allowed
so long as they have different as_of values. The as_of column is
therefore included when determining duplicates.
For oracle-output data, there should be only a single observation,
regardless of the as_of value, so the column is not included when
determining duplicates.
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_failure> condition class object.
Returned object also inherits from subclass <hub_check>.
Check is only performed when the target data type is time-series.
When the target task ID is not specified in the config (i.e. hub has single
target and target_keys = NULL), the validity of the target is only checked
through the config file. Otherwise, the values in the target task ID column
of target_tbl are checked. Note that valid time-series targets must be
step ahead and their target type must be one of "continuous", "discrete",
"binary" or "compositional". If the hub contains no valid time-series
targets, no time-series target data should be present and validation of
such data will be skipped.
check_target_tbl_ts_targets( target_tbl, target_type = c("time-series", "oracle-output"), file_path, hub_path )check_target_tbl_ts_targets( target_tbl, target_type = c("time-series", "oracle-output"), file_path, hub_path )
target_tbl |
A tibble/data.frame of the contents of the target data file being validated. |
target_type |
Type of target data to retrieve matching files. One of "time-series" or "oracle-output". Defaults to "time-series". |
file_path |
A character string representing the path to the target data file
relative to the |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_error> condition class object.
Returned object also inherits from subclass <hub_check>.
Check is only performed when the target data file contains columns that map onto task IDs or output types defined in the hub configuration.
check_target_tbl_values( target_tbl_chr, target_type = c("time-series", "oracle-output"), file_path, hub_path, date_col = NULL, allow_extra_dates = FALSE, config_target_data = NULL )check_target_tbl_values( target_tbl_chr, target_type = c("time-series", "oracle-output"), file_path, hub_path, date_col = NULL, allow_extra_dates = FALSE, config_target_data = NULL )
target_tbl_chr |
A tibble/data.frame of the contents of the target data file being validated. All columns should be coerced to character. |
target_type |
Type of target data to retrieve matching files. One of "time-series" or "oracle-output". Defaults to "time-series". |
file_path |
A character string representing the path to the target data file
relative to the |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
date_col |
Optional. Name of the date column (e.g., "target_end_date"). Only used when target-data.json config does not exist. When target-data.json exists, date column is extracted from config (this parameter is ignored). If cannot determine date column, date relaxation is skipped. |
allow_extra_dates |
Logical. If TRUE and target_type is "time-series", allows date values not in tasks.json. Other task ID columns are still strictly validated. Ignored for oracle-output (always strict). |
config_target_data |
Target data configuration object from
|
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_error> condition class object.
Returned object also inherits from subclass <hub_check>.
Check that model output data column datatypes conform to those define in the hub config.
check_tbl_col_types( tbl, file_path, hub_path, output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date") )check_tbl_col_types( tbl, file_path, hub_path, output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date") )
tbl |
a tibble/data.frame of the contents of the file being validated. |
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
output_type_id_datatype |
character string. One of |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_failure> condition class object.
Returned object also inherits from subclass <hub_check>.
Checks that a tibble/data.frame of data read in from the file being validated contains the expected task ID and standard column names according the round configuration being validated against.
check_tbl_colnames(tbl, round_id, file_path, hub_path = ".")check_tbl_colnames(tbl, round_id, file_path, hub_path = ".")
tbl |
a tibble/data.frame of the contents of the file being validated. |
round_id |
character string. The round identifier. |
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_error> condition class object.
Returned object also inherits from subclass <hub_check>.
This check is used to validate that values in any derived task ID columns matches accepted values for each derived task ID in the config. Given the dependence of derived task IDs on the values of other values, it ignores the combinations of derived task ID values with those of other task IDs and focuses only on identifying values that do not match the accepted values.
check_tbl_derived_task_id_vals( tbl, round_id, file_path, hub_path, derived_task_ids = get_hub_derived_task_ids(hub_path, round_id) )check_tbl_derived_task_id_vals( tbl, round_id, file_path, hub_path, derived_task_ids = get_hub_derived_task_ids(hub_path, round_id) )
tbl |
a tibble/data.frame of the contents of the file being validated. Column types must all be character. |
round_id |
character string. The round identifier. |
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
derived_task_ids |
Character vector of derived task ID names (task IDs whose
values depend on other task IDs) to ignore. Columns for such task ids will
contain |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_failure> condition class object.
If no derived_task_ids are specified, the check is skipped and a
<message/check_info> condition class object is retuned.
Returned object also inherits from subclass <hub_check>.
Check model output data tbl round ID matches submission round ID.
check_tbl_match_round_id(tbl, file_path, hub_path, round_id_col = NULL)check_tbl_match_round_id(tbl, file_path, hub_path, round_id_col = NULL)
tbl |
a tibble/data.frame of the contents of the file being validated. |
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
round_id_col |
Character string. The name of the column containing
|
This check only applies to files being submitted to rounds where
round_id_from_variable: true or where a round_id_col name is explicitly
provided. Skipped otherwise.
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_error> condition class object.
If round_id_from_variable: false and no round_id_col name is provided,
check is skipped and a <message/check_info> condition class object is
returned. If no valid round_id_col name is provided or can extracted from
config (check through check_valid_round_id_col), a <message/check_error>
condition class object is returned and the rest of the check skipped.
Checks that combinations of task ID, output type and output type ID value
combinations are unique, by checking that there are no duplicate rows across
all tbl columns excluding the value column.
check_tbl_rows_unique(tbl, file_path, hub_path)check_tbl_rows_unique(tbl, file_path, hub_path)
tbl |
a tibble/data.frame of the contents of the file being validated. Column types must all be character. |
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_failure> condition class object.
Returned object also inherits from subclass <hub_check>.
This check detects the compound task ID sets of samples, implied by the output_type_id
and task ID values, and checks them for internal consistency and compliance with
the compound_taskid_set defined for each round modeling task in the tasks.json config.
check_tbl_spl_compound_taskid_set( tbl, round_id, file_path, hub_path, derived_task_ids = get_hub_derived_task_ids(hub_path) )check_tbl_spl_compound_taskid_set( tbl, round_id, file_path, hub_path, derived_task_ids = get_hub_derived_task_ids(hub_path) )
tbl |
a tibble/data.frame of the contents of the file being validated. Column types must all be character. |
round_id |
character string. The round identifier. |
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
derived_task_ids |
Character vector of derived task ID names (task IDs whose
values depend on other task IDs) to ignore. Columns for such task ids will
contain |
If the check fails, the output of the check includes an errors element,
a list of items, one for each modeling task failing validation.
The structure depends on the reason the check failed.
If the check failed because more that a single unique compound_taskid_set was found
for a given model task, the errors object will be a list with one element for each
compound_taskid_set detected and will have the following structure:
tbl_comp_tids: a compound task id set detected in the the tbl.
output_type_ids: The output type ID of the sample that does not contain a
single, unique value for each compound task ID.
If the check failed because task IDs which is not allowed in the config, were identified
as compound task ID (i.e. samples describe "finer" compound modeling tasks)
for a given model task, the errors object will be a list with the structure
described above as well as the additional following elements:
config_comp_tids: the allowed compound_taskid_set defined in the modeling
task config.
invalid_tbl_comp_tids: the names of invalid compound task IDs.
The name of each element is the index identifying the config modeling task the sample is associated with mt_id.
See hubverse documentation on samples
for more details.
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_error> condition class object.
Returned object also inherits from subclass <hub_check>.
Check model output data tbl samples contain single unique values for each compound task ID within individual samples
check_tbl_spl_compound_tid( tbl, round_id, file_path, hub_path, compound_taskid_set = NULL, derived_task_ids = get_hub_derived_task_ids(hub_path, round_id) )check_tbl_spl_compound_tid( tbl, round_id, file_path, hub_path, compound_taskid_set = NULL, derived_task_ids = get_hub_derived_task_ids(hub_path, round_id) )
tbl |
a tibble/data.frame of the contents of the file being validated. Column types must all be character. |
round_id |
character string. The round identifier. |
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
compound_taskid_set |
a list of |
derived_task_ids |
Character vector of derived task ID names (task IDs whose
values depend on other task IDs) to ignore. Columns for such task ids will
contain |
Output of the check includes an errors element, a list of items,
one for each sample failing validation, with the following structure:
mt_id: Index identifying the config modeling task the sample is associated with.
output_type_id: The output type ID of the sample that does not contain a
single, unique value for each compound task ID.
values: The unique values of each compound task ID.
See hubverse documentation on samples
for more details.
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_error> condition class object.
Returned object also inherits from subclass <hub_check>.
Check that individual sample output_type_ids do not span multiple model tasks
check_tbl_spl_mt_unique( tbl, round_id, file_path, hub_path, derived_task_ids = get_hub_derived_task_ids(hub_path, round_id) )check_tbl_spl_mt_unique( tbl, round_id, file_path, hub_path, derived_task_ids = get_hub_derived_task_ids(hub_path, round_id) )
tbl |
a tibble/data.frame of the contents of the file being validated. Column types must all be character. |
round_id |
character string. The round identifier. |
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
derived_task_ids |
Character vector of derived task ID names (task IDs whose
values depend on other task IDs) to ignore during validation. Defaults to
extracting derived task IDs from hub |
Different model tasks can have different sample configurations
(compound_taskid_set, min/max_samples_per_task, etc.), so samples should
be entirely independent across model tasks. This check verifies that no
sample output_type_id appears in more than one model task.
Output of the check includes an errors element, a list with the following
structure:
mt_ids: Integer vector of model task indices the overlapping samples span.
output_type_ids: Character vector of sample output_type_ids that appear
in multiple model tasks.
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_error> condition class object.
Returned object also inherits from subclass <hub_check>.
Check model output data tbl samples contain the appropriate number of samples for a given compound idx.
check_tbl_spl_n( tbl, round_id, file_path, hub_path, compound_taskid_set = NULL, derived_task_ids = get_hub_derived_task_ids(hub_path, round_id) )check_tbl_spl_n( tbl, round_id, file_path, hub_path, compound_taskid_set = NULL, derived_task_ids = get_hub_derived_task_ids(hub_path, round_id) )
tbl |
a tibble/data.frame of the contents of the file being validated. Column types must all be character. |
round_id |
character string. The round identifier. |
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
compound_taskid_set |
a list of |
derived_task_ids |
Character vector of derived task ID names (task IDs whose
values depend on other task IDs) to ignore. Columns for such task ids will
contain |
Output of the check includes an errors element, a list of items,
one for each compound_idx failing validation, with the following structure:
compound_idx: the compound idx that failed validation of number of samples.
n: the number of samples counted for the compound idx.
min_samples_per_task: the minimum number of samples required for the compound idx.
max_samples_per_task: the maximum number of samples required for the compound idx.
compound_idx_tbl: a tibble of the expected structure for samples belonging to
the compound idx.
See hubverse documentation on samples
for more details.
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_failure> condition class object.
Returned object also inherits from subclass <hub_check>.
Check model output data tbl samples contain single unique combination of non-compound task ID values across all samples
check_tbl_spl_non_compound_tid( tbl, round_id, file_path, hub_path, compound_taskid_set = NULL, derived_task_ids = get_hub_derived_task_ids(hub_path, round_id) )check_tbl_spl_non_compound_tid( tbl, round_id, file_path, hub_path, compound_taskid_set = NULL, derived_task_ids = get_hub_derived_task_ids(hub_path, round_id) )
tbl |
a tibble/data.frame of the contents of the file being validated. Column types must all be character. |
round_id |
character string. The round identifier. |
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
compound_taskid_set |
a list of |
derived_task_ids |
Character vector of derived task ID names (task IDs whose
values depend on other task IDs) to ignore. Columns for such task ids will
contain |
Output of the check includes an errors element, a list of items,
one for each modeling task containing samples failing validation,
with the following structure:
mt_id: Index identifying the config modeling task the samples are associated with.
output_type_ids: The output type IDs of samples that do not match the most frequent
non-compound task ID value combination across all
samples in the modeling task.
frequent: The most frequent non-compound task ID value combination
across all samples in the modeling task to which all samples were compared.
See hubverse documentation on samples
for more details.
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_error> condition class object.
Returned object also inherits from subclass <hub_check>.
Check model output data tbl contains a single unique round ID.
check_tbl_unique_round_id(tbl, file_path, hub_path, round_id_col = NULL)check_tbl_unique_round_id(tbl, file_path, hub_path, round_id_col = NULL)
tbl |
a tibble/data.frame of the contents of the file being validated. |
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
round_id_col |
Character string. The name of the column containing
|
This check only applies to files being submitted to rounds where
round_id_from_variable: true or where a round_id_col name is explicitly
provided. Skipped otherwise.
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_error> condition class object.
If round_id_from_variable: false and no round_id_col name is provided,
check is skipped and a <message/check_info> condition class object is
returned. If no valid round_id_col name is provided or can extracted from
config (check through check_valid_round_id_col), a <message/check_error>
condition class object is returned and the rest of the check skipped.
Checks that values in the value column of a tibble/data.frame of data read
in from the file being validated conform to the configuration for each output
type of the appropriate model task.
check_tbl_value_col( tbl, round_id, file_path, hub_path, derived_task_ids = get_hub_derived_task_ids(hub_path, round_id) )check_tbl_value_col( tbl, round_id, file_path, hub_path, derived_task_ids = get_hub_derived_task_ids(hub_path, round_id) )
tbl |
a tibble/data.frame of the contents of the file being validated. |
round_id |
character string. The round identifier. |
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
derived_task_ids |
Character vector of derived task ID names (task IDs whose
values depend on other task IDs) to ignore. Columns for such task ids will
contain |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_failure> condition class object.
Returned object also inherits from subclass <hub_check>.
quantile and cdf output type values of model output data
are non-descendingChecks that values in the value column for quantile and cdf output type
data for each unique task ID/output type combination
are non-descending when arranged by increasing output_type_id order.
Check only performed if tbl contains quantile or cdf output type data.
If not, the check is skipped and a <message/check_info> condition class
object is returned.
check_tbl_value_col_ascending( tbl, file_path, hub_path, round_id, derived_task_ids = get_hub_derived_task_ids(hub_path) )check_tbl_value_col_ascending( tbl, file_path, hub_path, round_id, derived_task_ids = get_hub_derived_task_ids(hub_path) )
tbl |
a tibble/data.frame of the contents of the file being validated. Column types must all be character. |
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
round_id |
character string. The round identifier. |
derived_task_ids |
Character vector of derived task ID names (task IDs whose
values depend on other task IDs) to ignore. Columns for such task ids will
contain |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_failure> condition class object.
Returned object also inherits from subclass <hub_check>.
pmf output type values of model output data sum to 1.Checks that values in the value column of pmf output type
data for each unique task ID combination sum to 1.
Check only performed if tbl contains pmf output type data.
If not, the check is skipped and a <message/check_info> condition class
object is returned.
check_tbl_value_col_sum1(tbl, file_path)check_tbl_value_col_sum1(tbl, file_path)
tbl |
a tibble/data.frame of the contents of the file being validated. |
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_failure> condition class object.
Returned object also inherits from subclass <hub_check>.
Check model output data tbl contains valid value combinations
check_tbl_values( tbl, round_id, file_path, hub_path, derived_task_ids = get_hub_derived_task_ids(hub_path, round_id) )check_tbl_values( tbl, round_id, file_path, hub_path, derived_task_ids = get_hub_derived_task_ids(hub_path, round_id) )
tbl |
a tibble/data.frame of the contents of the file being validated. Column types must all be character. |
round_id |
character string. The round identifier. |
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
derived_task_ids |
Character vector of derived task ID names (task IDs whose
values depend on other task IDs) to ignore. Columns for such task ids will
contain |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_error> condition class object.
Returned object also inherits from subclass <hub_check>.
Check all required task ID/output type/output type ID value combinations present in model data.
check_tbl_values_required( tbl, round_id, file_path, hub_path, derived_task_ids = get_hub_derived_task_ids(hub_path) )check_tbl_values_required( tbl, round_id, file_path, hub_path, derived_task_ids = get_hub_derived_task_ids(hub_path) )
tbl |
a tibble/data.frame of the contents of the file being validated. Column types must all be character. |
round_id |
character string. The round identifier. |
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
derived_task_ids |
Character vector of derived task ID names (task IDs whose
values depend on other task IDs) to ignore. Columns for such task ids will
contain |
Note that it is necessary for derived_task_ids to be specified if any of
the task IDs with required values have dependent derived task IDs. If this is the
case and derived task IDs are not specified, the dependent nature of derived
task ID values will result in false validation errors when validating
required values.
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_failure> condition class object.
Returned object also inherits from subclass <hub_check>.
round_id determined for the submission is validCheck whether the round_id determined for the submission is valid
check_valid_round_id(round_id, file_path, hub_path = ".")check_valid_round_id(round_id, file_path, hub_path = ".")
round_id |
character string. The round identifier. |
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_error> condition class object.
Returned object also inherits from subclass <hub_check>.
Check that any round_id_col name provided or extracted from the hub config is valid.
check_valid_round_id_col(tbl, file_path, hub_path, round_id_col = NULL)check_valid_round_id_col(tbl, file_path, hub_path, round_id_col = NULL)
tbl |
a tibble/data.frame of the contents of the file being validated. |
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
round_id_col |
Character string. The name of the column containing
|
This check only applies to files being submitted to rounds where
round_id_from_variable: true or where a round_id_col name is explicitly
provided. Skipped otherwise.
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_failure> condition class object.
If round_id_from_variable: false and no round_id_col name is provided,
check is skipped and a <message/check_info> condition class object is
returned.
Returned object also inherits from subclass <hub_check>.
Combines multiple validation objects of the same class into one. Works with
both single-file validation objects (hub_validations, target_validations)
and multi-file collection objects (hub_validations_collection,
target_validations_collection). For more details on these classes,
see article on <hub_validations> S3 class objects.
combine(...) ## S3 method for class 'hub_validations_collection' combine(...)combine(...) ## S3 method for class 'hub_validations_collection' combine(...)
... |
Validation objects to be concatenated. All objects must be of the same class. NULL values are ignored. Empty objects are filtered out when combining multiple inputs, but a single empty input is returned as-is. |
For hub_validations objects, all inputs must share the same where
attribute (i.e., be validations for the same subject).
For hub_validations_collection objects, the individual hub_validations
objects from all collections are extracted and grouped by their where
attribute, combining validation results for the same subject.
Subclasses (e.g., target_validations, target_validations_collection) are
preserved.
An object of the same class as the inputs, or NULL if no valid inputs provided.
new_hub_validations(), new_hub_validations_collection(),
new_target_validations(), new_target_validations_collection()
Create a custom validation check function template file.
create_custom_check( name, hub_path = ".", r_dir = "src/validations/R", error = FALSE, conditional = FALSE, error_object = FALSE, config = FALSE, extra_args = FALSE, overwrite = FALSE )create_custom_check( name, hub_path = ".", r_dir = "src/validations/R", error = FALSE, conditional = FALSE, error_object = FALSE, config = FALSE, extra_args = FALSE, overwrite = FALSE )
name |
Character string. Name of the custom check function. We recommend following the hubValidations package naming convention. For more details, consult the article on writing custom check functions. |
hub_path |
Character string. Path to the hub directory. Default is the current working directory. |
r_dir |
Character string. Path (relative to |
error |
Logical. Defaults to |
conditional |
Logical. If |
error_object |
Logical. If |
config |
Logical. If |
extra_args |
Logical. If |
overwrite |
Logical. If |
See the article on writing custom check functions for more.
Invisible TRUE if the custom check function file is created successfully.
withr::with_tempdir({ # Create the custom check file with default settings. create_custom_check("check_default") cat(readLines("src/validations/R/check_default.R"), sep = "\n") # Create fully featured custom check file. create_custom_check("check_full", error = TRUE, conditional = TRUE, error_object = TRUE, config = TRUE, extra_args = TRUE ) cat(readLines("src/validations/R/check_full.R"), sep = "\n") })withr::with_tempdir({ # Create the custom check file with default settings. create_custom_check("check_default") cat(readLines("src/validations/R/check_default.R"), sep = "\n") # Create fully featured custom check file. create_custom_check("check_full", error = TRUE, conditional = TRUE, error_object = TRUE, config = TRUE, extra_args = TRUE ) cat(readLines("src/validations/R/check_full.R"), sep = "\n") })
Create expanded grid of valid task ID and output type value combinations
expand_model_out_grid( config_tasks, round_id, required_vals_only = FALSE, force_output_types = FALSE, all_character = FALSE, output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date"), as_arrow_table = FALSE, bind_model_tasks = TRUE, include_sample_ids = FALSE, compound_taskid_set = NULL, output_types = NULL, derived_task_ids = get_config_derived_task_ids(config_tasks, round_id) )expand_model_out_grid( config_tasks, round_id, required_vals_only = FALSE, force_output_types = FALSE, all_character = FALSE, output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date"), as_arrow_table = FALSE, bind_model_tasks = TRUE, include_sample_ids = FALSE, compound_taskid_set = NULL, output_types = NULL, derived_task_ids = get_config_derived_task_ids(config_tasks, round_id) )
config_tasks |
a list version of the content's of a hub's |
round_id |
Character string. Round identifier. If the round is set to
|
required_vals_only |
Logical. Whether to return only combinations of Task ID and related output type ID required values. |
force_output_types |
Logical. Whether to force all output types to be required.
If |
all_character |
Logical. Whether to return all character column. |
output_type_id_datatype |
character string. One of |
as_arrow_table |
Logical. Whether to return an arrow table. Defaults to |
bind_model_tasks |
Logical. Whether to bind expanded grids of values from multiple modeling tasks into a single tibble/arrow table or return a list. |
include_sample_ids |
Logical. Whether to include sample identifiers in
the |
compound_taskid_set |
List of character vectors, one for each modeling task
in the round. Can be used to override the compound task ID set defined in the
config. If |
output_types |
Character vector of output type names to include. Use to subset for grids for specific output types. |
derived_task_ids |
Character vector of derived task ID names (task IDs whose
values depend on other task IDs) to ignore. Columns for such task ids will
contain |
When a round is set to round_id_from_variable: true,
the value of the task ID from which round IDs are derived (i.e. the task ID
specified in round_id property of config_tasks) is set to the value of the
round_id argument in the returned output.
When sample output types are included in the output and include_sample_ids = TRUE,
the output_type_id column contains example sample indexes which are useful
for identifying the compound task ID structure of multivariate sampling
distributions in particular, i.e. which combinations of task ID values
represent individual samples.
If bind_model_tasks = TRUE (default) a tibble or arrow table
containing all possible task ID and related output type ID
value combinations. If bind_model_tasks = FALSE, a list containing a
tibble or arrow table for each round modeling task.
Columns are coerced to data types according to the hub schema,
unless all_character = TRUE. If all_character = TRUE, all columns are returned as
character which can be faster when large expanded grids are expected.
If required_vals_only = TRUE, values are limited to the combinations of required
values only.
Note that if required_vals_only = TRUE and an optional output type is
requested through output_types, a zero row grid will be returned.
If all output types are requested however (i.e. when output_types = NULL) and
they are all optional, a grid of required task ID values only will be returned.
However, whenever force_output_types = TRUE, all output types are treated as
required.
hub_con <- hubData::connect_hub( system.file("testhubs/flusight", package = "hubUtils") ) config_tasks <- attr(hub_con, "config_tasks") expand_model_out_grid(config_tasks, round_id = "2023-01-02") expand_model_out_grid( config_tasks, round_id = "2023-01-02", required_vals_only = TRUE ) # Specifying a round in a hub with multiple round configurations. hub_con <- hubData::connect_hub( system.file("testhubs/simple", package = "hubUtils") ) config_tasks <- attr(hub_con, "config_tasks") expand_model_out_grid(config_tasks, round_id = "2022-10-01") # Later round_id maps to round config that includes additional task ID 'age_group'. expand_model_out_grid(config_tasks, round_id = "2022-10-29") # Coerce all columns to character expand_model_out_grid(config_tasks, round_id = "2022-10-29", all_character = TRUE ) # Return arrow table expand_model_out_grid(config_tasks, round_id = "2022-10-29", all_character = TRUE, as_arrow_table = TRUE ) # Hub with sample output type config_tasks <- read_config_file(system.file("config", "tasks.json", package = "hubValidations" )) expand_model_out_grid(config_tasks, round_id = "2022-12-26" ) # Include sample IDS expand_model_out_grid(config_tasks, round_id = "2022-12-26", include_sample_ids = TRUE ) # Hub with sample output type and compound task ID structure config_tasks <- read_config_file( system.file("config", "tasks-comp-tid.json", package = "hubValidations") ) expand_model_out_grid(config_tasks, round_id = "2022-12-26", include_sample_ids = TRUE ) # Override config compound task ID set # Create coarser compound task ID set for the first modeling task which contains # samples expand_model_out_grid(config_tasks, round_id = "2022-12-26", include_sample_ids = TRUE, compound_taskid_set = list( c("forecast_date", "target"), NULL ) ) expand_model_out_grid(config_tasks, round_id = "2022-12-26", include_sample_ids = TRUE, compound_taskid_set = list( NULL, NULL ) ) # Subset output types config_tasks <- read_config( system.file("testhubs", "samples", package = "hubValidations") ) expand_model_out_grid(config_tasks, round_id = "2022-10-29", include_sample_ids = TRUE, bind_model_tasks = FALSE, output_types = c("sample", "pmf"), ) expand_model_out_grid(config_tasks, round_id = "2022-10-29", include_sample_ids = TRUE, bind_model_tasks = TRUE, output_types = "sample", ) # Ignore derived task IDs expand_model_out_grid(config_tasks, round_id = "2022-10-29", include_sample_ids = TRUE, bind_model_tasks = FALSE, output_types = "sample", derived_task_ids = "target_end_date" ) # Return only required values hub_path <- system.file("testhubs", "v4", "simple", package = "hubUtils") config_tasks <- read_config(hub_path) # Return required output types and output_types_ids only expand_model_out_grid( config_tasks = config_tasks, round_id = "2022-10-22", required_vals_only = TRUE ) # Force all output types to be required expand_model_out_grid( config_tasks = config_tasks, round_id = "2022-10-22", required_vals_only = TRUE, force_output_types = TRUE ) # Sub-setting for an optional output type returns an empty data frame expand_model_out_grid( config_tasks = config_tasks, round_id = "2022-10-22", output_types = "mean", required_vals_only = TRUE ) # force_output_types on an optional output type forces all output_type_id values # to be required expand_model_out_grid( config_tasks = config_tasks, round_id = "2022-10-22", output_types = "mean", required_vals_only = TRUE, force_output_types = TRUE ) # Ignore derived task IDs hub_path <- system.file("testhubs", "v4", "flusight", package = "hubUtils") config_tasks <- read_config(hub_path) # Defaults to using derived_task_ids from config expand_model_out_grid(config_tasks, round_id = "2023-05-08") # Can be overridden by argument derived_task_ids expand_model_out_grid(config_tasks, round_id = "2023-05-08", derived_task_ids = NULL )hub_con <- hubData::connect_hub( system.file("testhubs/flusight", package = "hubUtils") ) config_tasks <- attr(hub_con, "config_tasks") expand_model_out_grid(config_tasks, round_id = "2023-01-02") expand_model_out_grid( config_tasks, round_id = "2023-01-02", required_vals_only = TRUE ) # Specifying a round in a hub with multiple round configurations. hub_con <- hubData::connect_hub( system.file("testhubs/simple", package = "hubUtils") ) config_tasks <- attr(hub_con, "config_tasks") expand_model_out_grid(config_tasks, round_id = "2022-10-01") # Later round_id maps to round config that includes additional task ID 'age_group'. expand_model_out_grid(config_tasks, round_id = "2022-10-29") # Coerce all columns to character expand_model_out_grid(config_tasks, round_id = "2022-10-29", all_character = TRUE ) # Return arrow table expand_model_out_grid(config_tasks, round_id = "2022-10-29", all_character = TRUE, as_arrow_table = TRUE ) # Hub with sample output type config_tasks <- read_config_file(system.file("config", "tasks.json", package = "hubValidations" )) expand_model_out_grid(config_tasks, round_id = "2022-12-26" ) # Include sample IDS expand_model_out_grid(config_tasks, round_id = "2022-12-26", include_sample_ids = TRUE ) # Hub with sample output type and compound task ID structure config_tasks <- read_config_file( system.file("config", "tasks-comp-tid.json", package = "hubValidations") ) expand_model_out_grid(config_tasks, round_id = "2022-12-26", include_sample_ids = TRUE ) # Override config compound task ID set # Create coarser compound task ID set for the first modeling task which contains # samples expand_model_out_grid(config_tasks, round_id = "2022-12-26", include_sample_ids = TRUE, compound_taskid_set = list( c("forecast_date", "target"), NULL ) ) expand_model_out_grid(config_tasks, round_id = "2022-12-26", include_sample_ids = TRUE, compound_taskid_set = list( NULL, NULL ) ) # Subset output types config_tasks <- read_config( system.file("testhubs", "samples", package = "hubValidations") ) expand_model_out_grid(config_tasks, round_id = "2022-10-29", include_sample_ids = TRUE, bind_model_tasks = FALSE, output_types = c("sample", "pmf"), ) expand_model_out_grid(config_tasks, round_id = "2022-10-29", include_sample_ids = TRUE, bind_model_tasks = TRUE, output_types = "sample", ) # Ignore derived task IDs expand_model_out_grid(config_tasks, round_id = "2022-10-29", include_sample_ids = TRUE, bind_model_tasks = FALSE, output_types = "sample", derived_task_ids = "target_end_date" ) # Return only required values hub_path <- system.file("testhubs", "v4", "simple", package = "hubUtils") config_tasks <- read_config(hub_path) # Return required output types and output_types_ids only expand_model_out_grid( config_tasks = config_tasks, round_id = "2022-10-22", required_vals_only = TRUE ) # Force all output types to be required expand_model_out_grid( config_tasks = config_tasks, round_id = "2022-10-22", required_vals_only = TRUE, force_output_types = TRUE ) # Sub-setting for an optional output type returns an empty data frame expand_model_out_grid( config_tasks = config_tasks, round_id = "2022-10-22", output_types = "mean", required_vals_only = TRUE ) # force_output_types on an optional output type forces all output_type_id values # to be required expand_model_out_grid( config_tasks = config_tasks, round_id = "2022-10-22", output_types = "mean", required_vals_only = TRUE, force_output_types = TRUE ) # Ignore derived task IDs hub_path <- system.file("testhubs", "v4", "flusight", package = "hubUtils") config_tasks <- read_config(hub_path) # Defaults to using derived_task_ids from config expand_model_out_grid(config_tasks, round_id = "2023-05-08") # Can be overridden by argument derived_task_ids expand_model_out_grid(config_tasks, round_id = "2023-05-08", derived_task_ids = NULL )
<config> class objectGet hub configuration fields from a <config> class object
get_config_derived_task_ids(config_tasks, round_id = NULL)get_config_derived_task_ids(config_tasks, round_id = NULL)
config_tasks |
a list version of the content's of a hub's |
round_id |
Character string. Round identifier. If the round is set to
|
get_config_derived_task_ids: character vector of hub or round level derived
task ID names. If round_id is NULL or the round does not have a round level
derived_tasks_ids setting, returns the hub level derived_tasks_ids setting.
get_config_derived_task_ids(): Get the hub or round level derived_tasks_ids
hub_path <- system.file("testhubs/v4/flusight", package = "hubUtils") config_tasks <- read_config(hub_path) get_config_derived_task_ids(config_tasks) get_config_derived_task_ids(config_tasks, round_id = "2023-05-08")hub_path <- system.file("testhubs/v4/flusight", package = "hubUtils") config_tasks <- read_config(hub_path) get_config_derived_task_ids(config_tasks) get_config_derived_task_ids(config_tasks, round_id = "2023-05-08")
Retrieves the unique target task ID by extracting all target metadata
and extracting the names of their target_keys. For valid config files
these should be the same across all rounds and model tasks.
get_target_task_id(config_tasks)get_target_task_id(config_tasks)
config_tasks |
a list representation of the |
A character vector of unique target task ID names. Post v5.0.0 this should be a single task ID name.
Detect the compound_taskid_set for a tbl for each modeling task in a given round.
get_tbl_compound_taskid_set( tbl, config_tasks, round_id, compact = TRUE, error = TRUE, derived_task_ids = get_config_derived_task_ids(config_tasks, round_id) )get_tbl_compound_taskid_set( tbl, config_tasks, round_id, compact = TRUE, error = TRUE, derived_task_ids = get_config_derived_task_ids(config_tasks, round_id) )
tbl |
a tibble/data.frame of the contents of the file being validated. Column types must all be character. |
config_tasks |
a list representantion of the |
round_id |
Character string. The round ID. |
compact |
Logical. If TRUE, the output will be compacted to remove NULL elements. |
error |
Logical. If TRUE, an error will be thrown if the compound task ID set is not valid. If FALSE and an error is detected, the detected compound task ID set will be returned with error attributes attached. |
derived_task_ids |
Character vector of derived task ID names (task IDs whose
values depend on other task IDs) to ignore. Columns for such task ids will
contain |
A list of vectors of compound task IDs detected in the tbl, one for each
modeling task in the round. If compact is TRUE, modeling tasks returning NULL
elements will be removed.
hub_path <- system.file("testhubs/samples", package = "hubValidations") file_path <- "flu-base/2022-10-22-flu-base.csv" round_id <- "2022-10-22" tbl <- read_model_out_file( file_path = file_path, hub_path = hub_path, coerce_types = "chr" ) config_tasks <- read_config(hub_path, "tasks") get_tbl_compound_taskid_set(tbl, config_tasks, round_id) get_tbl_compound_taskid_set(tbl, config_tasks, round_id, compact = FALSE )hub_path <- system.file("testhubs/samples", package = "hubValidations") file_path <- "flu-base/2022-10-22-flu-base.csv" round_id <- "2022-10-22" tbl <- read_model_out_file( file_path = file_path, hub_path = hub_path, coerce_types = "chr" ) config_tasks <- read_config(hub_path, "tasks") get_tbl_compound_taskid_set(tbl, config_tasks, round_id) get_tbl_compound_taskid_set(tbl, config_tasks, round_id, compact = FALSE )
Get status of a hub check
is_success(x) is_failure(x) is_error(x) is_info(x) not_pass(x) is_exec_error(x) is_exec_warn(x) is_any_error(x)is_success(x) is_failure(x) is_error(x) is_info(x) not_pass(x) is_exec_error(x) is_exec_warn(x) is_any_error(x)
x |
an object that inherits from class |
Logical. Is given status of check TRUE?
is_success(): Is check success?
is_failure(): Is check failure?
is_error(): Is check error?
is_info(): Is check info?
not_pass(): Did check not pass?
is_exec_error(): Is exec error?
is_exec_warn(): Is exec warning?
is_any_error(): Is error or exec error?
tbl data to their model tasks in config_tasks.Split and match model output tbl data to their corresponding model tasks in
config_tasks. Useful for performing model task specific checks on model output.
For v3 samples, the output_type_id column is set to NA for sample outputs.
match_tbl_to_model_task( tbl, config_tasks, round_id, output_types = NULL, derived_task_ids = get_config_derived_task_ids(config_tasks, round_id), all_character = TRUE )match_tbl_to_model_task( tbl, config_tasks, round_id, output_types = NULL, derived_task_ids = get_config_derived_task_ids(config_tasks, round_id), all_character = TRUE )
tbl |
a tibble/data.frame of the contents of the file being validated. |
config_tasks |
a list version of the content's of a hub's |
round_id |
Character string. Round identifier. If the round is set to
|
output_types |
Character vector of output type names to include. Use to subset for grids for specific output types. |
derived_task_ids |
Character vector of derived task ID names (task IDs whose
values depend on other task IDs) to ignore. Columns for such task ids will
contain |
all_character |
Logical. Whether to return all character column. |
A list containing a tbl_df of model output data matched to a model
task with one element per round model task.
hub_path <- system.file("testhubs/samples", package = "hubValidations") tbl <- read_model_out_file( file_path = "flu-base/2022-10-22-flu-base.csv", hub_path, coerce_types = "chr" ) config_tasks <- read_config(hub_path, "tasks") match_tbl_to_model_task(tbl, config_tasks, round_id = "2022-10-22") match_tbl_to_model_task(tbl, config_tasks, round_id = "2022-10-22", output_types = "sample" )hub_path <- system.file("testhubs/samples", package = "hubValidations") tbl <- read_model_out_file( file_path = "flu-base/2022-10-22-flu-base.csv", hub_path, coerce_types = "chr" ) config_tasks <- read_config(hub_path, "tasks") match_tbl_to_model_task(tbl, config_tasks, round_id = "2022-10-22") match_tbl_to_model_task(tbl, config_tasks, round_id = "2022-10-22", output_types = "sample" )
hub_validations S3 class objectA hub_validations object contains validation results for a single
validation subject. Depending on context, this could be a file, a
configuration directory (hub-config), or a target dataset type
(time-series, oracle-output).
All checks must have the same $where value, which is extracted and stored
as the where attribute.
new_hub_validations(...) as_hub_validations(x)new_hub_validations(...) as_hub_validations(x)
... |
named elements to be included. Each element must be an object which
inherits from class |
x |
a list of named elements. Each element must be an object which
inherits from class |
an S3 object of class <hub_validations> with a where attribute.
new_hub_validations(): Create new <hub_validations> S3 class object
as_hub_validations(): Convert list to <hub_validations> S3 class object
new_hub_validations() hub_path <- system.file("testhubs/simple", package = "hubValidations") file_path <- "team1-goodmodel/2022-10-08-team1-goodmodel.csv" new_hub_validations( file_exists = check_file_exists(file_path, hub_path), file_name = check_file_name(file_path) ) x <- list( file_exists = check_file_exists(file_path, hub_path), file_name = check_file_name(file_path) ) as_hub_validations(x)new_hub_validations() hub_path <- system.file("testhubs/simple", package = "hubValidations") file_path <- "team1-goodmodel/2022-10-08-team1-goodmodel.csv" new_hub_validations( file_exists = check_file_exists(file_path, hub_path), file_name = check_file_name(file_path) ) x <- list( file_exists = check_file_exists(file_path, hub_path), file_name = check_file_name(file_path) ) as_hub_validations(x)
hub_validations_collection S3 class objectA hub_validations_collection is a container for validation results from
multiple validation subjects. It is a named list where each element is a
hub_validations object. Names are automatically extracted from the
where attribute of each hub_validations object. If multiple
hub_validations objects have the same where value, they are merged using
combine(). Empty hub_validations objects are ignored.
new_hub_validations_collection(...) as_hub_validations_collection(x)new_hub_validations_collection(...) as_hub_validations_collection(x)
... |
|
x |
a list where each element is a |
an S3 object of class <hub_validations_collection>. Elements are
named by their where attribute
(e.g., collection[["path/to/file.csv"]]).
new_hub_validations_collection(): Create new
<hub_validations_collection> S3 class object
as_hub_validations_collection(): Convert list to
<hub_validations_collection> S3 class object
new_hub_validations_collection() hub_path <- system.file("testhubs/simple", package = "hubValidations") # Create validations for two different files file_path_1 <- "team1-goodmodel/2022-10-08-team1-goodmodel.csv" validations_1 <- new_hub_validations( file_exists = check_file_exists(file_path_1, hub_path), file_name = check_file_name(file_path_1) ) file_path_2 <- "team1-goodmodel/2022-10-15-team1-goodmodel.csv" validations_2 <- new_hub_validations( file_exists = check_file_exists(file_path_2, hub_path), file_name = check_file_name(file_path_2) ) # Combine into a collection collection <- new_hub_validations_collection(validations_1, validations_2) # Print the collection collection # Get file paths (element names) names(collection) # Access validations for a specific file collection[[file_path_1]] # Access validations for a specific file and check collection$`team1-goodmodel/2022-10-08-team1-goodmodel.csv`$file_existsnew_hub_validations_collection() hub_path <- system.file("testhubs/simple", package = "hubValidations") # Create validations for two different files file_path_1 <- "team1-goodmodel/2022-10-08-team1-goodmodel.csv" validations_1 <- new_hub_validations( file_exists = check_file_exists(file_path_1, hub_path), file_name = check_file_name(file_path_1) ) file_path_2 <- "team1-goodmodel/2022-10-15-team1-goodmodel.csv" validations_2 <- new_hub_validations( file_exists = check_file_exists(file_path_2, hub_path), file_name = check_file_name(file_path_2) ) # Combine into a collection collection <- new_hub_validations_collection(validations_1, validations_2) # Print the collection collection # Get file paths (element names) names(collection) # Access validations for a specific file collection[[file_path_1]] # Access validations for a specific file and check collection$`team1-goodmodel/2022-10-08-team1-goodmodel.csv`$file_exists
target_validations S3 class objectA target_validations object contains validation results for a single
validation subject. Depending on context, this could be a target data file,
the hub configuration directory (hub-config), or a target dataset type
(time-series, oracle-output).
All checks must have the same $where value, which is extracted and stored
as the where attribute.
new_target_validations(...) as_target_validations(x)new_target_validations(...) as_target_validations(x)
... |
named elements to be included. Each element must be an object which
inherits from class |
x |
a list of named elements. Each element must be an object which
inherits from class |
an S3 object of class <target_validations> with a where attribute.
new_target_validations(): Create new <target_validations> S3 class object
as_target_validations(): Convert list to <target_validations> S3 class object
new_target_validations() hub_path <- system.file("testhubs/v5/target_file", package = "hubUtils") file_path <- "time-series.csv" new_target_validations( target_file_name = check_target_file_name(file_path), target_file_ext_valid = check_target_file_ext_valid(file_path) ) x <- list( target_file_name = check_target_file_name(file_path), target_file_ext_valid = check_target_file_ext_valid(file_path) ) as_target_validations(x) file_path <- "time-series/target=flu_hosp_rate/part-0.parquet" new_target_validations( target_file_name = check_target_file_name(file_path), target_file_ext_valid = check_target_file_ext_valid(file_path) )new_target_validations() hub_path <- system.file("testhubs/v5/target_file", package = "hubUtils") file_path <- "time-series.csv" new_target_validations( target_file_name = check_target_file_name(file_path), target_file_ext_valid = check_target_file_ext_valid(file_path) ) x <- list( target_file_name = check_target_file_name(file_path), target_file_ext_valid = check_target_file_ext_valid(file_path) ) as_target_validations(x) file_path <- "time-series/target=flu_hosp_rate/part-0.parquet" new_target_validations( target_file_name = check_target_file_name(file_path), target_file_ext_valid = check_target_file_ext_valid(file_path) )
target_validations_collection S3 class objectA target_validations_collection is a container for target validation results
from multiple validation subjects. It is a named list where each element is a
target_validations object. Names are automatically extracted from the
where attribute of each target_validations object. If multiple
target_validations objects have the same where value, they are merged using
combine(). Empty target_validations objects are ignored.
new_target_validations_collection(...) as_target_validations_collection(x)new_target_validations_collection(...) as_target_validations_collection(x)
... |
|
x |
a list where each element is a |
an S3 object of class <target_validations_collection>. Elements are
named by their where attribute
(e.g., collection[["path/to/file.csv"]]).
new_target_validations_collection(): Create new
<target_validations_collection> S3 class object
as_target_validations_collection(): Convert list to
<target_validations_collection> S3 class object
new_target_validations_collection() # Create validations for two different files file_path_1 <- "time-series.csv" validations_1 <- new_target_validations( target_file_name = check_target_file_name(file_path_1), target_file_ext_valid = check_target_file_ext_valid(file_path_1) ) file_path_2 <- "other-data.csv" validations_2 <- new_target_validations( target_file_name = check_target_file_name(file_path_2), target_file_ext_valid = check_target_file_ext_valid(file_path_2) ) # Combine into a collection collection <- new_target_validations_collection(validations_1, validations_2) # Print the collection collection # Get file paths (element names) names(collection) # Access validations for a specific file collection[[file_path_1]] # Access a specific check within a file's validations collection[["time-series.csv"]]$target_file_namenew_target_validations_collection() # Create validations for two different files file_path_1 <- "time-series.csv" validations_1 <- new_target_validations( target_file_name = check_target_file_name(file_path_1), target_file_ext_valid = check_target_file_ext_valid(file_path_1) ) file_path_2 <- "other-data.csv" validations_2 <- new_target_validations( target_file_name = check_target_file_name(file_path_2), target_file_ext_valid = check_target_file_ext_valid(file_path_2) ) # Combine into a collection collection <- new_target_validations_collection(validations_1, validations_2) # Print the collection collection # Get file paths (element names) names(collection) # Access validations for a specific file collection[[file_path_1]] # Access a specific check within a file's validations collection[["time-series.csv"]]$target_file_name
Check that submitting team does not exceed maximum number of allowed models per team
opt_check_metadata_team_max_model_n(file_path, hub_path, n_max = 2L)opt_check_metadata_team_max_model_n(file_path, hub_path, n_max = 2L)
file_path |
character string. Path to the file being validated relative to the hub's model-metadata directory. |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
n_max |
Integer. Number of maximum allowed models per team. |
Should be deployed as part of validate_model_metadata optional checks.
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_failure> condition class object.
Returned object also inherits from subclass <hub_check>.
Check time difference between values in two date columns equal a defined period.
opt_check_tbl_col_timediff( tbl, file_path, hub_path, t0_colname, t1_colname, timediff = lubridate::weeks(2), output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date") )opt_check_tbl_col_timediff( tbl, file_path, hub_path, t0_colname, t1_colname, timediff = lubridate::weeks(2), output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date") )
tbl |
a tibble/data.frame of the contents of the file being validated. |
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
t0_colname |
Character string. The name of the time zero date column. |
t1_colname |
Character string. The name of the time zero + 1 time step date column. |
timediff |
an object of class |
output_type_id_datatype |
character string. One of |
Should be deployed as part of validate_model_data optional checks.
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_failure> condition class object.
Returned object also inherits from subclass <hub_check>.
Check that predicted values per location are less than total location population.
opt_check_tbl_counts_lt_popn( tbl, file_path, hub_path, targets = NULL, popn_file_path = "auxiliary-data/locations.csv", popn_col = "population", location_col = "location" )opt_check_tbl_counts_lt_popn( tbl, file_path, hub_path, targets = NULL, popn_file_path = "auxiliary-data/locations.csv", popn_col = "population", location_col = "location" )
tbl |
a tibble/data.frame of the contents of the file being validated. |
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
targets |
Either a single target key list or a list of multiple target key lists. |
popn_file_path |
Character string.
Path to population data relative to the hub root.
Defaults to |
popn_col |
Character string. The name of the population size column in the population data set. |
location_col |
Character string. The name of the location column. Used to join population data to submission file data. Must be shared by both files. |
Should only be applied to rows containing count predictions. Use argument
targets to filter tbl data to appropriate count target rows.
Should be deployed as part of validate_model_data optional checks.
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_failure> condition class object.
Returned object also inherits from subclass <hub_check>.
hub_path <- system.file("testhubs/flusight", package = "hubValidations") file_path <- "hub-ensemble/2023-05-08-hub-ensemble.parquet" tbl <- hubValidations::read_model_out_file(file_path, hub_path) # Single target key list targets <- list("target" = "wk ahead inc flu hosp") opt_check_tbl_counts_lt_popn(tbl, file_path, hub_path, targets = targets)hub_path <- system.file("testhubs/flusight", package = "hubValidations") file_path <- "hub-ensemble/2023-05-08-hub-ensemble.parquet" tbl <- hubValidations::read_model_out_file(file_path, hub_path) # Single target key list targets <- list("target" = "wk ahead inc flu hosp") opt_check_tbl_counts_lt_popn(tbl, file_path, hub_path, targets = targets)
Check time difference between values in two date columns equals a defined time period defined by values in a horizon column
opt_check_tbl_horizon_timediff( tbl, file_path, hub_path, t0_colname, t1_colname, horizon_colname = "horizon", timediff = lubridate::weeks(), output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date") )opt_check_tbl_horizon_timediff( tbl, file_path, hub_path, t0_colname, t1_colname, horizon_colname = "horizon", timediff = lubridate::weeks(), output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date") )
tbl |
a tibble/data.frame of the contents of the file being validated. |
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
t0_colname |
Character string. The name of the time zero date column. |
t1_colname |
Character string. The name of the time zero + 1 time step date column. |
horizon_colname |
Character string. The name of the horizon column.
Defaults to |
timediff |
an object of class |
output_type_id_datatype |
character string. One of |
Should be deployed as part of validate_model_data optional checks.
Depending on whether validation has succeeded, one of:
<message/check_success> condition class object.
<error/check_failure> condition class object.
Returned object also inherits from subclass <hub_check>.
Parse model output file metadata from file name
parse_file_name(file_path, file_type = c("model_output", "model_metadata"))parse_file_name(file_path, file_type = c("model_output", "model_metadata"))
file_path |
Character string. A model output file name. Can include parent directories which are ignored. |
file_type |
Character string. Type of file name being parsed. One of |
File names are allowed to contain the following compression extension prefixes:
.snappy, .gzip, .gz, .brotli, .zstd, .lz4, .lzo, .bz2.
These extension prefixes are now extracted when parsing the file name
and returned as compression_ext element if present.
A list with the following elements:
round_id: The round ID the model output is associated with (NA for
model metadata files.)
team_abbr: The team responsible for the model.
model_abbr: The name of the model.
model_id: The unique model ID derived from the concatenation of
<team_abbr>-<model_abbr>.
ext: The file extension.
compression_ext: optional. The compression extension if present.
parse_file_name("hub-baseline/2022-10-15-hub-baseline.csv") parse_file_name("hub-baseline/2022-10-15-hub-baseline.gzip.parquet")parse_file_name("hub-baseline/2022-10-15-hub-baseline.csv") parse_file_name("hub-baseline/2022-10-15-hub-baseline.gzip.parquet")
validate_...() function as a bullet listPrints a formatted summary of validation results. Validation-level warnings
(attached to the hub_validations object) are always displayed prominently
in a box at the top. Check-level warnings (attached to individual checks)
are only shown when show_check_warnings = TRUE.
## S3 method for class 'hub_validations' print(x, show_check_warnings = FALSE, ...)## S3 method for class 'hub_validations' print(x, show_check_warnings = FALSE, ...)
x |
An object of class |
show_check_warnings |
Logical. If |
... |
Unused argument present for class consistency |
Returns x invisibly.
hub_path <- system.file("testhubs/simple", package = "hubValidations") v <- validate_submission( hub_path, file_path = "team1-goodmodel/2022-10-08-team1-goodmodel.csv" ) # Default print print(v) # Show check-level warnings (if any) print(v, show_check_warnings = TRUE) # Example with validation-level warning v_with_warning <- v attr(v_with_warning, "warnings") <- list( capture_validation_warning( msg = "Example validation-level warning message.", where = "example" ) ) print(v_with_warning)hub_path <- system.file("testhubs/simple", package = "hubValidations") v <- validate_submission( hub_path, file_path = "team1-goodmodel/2022-10-08-team1-goodmodel.csv" ) # Default print print(v) # Show check-level warnings (if any) print(v, show_check_warnings = TRUE) # Example with validation-level warning v_with_warning <- v attr(v_with_warning, "warnings") <- list( capture_validation_warning( msg = "Example validation-level warning message.", where = "example" ) ) print(v_with_warning)
Prints a formatted summary of validation results from multiple files. Each file's validations are printed under a header showing the file path. Validation-level warnings (attached to the collection object) are displayed prominently in a box at the top.
## S3 method for class 'hub_validations_collection' print(x, show_check_warnings = FALSE, ...)## S3 method for class 'hub_validations_collection' print(x, show_check_warnings = FALSE, ...)
x |
An object of class |
show_check_warnings |
Logical. If |
... |
Unused argument present for class consistency |
Returns x invisibly.
Read a model output file
read_model_out_file( file_path, hub_path = ".", coerce_types = c("hub", "chr", "none"), output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date") )read_model_out_file( file_path, hub_path = ".", coerce_types = c("hub", "chr", "none"), output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date") )
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
coerce_types |
character. What to coerce column types to on read.
|
output_type_id_datatype |
character string. One of |
a tibble of contents of the model output file.
Read a single target data file
read_target_file( target_file_path, hub_path, coerce_types = c("target", "chr", "none"), date_col = NULL, na = c("NA", "") )read_target_file( target_file_path, hub_path, coerce_types = c("target", "chr", "none"), date_col = NULL, na = c("NA", "") )
target_file_path |
Character string. Path to the target data file being validated
relative to the hub's |
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
coerce_types |
character string. What to coerce column types to on read.
|
date_col |
Optional column name to be interpreted as date. Default is |
na |
A character vector of strings to interpret as missing values. Only
applies to CSV files. The default is |
a tibble of contents of the target data file.
# download example hub hub_path <- system.file("testhubs/v5/target_file", package = "hubUtils" ) # read in time-series file read_target_file("time-series.csv", hub_path) read_target_file("time-series.csv", hub_path, coerce_types = "chr") # read in oracle-output file read_target_file("oracle-output.csv", hub_path) read_target_file("oracle-output.csv", hub_path, coerce_types = "chr")# download example hub hub_path <- system.file("testhubs/v5/target_file", package = "hubUtils" ) # read in time-series file read_target_file("time-series.csv", hub_path) read_target_file("time-series.csv", hub_path, coerce_types = "chr") # read in oracle-output file read_target_file("oracle-output.csv", hub_path) read_target_file("oracle-output.csv", hub_path, coerce_types = "chr")
Create a model output submission file template
submission_tmpl( path, round_id, required_vals_only = FALSE, force_output_types = FALSE, complete_cases_only = TRUE, compound_taskid_set = NULL, output_types = NULL, derived_task_ids = NULL, hub_con = deprecated(), config_tasks = deprecated() )submission_tmpl( path, round_id, required_vals_only = FALSE, force_output_types = FALSE, complete_cases_only = TRUE, compound_taskid_set = NULL, output_types = NULL, derived_task_ids = NULL, hub_con = deprecated(), config_tasks = deprecated() )
path |
Character string. Can be one of:
See examples for more details. |
round_id |
Character string. Round identifier. If the round is set to
|
required_vals_only |
Logical. Whether to return only combinations of Task ID and related output type ID required values. |
force_output_types |
Logical. Whether to force all output types to be required.
If |
complete_cases_only |
Logical. If |
compound_taskid_set |
List of character vectors, one for each modeling task
in the round. Can be used to override the compound task ID set defined in the
config. If |
output_types |
Character vector of output type names to include. Use to subset for grids for specific output types. |
derived_task_ids |
Character vector of derived task ID names (task IDs whose
values depend on other task IDs) to ignore. Columns for such task ids will
contain |
hub_con |
|
config_tasks |
|
For task IDs where all values are optional, by default, columns
are created as columns of NAs when required_vals_only = TRUE.
When such columns exist, the function returns a tibble with zero rows, as no
complete cases of required value combinations exists.
(Note that determination of complete cases does excludes valid NA
output_type_id values in "mean" and "median" output types).
To return a template of incomplete required cases, which includes NA columns, use
complete_cases_only = FALSE.
To include output types that are optional in the submission template
when required_vals_only = TRUE and complete_cases_only = FALSE, use
force_output_types = TRUE. Use this in combination with sub-setting for
output types you plan to submit via argument output_types to create a
submission template customised to your submission plans.
Tip: to ensure you create a template with all required output types, it's
a good idea to first run the functions without subsetting or forcing output
types and examing the unique values in output_type to check which output
types are required.
When sample output types are included in the output, the output_type_id
column contains example sample indexes which are useful for identifying the
compound task ID structure of multivariate sampling distributions in particular,
i.e. which combinations of task ID values represent individual samples.
When a round is set to round_id_from_variable: true,
the value of the task ID from which round IDs are derived (i.e. the task ID
specified in round_id property of config_tasks) is set to the value of the
round_id argument in the returned output.
a tibble template containing an expanded grid of valid task ID and
output type ID value combinations for a given submission round
and output type.
If required_vals_only = TRUE, values are limited to the combination of required
values only.
hub_path <- system.file("testhubs/flusight", package = "hubUtils") submission_tmpl(hub_path, round_id = "2023-01-02") # Return required values only submission_tmpl( hub_path, round_id = "2023-01-02", required_vals_only = TRUE ) submission_tmpl( hub_path, round_id = "2023-01-02", required_vals_only = TRUE, complete_cases_only = FALSE ) # Specify a round in a hub with multiple rounds hub_path <- system.file("testhubs/simple", package = "hubUtils") submission_tmpl(hub_path, round_id = "2022-10-01") submission_tmpl(hub_path, round_id = "2022-10-29") # Subset for a specific output type hub_path <- system.file("testhubs", "samples", package = "hubValidations") submission_tmpl( hub_path, round_id = "2022-12-17", output_types = "sample" ) # Create a template from the path to a tasks config file config_path <- system.file("config", "tasks.json", package = "hubValidations" ) submission_tmpl( config_path, round_id = "2022-12-26" ) # Hub with sample output type and compound task ID structure config_path <- system.file("config", "tasks-comp-tid.json", package = "hubValidations" ) submission_tmpl( config_path, round_id = "2022-12-26", output_types = "sample" ) # Override config compound task ID set # Create coarser compound task ID set for the first modeling task which contains # samples submission_tmpl( config_path, round_id = "2022-12-26", output_types = "sample", compound_taskid_set = list( c("forecast_date", "target"), NULL ) ) # Derive a template with ignored derived task ID. Useful to avoid creating # a template with invalid derived task ID value combinations. hub_path <- system.file("testhubs", "flusight", package = "hubValidations") submission_tmpl( hub_path, round_id = "2022-12-12", output_types = "pmf", derived_task_ids = "target_end_date", complete_cases_only = FALSE ) # Force optional output type, in this case "mean". submission_tmpl( hub_path, round_id = "2022-12-12", required_vals_only = TRUE, output_types = c("pmf", "quantile", "mean"), force_output_types = TRUE, derived_task_ids = "target_end_date", complete_cases_only = FALSE ) # Create a template from a URL to fully configured hub repository on GitHub submission_tmpl( path = "https://github.com/hubverse-org/example-simple-forecast-hub", round_id = "2022-11-28", output_types = "quantile" ) # Create a template from a URL to the raw contents of a tasks.json file on # GitHub config_raw_url <- paste0( "https://raw.githubusercontent.com/hubverse-org/", "example-simple-forecast-hub/refs/heads/main/hub-config/tasks.json" ) submission_tmpl( path = config_raw_url, round_id = "2022-11-28", output_types = "quantile" ) # Create submission file using config file from AWS S3 bucket hub # Use `s3_bucket()` to create a path to the hub's root directory s3_hub_path <- arrow::s3_bucket("hubverse/hubutils/testhubs/simple/") submission_tmpl( path = s3_hub_path, round_id = "2022-10-01", output_types = "quantile" ) # Use `path()` method to create a path to the tasks.json file relative to the # the S3 cloud hub's root directory s3_config_path <- s3_hub_path$path("hub-config/tasks.json") submission_tmpl( path = s3_config_path, round_id = "2022-10-01", output_types = "quantile" )hub_path <- system.file("testhubs/flusight", package = "hubUtils") submission_tmpl(hub_path, round_id = "2023-01-02") # Return required values only submission_tmpl( hub_path, round_id = "2023-01-02", required_vals_only = TRUE ) submission_tmpl( hub_path, round_id = "2023-01-02", required_vals_only = TRUE, complete_cases_only = FALSE ) # Specify a round in a hub with multiple rounds hub_path <- system.file("testhubs/simple", package = "hubUtils") submission_tmpl(hub_path, round_id = "2022-10-01") submission_tmpl(hub_path, round_id = "2022-10-29") # Subset for a specific output type hub_path <- system.file("testhubs", "samples", package = "hubValidations") submission_tmpl( hub_path, round_id = "2022-12-17", output_types = "sample" ) # Create a template from the path to a tasks config file config_path <- system.file("config", "tasks.json", package = "hubValidations" ) submission_tmpl( config_path, round_id = "2022-12-26" ) # Hub with sample output type and compound task ID structure config_path <- system.file("config", "tasks-comp-tid.json", package = "hubValidations" ) submission_tmpl( config_path, round_id = "2022-12-26", output_types = "sample" ) # Override config compound task ID set # Create coarser compound task ID set for the first modeling task which contains # samples submission_tmpl( config_path, round_id = "2022-12-26", output_types = "sample", compound_taskid_set = list( c("forecast_date", "target"), NULL ) ) # Derive a template with ignored derived task ID. Useful to avoid creating # a template with invalid derived task ID value combinations. hub_path <- system.file("testhubs", "flusight", package = "hubValidations") submission_tmpl( hub_path, round_id = "2022-12-12", output_types = "pmf", derived_task_ids = "target_end_date", complete_cases_only = FALSE ) # Force optional output type, in this case "mean". submission_tmpl( hub_path, round_id = "2022-12-12", required_vals_only = TRUE, output_types = c("pmf", "quantile", "mean"), force_output_types = TRUE, derived_task_ids = "target_end_date", complete_cases_only = FALSE ) # Create a template from a URL to fully configured hub repository on GitHub submission_tmpl( path = "https://github.com/hubverse-org/example-simple-forecast-hub", round_id = "2022-11-28", output_types = "quantile" ) # Create a template from a URL to the raw contents of a tasks.json file on # GitHub config_raw_url <- paste0( "https://raw.githubusercontent.com/hubverse-org/", "example-simple-forecast-hub/refs/heads/main/hub-config/tasks.json" ) submission_tmpl( path = config_raw_url, round_id = "2022-11-28", output_types = "quantile" ) # Create submission file using config file from AWS S3 bucket hub # Use `s3_bucket()` to create a path to the hub's root directory s3_hub_path <- arrow::s3_bucket("hubverse/hubutils/testhubs/simple/") submission_tmpl( path = s3_hub_path, round_id = "2022-10-01", output_types = "quantile" ) # Use `path()` method to create a path to the tasks.json file relative to the # the S3 cloud hub's root directory s3_config_path <- s3_hub_path$path("hub-config/tasks.json") submission_tmpl( path = s3_config_path, round_id = "2022-10-01", output_types = "quantile" )
Wrap check expression in try to capture check execution errors
try_check(expr, file_path)try_check(expr, file_path)
expr |
check function expression to run. |
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
If expr executes correctly, the output of expr is returned. If
execution fails, and object of class <error/check_exec_error> is returned.
The execution error message is attached as attribute msg.
Validate the contents of a submitted model data file
validate_model_data( hub_path, file_path, round_id_col = NULL, output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date"), validations_cfg_path = NULL, derived_task_ids = NULL )validate_model_data( hub_path, file_path, round_id_col = NULL, output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date"), validations_cfg_path = NULL, derived_task_ids = NULL )
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
round_id_col |
Character string. The name of the column containing
|
output_type_id_datatype |
character string. One of |
validations_cfg_path |
Path to |
derived_task_ids |
Character vector of derived task ID names (task IDs whose
values depend on other task IDs) to ignore. Columns for such task ids will
contain |
Note that it is necessary for derived_task_ids to be specified if any
task IDs with required values have dependent derived task IDs. If this
is the case and derived task IDs are not specified, the dependent nature of
derived task ID values will result in false validation errors when
validating required values.
Details of checks performed by validate_model_data()
| Name | Check | Early return | Fail output | Extra info |
|---|---|---|---|---|
| file_read | File can be read without errors | TRUE | check_error | |
| valid_round_id_col | Round ID var from config exists in data column names. Skipped if `round_id_from_var` is FALSE in config. | FALSE | check_failure | |
| unique_round_id | Round ID column contains a single unique round ID. Skipped if `round_id_from_var` is FALSE in config. | TRUE | check_error | |
| match_round_id | Round ID from file contents matches round ID from file name. Skipped if `round_id_from_var` is FALSE in config. | TRUE | check_error | |
| colnames | File column names match expected column names for round (i.e. task ID names + hub standard column names) | TRUE | check_error | |
| col_types | File column types match expected column types from config. Mainly applicable to parquet & arrow files. | FALSE | check_failure | |
| valid_vals | Columns (excluding the `value` and any derived task ID columns) contain valid combinations of task ID / output type / output type ID values | TRUE | check_error | error_tbl: table of invalid task ID/output type/output type ID value combinations |
| derived_task_id_vals | Derived task ID columns contain valid values. | FALSE | check_failure | errors: named list of derived task ID values. Each element contains the invalid values for each derived task ID that failed the check. |
| rows_unique | Columns (excluding the `value` and any derived task ID columns) contain unique combinations of task ID / output type / output type ID values | FALSE | check_failure | |
| req_vals | Columns (excluding the `value` and any derived task ID columns) contain all required combinations of task ID / output type / output type ID values | FALSE | check_failure | missing_df: table of missing task ID/output type/output type ID value combinations |
| value_col_valid | Values in `value` column are coercible to data type configured for each output type | FALSE | check_failure | |
| value_col_non_desc | Values in `value` column are non-decreasing as output_type_ids increase for all unique task ID /output type value combinations. Applies to `quantile` or `cdf` output types only | FALSE | check_failure | error_tbl: table of rows affected |
| value_col_sum1 | Values in the `value` column of `pmf` output type data for each unique task ID combination sum to 1. | FALSE | check_failure | error_tbl: table of rows affected |
| spl_mt_unique | Individual sample output_type_ids do not span multiple model tasks. | TRUE | check_error | errors: list with mt_ids (model task indices the overlapping samples span) and output_type_ids (sample IDs appearing in multiple model tasks). |
| spl_compound_taskid_set | Sample compound task id sets for each modeling task match or are coarser than the expected set defined in tasks.json config. | TRUE | check_error | errors: list containing item for each failing modeling task. Exact structure dependent on type of validation failure. See check function documentation for more details. |
| spl_compound_tid | Samples contain single unique values for each compound task ID within individual samples (v3 and above schema only). | TRUE | check_error | errors: list containing item for each sample failing validation with breakdown of unique values for each compound task ID. |
| spl_non_compound_tid | Samples contain single unique combination of non-compound task ID values across all samples (v3 and above schema only). | TRUE | check_error | errors: list containing item for each modeling task with vectors of output type ids of samples failing validation and example table of most frequent non-compound task ID value combination across all samples in the modeling task |
| spl_n | Number of samples for a given compound idx falls within accepted compound task range (v3 and above schema only). | FALSE | check_failure | errors: list containing item for each compound_idx failing validation with sample count, metadata on expected samples and example table of expected structure for samples belonging to the compound idx in question. |
An object of class hub_validations. Each named element contains
a hub_check class object reflecting the result of a given check. Function
will return early if a check returns an error. The where attribute is set
to file_path.
For more details on the structure of <hub_validations> objects, including
how to access more information on individual checks,
see article on <hub_validations> S3 class objects.
hub_path <- system.file("testhubs/simple", package = "hubValidations") file_path <- "team1-goodmodel/2022-10-08-team1-goodmodel.csv" validate_model_data(hub_path, file_path)hub_path <- system.file("testhubs/simple", package = "hubValidations") file_path <- "team1-goodmodel/2022-10-08-team1-goodmodel.csv" validate_model_data(hub_path, file_path)
Valid file level properties of a submitted model output file.
validate_model_file(hub_path, file_path, validations_cfg_path = NULL)validate_model_file(hub_path, file_path, validations_cfg_path = NULL)
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
validations_cfg_path |
Path to |
Details of checks performed by validate_model_file()
| Name | Check | Early return | Fail output | Extra info |
|---|---|---|---|---|
| file_exists | File exists at `file_path` provided | TRUE | check_error | |
| file_name | File name valid | TRUE | check_error | |
| file_location | File located in correct team directory | FALSE | check_failure | |
| round_id_valid | File round ID is valid hub round IDs | TRUE | check_error | |
| file_format | File format is accepted hub/round format | TRUE | check_error | |
| file_n | Number of submission files per round per team does not exceed allowed number | FALSE | check_failure | |
| metadata_exists | Model metadata file exists in expected location | FALSE | check_failure |
An object of class hub_validations. Each named element contains
a hub_check class object reflecting the result of a given check. Function
will return early if a check returns an error. The where attribute is set
to file_path.
For more details on the structure of <hub_validations> objects, including
how to access more information on individual checks,
see article on <hub_validations> S3 class objects.
hub_path <- system.file("testhubs/simple", package = "hubValidations") validate_model_file(hub_path, file_path = "team1-goodmodel/2022-10-08-team1-goodmodel.csv" ) validate_model_file(hub_path, file_path = "team1-goodmodel/2022-10-15-team1-goodmodel.csv" )hub_path <- system.file("testhubs/simple", package = "hubValidations") validate_model_file(hub_path, file_path = "team1-goodmodel/2022-10-08-team1-goodmodel.csv" ) validate_model_file(hub_path, file_path = "team1-goodmodel/2022-10-15-team1-goodmodel.csv" )
Valid properties of a metadata file.
validate_model_metadata( hub_path, file_path, round_id = "default", validations_cfg_path = NULL )validate_model_metadata( hub_path, file_path, round_id = "default", validations_cfg_path = NULL )
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
round_id |
character string. The round identifier. Used primarily to indicate whether the "default" or a round specific configuration should be used for custom validations. |
validations_cfg_path |
Path to |
Details of checks performed by validate_model_metadata()
| Name | Check | Early return | Fail output | Extra info |
|---|---|---|---|---|
| metadata_schema_exists | A model metadata schema file exists in `hub-config` directory. | TRUE | check_error | |
| metadata_file_exists | A file with name provided to argument `file_path` exists at the expected location (the `model-metadata` directory). | TRUE | check_error | |
| metadata_file_ext | The metadata file has correct extension (yaml or yml). | TRUE | check_error | |
| metadata_file_location | The metadata file has been saved to correct location. | TRUE | check_failure | |
| metadata_matches_schema | The contents of the metadata file match the hub's model metadata schema | TRUE | check_error | |
| metadata_file_name | The metadata filename matches the model ID specified in the contents of the file. | TRUE | check_error |
An object of class hub_validations. Each named element contains
a hub_check class object reflecting the result of a given check. Function
will return early if a check returns an error. The where attribute is set
to file_path.
For more details on the structure of <hub_validations> objects, including
how to access more information on individual checks,
see article on <hub_validations> S3 class objects.
hub_path <- system.file("testhubs/simple", package = "hubValidations") validate_model_metadata(hub_path, file_path = "hub-baseline.yml" ) validate_model_metadata(hub_path, file_path = "team1-goodmodel.yaml" )hub_path <- system.file("testhubs/simple", package = "hubValidations") validate_model_metadata(hub_path, file_path = "hub-baseline.yml" ) validate_model_metadata(hub_path, file_path = "team1-goodmodel.yaml" )
Validates model output and model metadata files in a Pull Request.
validate_pr( hub_path = ".", gh_repo, pr_number, round_id_col = NULL, output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date"), validations_cfg_path = NULL, skip_submit_window_check = FALSE, file_modification_check = c("error", "failure", "warn", "message", "none"), allow_submit_window_mods = TRUE, submit_window_ref_date_from = c("file", "file_path"), derived_task_ids = NULL )validate_pr( hub_path = ".", gh_repo, pr_number, round_id_col = NULL, output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date"), validations_cfg_path = NULL, skip_submit_window_check = FALSE, file_modification_check = c("error", "failure", "warn", "message", "none"), allow_submit_window_mods = TRUE, submit_window_ref_date_from = c("file", "file_path"), derived_task_ids = NULL )
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
gh_repo |
GitHub repository address in the format |
pr_number |
Number of the pull request to validate |
round_id_col |
Character string. The name of the column containing
|
output_type_id_datatype |
character string. One of |
validations_cfg_path |
Path to |
skip_submit_window_check |
Logical. Whether to skip the submission window check. |
file_modification_check |
Character string. Whether to perform check and what to return when modification/deletion of a previously submitted model output file or deletion of a previously submitted model metadata file is detected in PR:
|
allow_submit_window_mods |
Logical. Whether to allow modifications/deletions
of model output files within their submission windows. Defaults to |
submit_window_ref_date_from |
whether to get the reference date around
which relative submission windows will be determined from the file's
|
derived_task_ids |
Character vector of derived task ID names (task IDs whose
values depend on other task IDs) to ignore. Columns for such task ids will
contain |
Only model output and model metadata files are individually validated using
validate_submission() or validate_model_metadata() respectively although
as part of checks, hub config files are also validated.
Any other files included in the PR are ignored but flagged in a message.
By default, modifications (which include renaming) and deletions of
previously submitted model output files and deletions or renaming of
previously submitted model metadata files are not allowed
and return a <error/check_error> condition class object for each
applicable modified/deleted file. This behaviour can be modified through
arguments file_modification_check, which controls whether modification/deletion
checks are performed and what is returned if modifications/deletions are detected,
and allow_submit_window_mods, which controls whether modifications/deletions
of model output files are allowed within their submission windows.
When modification/deletion checks are enabled, each affected file creates an
entry in the returned collection named by the file's path. The check within
each entry is named valid_file_status (reflecting that we validate the
file's git status).
For example, to access the check for a deleted file:
collection[["team1-goodmodel/2022-10-15-team1-goodmodel.csv"]][["valid_file_status"]].
Note that to establish relative submission windows when performing
modification/deletion checks and allow_submit_window_mods
is TRUE, the reference date is taken as the round_id extracted from
the file path (i.e. submit_window_ref_date_from is always set to "file_path").
This is because we cannot extract dates from columns of deleted
files. If hub submission window reference dates do not match round IDs in file paths,
currently allow_submit_window_mods will not work correctly and is best set
to FALSE. This only relates to hubs/rounds where submission windows are
determined relative to a reference date and not when explicit submission
window start and end dates are provided in the config.
Finally, note that it is necessary for derived_task_ids to be specified if any
task IDs with required values have dependent derived task IDs. If this
is the case and derived task IDs are not specified, the dependent nature of
derived task ID values will result in false validation errors when
validating required values.
Details of checks performed by validate_submission()
| Name | Check | Early return | Fail output | Extra info |
|---|---|---|---|---|
| valid_config | Hub config valid | TRUE | check_error | |
| submission_time | Current time within file submission window | FALSE | check_failure | |
| file_exists | File exists at `file_path` provided | TRUE | check_error | |
| file_name | File name valid | TRUE | check_error | |
| file_location | File located in correct team directory | FALSE | check_failure | |
| round_id_valid | File round ID is valid hub round IDs | TRUE | check_error | |
| file_format | File format is accepted hub/round format | TRUE | check_error | |
| file_n | Number of submission files per round per team does not exceed allowed number | FALSE | check_failure | |
| metadata_exists | Model metadata file exists in expected location | FALSE | check_failure | |
| file_read | File can be read without errors | TRUE | check_error | |
| valid_round_id_col | Round ID var from config exists in data column names. Skipped if `round_id_from_var` is FALSE in config. | FALSE | check_failure | |
| unique_round_id | Round ID column contains a single unique round ID. Skipped if `round_id_from_var` is FALSE in config. | TRUE | check_error | |
| match_round_id | Round ID from file contents matches round ID from file name. Skipped if `round_id_from_var` is FALSE in config. | TRUE | check_error | |
| colnames | File column names match expected column names for round (i.e. task ID names + hub standard column names) | TRUE | check_error | |
| col_types | File column types match expected column types from config. Mainly applicable to parquet & arrow files. | FALSE | check_failure | |
| valid_vals | Columns (excluding the `value` and any derived task ID columns) contain valid combinations of task ID / output type / output type ID values | TRUE | check_error | error_tbl: table of invalid task ID/output type/output type ID value combinations |
| derived_task_id_vals | Derived task ID columns contain valid values. | FALSE | check_failure | errors: named list of derived task ID values. Each element contains the invalid values for each derived task ID that failed the check. |
| rows_unique | Columns (excluding the `value` and any derived task ID columns) contain unique combinations of task ID / output type / output type ID values | FALSE | check_failure | |
| req_vals | Columns (excluding the `value` and any derived task ID columns) contain all required combinations of task ID / output type / output type ID values | FALSE | check_failure | missing_df: table of missing task ID/output type/output type ID value combinations |
| value_col_valid | Values in `value` column are coercible to data type configured for each output type | FALSE | check_failure | |
| value_col_non_desc | Values in `value` column are non-decreasing as output_type_ids increase for all unique task ID /output type value combinations. Applies to `quantile` or `cdf` output types only | FALSE | check_failure | error_tbl: table of rows affected |
| value_col_sum1 | Values in the `value` column of `pmf` output type data for each unique task ID combination sum to 1. | FALSE | check_failure | error_tbl: table of rows affected |
| spl_mt_unique | Individual sample output_type_ids do not span multiple model tasks. | TRUE | check_error | errors: list with mt_ids (model task indices the overlapping samples span) and output_type_ids (sample IDs appearing in multiple model tasks). |
| spl_compound_taskid_set | Sample compound task id sets for each modeling task match or are coarser than the expected set defined in tasks.json config. | TRUE | check_error | errors: list containing item for each failing modeling task. Exact structure dependent on type of validation failure. See check function documentation for more details. |
| spl_compound_tid | Samples contain single unique values for each compound task ID within individual samples (v3 and above schema only). | TRUE | check_error | errors: list containing item for each sample failing validation with breakdown of unique values for each compound task ID. |
| spl_non_compound_tid | Samples contain single unique combination of non-compound task ID values across all samples (v3 and above schema only). | TRUE | check_error | errors: list containing item for each modeling task with vectors of output type ids of samples failing validation and example table of most frequent non-compound task ID value combination across all samples in the modeling task |
| spl_n | Number of samples for a given compound idx falls within accepted compound task range (v3 and above schema only). | FALSE | check_failure | errors: list containing item for each compound_idx failing validation with sample count, metadata on expected samples and example table of expected structure for samples belonging to the compound idx in question. |
Details of checks performed by validate_model_metadata()
| Name | Check | Early return | Fail output | Extra info | optional |
|---|---|---|---|---|---|
| metadata_schema_exists | A model metadata schema file exists in `hub-config` directory. | TRUE | check_error | FALSE | |
| metadata_file_exists | A file with name provided to argument `file_path` exists at the expected location (the `model-metadata` directory). | TRUE | check_error | FALSE | |
| metadata_file_ext | The metadata file has correct extension (yaml or yml). | TRUE | check_error | FALSE | |
| metadata_file_location | The metadata file has been saved to correct location. | TRUE | check_failure | FALSE | |
| metadata_matches_schema | The contents of the metadata file match the hub's model metadata schema | TRUE | check_error | FALSE | |
| metadata_file_name | The metadata filename matches the model ID specified in the contents of the file. | TRUE | check_error | FALSE | |
| NA | The number of metadata files submitted by a single team does not exceed the maximum number allowed. | FALSE | check_failure | TRUE |
An object of class hub_validations_collection, a collection of
validation results. The collection includes entries for hub config
validation ("hub-config") and file-specific validations (named by
file path).
## Not run: validate_pr( hub_path = ".", gh_repo = "hubverse-org/ci-testhub-simple", pr_number = 3 ) ## End(Not run)## Not run: validate_pr( hub_path = ".", gh_repo = "hubverse-org/ci-testhub-simple", pr_number = 3 ) ## End(Not run)
Checks both file level properties like file name, extension, location etc as well as model output data, i.e. the contents of the file.
validate_submission( hub_path, file_path, round_id_col = NULL, validations_cfg_path = NULL, output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date"), skip_submit_window_check = FALSE, skip_check_config = FALSE, submit_window_ref_date_from = c("file", "file_path"), derived_task_ids = NULL )validate_submission( hub_path, file_path, round_id_col = NULL, validations_cfg_path = NULL, output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date"), skip_submit_window_check = FALSE, skip_check_config = FALSE, submit_window_ref_date_from = c("file", "file_path"), derived_task_ids = NULL )
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
round_id_col |
Character string. The name of the column containing
|
validations_cfg_path |
Path to |
output_type_id_datatype |
character string. One of |
skip_submit_window_check |
Logical. Whether to skip the submission window check. |
skip_check_config |
Logical. Whether to skip the hub config validation check. |
submit_window_ref_date_from |
whether to get the reference date around
which relative submission windows will be determined from the file's
|
derived_task_ids |
Character vector of derived task ID names (task IDs whose
values depend on other task IDs) to ignore. Columns for such task ids will
contain |
Note that it is necessary for derived_task_ids to be specified if any
task IDs with required values have dependent derived task IDs. If this
is the case and derived task IDs are not specified, the dependent nature of
derived task ID values will result in false validation errors when
validating required values.
Details of checks performed by validate_submission()
| Name | Check | Early return | Fail output | Extra info |
|---|---|---|---|---|
| valid_config | Hub config valid | TRUE | check_error | |
| submission_time | Current time within file submission window | FALSE | check_failure | |
| file_exists | File exists at `file_path` provided | TRUE | check_error | |
| file_name | File name valid | TRUE | check_error | |
| file_location | File located in correct team directory | FALSE | check_failure | |
| round_id_valid | File round ID is valid hub round IDs | TRUE | check_error | |
| file_format | File format is accepted hub/round format | TRUE | check_error | |
| file_n | Number of submission files per round per team does not exceed allowed number | FALSE | check_failure | |
| metadata_exists | Model metadata file exists in expected location | FALSE | check_failure | |
| file_read | File can be read without errors | TRUE | check_error | |
| valid_round_id_col | Round ID var from config exists in data column names. Skipped if `round_id_from_var` is FALSE in config. | FALSE | check_failure | |
| unique_round_id | Round ID column contains a single unique round ID. Skipped if `round_id_from_var` is FALSE in config. | TRUE | check_error | |
| match_round_id | Round ID from file contents matches round ID from file name. Skipped if `round_id_from_var` is FALSE in config. | TRUE | check_error | |
| colnames | File column names match expected column names for round (i.e. task ID names + hub standard column names) | TRUE | check_error | |
| col_types | File column types match expected column types from config. Mainly applicable to parquet & arrow files. | FALSE | check_failure | |
| valid_vals | Columns (excluding the `value` and any derived task ID columns) contain valid combinations of task ID / output type / output type ID values | TRUE | check_error | error_tbl: table of invalid task ID/output type/output type ID value combinations |
| derived_task_id_vals | Derived task ID columns contain valid values. | FALSE | check_failure | errors: named list of derived task ID values. Each element contains the invalid values for each derived task ID that failed the check. |
| rows_unique | Columns (excluding the `value` and any derived task ID columns) contain unique combinations of task ID / output type / output type ID values | FALSE | check_failure | |
| req_vals | Columns (excluding the `value` and any derived task ID columns) contain all required combinations of task ID / output type / output type ID values | FALSE | check_failure | missing_df: table of missing task ID/output type/output type ID value combinations |
| value_col_valid | Values in `value` column are coercible to data type configured for each output type | FALSE | check_failure | |
| value_col_non_desc | Values in `value` column are non-decreasing as output_type_ids increase for all unique task ID /output type value combinations. Applies to `quantile` or `cdf` output types only | FALSE | check_failure | error_tbl: table of rows affected |
| value_col_sum1 | Values in the `value` column of `pmf` output type data for each unique task ID combination sum to 1. | FALSE | check_failure | error_tbl: table of rows affected |
| spl_mt_unique | Individual sample output_type_ids do not span multiple model tasks. | TRUE | check_error | errors: list with mt_ids (model task indices the overlapping samples span) and output_type_ids (sample IDs appearing in multiple model tasks). |
| spl_compound_taskid_set | Sample compound task id sets for each modeling task match or are coarser than the expected set defined in tasks.json config. | TRUE | check_error | errors: list containing item for each failing modeling task. Exact structure dependent on type of validation failure. See check function documentation for more details. |
| spl_compound_tid | Samples contain single unique values for each compound task ID within individual samples (v3 and above schema only). | TRUE | check_error | errors: list containing item for each sample failing validation with breakdown of unique values for each compound task ID. |
| spl_non_compound_tid | Samples contain single unique combination of non-compound task ID values across all samples (v3 and above schema only). | TRUE | check_error | errors: list containing item for each modeling task with vectors of output type ids of samples failing validation and example table of most frequent non-compound task ID value combination across all samples in the modeling task |
| spl_n | Number of samples for a given compound idx falls within accepted compound task range (v3 and above schema only). | FALSE | check_failure | errors: list containing item for each compound_idx failing validation with sample count, metadata on expected samples and example table of expected structure for samples belonging to the compound idx in question. |
A hub_validations_collection object containing validation results
organized by file. The collection includes separate entries for hub config
validation (keyed by "hub-config") and file-specific validations (keyed by
file path).
hub_path <- system.file("testhubs/simple", package = "hubValidations") file_path <- "team1-goodmodel/2022-10-08-team1-goodmodel.csv" validate_submission(hub_path, file_path)hub_path <- system.file("testhubs/simple", package = "hubValidations") file_path <- "team1-goodmodel/2022-10-08-team1-goodmodel.csv" validate_submission(hub_path, file_path)
Validate a submitted model data file submission time.
validate_submission_time( hub_path, file_path, ref_date_from = c("file_path", "file") )validate_submission_time( hub_path, file_path, ref_date_from = c("file_path", "file") )
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
file_path |
character string. Path to the file being validated relative to the hub's model-output directory. |
ref_date_from |
whether to get the reference date around
which relative submission windows will be determined from the file's
|
An object of class hub_validations. Each named element contains
a hub_check class object reflecting the result of a given check. Function
will return early if a check returns an error. The where attribute is set
to file_path.
For more details on the structure of <hub_validations> objects, including
how to access more information on individual checks,
see article on <hub_validations> S3 class objects.
hub_path <- system.file("testhubs/simple", package = "hubValidations") file_path <- "team1-goodmodel/2022-10-08-team1-goodmodel.csv" validate_submission_time(hub_path, file_path)hub_path <- system.file("testhubs/simple", package = "hubValidations") file_path <- "team1-goodmodel/2022-10-08-team1-goodmodel.csv" validate_submission_time(hub_path, file_path)
Validate the contents of a submitted target data file.
validate_target_data( hub_path, file_path, target_type = c("time-series", "oracle-output"), date_col = NULL, allow_extra_dates = FALSE, na = c("NA", ""), output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date"), validations_cfg_path = NULL, round_id = "default" )validate_target_data( hub_path, file_path, target_type = c("time-series", "oracle-output"), date_col = NULL, allow_extra_dates = FALSE, na = c("NA", ""), output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date"), validations_cfg_path = NULL, round_id = "default" )
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
file_path |
A character string representing the path to the target data file
relative to the |
target_type |
Type of target data to retrieve matching files. One of "time-series" or "oracle-output". Defaults to "time-series". |
date_col |
Optional name of the column containing the date observations
actually occurred (e.g., |
allow_extra_dates |
Logical. If TRUE and target_type is "time-series", allows date values not in tasks.json. Other task ID columns are still strictly validated. Ignored for oracle-output (always strict). |
na |
A character vector of strings to interpret as missing values. Only
applies to CSV files. The default is |
output_type_id_datatype |
character string. One of |
validations_cfg_path |
Path to YAML file configuring custom validation checks.
If |
round_id |
Character string. Not generally relevant to target datasets
but can be used to specify a specific block of custom validation checks.
Otherwise best set to |
Details of checks performed by validate_target_data()
| Name | Check | Early return | Fail output | Extra info |
|---|---|---|---|---|
| target_file_read | Target data file can be read successfully. | TRUE | check_error | |
| target_tbl_colnames | Target data file has the correct column names according to target type. | TRUE | check_error | |
| target_tbl_coltypes | Target data file has the correct column types according to target type. | TRUE | check_error | |
| target_tbl_ts_targets | Targets in a time-series target data file are valid. Only performed on `time-series` data files. | TRUE | check_error | |
| target_tbl_rows_unique | Target data file rows are all unique. | FALSE | check_failure | |
| target_tbl_values | Task ID columns in a target data file have valid task ID values. | TRUE | check_error | |
| target_tbl_output_type_ids | Output type ID values in a target data file are valid and complete. Only performed when the target data file contains an `output_type_id` column. | TRUE | check_error | |
| target_tbl_oracle_value | Oracle values in a target data file are valid. Only performed on `oracle output` data files. | FALSE | check_failure |
An object of class hub_validations. Each named element contains
a hub_check class object reflecting the result of a given check. Function
will return early if a check returns an error. The where attribute is set
to file_path.
For more details on the structure of <hub_validations> objects, including
how to access more information on individual checks,
see article on <hub_validations> S3 class objects.
hub_path <- system.file("testhubs/v5/target_file", package = "hubUtils") validate_target_data(hub_path, file_path = "time-series.csv", target_type = "time-series" ) validate_target_data(hub_path, file_path = "oracle-output.csv", target_type = "oracle-output" ) hub_path <- system.file("testhubs/v5/target_dir", package = "hubUtils") validate_target_data(hub_path, file_path = "time-series/target=flu_hosp_rate/part-0.parquet", target_type = "time-series" ) validate_target_data(hub_path, file_path = "oracle-output/output_type=pmf/part-0.parquet", target_type = "oracle-output" )hub_path <- system.file("testhubs/v5/target_file", package = "hubUtils") validate_target_data(hub_path, file_path = "time-series.csv", target_type = "time-series" ) validate_target_data(hub_path, file_path = "oracle-output.csv", target_type = "oracle-output" ) hub_path <- system.file("testhubs/v5/target_dir", package = "hubUtils") validate_target_data(hub_path, file_path = "time-series/target=flu_hosp_rate/part-0.parquet", target_type = "time-series" ) validate_target_data(hub_path, file_path = "oracle-output/output_type=pmf/part-0.parquet", target_type = "oracle-output" )
Validate dataset level properties of a given target type
validate_target_dataset( hub_path, target_type = c("time-series", "oracle-output"), validations_cfg_path = NULL, round_id = "default" )validate_target_dataset( hub_path, target_type = c("time-series", "oracle-output"), validations_cfg_path = NULL, round_id = "default" )
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
target_type |
Type of target data to retrieve matching files. One of "time-series" or "oracle-output". Defaults to "time-series". |
validations_cfg_path |
Path to YAML file configuring custom validation checks.
If |
round_id |
Character string. Not generally relevant to target datasets
but can be used to specify a specific block of custom validation checks.
Otherwise best set to |
Details of checks performed by validate_target_dataset()
| Name | Check | Early return | Fail output | Extra info |
|---|---|---|---|---|
| target_dataset_exists | Target dataset can be successfully detected for a given target type. | TRUE | check_error | |
| target_dataset_unique | A single unique target dataset exists for a given target type. | TRUE | check_error | |
| target_dataset_file_ext_unique | All files of a given target type share a single unique file format. | TRUE | check_error | |
| target_dataset_rows_unique | Target dataset rows are all unique. | FALSE | check_failure |
An object of class hub_validations. Each named element contains
a hub_check class object reflecting the result of a given check. Function
will return early if a check returns an error. The where attribute is set
to target_type (e.g. "oracle-output", "time-series").
For more details on the structure of <hub_validations> objects, including
how to access more information on individual checks,
see article on <hub_validations> S3 class objects.
# Validate single file target datasets hub_path <- system.file("testhubs/v5/target_file", package = "hubUtils") validate_target_dataset(hub_path, target_type = "time-series" ) validate_target_dataset(hub_path, target_type = "oracle-output" ) # Validate multi-file partitioned target datasets hub_path <- system.file("testhubs/v5/target_dir", package = "hubUtils") validate_target_dataset(hub_path, target_type = "time-series" ) validate_target_dataset(hub_path, target_type = "oracle-output" )# Validate single file target datasets hub_path <- system.file("testhubs/v5/target_file", package = "hubUtils") validate_target_dataset(hub_path, target_type = "time-series" ) validate_target_dataset(hub_path, target_type = "oracle-output" ) # Validate multi-file partitioned target datasets hub_path <- system.file("testhubs/v5/target_dir", package = "hubUtils") validate_target_dataset(hub_path, target_type = "time-series" ) validate_target_dataset(hub_path, target_type = "oracle-output" )
Validate file level properties of a target data file.
validate_target_file( hub_path, file_path, validations_cfg_path = NULL, round_id = "default" )validate_target_file( hub_path, file_path, validations_cfg_path = NULL, round_id = "default" )
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
file_path |
A character string representing the path to the target data
file relative to the |
validations_cfg_path |
Path to YAML file configuring custom validation checks.
If |
round_id |
Character string. Not generally relevant to target datasets
but can be used to specify a specific block of custom validation checks.
Otherwise best set to |
Details of checks performed by validate_target_file()
| Name | Check | Early return | Fail output | Extra info |
|---|---|---|---|---|
| target_file_exists | File exists at `file_path` provided. | TRUE | check_error | |
| target_partition_file_name | Hive-style partition file path segments are valid and can be parsed successfully. Skipped if target dataset not hive-partitioned. | TRUE | check_error | |
| target_file_ext | Target data file extension is valid. | TRUE | check_error |
An object of class hub_validations. Each named element contains
a hub_check class object reflecting the result of a given check. Function
will return early if a check returns an error. The where attribute is set
to file_path.
For more details on the structure of <hub_validations> objects, including
how to access more information on individual checks,
see article on <hub_validations> S3 class objects.
hub_path <- system.file("testhubs/v5/target_file", package = "hubUtils") validate_target_file(hub_path, file_path = "time-series.csv" ) validate_target_file(hub_path, file_path = "oracle-output.csv" ) hub_path <- system.file("testhubs/v5/target_dir", package = "hubUtils") validate_target_file(hub_path, file_path = "time-series/target=flu_hosp_rate/part-0.parquet" ) validate_target_file(hub_path, file_path = "oracle-output/output_type=pmf/part-0.parquet" )hub_path <- system.file("testhubs/v5/target_file", package = "hubUtils") validate_target_file(hub_path, file_path = "time-series.csv" ) validate_target_file(hub_path, file_path = "oracle-output.csv" ) hub_path <- system.file("testhubs/v5/target_dir", package = "hubUtils") validate_target_file(hub_path, file_path = "time-series/target=flu_hosp_rate/part-0.parquet" ) validate_target_file(hub_path, file_path = "oracle-output/output_type=pmf/part-0.parquet" )
Validates target data files in a Pull Request.
validate_target_pr( hub_path = ".", gh_repo, pr_number, output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date"), date_col = NULL, allow_extra_dates = FALSE, na = c("NA", ""), round_id = "default", validations_cfg_path = NULL, file_modification_check = c("none", "message", "failure", "error"), allow_target_type_deletion = FALSE )validate_target_pr( hub_path = ".", gh_repo, pr_number, output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date"), date_col = NULL, allow_extra_dates = FALSE, na = c("NA", ""), round_id = "default", validations_cfg_path = NULL, file_modification_check = c("none", "message", "failure", "error"), allow_target_type_deletion = FALSE )
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
gh_repo |
GitHub repository address in the format |
pr_number |
Number of the pull request to validate |
output_type_id_datatype |
character string. One of |
date_col |
Optional name of the column containing the date observations
actually occurred (e.g., |
allow_extra_dates |
Logical. If TRUE and target_type is "time-series", allows date values not in tasks.json. Other task ID columns are still strictly validated. Ignored for oracle-output (always strict). |
na |
Character vector of strings to interpret as missing values when reading data files. Passed to the underlying file reader. |
round_id |
Character string. Not generally relevant to target datasets
but can be used to specify a specific block of custom validation checks.
Otherwise best set to |
validations_cfg_path |
Path to YAML file configuring custom validation checks.
If |
file_modification_check |
Character string. Whether to perform check and what to return when modification/deletion of a previously submitted target data file is detected in PR:
Defaults to |
allow_target_type_deletion |
Logical. Whether to allow deletion of an entire
target type dataset (i.e. all files of a target type) in the PR. Defaults to |
Only target data files are individually validated using
validate_target_submission() although as part of checks, hub config files and
any affected target type datasets as a whole are also validated via
validate_target_dataset().
Any other files included in the PR are ignored but flagged in a message.
By default, modifications (which include renaming) and deletions of
previously submitted target data files are allowed.
This behaviour can be modified through
arguments file_modification_check, which controls whether modification/deletion
checks are performed and what is returned if modifications/deletions are detected.
When modification/deletion checks are enabled, each affected file creates an
entry in the returned collection named by the file's path. The check within
each entry is named valid_file_status (reflecting that we validate the
file's git status).
For example, to access the check for a deleted file:
collection[["oracle-output/output_type=sample/part-0.parquet"]][["valid_file_status"]].
Details of checks performed by validate_target_dataset()
| Name | Check | Early return | Fail output | Extra info |
|---|---|---|---|---|
| valid_config | Hub config valid | TRUE | check_error | |
| target_dataset_exists | Target dataset can be successfully detected for a given target type. | TRUE | check_error | |
| target_dataset_unique | A single unique target dataset exists for a given target type. | TRUE | check_error | |
| target_dataset_file_ext_unique | All files of a given target type share a single unique file format. | TRUE | check_error | |
| target_dataset_rows_unique | Target dataset rows are all unique. | FALSE | check_failure |
Details of checks performed by validate_target_submission()
| Name | Check | Early return | Fail output | Extra info | optional |
|---|---|---|---|---|---|
| target_file_exists | File exists at `file_path` provided. | TRUE | check_error | FALSE | |
| target_partition_file_name | Hive-style partition file path segments are valid and can be parsed successfully. Skipped if target dataset not hive-partitioned. | TRUE | check_error | FALSE | |
| target_file_ext | Target data file extension is valid. | TRUE | check_error | FALSE | |
| target_file_read | Target data file can be read successfully. | TRUE | check_error | FALSE | |
| target_tbl_colnames | Target data file has the correct column names according to target type. | TRUE | check_error | FALSE | |
| target_tbl_coltypes | Target data file has the correct column types according to target type. | TRUE | check_error | FALSE | |
| target_tbl_ts_targets | Targets in a time-series target data file are valid. Only performed on `time-series` data files. | TRUE | check_error | FALSE | |
| target_tbl_rows_unique | Target data file rows are all unique. | FALSE | check_failure | FALSE | |
| target_tbl_values | Task ID columns in a target data file have valid task ID values. | TRUE | check_error | FALSE | |
| target_tbl_output_type_ids | Output type ID values in a target data file are valid and complete. Only performed when the target data file contains an `output_type_id` column. | TRUE | check_error | FALSE | |
| target_tbl_oracle_value | Oracle values in a target data file are valid. Only performed on `oracle output` data files. | FALSE | check_failure | FALSE |
An object of class target_validations_collection, a collection of
validation results. The collection includes entries for hub config
validation ("hub-config"), target dataset type validation ("time-series",
"oracle-output"), and individual file validations (named by file path).
## Not run: tmp_dir <- withr::local_tempdir() ci_target_hub_path <- fs::path(tmp_dir, "target") gert::git_clone( url = "https://github.com/hubverse-org/ci-testhub-target.git", path = ci_target_hub_path ) # Validate addition of single file in single file target dataset gert::git_branch_checkout( "add-file-oracle-output", repo = ci_target_hub_path ) validate_target_pr( hub_path = ci_target_hub_path, gh_repo = "hubverse-org/ci-testhub-target", pr_number = 1 ) # Validate addition of multiple files in partitioned target dataset gert::git_branch_checkout( "add-target-dir-files-v5", repo = ci_target_hub_path ) validate_target_pr( hub_path = ci_target_hub_path, gh_repo = "hubverse-org/ci-testhub-target", pr_number = 2 ) ## End(Not run)## Not run: tmp_dir <- withr::local_tempdir() ci_target_hub_path <- fs::path(tmp_dir, "target") gert::git_clone( url = "https://github.com/hubverse-org/ci-testhub-target.git", path = ci_target_hub_path ) # Validate addition of single file in single file target dataset gert::git_branch_checkout( "add-file-oracle-output", repo = ci_target_hub_path ) validate_target_pr( hub_path = ci_target_hub_path, gh_repo = "hubverse-org/ci-testhub-target", pr_number = 1 ) # Validate addition of multiple files in partitioned target dataset gert::git_branch_checkout( "add-target-dir-files-v5", repo = ci_target_hub_path ) validate_target_pr( hub_path = ci_target_hub_path, gh_repo = "hubverse-org/ci-testhub-target", pr_number = 2 ) ## End(Not run)
Checks both file level properties like file name, extension, location etc as well as target data, i.e. the contents of the file.
validate_target_submission( hub_path, file_path, target_type = c("time-series", "oracle-output"), date_col = NULL, allow_extra_dates = FALSE, round_id = "default", na = c("NA", ""), output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date"), validations_cfg_path = NULL, skip_check_config = FALSE )validate_target_submission( hub_path, file_path, target_type = c("time-series", "oracle-output"), date_col = NULL, allow_extra_dates = FALSE, round_id = "default", na = c("NA", ""), output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date"), validations_cfg_path = NULL, skip_check_config = FALSE )
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
file_path |
A character string representing the path to the target data
file relative to the |
target_type |
Character string. The type of target data, either
|
date_col |
Optional name of the column containing the date observations
actually occurred (e.g., |
allow_extra_dates |
Logical. If TRUE and target_type is "time-series", allows date values not in tasks.json. Other task ID columns are still strictly validated. Ignored for oracle-output (always strict). |
round_id |
Character string. Not generally relevant to target datasets
but can be used to specify a specific block of custom validation checks.
Otherwise best set to |
na |
Character vector of strings to interpret as missing values when reading data files. Passed to the underlying file reader. |
output_type_id_datatype |
character string. One of |
validations_cfg_path |
Path to YAML file configuring custom validation checks.
If |
skip_check_config |
Logical. Whether to skip the hub config validation check. |
Details of checks performed by validate_target_submission()
| Name | Check | Early return | Fail output | Extra info |
|---|---|---|---|---|
| valid_config | Hub config valid | TRUE | check_error | |
| target_file_exists | File exists at `file_path` provided. | TRUE | check_error | |
| target_partition_file_name | Hive-style partition file path segments are valid and can be parsed successfully. Skipped if target dataset not hive-partitioned. | TRUE | check_error | |
| target_file_ext | Target data file extension is valid. | TRUE | check_error | |
| target_file_read | Target data file can be read successfully. | TRUE | check_error | |
| target_tbl_colnames | Target data file has the correct column names according to target type. | TRUE | check_error | |
| target_tbl_coltypes | Target data file has the correct column types according to target type. | TRUE | check_error | |
| target_tbl_ts_targets | Targets in a time-series target data file are valid. Only performed on `time-series` data files. | TRUE | check_error | |
| target_tbl_rows_unique | Target data file rows are all unique. | FALSE | check_failure | |
| target_tbl_values | Task ID columns in a target data file have valid task ID values. | TRUE | check_error | |
| target_tbl_output_type_ids | Output type ID values in a target data file are valid and complete. Only performed when the target data file contains an `output_type_id` column. | TRUE | check_error | |
| target_tbl_oracle_value | Oracle values in a target data file are valid. Only performed on `oracle output` data files. | FALSE | check_failure |
A target_validations_collection object containing validation results
organized by file. The collection includes separate entries for hub config
validation (keyed by "hub-config") and file-specific validations (keyed by
file path).
hub_path <- system.file("testhubs/v5/target_file", package = "hubUtils") validate_target_submission( hub_path, file_path = "time-series.csv", target_type = "time-series" ) # Example with partitioned data hub_path <- system.file("testhubs/v5/target_dir", package = "hubUtils") validate_target_submission( hub_path, file_path = "time-series/target=flu_hosp_rate/part-0.parquet", target_type = "time-series" )hub_path <- system.file("testhubs/v5/target_file", package = "hubUtils") validate_target_submission( hub_path, file_path = "time-series.csv", target_type = "time-series" ) # Example with partitioned data hub_path <- system.file("testhubs/v5/target_dir", package = "hubUtils") validate_target_submission( hub_path, file_path = "time-series/target=flu_hosp_rate/part-0.parquet", target_type = "time-series" )