Package 'hubValidations'

Title: Testing framework for hubverse hub validations
Description: This package aims at providing a simple interface to run validations on data and metadata submitted to a hubverse modeling hub. Validation tests can be run at different levels (single file, single folder, whole repository) and locally as well as part of a continuous integration workflow.
Authors: Anna Krystalli [aut, cre] , Evan Ray [aut], Hugo Gruson [aut] , Zhian N. Kamvar [ctb] , Consortium of Infectious Disease Modeling Hubs [cph]
Maintainer: Anna Krystalli <[email protected]>
License: MIT + file LICENSE
Version: 0.11.0
Built: 2025-03-12 18:24:22 UTC

Help Index

Capture a condition of the result of validation check.


Capture a condition of the result of validation check.


  msg_verbs = c("is", "must be"),
  error = FALSE,
  details = NULL,



logical, the result of a validation check. If check is FALSE, validation has failed. If check is TRUE, validation has succeeded.


character string. Path to the file being validated. Must be the relative path to the hub's model-output (or equivalent) directory.


character string. The subject of the validation.


character string. The attribute of subject being validated.


character vector of length 2. The verbs describing the state of the attribute in relation to the validation subject. The first element describes the state when validation succeeds, the second element, when validation fails.


logical. In the case of validation failure, whether the function should return an object of class ⁠<error/check_error>⁠ (TRUE) or ⁠<error/check_failure>⁠ (FALSE, default).


further details to be appended to the output message.


<dynamic> Named data fields stored inside the condition object.


Arguments msg_subject, msg_attribute, msg_verbs and details accept text that can interpreted and formatted by cli::format_inline().


Depending on whether validation has succeeded and the value of the error argument, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_failure>⁠ condition class object.

  • ⁠<error/check_error>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.


  check = TRUE, file_path = "test/file.csv",
  msg_subject = "{.var round_id}", msg_attribute = "valid.", error = FALSE
  check = FALSE, file_path = "test/file.csv",
  msg_subject = "{.var round_id}", msg_attribute = "valid.", error = FALSE,
  details = "Must be one of 'A' or 'B', not 'C'"
  check = FALSE, file_path = "test/file.csv",
  msg_subject = "{.var round_id}", msg_attribute = "valid.", error = TRUE,
  details = "Must be one of {.val {c('A', 'B')}}, not {.val C}"

Capture a simple info message condition


Capture a simple info message condition. Useful for communicating when a check is ignored or skipped.


capture_check_info(file_path, msg, call = rlang::caller_call())



character string. Path to the file being validated. Must be the relative path to the hub's model-output (or equivalent) directory.


Character string. Accepts text that can interpreted and formatted by cli::format_inline().


The defused call of the function that generated the message. Use to override default which uses the caller call. See rlang::stack for more details.


A ⁠<message/check_info>⁠ condition class object. Returned object also inherits from subclass ⁠<hub_check>⁠.

Capture an execution error condition


Capture an execution error condition. Useful for communicating when a check execution has failed. Usually used in conjunction with try.


capture_exec_error(file_path, msg, call = NULL)



character string. Path to the file being validated. Must be the relative path to the hub's model-output (or equivalent) directory.


Character string.


Character string. Name of the parent call that failed to execute. If NULL (default), the caller's call name is captured.


A ⁠<error/check_exec_error>⁠ condition class object. Returned object also inherits from subclass ⁠<hub_check>⁠.

Capture an execution warning condition


Capture an execution warning condition. Useful for communicating when a check execution has failed. Usually used in conjunction with try.


capture_exec_warning(file_path, msg, call = NULL)



character string. Path to the file being validated. Must be the relative path to the hub's model-output (or equivalent) directory.


Character string.


Character string. Name of the parent call that failed to execute. If NULL (default), the caller's call name is captured.


A ⁠<warning/check_exec_warn>⁠ condition class object. Returned object also inherits from subclass ⁠<hub_check>⁠.

Check hub correctly configured


Checks that admin and tasks configuration files in directory hub-config are valid.





Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_error>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Check file exists at the file path specified


Check file exists at the file path specified


  hub_path = ".",
  subdir = c("model-output", "model-metadata", "hub-config")



character string. Path to the file being validated relative to the hub's model-output directory.


Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


subdirectory within the hub


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_error>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Check file format is accepted by hub.


Check file format is accepted by hub.


check_file_format(file_path, hub_path, round_id)



character string. Path to the file being validated relative to the hub's model-output directory.


Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


character string. The round identifier.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_error>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Check file is being submitted to the correct folder


Checks that the model_id metadata in the file name matches the directory name the file is being submitted to.





character string. Path to the file being validated relative to the hub's model-output directory.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_failure>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Check number of files submitted per round does not exceed the allowed number of submissions per team.


Check number of files submitted per round does not exceed the allowed number of submissions per team.


check_file_n(file_path, hub_path, allowed_n = 1L)



character string. Path to the file being validated relative to the hub's model-output directory.


Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


integer(1). The maximum number of files allowed per round.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_failure>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Check a model output file name can be correctly parsed.


Check a model output file name can be correctly parsed.





character string. Path to the file being validated relative to the hub's model-output directory.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_error>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Check file can be read successfully


Check file can be read successfully


check_file_read(file_path, hub_path = ".")



character string. Path to the file being validated relative to the hub's model-output directory.


Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_error>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Raise conditions stored in a hub_validations S3 object


This is meant to be used in CI workflows to raise conditions from hub_validations objects but can also be useful locally to summarise the results of checks contained in a hub_validations S3 object.


check_for_errors(x, verbose = FALSE)



A hub_validations object


Logical. If TRUE, print the results of all checks prior to raising condition and summarising hub_validations S3 object check results.


An error if one of the elements of x is of class check_failure, check_error, check_exec_error or check_exec_warning. TRUE invisibly otherwise.

Check whether a metadata schema file exists


Check whether a metadata schema file exists


check_metadata_file_exists(hub_path = ".", file_path)



Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


character string. Path to the file being validated relative to the hub's model-metadata directory.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_error>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Check file is being submitted to the correct folder


Checks that the model_id metadata in the file name matches the directory name the file is being submitted to.





character string. Path to the file being validated relative to the hub's model-output directory.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_failure>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Check that the metadata file is being submitted to the correct folder


Check that the metadata file is being submitted to the correct folder





character string. Path to the file being validated relative to the hub's model-metadata directory.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_failure>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Check whether the file name of a metadata file matches the model_id or combination of team_abbr and model_abbr specified within the metadata file


Check whether the file name of a metadata file matches the model_id or combination of team_abbr and model_abbr specified within the metadata file


check_metadata_file_name(file_path, hub_path = ".")



character string. Path to the file being validated relative to the hub's model-metadata directory.


Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_failure>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Check whether a metadata file matches the schema provided by the hub


Check whether a metadata file matches the schema provided by the hub


check_metadata_matches_schema(file_path, hub_path = ".")



character string. Path to the file being validated relative to the hub's model-metadata directory.


Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_failure>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Check whether a metadata schema file exists


Check whether a metadata schema file exists


check_metadata_schema_exists(hub_path = ".")



Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_failure>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Check whether a metadata file for the given model exists


Check whether a metadata file for the given model exists


check_submission_metadata_file_exists(file_path, hub_path = ".")



character string. Path to the file being validated relative to the hub's model-output directory.


Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_error>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Checks submission is within the valid submission window for a given round.


Checks submission is within the valid submission window for a given round.


  ref_date_from = c("file", "file_path")



Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


character string. Path to the file being validated relative to the hub's model-output directory.


whether to get the reference date around which relative submission windows will be determined from the file's file_path round ID or the file contents themselves. file requires that the file can be read. Only applicable when a round is configured to determine the submission windows relative to the value in a date column in model output files. Not applicable when explicit submission window start and end dates are provided in the hub's config.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_failure>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Check model data column data types


Check that model output data column datatypes conform to those define in the hub config.


  output_type_id_datatype = c("from_config", "auto", "character", "double", "integer",
    "logical", "Date")



a tibble/data.frame of the contents of the file being validated.


character string. Path to the file being validated relative to the hub's model-output directory.


Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


character string. One of "from_config", "auto", "character", "double", "integer", "logical", "Date". Defaults to "from_config" which uses the setting in the output_type_id_datatype property in the tasks.json config file if available. If the property is not set in the config, the argument falls back to "auto" which determines the output_type_id data type automatically from the tasks.json config file as the simplest data type required to represent all output type ID values across all output types in the hub. When only point estimate output types (where output_type_ids are NA,) are being collected by a hub, the output_type_id column is assigned a character data type when auto-determined. Other data type values can be used to override automatic determination. Note that attempting to coerce output_type_id to a data type that is not valid for the data (e.g. trying to coerce"character" values to "double") will likely result in an error or potentially unexpected behaviour so use with care.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_failure>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Check column names of model output data


Checks that a tibble/data.frame of data read in from the file being validated contains the expected task ID and standard column names according the round configuration being validated against.


check_tbl_colnames(tbl, round_id, file_path, hub_path = ".")



a tibble/data.frame of the contents of the file being validated.


character string. The round identifier.


character string. Path to the file being validated relative to the hub's model-output directory.


Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_error>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Check derived task ID columns contain valid values


This check is used to validate that values in any derived task ID columns matches accepted values for each derived task ID in the config. Given the dependence of derived task IDs on the values of other values, it ignores the combinations of derived task ID values with those of other task IDs and focuses only on identifying values that do not match the accepted values.


  derived_task_ids = get_hub_derived_task_ids(hub_path, round_id)



a tibble/data.frame of the contents of the file being validated. Column types must all be character.


character string. The round identifier.


character string. Path to the file being validated relative to the hub's model-output directory.


Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


Character vector of derived task ID names (task IDs whose values depend on other task IDs) to ignore. Columns for such task ids will contain NAs. Defaults to extracting derived task IDs from config_tasks. See get_config_derived_task_ids() for more details.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_failure>⁠ condition class object.

If no derived_task_ids are specified, the check is skipped and a ⁠<message/check_info>⁠ condition class object is retuned.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Check model output data tbl round ID matches submission round ID.


Check model output data tbl round ID matches submission round ID.


check_tbl_match_round_id(tbl, file_path, hub_path, round_id_col = NULL)



a tibble/data.frame of the contents of the file being validated.


character string. Path to the file being validated relative to the hub's model-output directory.


Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


Character string. The name of the column containing round_ids. Usually, the value of round property round_id in hub tasks.json config file. Defaults to NULL and determined from the config if applicable.


This check only applies to files being submitted to rounds where round_id_from_variable: true or where a round_id_col name is explicitly provided. Skipped otherwise.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_error>⁠ condition class object.

If round_id_from_variable: false and no round_id_col name is provided, check is skipped and a ⁠<message/check_info>⁠ condition class object is returned. If no valid round_id_col name is provided or can extracted from config (check through check_valid_round_id_col), a ⁠<message/check_error>⁠ condition class object is returned and the rest of the check skipped.

Check model data rows are all unique


Checks that combinations of task ID, output type and output type ID value combinations are unique, by checking that there are no duplicate rows across all tbl columns excluding the value column.


check_tbl_rows_unique(tbl, file_path, hub_path)



a tibble/data.frame of the contents of the file being validated. Column types must all be character.


character string. Path to the file being validated relative to the hub's model-output directory.


Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_failure>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Check model output data tbl sample compound task id sets for each modeling task match or are coarser than the expected set defined in the config.


This check detects the compound task ID sets of samples, implied by the output_type_id and task ID values, and checks them for internal consistency and compliance with the compound_taskid_set defined for each round modeling task in the tasks.json config.


  derived_task_ids = get_hub_derived_task_ids(hub_path)



a tibble/data.frame of the contents of the file being validated. Column types must all be character.


character string. The round identifier.


character string. Path to the file being validated relative to the hub's model-output directory.


Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


Character vector of derived task ID names (task IDs whose values depend on other task IDs) to ignore. Columns for such task ids will contain NAs. Defaults to extracting derived task IDs from hub task.json. See get_hub_derived_task_ids() for more details.


If the check fails, the output of the check includes an errors element, a list of items, one for each modeling task failing validation. The structure depends on the reason the check failed.

If the check failed because more that a single unique compound_taskid_set was found for a given model task, the errors object will be a list with one element for each compound_taskid_set detected and will have the following structure:

  • tbl_comp_tids: a compound task id set detected in the the tbl.

  • output_type_ids: The output type ID of the sample that does not contain a single, unique value for each compound task ID.

If the check failed because task IDs which is not allowed in the config, were identified as compound task ID (i.e. samples describe "finer" compound modeling tasks) for a given model task, the errors object will be a list with the structure described above as well as the additional following elements:

  • config_comp_tids: the allowed compound_taskid_set defined in the modeling task config.

  • invalid_tbl_comp_tids: the names of invalid compound task IDs.

The name of each element is the index identifying the config modeling task the sample is associated with mt_id. See hubverse documentation on samples for more details.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_error>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Check model output data tbl samples contain single unique values for each compound task ID within individual samples


Check model output data tbl samples contain single unique values for each compound task ID within individual samples


  compound_taskid_set = NULL,
  derived_task_ids = get_hub_derived_task_ids(hub_path, round_id)



a tibble/data.frame of the contents of the file being validated. Column types must all be character.


character string. The round identifier.


character string. Path to the file being validated relative to the hub's model-output directory.


Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


a list of compound_taskid_sets (characters vector of compound task IDs), one for each modeling task. Used to override the compound task ID set in the config file, for example, when validating coarser samples.


Character vector of derived task ID names (task IDs whose values depend on other task IDs) to ignore. Columns for such task ids will contain NAs. Defaults to extracting derived task IDs from hub task.json. See get_hub_derived_task_ids() for more details.


Output of the check includes an errors element, a list of items, one for each sample failing validation, with the following structure:

  • mt_id: Index identifying the config modeling task the sample is associated with.

  • output_type_id: The output type ID of the sample that does not contain a single, unique value for each compound task ID.

  • values: The unique values of each compound task ID. See hubverse documentation on samples for more details.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_error>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Check model output data tbl samples contain the appropriate number of samples for a given compound idx.


Check model output data tbl samples contain the appropriate number of samples for a given compound idx.


  compound_taskid_set = NULL,
  derived_task_ids = get_hub_derived_task_ids(hub_path, round_id)



a tibble/data.frame of the contents of the file being validated. Column types must all be character.


character string. The round identifier.


character string. Path to the file being validated relative to the hub's model-output directory.


Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


a list of compound_taskid_sets (characters vector of compound task IDs), one for each modeling task. Used to override the compound task ID set in the config file, for example, when validating coarser samples.


Character vector of derived task ID names (task IDs whose values depend on other task IDs) to ignore. Columns for such task ids will contain NAs. Defaults to extracting derived task IDs from hub task.json. See get_hub_derived_task_ids() for more details.


Output of the check includes an errors element, a list of items, one for each compound_idx failing validation, with the following structure:

  • compound_idx: the compound idx that failed validation of number of samples.

  • n: the number of samples counted for the compound idx.

  • min_samples_per_task: the minimum number of samples required for the compound idx.

  • max_samples_per_task: the maximum number of samples required for the compound idx.

  • compound_idx_tbl: a tibble of the expected structure for samples belonging to the compound idx. See hubverse documentation on samples for more details.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_failure>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Check model output data tbl samples contain single unique combination of non-compound task ID values across all samples


Check model output data tbl samples contain single unique combination of non-compound task ID values across all samples


  compound_taskid_set = NULL,
  derived_task_ids = get_hub_derived_task_ids(hub_path, round_id)



a tibble/data.frame of the contents of the file being validated. Column types must all be character.


character string. The round identifier.


character string. Path to the file being validated relative to the hub's model-output directory.


Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


a list of compound_taskid_sets (characters vector of compound task IDs), one for each modeling task. Used to override the compound task ID set in the config file, for example, when validating coarser samples.


Character vector of derived task ID names (task IDs whose values depend on other task IDs) to ignore. Columns for such task ids will contain NAs. Defaults to extracting derived task IDs from hub task.json. See get_hub_derived_task_ids() for more details.


Output of the check includes an errors element, a list of items, one for each modeling task containing samples failing validation, with the following structure:

  • mt_id: Index identifying the config modeling task the samples are associated with.

  • output_type_ids: The output type IDs of samples that do not match the most frequent non-compound task ID value combination across all samples in the modeling task.

  • frequent: The most frequent non-compound task ID value combination across all samples in the modeling task to which all samples were compared. See hubverse documentation on samples for more details.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_error>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Check model output data tbl contains a single unique round ID.


Check model output data tbl contains a single unique round ID.


check_tbl_unique_round_id(tbl, file_path, hub_path, round_id_col = NULL)



a tibble/data.frame of the contents of the file being validated.


character string. Path to the file being validated relative to the hub's model-output directory.


Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


Character string. The name of the column containing round_ids. Usually, the value of round property round_id in hub tasks.json config file. Defaults to NULL and determined from the config if applicable.


This check only applies to files being submitted to rounds where round_id_from_variable: true or where a round_id_col name is explicitly provided. Skipped otherwise.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_error>⁠ condition class object.

If round_id_from_variable: false and no round_id_col name is provided, check is skipped and a ⁠<message/check_info>⁠ condition class object is returned. If no valid round_id_col name is provided or can extracted from config (check through check_valid_round_id_col), a ⁠<message/check_error>⁠ condition class object is returned and the rest of the check skipped.

Check output type values of model output data against config


Checks that values in the value column of a tibble/data.frame of data read in from the file being validated conform to the configuration for each output type of the appropriate model task.


  derived_task_ids = get_hub_derived_task_ids(hub_path, round_id)



a tibble/data.frame of the contents of the file being validated.


character string. The round identifier.


character string. Path to the file being validated relative to the hub's model-output directory.


Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


Character vector of derived task ID names (task IDs whose values depend on other task IDs) to ignore. Columns for such task ids will contain NAs. Defaults to extracting derived task IDs from hub task.json. See get_hub_derived_task_ids() for more details.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_failure>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Check that quantile and cdf output type values of model output data are non-descending


Checks that values in the value column for quantile and cdf output type data for each unique task ID/output type combination are non-descending when arranged by increasing output_type_id order. Check only performed if tbl contains quantile or cdf output type data. If not, the check is skipped and a ⁠<message/check_info>⁠ condition class object is returned.


  derived_task_ids = get_hub_derived_task_ids(hub_path)



a tibble/data.frame of the contents of the file being validated. Column types must all be character.


character string. Path to the file being validated relative to the hub's model-output directory.


Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


character string. The round identifier.


Character vector of derived task ID names (task IDs whose values depend on other task IDs) to ignore. Columns for such task ids will contain NAs. Defaults to extracting derived task IDs from hub task.json. See get_hub_derived_task_ids() for more details.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_failure>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Check that pmf output type values of model output data sum to 1.


Checks that values in the value column of pmf output type data for each unique task ID combination sum to 1. Check only performed if tbl contains pmf output type data. If not, the check is skipped and a ⁠<message/check_info>⁠ condition class object is returned.


check_tbl_value_col_sum1(tbl, file_path)



a tibble/data.frame of the contents of the file being validated.


character string. Path to the file being validated relative to the hub's model-output directory.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_failure>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Check model output data tbl contains valid value combinations


Check model output data tbl contains valid value combinations


  derived_task_ids = get_hub_derived_task_ids(hub_path, round_id)



a tibble/data.frame of the contents of the file being validated. Column types must all be character.


character string. The round identifier.


character string. Path to the file being validated relative to the hub's model-output directory.


Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


Character vector of derived task ID names (task IDs whose values depend on other task IDs) to ignore. Columns for such task ids will contain NAs. Defaults to extracting derived task IDs from hub task.json. See get_hub_derived_task_ids() for more details.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_error>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Check all required task ID/output type/output type ID value combinations present in model data.


Check all required task ID/output type/output type ID value combinations present in model data.


  derived_task_ids = get_hub_derived_task_ids(hub_path)



a tibble/data.frame of the contents of the file being validated. Column types must all be character.


character string. The round identifier.


character string. Path to the file being validated relative to the hub's model-output directory.


Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


Character vector of derived task ID names (task IDs whose values depend on other task IDs) to ignore. Columns for such task ids will contain NAs. Defaults to extracting derived task IDs from hub task.json. See get_hub_derived_task_ids() for more details.


Note that it is necessary for derived_task_ids to be specified if any of the task IDs with required values have dependent derived task IDs. If this is the case and derived task IDs are not specified, the dependent nature of derived task ID values will result in false validation errors when validating required values.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_failure>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Check whether the round_id determined for the submission is valid


Check whether the round_id determined for the submission is valid


check_valid_round_id(round_id, file_path, hub_path = ".")



character string. The round identifier.


character string. Path to the file being validated relative to the hub's model-output directory.


Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_error>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Check that any round_id_col name provided or extracted from the hub config is valid.


Check that any round_id_col name provided or extracted from the hub config is valid.


check_valid_round_id_col(tbl, file_path, hub_path, round_id_col = NULL)



a tibble/data.frame of the contents of the file being validated.


character string. Path to the file being validated relative to the hub's model-output directory.


Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


Character string. The name of the column containing round_ids. Usually, the value of round property round_id in hub tasks.json config file. Defaults to NULL and determined from the config if applicable.


This check only applies to files being submitted to rounds where round_id_from_variable: true or where a round_id_col name is explicitly provided. Skipped otherwise.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_failure>⁠ condition class object.

If round_id_from_variable: false and no round_id_col name is provided, check is skipped and a ⁠<message/check_info>⁠ condition class object is returned. Returned object also inherits from subclass ⁠<hub_check>⁠.

Concatenate hub_validations S3 class objects


Concatenate hub_validations S3 class objects





hub_validations S3 class objects to be concatenated.


a hub_validations S3 class object.

Create a custom validation check function template file.


Create a custom validation check function template file.


  hub_path = ".",
  r_dir = "src/validations/R",
  error = FALSE,
  conditional = FALSE,
  error_object = FALSE,
  config = FALSE,
  extra_args = FALSE,
  overwrite = FALSE



Character string. Name of the custom check function. We recommend following the hubValidations package naming convention. For more details, consult the article on writing custom check functions.


Character string. Path to the hub directory. Default is the current working directory.


Character string. Path (relative to hub_path) to the directory the custom check function file will be written to. Default is src/validations/R which is the recommended directory for storing custom check functions.


Logical. Defaults to FALSE, which will return a ⁠<error/check_failure>⁠ class object in the case of a failed check. Set this to TRUE if your custom check function is required to pass for other custom checks to be performed; in the case of a failed check, the custom check function will then return an ⁠<error/check_error>⁠ class object and cause custom validations to return early. Note that in the case of custom validations, executions errors in custom functions will also result in custom validations returning early.


Logical. If TRUE, the custom check function template will include a block of code to check a condition before running the check. This is useful when a check may need to be skipped based on a condition.


Logical. If TRUE, the custom check function template will include an error object that can be used to store additional information about the properties of the object being checked that caused check failure. For example, it could store the index of rows in a tbl that caused a check failure.


Logical. If TRUE, the custom check function template will include hub_path as a function argument and a block of code for reading in the hub tasks.json config file.


Logical. If TRUE, the custom check function template will include an extra_arg template function argument and template block of code to check the input arguments of the custom check function.


Logical. If TRUE, the function will overwrite an existing


See the article on writing custom check functions for more.


Invisible TRUE if the custom check function file is created successfully.


  # Create the custom check file with default settings.
  cat(readLines("src/validations/R/check_default.R"), sep = "\n")

  # Create fully featured custom check file.
    error = TRUE, conditional = TRUE,
    error_object = TRUE, config = TRUE,
    extra_args = TRUE
  cat(readLines("src/validations/R/check_full.R"), sep = "\n")

Create expanded grid of valid task ID and output type value combinations


Create expanded grid of valid task ID and output type value combinations


  required_vals_only = FALSE,
  force_output_types = FALSE,
  all_character = FALSE,
  output_type_id_datatype = c("from_config", "auto", "character", "double", "integer",
    "logical", "Date"),
  as_arrow_table = FALSE,
  bind_model_tasks = TRUE,
  include_sample_ids = FALSE,
  compound_taskid_set = NULL,
  output_types = NULL,
  derived_task_ids = get_config_derived_task_ids(config_tasks, round_id)



a list version of the content's of a hub's tasks.json config file, accessed through the "config_tasks" attribute of a ⁠<hub_connection>⁠ object or function hubUtils::read_config().


Character string. Round identifier. If the round is set to round_id_from_variable: true, IDs are values of the task ID defined in the round's round_id property of config_tasks. Otherwise should match round's round_id value in config. Ignored if hub contains only a single round.


Logical. Whether to return only combinations of Task ID and related output type ID required values.


Logical. Whether to force all output types to be required. If TRUE, all output type ID values are treated as required regardless of the value of the is_required property. Useful for creating grids of required values for optional output types.


Logical. Whether to return all character column.


character string. One of "from_config", "auto", "character", "double", "integer", "logical", "Date". Defaults to "from_config" which uses the setting in the output_type_id_datatype property in the tasks.json config file if available. If the property is not set in the config, the argument falls back to "auto" which determines the output_type_id data type automatically from the tasks.json config file as the simplest data type required to represent all output type ID values across all output types in the hub. When only point estimate output types (where output_type_ids are NA,) are being collected by a hub, the output_type_id column is assigned a character data type when auto-determined. Other data type values can be used to override automatic determination. Note that attempting to coerce output_type_id to a data type that is not valid for the data (e.g. trying to coerce"character" values to "double") will likely result in an error or potentially unexpected behaviour so use with care.


Logical. Whether to return an arrow table. Defaults to FALSE.


Logical. Whether to bind expanded grids of values from multiple modeling tasks into a single tibble/arrow table or return a list.


Logical. Whether to include sample identifiers in the output_type_id column.


List of character vectors, one for each modeling task in the round. Can be used to override the compound task ID set defined in the config. If NULL is provided for a given modeling task, a compound task ID set of all task IDs is used.


Character vector of output type names to include. Use to subset for grids for specific output types.


Character vector of derived task ID names (task IDs whose values depend on other task IDs) to ignore. Columns for such task ids will contain NAs. Defaults to extracting derived task IDs from config_tasks. See get_config_derived_task_ids() for more details.


When a round is set to round_id_from_variable: true, the value of the task ID from which round IDs are derived (i.e. the task ID specified in round_id property of config_tasks) is set to the value of the round_id argument in the returned output.

When sample output types are included in the output and include_sample_ids = TRUE, the output_type_id column contains example sample indexes which are useful for identifying the compound task ID structure of multivariate sampling distributions in particular, i.e. which combinations of task ID values represent individual samples.


If bind_model_tasks = TRUE (default) a tibble or arrow table containing all possible task ID and related output type ID value combinations. If bind_model_tasks = FALSE, a list containing a tibble or arrow table for each round modeling task.

Columns are coerced to data types according to the hub schema, unless all_character = TRUE. If all_character = TRUE, all columns are returned as character which can be faster when large expanded grids are expected. If required_vals_only = TRUE, values are limited to the combinations of required values only.

Note that if required_vals_only = TRUE and an optional output type is requested through output_types, a zero row grid will be returned. If all output types are requested however (i.e. when output_types = NULL) and they are all optional, a grid of required task ID values only will be returned. However, whenever force_output_types = TRUE, all output types are treated as required.


hub_con <- hubData::connect_hub(
  system.file("testhubs/flusight", package = "hubUtils")
config_tasks <- attr(hub_con, "config_tasks")
expand_model_out_grid(config_tasks, round_id = "2023-01-02")
  round_id = "2023-01-02",
  required_vals_only = TRUE
# Specifying a round in a hub with multiple round configurations.
hub_con <- hubData::connect_hub(
  system.file("testhubs/simple", package = "hubUtils")
config_tasks <- attr(hub_con, "config_tasks")
expand_model_out_grid(config_tasks, round_id = "2022-10-01")
# Later round_id maps to round config that includes additional task ID 'age_group'.
expand_model_out_grid(config_tasks, round_id = "2022-10-29")
# Coerce all columns to character
  round_id = "2022-10-29",
  all_character = TRUE
# Return arrow table
  round_id = "2022-10-29",
  all_character = TRUE,
  as_arrow_table = TRUE
# Hub with sample output type
config_tasks <- read_config_file(system.file("config", "tasks.json",
  package = "hubValidations"
  round_id = "2022-12-26"
# Include sample IDS
  round_id = "2022-12-26",
  include_sample_ids = TRUE
# Hub with sample output type and compound task ID structure
config_tasks <- read_config_file(
  system.file("config", "tasks-comp-tid.json", package = "hubValidations")
  round_id = "2022-12-26",
  include_sample_ids = TRUE
# Override config compound task ID set
# Create coarser compound task ID set for the first modeling task which contains
# samples
  round_id = "2022-12-26",
  include_sample_ids = TRUE,
  compound_taskid_set = list(
    c("forecast_date", "target"),
  round_id = "2022-12-26",
  include_sample_ids = TRUE,
  compound_taskid_set = list(
# Subset output types
config_tasks <- read_config(
  system.file("testhubs", "samples", package = "hubValidations")
  round_id = "2022-10-29",
  include_sample_ids = TRUE,
  bind_model_tasks = FALSE,
  output_types = c("sample", "pmf"),
  round_id = "2022-10-29",
  include_sample_ids = TRUE,
  bind_model_tasks = TRUE,
  output_types = "sample",
# Ignore derived task IDs
  round_id = "2022-10-29",
  include_sample_ids = TRUE,
  bind_model_tasks = FALSE,
  output_types = "sample",
  derived_task_ids = "target_end_date"
# Return only required values
hub_path <- system.file("testhubs", "v4", "simple", package = "hubUtils")
config_tasks <- read_config(hub_path)
# Return required output types and output_types_ids only
  config_tasks = config_tasks,
  round_id = "2022-10-22",
  required_vals_only = TRUE
# Force all output types to be required
  config_tasks = config_tasks,
  round_id = "2022-10-22",
  required_vals_only = TRUE,
  force_output_types = TRUE
# Sub-setting for an optional output type returns an empty data frame
  config_tasks = config_tasks,
  round_id = "2022-10-22",
  output_types = "mean",
  required_vals_only = TRUE
# force_output_types on an optional output type forces all output_type_id values
# to be required
  config_tasks = config_tasks,
  round_id = "2022-10-22",
  output_types = "mean",
  required_vals_only = TRUE,
  force_output_types = TRUE
# Ignore derived task IDs
hub_path <- system.file("testhubs", "v4", "flusight", package = "hubUtils")
config_tasks <- read_config(hub_path)
# Defaults to using derived_task_ids from config
expand_model_out_grid(config_tasks, round_id = "2023-05-08")
# Can be overridden by argument derived_task_ids
  round_id = "2023-05-08",
  derived_task_ids = NULL

Get hub configuration fields from a ⁠<config>⁠ class object


Get hub configuration fields from a ⁠<config>⁠ class object


get_config_derived_task_ids(config_tasks, round_id = NULL)



a list version of the content's of a hub's tasks.json config file, accessed through the "config_tasks" attribute of a ⁠<hub_connection>⁠ object or function hubUtils::read_config().


Character string. Round identifier. If the round is set to round_id_from_variable: true, IDs are values of the task ID defined in the round's round_id property of config_tasks. Otherwise should match round's round_id value in config. Ignored if hub contains only a single round.


  • get_config_derived_task_ids: character vector of hub or round level derived task ID names. If round_id is NULL or the round does not have a round level derived_tasks_ids setting, returns the hub level derived_tasks_ids setting.


  • get_config_derived_task_ids(): Get the hub or round level derived_tasks_ids


hub_path <- system.file("testhubs/v4/flusight", package = "hubUtils")
config_tasks <- read_config(hub_path)
get_config_derived_task_ids(config_tasks, round_id = "2023-05-08")

Detect the compound_taskid_set for a tbl for each modeling task in a given round.


Detect the compound_taskid_set for a tbl for each modeling task in a given round.


  compact = TRUE,
  error = TRUE,
  derived_task_ids = get_config_derived_task_ids(config_tasks, round_id)



a tibble/data.frame of the contents of the file being validated. Column types must all be character.


a list representantion of the tasks.json config file.


Character string. The round ID.


Logical. If TRUE, the output will be compacted to remove NULL elements.


Logical. If TRUE, an error will be thrown if the compound task ID set is not valid. If FALSE and an error is detected, the detected compound task ID set will be returned with error attributes attached.


Character vector of derived task ID names (task IDs whose values depend on other task IDs) to ignore. Columns for such task ids will contain NAs. Defaults to extracting derived task IDs from config_tasks. See get_config_derived_task_ids() for more details.


A list of vectors of compound task IDs detected in the tbl, one for each modeling task in the round. If compact is TRUE, modeling tasks returning NULL elements will be removed.


hub_path <- system.file("testhubs/samples", package = "hubValidations")
file_path <- "flu-base/2022-10-22-flu-base.csv"
round_id <- "2022-10-22"
tbl <- read_model_out_file(
  file_path = file_path,
  hub_path = hub_path,
  coerce_types = "chr"
config_tasks <- read_config(hub_path, "tasks")
get_tbl_compound_taskid_set(tbl, config_tasks, round_id)
get_tbl_compound_taskid_set(tbl, config_tasks, round_id,
  compact = FALSE

Get status of a hub check


Get status of a hub check












an object that inherits from class ⁠<hub_check>⁠ to test.


Logical. Is given status of check TRUE?


  • is_success(): Is check success?

  • is_failure(): Is check failure?

  • is_error(): Is check error?

  • is_info(): Is check info?

  • not_pass(): Did check not pass?

  • is_exec_error(): Is exec error?

  • is_exec_warn(): Is exec warning?

  • is_any_error(): Is error or exec error?

Match model output tbl data to their model tasks in config_tasks.


Split and match model output tbl data to their corresponding model tasks in config_tasks. Useful for performing model task specific checks on model output. For v3 samples, the output_type_id column is set to NA for sample outputs.


  output_types = NULL,
  derived_task_ids = get_config_derived_task_ids(config_tasks, round_id),
  all_character = TRUE



a tibble/data.frame of the contents of the file being validated.


a list version of the content's of a hub's tasks.json config file, accessed through the "config_tasks" attribute of a ⁠<hub_connection>⁠ object or function hubUtils::read_config().


Character string. Round identifier. If the round is set to round_id_from_variable: true, IDs are values of the task ID defined in the round's round_id property of config_tasks. Otherwise should match round's round_id value in config. Ignored if hub contains only a single round.


Character vector of output type names to include. Use to subset for grids for specific output types.


Character vector of derived task ID names (task IDs whose values depend on other task IDs) to ignore. Columns for such task ids will contain NAs. Defaults to extracting derived task IDs from config_tasks. See get_config_derived_task_ids() for more details.


Logical. Whether to return all character column.


A list containing a tbl_df of model output data matched to a model task with one element per round model task.


hub_path <- system.file("testhubs/samples", package = "hubValidations")
tbl <- read_model_out_file(
  file_path = "flu-base/2022-10-22-flu-base.csv",
  hub_path, coerce_types = "chr"
config_tasks <- read_config(hub_path, "tasks")
match_tbl_to_model_task(tbl, config_tasks, round_id = "2022-10-22")
match_tbl_to_model_task(tbl, config_tasks,
  round_id = "2022-10-22",
  output_types = "sample"

Create new or convert list to hub_validations S3 class object


Create new or convert list to hub_validations S3 class object






named elements to be included. Each element must be an object which inherits from class ⁠<hub_check>⁠.


a list of named elements. Each element must be an object which inherits from class ⁠<hub_check>⁠.


an S3 object of class ⁠<hub_validations>⁠.


  • new_hub_validations(): Create new ⁠<hub_validations>⁠ S3 class object

  • as_hub_validations(): Convert list to ⁠<hub_validations>⁠ S3 class object



hub_path <- system.file("testhubs/simple", package = "hubValidations")
file_path <- "team1-goodmodel/2022-10-08-team1-goodmodel.csv"
  file_exists = check_file_exists(file_path, hub_path),
  file_name = check_file_name(file_path)
x <- list(
  file_exists = check_file_exists(file_path, hub_path),
  file_name = check_file_name(file_path)

Check that submitting team does not exceed maximum number of allowed models per team


Check that submitting team does not exceed maximum number of allowed models per team


opt_check_metadata_team_max_model_n(file_path, hub_path, n_max = 2L)



character string. Path to the file being validated relative to the hub's model-metadata directory.


Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


Integer. Number of maximum allowed models per team.


Should be deployed as part of validate_model_metadata optional checks.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_failure>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Check time difference between values in two date columns equal a defined period.


Check time difference between values in two date columns equal a defined period.


  timediff = lubridate::weeks(2),
  output_type_id_datatype = c("from_config", "auto", "character", "double", "integer",
    "logical", "Date")



a tibble/data.frame of the contents of the file being validated.


character string. Path to the file being validated relative to the hub's model-output directory.


Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


Character string. The name of the time zero date column.


Character string. The name of the time zero + 1 time step date column.


an object of class lubridate Period and length 1.


character string. One of "from_config", "auto", "character", "double", "integer", "logical", "Date". Defaults to "from_config" which uses the setting in the output_type_id_datatype property in the tasks.json config file if available. If the property is not set in the config, the argument falls back to "auto" which determines the output_type_id data type automatically from the tasks.json config file as the simplest data type required to represent all output type ID values across all output types in the hub. When only point estimate output types (where output_type_ids are NA,) are being collected by a hub, the output_type_id column is assigned a character data type when auto-determined. Other data type values can be used to override automatic determination. Note that attempting to coerce output_type_id to a data type that is not valid for the data (e.g. trying to coerce"character" values to "double") will likely result in an error or potentially unexpected behaviour so use with care.


Should be deployed as part of validate_model_data optional checks.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_failure>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Check that predicted values per location are less than total location population.


Check that predicted values per location are less than total location population.


  targets = NULL,
  popn_file_path = "auxiliary-data/locations.csv",
  popn_col = "population",
  location_col = "location"



a tibble/data.frame of the contents of the file being validated.


character string. Path to the file being validated relative to the hub's model-output directory.


Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


Either a single target key list or a list of multiple target key lists.


Character string. Path to population data relative to the hub root. Defaults to auxiliary-data/locations.csv.


Character string. The name of the population size column in the population data set.


Character string. The name of the location column. Used to join population data to submission file data. Must be shared by both files.


Should only be applied to rows containing count predictions. Use argument targets to filter tbl data to appropriate count target rows.

Should be deployed as part of validate_model_data optional checks.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_failure>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.


hub_path <- system.file("testhubs/flusight", package = "hubValidations")
file_path <- "hub-ensemble/2023-05-08-hub-ensemble.parquet"
tbl <- hubValidations::read_model_out_file(file_path, hub_path)
# Single target key list
targets <- list("target" = "wk ahead inc flu hosp")
opt_check_tbl_counts_lt_popn(tbl, file_path, hub_path, targets = targets)

Check time difference between values in two date columns equals a defined time period defined by values in a horizon column


Check time difference between values in two date columns equals a defined time period defined by values in a horizon column


  horizon_colname = "horizon",
  timediff = lubridate::weeks(),
  output_type_id_datatype = c("from_config", "auto", "character", "double", "integer",
    "logical", "Date")



a tibble/data.frame of the contents of the file being validated.


character string. Path to the file being validated relative to the hub's model-output directory.


Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


Character string. The name of the time zero date column.


Character string. The name of the time zero + 1 time step date column.


Character string. The name of the horizon column. Defaults to "horizon".


an object of class lubridate Period and length 1. The period of a single horizon. Default to 1 week.


character string. One of "from_config", "auto", "character", "double", "integer", "logical", "Date". Defaults to "from_config" which uses the setting in the output_type_id_datatype property in the tasks.json config file if available. If the property is not set in the config, the argument falls back to "auto" which determines the output_type_id data type automatically from the tasks.json config file as the simplest data type required to represent all output type ID values across all output types in the hub. When only point estimate output types (where output_type_ids are NA,) are being collected by a hub, the output_type_id column is assigned a character data type when auto-determined. Other data type values can be used to override automatic determination. Note that attempting to coerce output_type_id to a data type that is not valid for the data (e.g. trying to coerce"character" values to "double") will likely result in an error or potentially unexpected behaviour so use with care.


Should be deployed as part of validate_model_data optional checks.


Depending on whether validation has succeeded, one of:

  • ⁠<message/check_success>⁠ condition class object.

  • ⁠<error/check_failure>⁠ condition class object.

Returned object also inherits from subclass ⁠<hub_check>⁠.

Parse model output file metadata from file name


Parse model output file metadata from file name


parse_file_name(file_path, file_type = c("model_output", "model_metadata"))



Character string. A model output file name. Can include parent directories which are ignored.


Character string. Type of file name being parsed. One of "model_output" or "model_metadata".


File names are allowed to contain the following compression extension prefixes: .snappy, .gzip, .gz, .brotli, .zstd, .lz4, .lzo, .bz2. These extension prefixes are now extracted when parsing the file name and returned as compression_ext element if present.


A list with the following elements:

  • round_id: The round ID the model output is associated with (NA for model metadata files.)

  • team_abbr: The team responsible for the model.

  • model_abbr: The name of the model.

  • model_id: The unique model ID derived from the concatenation of ⁠<team_abbr>-<model_abbr>⁠.

  • ext: The file extension.

  • compression_ext: optional. The compression extension if present.



Print results of validate_...() function as a bullet list


Print results of validate_...() function as a bullet list


## S3 method for class 'hub_validations'
print(x, ...)



An object of class hub_validations


Unused argument present for class consistency

Print results of validate_pr() function as a bullet list


Print results of validate_pr() function as a bullet list


## S3 method for class 'pr_hub_validations'
print(x, ...)



An object of class pr_hub_validations


Unused argument present for class consistency

Read a model output file


Read a model output file


  hub_path = ".",
  coerce_types = c("hub", "chr", "none"),
  output_type_id_datatype = c("from_config", "auto", "character", "double", "integer",
    "logical", "Date")



character string. Path to the file being validated relative to the hub's model-output directory.


Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


character. What to coerce column types to on read.

  • hub: (default) read in (csv) or coerce (parquet, arrow) to hub schema. When coercing data types using the hub schema, the output_type_id_datatype can also be used to set the output_type_id column data type manually.

  • chr: read in (csv) or coerce (parquet, arrow) all columns to character.

  • none: No coercion. Use arrow ⁠read_*⁠ function defaults.


character string. One of "from_config", "auto", "character", "double", "integer", "logical", "Date". Defaults to "from_config" which uses the setting in the output_type_id_datatype property in the tasks.json config file if available. If the property is not set in the config, the argument falls back to "auto" which determines the output_type_id data type automatically from the tasks.json config file as the simplest data type required to represent all output type ID values across all output types in the hub. When only point estimate output types (where output_type_ids are NA,) are being collected by a hub, the output_type_id column is assigned a character data type when auto-determined. Other data type values can be used to override automatic determination. Note that attempting to coerce output_type_id to a data type that is not valid for the data (e.g. trying to coerce"character" values to "double") will likely result in an error or potentially unexpected behaviour so use with care.


a tibble of contents of the model output file.

Create a model output submission file template


Create a model output submission file template


  required_vals_only = FALSE,
  force_output_types = FALSE,
  complete_cases_only = TRUE,
  compound_taskid_set = NULL,
  output_types = NULL,
  derived_task_ids = NULL,
  hub_con = deprecated(),
  config_tasks = deprecated()



Character string. Can be one of:

  • a path to a local fully configured hub directory

  • a path to a local tasks.json file.

  • a URL to the repository of a fully configured hub on GitHub.

  • a URL to the raw contents of a tasks.json file on GitHub.

  • a ⁠<SubTreeFileSystem>⁠ class object pointing to the root of an S3 cloud hub.

  • a ⁠<SubTreeFileSystem>⁠ class object pointing to a tasks.json config file in an S3 cloud hub, relative to the hub's root directory.

See examples for more details.


Character string. Round identifier. If the round is set to round_id_from_variable: true, IDs are values of the task ID defined in the round's round_id property of config_tasks. Otherwise should match round's round_id value in config. Ignored if hub contains only a single round.


Logical. Whether to return only combinations of Task ID and related output type ID required values.


Logical. Whether to force all output types to be required. If TRUE, all output type ID values are treated as required regardless of the value of the is_required property. Useful for creating grids of required values for optional output types.


Logical. If TRUE (default) and required_vals_only = TRUE, only rows with complete cases of combinations of required values are returned. If FALSE, rows with incomplete cases of combinations of required values are included in the output.


List of character vectors, one for each modeling task in the round. Can be used to override the compound task ID set defined in the config. If NULL is provided for a given modeling task, a compound task ID set of all task IDs is used.


Character vector of output type names to include. Use to subset for grids for specific output types.


Character vector of derived task ID names (task IDs whose values depend on other task IDs) to ignore. Columns for such task ids will contain NAs. If NULL, defaults to extracting derived task IDs from config_tasks or the config_tasks attribute of hub_con. See get_config_derived_task_ids() for more details.


[Deprecated] Use path instead. A ⁠⁠<hub_connection>⁠⁠ class object.


[Deprecated] Use path instead. A list version of the content's of a hub's tasks.json config file, accessed through the "config_tasks" attribute of a ⁠<hub_connection>⁠ object or function read_config().


For task IDs where all values are optional, by default, columns are created as columns of NAs when required_vals_only = TRUE. When such columns exist, the function returns a tibble with zero rows, as no complete cases of required value combinations exists. (Note that determination of complete cases does excludes valid NA output_type_id values in "mean" and "median" output types). To return a template of incomplete required cases, which includes NA columns, use complete_cases_only = FALSE.

To include output types that are optional in the submission template when required_vals_only = TRUE and complete_cases_only = FALSE, use force_output_types = TRUE. Use this in combination with sub-setting for output types you plan to submit via argument output_types to create a submission template customised to your submission plans. Tip: to ensure you create a template with all required output types, it's a good idea to first run the functions without subsetting or forcing output types and examing the unique values in output_type to check which output types are required.

When sample output types are included in the output, the output_type_id column contains example sample indexes which are useful for identifying the compound task ID structure of multivariate sampling distributions in particular, i.e. which combinations of task ID values represent individual samples.

When a round is set to round_id_from_variable: true, the value of the task ID from which round IDs are derived (i.e. the task ID specified in round_id property of config_tasks) is set to the value of the round_id argument in the returned output.


a tibble template containing an expanded grid of valid task ID and output type ID value combinations for a given submission round and output type. If required_vals_only = TRUE, values are limited to the combination of required values only.


hub_path <- system.file("testhubs/flusight", package = "hubUtils")
submission_tmpl(hub_path, round_id = "2023-01-02")
# Return required values only
  round_id = "2023-01-02",
  required_vals_only = TRUE
  round_id = "2023-01-02",
  required_vals_only = TRUE,
  complete_cases_only = FALSE
# Specify a round in a hub with multiple rounds
hub_path <- system.file("testhubs/simple", package = "hubUtils")
submission_tmpl(hub_path, round_id = "2022-10-01")
submission_tmpl(hub_path, round_id = "2022-10-29")
# Subset for a specific output type
hub_path <- system.file("testhubs", "samples", package = "hubValidations")
  round_id = "2022-12-17",
  output_types = "sample"
# Create a template from the path to a tasks config file
config_path <- system.file("config", "tasks.json",
  package = "hubValidations"
  round_id = "2022-12-26"
# Hub with sample output type and compound task ID structure
config_path <- system.file("config", "tasks-comp-tid.json",
  package = "hubValidations"
  round_id = "2022-12-26",
  output_types = "sample"
# Override config compound task ID set
# Create coarser compound task ID set for the first modeling task which contains
# samples
  round_id = "2022-12-26",
  output_types = "sample",
  compound_taskid_set = list(
    c("forecast_date", "target"),
# Derive a template with ignored derived task ID. Useful to avoid creating
# a template with invalid derived task ID value combinations.
hub_path <- system.file("testhubs", "flusight", package = "hubValidations")
  round_id = "2022-12-12",
  output_types = "pmf",
  derived_task_ids = "target_end_date",
  complete_cases_only = FALSE
# Force optional output type, in this case "mean".
  round_id = "2022-12-12",
  required_vals_only = TRUE,
  output_types = c("pmf", "quantile", "mean"),
  force_output_types = TRUE,
  derived_task_ids = "target_end_date",
  complete_cases_only = FALSE
# Create a template from a URL to fully configured hub repository on GitHub
  path = "",
  round_id = "2022-11-28",
  output_types = "quantile"
# Create a template from a URL to the raw contents of a tasks.json file on
# GitHub
config_raw_url <- paste0(
  path = config_raw_url,
  round_id = "2022-11-28",
  output_types = "quantile"

# Create submission file using config file from AWS S3 bucket hub
# Use `s3_bucket()` to create a path to the hub's root directory
s3_hub_path <- arrow::s3_bucket("hubverse/hubutils/testhubs/simple/")
  path = s3_hub_path,
  round_id = "2022-10-01",
  output_types = "quantile"
# Use `path()` method to create a path to the tasks.json file relative to the
# the S3 cloud hub's root directory
s3_config_path <- s3_hub_path$path("hub-config/tasks.json")
  path = s3_config_path,
  round_id = "2022-10-01",
  output_types = "quantile"

Wrap check expression in try to capture check execution errors


Wrap check expression in try to capture check execution errors


try_check(expr, file_path)



check function expression to run.


character string. Path to the file being validated relative to the hub's model-output directory.


If expr executes correctly, the output of expr is returned. If execution fails, and object of class ⁠<error/check_exec_error>⁠ is returned. The execution error message is attached as attribute msg.

Validate the contents of a submitted model data file


Validate the contents of a submitted model data file


  round_id_col = NULL,
  output_type_id_datatype = c("from_config", "auto", "character", "double", "integer",
    "logical", "Date"),
  validations_cfg_path = NULL,
  derived_task_ids = NULL



Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


character string. Path to the file being validated relative to the hub's model-output directory.


Character string. The name of the column containing round_ids. Usually, the value of round property round_id in hub tasks.json config file. Defaults to NULL and determined from the config if applicable.


character string. One of "from_config", "auto", "character", "double", "integer", "logical", "Date". Defaults to "from_config" which uses the setting in the output_type_id_datatype property in the tasks.json config file if available. If the property is not set in the config, the argument falls back to "auto" which determines the output_type_id data type automatically from the tasks.json config file as the simplest data type required to represent all output type ID values across all output types in the hub. When only point estimate output types (where output_type_ids are NA,) are being collected by a hub, the output_type_id column is assigned a character data type when auto-determined. Other data type values can be used to override automatic determination. Note that attempting to coerce output_type_id to a data type that is not valid for the data (e.g. trying to coerce"character" values to "double") will likely result in an error or potentially unexpected behaviour so use with care.


Path to validations.yml file. If NULL defaults to hub-config/validations.yml.


Character vector of derived task ID names (task IDs whose values depend on other task IDs) to ignore. Columns for such task ids will contain NAs. If NULL, defaults to extracting derived task IDs from hub task.json. See get_hub_derived_task_ids() for more details.


Note that it is necessary for derived_task_ids to be specified if any task IDs with required values have dependent derived task IDs. If this is the case and derived task IDs are not specified, the dependent nature of derived task ID values will result in false validation errors when validating required values.

Details of checks performed by validate_model_data()

Name Check Early return Fail output Extra info
file_read File can be read without errors TRUE check_error
valid_round_id_col Round ID var from config exists in data column names. Skipped if `round_id_from_var` is FALSE in config. FALSE check_failure
unique_round_id Round ID column contains a single unique round ID. Skipped if `round_id_from_var` is FALSE in config. TRUE check_error
match_round_id Round ID from file contents matches round ID from file name. Skipped if `round_id_from_var` is FALSE in config. TRUE check_error
colnames File column names match expected column names for round (i.e. task ID names + hub standard column names) TRUE check_error
col_types File column types match expected column types from config. Mainly applicable to parquet & arrow files. FALSE check_failure
valid_vals Columns (excluding the `value` and any derived task ID columns) contain valid combinations of task ID / output type / output type ID values TRUE check_error error_tbl: table of invalid task ID/output type/output type ID value combinations
derived_task_id_vals Derived task ID columns contain valid values. FALSE check_failure errors: named list of derived task ID values. Each element contains the invalid values for each derived task ID that failed the check.
rows_unique Columns (excluding the `value` and any derived task ID columns) contain unique combinations of task ID / output type / output type ID values FALSE check_failure
req_vals Columns (excluding the `value` and any derived task ID columns) contain all required combinations of task ID / output type / output type ID values FALSE check_failure missing_df: table of missing task ID/output type/output type ID value combinations
value_col_valid Values in `value` column are coercible to data type configured for each output type FALSE check_failure
value_col_non_desc Values in `value` column are non-decreasing as output_type_ids increase for all unique task ID /output type value combinations. Applies to `quantile` or `cdf` output types only FALSE check_failure error_tbl: table of rows affected
value_col_sum1 Values in the `value` column of `pmf` output type data for each unique task ID combination sum to 1. FALSE check_failure error_tbl: table of rows affected
spl_compound_taskid_set Sample compound task id sets for each modeling task match or are coarser than the expected set defined in tasks.json config. TRUE check_error errors: list containing item for each failing modeling task. Exact structure dependent on type of validation failure. See check function documentation for more details.
spl_compound_tid Samples contain single unique values for each compound task ID within individual samples (v3 and above schema only). TRUE check_error errors: list containing item for each sample failing validation with breakdown of unique values for each compound task ID.
spl_non_compound_tid Samples contain single unique combination of non-compound task ID values across all samples (v3 and above schema only). TRUE check_error errors: list containing item for each modeling task with vectors of output type ids of samples failing validation and example table of most frequent non-compound task ID value combination across all samples in the modeling task.
spl_n Number of samples for a given compound idx falls within accepted compound task range (v3 and above schema only). FALSE check_failure errors: list containing item for each compound_idx failing validation with sample count, metadata on expected samples and example table of expected structure for samples belonging to the compound idx in question.


An object of class hub_validations. Each named element contains a hub_check class object reflecting the result of a given check. Function will return early if a check returns an error.

For more details on the structure of ⁠<hub_validations>⁠ objects, including how to access more information on individual checks, see article on ⁠<hub_validations>⁠ S3 class objects.


hub_path <- system.file("testhubs/simple", package = "hubValidations")
file_path <- "team1-goodmodel/2022-10-08-team1-goodmodel.csv"
validate_model_data(hub_path, file_path)

Valid file level properties of a submitted model output file.


Valid file level properties of a submitted model output file.


validate_model_file(hub_path, file_path, validations_cfg_path = NULL)



Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


character string. Path to the file being validated relative to the hub's model-output directory.


Path to validations.yml file. If NULL defaults to hub-config/validations.yml.


Details of checks performed by validate_model_file()

Name Check Early return Fail output Extra info
file_exists File exists at `file_path` provided TRUE check_error
file_name File name valid TRUE check_error
file_location File located in correct team directory FALSE check_failure
round_id_valid File round ID is valid hub round IDs TRUE check_error
file_format File format is accepted hub/round format TRUE check_error
file_n Number of submission files per round per team does not exceed allowed number FALSE check_failure
metadata_exists Model metadata file exists in expected location FALSE check_failure


An object of class hub_validations. Each named element contains a hub_check class object reflecting the result of a given check. Function will return early if a check returns an error.

For more details on the structure of ⁠<hub_validations>⁠ objects, including how to access more information on individual checks, see article on ⁠<hub_validations>⁠ S3 class objects.


hub_path <- system.file("testhubs/simple", package = "hubValidations")
  file_path = "team1-goodmodel/2022-10-08-team1-goodmodel.csv"
  file_path = "team1-goodmodel/2022-10-15-team1-goodmodel.csv"

Valid properties of a metadata file.


Valid properties of a metadata file.


  round_id = "default",
  validations_cfg_path = NULL



Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


character string. Path to the file being validated relative to the hub's model-output directory.


character string. The round identifier. Used primarily to indicate whether the "default" or a round specific configuration should be used for custom validations.


Path to validations.yml file. If NULL defaults to hub-config/validations.yml.


Details of checks performed by validate_model_metadata()

Name Check Early return Fail output Extra info
metadata_schema_exists A model metadata schema file exists in `hub-config` directory. TRUE check_error
metadata_file_exists A file with name provided to argument `file_path` exists at the expected location (the `model-metadata` directory). TRUE check_error
metadata_file_ext The metadata file has correct extension (yaml or yml). TRUE check_error
metadata_file_location The metadata file has been saved to correct location. TRUE check_failure
metadata_matches_schema The contents of the metadata file match the hub's model metadata schema TRUE check_error
metadata_file_name The metadata filename matches the model ID specified in the contents of the file. TRUE check_error


An object of class hub_validations. Each named element contains a hub_check class object reflecting the result of a given check. Function will return early if a check returns an error.


hub_path <- system.file("testhubs/simple", package = "hubValidations")
  file_path = "hub-baseline.yml"
  file_path = "team1-goodmodel.yaml"

Validate Pull Request


Validates model output and model metadata files in a Pull Request.


  hub_path = ".",
  round_id_col = NULL,
  output_type_id_datatype = c("from_config", "auto", "character", "double", "integer",
    "logical", "Date"),
  validations_cfg_path = NULL,
  skip_submit_window_check = FALSE,
  file_modification_check = c("error", "failure", "warn", "message", "none"),
  allow_submit_window_mods = TRUE,
  submit_window_ref_date_from = c("file", "file_path"),
  derived_task_ids = NULL



Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


GitHub repository address in the format username/repo


Number of the pull request to validate


Character string. The name of the column containing round_ids. Only required if files contain a column that contains round_id details but has not been configured via round_id_from_variable: true and ⁠round_id:⁠ in in hub tasks.json config file.


character string. One of "from_config", "auto", "character", "double", "integer", "logical", "Date". Defaults to "from_config" which uses the setting in the output_type_id_datatype property in the tasks.json config file if available. If the property is not set in the config, the argument falls back to "auto" which determines the output_type_id data type automatically from the tasks.json config file as the simplest data type required to represent all output type ID values across all output types in the hub. When only point estimate output types (where output_type_ids are NA,) are being collected by a hub, the output_type_id column is assigned a character data type when auto-determined. Other data type values can be used to override automatic determination. Note that attempting to coerce output_type_id to a data type that is not valid for the data (e.g. trying to coerce"character" values to "double") will likely result in an error or potentially unexpected behaviour so use with care.


Path to validations.yml file. If NULL defaults to hub-config/validations.yml.


Logical. Whether to skip the submission window check.


Character string. Whether to perform check and what to return when modification/deletion of a previously submitted model output file or deletion of a previously submitted model metadata file is detected in PR:

  • "error": Appends a ⁠<error/check_error>⁠ condition class object for each applicable modified/deleted file.

  • "warning": Appends a ⁠<error/check_failure>⁠ condition class object for each applicable modified/deleted file.

  • "message": Appends a ⁠<message/check_info>⁠ condition class object for each applicable modified/deleted file.

  • "none": No modification/deletion checks performed.


Logical. Whether to allow modifications/deletions of model output files within their submission windows. Defaults to TRUE.


whether to get the reference date around which relative submission windows will be determined from the file's file_path round ID or the file contents themselves. file requires that the file can be read. Only applicable when a round is configured to determine the submission windows relative to the value in a date column in model output files. Not applicable when explicit submission window start and end dates are provided in the hub's config.


Character vector of derived task ID names (task IDs whose values depend on other task IDs) to ignore. Columns for such task ids will contain NAs. If NULL, defaults to extracting derived task IDs from hub task.json. See get_hub_derived_task_ids() for more details.


Only model output and model metadata files are individually validated using validate_submission() or validate_model_metadata() respectively although as part of checks, hub config files are also validated. Any other files included in the PR are ignored but flagged in a message.

By default, modifications (which include renaming) and deletions of previously submitted model output files and deletions or renaming of previously submitted model metadata files are not allowed and return a ⁠<error/check_error>⁠ condition class object for each applicable modified/deleted file. This behaviour can be modified through arguments file_modification_check, which controls whether modification/deletion checks are performed and what is returned if modifications/deletions are detected, and allow_submit_window_mods, which controls whether modifications/deletions of model output files are allowed within their submission windows.

Note that to establish relative submission windows when performing modification/deletion checks and allow_submit_window_mods is TRUE, the reference date is taken as the round_id extracted from the file path (i.e. submit_window_ref_date_from is always set to "file_path"). This is because we cannot extract dates from columns of deleted files. If hub submission window reference dates do not match round IDs in file paths, currently allow_submit_window_mods will not work correctly and is best set to FALSE. This only relates to hubs/rounds where submission windows are determined relative to a reference date and not when explicit submission window start and end dates are provided in the config.

Finally, note that it is necessary for derived_task_ids to be specified if any task IDs with required values have dependent derived task IDs. If this is the case and derived task IDs are not specified, the dependent nature of derived task ID values will result in false validation errors when validating required values.

Checks on model output files

Details of checks performed by validate_submission()

Name Check Early return Fail output Extra info
valid_config Hub config valid TRUE check_error
submission_time Current time within file submission window FALSE check_failure
file_exists File exists at `file_path` provided TRUE check_error
file_name File name valid TRUE check_error
file_location File located in correct team directory FALSE check_failure
round_id_valid File round ID is valid hub round IDs TRUE check_error
file_format File format is accepted hub/round format TRUE check_error
file_n Number of submission files per round per team does not exceed allowed number FALSE check_failure
metadata_exists Model metadata file exists in expected location FALSE check_failure
file_read File can be read without errors TRUE check_error
valid_round_id_col Round ID var from config exists in data column names. Skipped if `round_id_from_var` is FALSE in config. FALSE check_failure
unique_round_id Round ID column contains a single unique round ID. Skipped if `round_id_from_var` is FALSE in config. TRUE check_error
match_round_id Round ID from file contents matches round ID from file name. Skipped if `round_id_from_var` is FALSE in config. TRUE check_error
colnames File column names match expected column names for round (i.e. task ID names + hub standard column names) TRUE check_error
col_types File column types match expected column types from config. Mainly applicable to parquet & arrow files. FALSE check_failure
valid_vals Columns (excluding the `value` and any derived task ID columns) contain valid combinations of task ID / output type / output type ID values TRUE check_error error_tbl: table of invalid task ID/output type/output type ID value combinations
derived_task_id_vals Derived task ID columns contain valid values. FALSE check_failure errors: named list of derived task ID values. Each element contains the invalid values for each derived task ID that failed the check.
rows_unique Columns (excluding the `value` and any derived task ID columns) contain unique combinations of task ID / output type / output type ID values FALSE check_failure
req_vals Columns (excluding the `value` and any derived task ID columns) contain all required combinations of task ID / output type / output type ID values FALSE check_failure missing_df: table of missing task ID/output type/output type ID value combinations
value_col_valid Values in `value` column are coercible to data type configured for each output type FALSE check_failure
value_col_non_desc Values in `value` column are non-decreasing as output_type_ids increase for all unique task ID /output type value combinations. Applies to `quantile` or `cdf` output types only FALSE check_failure error_tbl: table of rows affected
value_col_sum1 Values in the `value` column of `pmf` output type data for each unique task ID combination sum to 1. FALSE check_failure error_tbl: table of rows affected
spl_compound_taskid_set Sample compound task id sets for each modeling task match or are coarser than the expected set defined in tasks.json config. TRUE check_error errors: list containing item for each failing modeling task. Exact structure dependent on type of validation failure. See check function documentation for more details.
spl_compound_tid Samples contain single unique values for each compound task ID within individual samples (v3 and above schema only). TRUE check_error errors: list containing item for each sample failing validation with breakdown of unique values for each compound task ID.
spl_non_compound_tid Samples contain single unique combination of non-compound task ID values across all samples (v3 and above schema only). TRUE check_error errors: list containing item for each modeling task with vectors of output type ids of samples failing validation and example table of most frequent non-compound task ID value combination across all samples in the modeling task.
spl_n Number of samples for a given compound idx falls within accepted compound task range (v3 and above schema only). FALSE check_failure errors: list containing item for each compound_idx failing validation with sample count, metadata on expected samples and example table of expected structure for samples belonging to the compound idx in question.

Checks on model metadata files

Details of checks performed by validate_model_metadata()

Name Check Early return Fail output Extra info optional
metadata_schema_exists A model metadata schema file exists in `hub-config` directory. TRUE check_error FALSE
metadata_file_exists A file with name provided to argument `file_path` exists at the expected location (the `model-metadata` directory). TRUE check_error FALSE
metadata_file_ext The metadata file has correct extension (yaml or yml). TRUE check_error FALSE
metadata_file_location The metadata file has been saved to correct location. TRUE check_failure FALSE
metadata_matches_schema The contents of the metadata file match the hub's model metadata schema TRUE check_error FALSE
metadata_file_name The metadata filename matches the model ID specified in the contents of the file. TRUE check_error FALSE
NA The number of metadata files submitted by a single team does not exceed the maximum number allowed. FALSE check_failure TRUE


An object of class hub_validations.


## Not run: 
  hub_path = ".",
  gh_repo = "hubverse-org/ci-testhub-simple",
  pr_number = 3

## End(Not run)

Validate a submitted model data file.


Checks both file level properties like file name, extension, location etc as well as model output data, i.e. the contents of the file.


  round_id_col = NULL,
  validations_cfg_path = NULL,
  output_type_id_datatype = c("from_config", "auto", "character", "double", "integer",
    "logical", "Date"),
  skip_submit_window_check = FALSE,
  skip_check_config = FALSE,
  submit_window_ref_date_from = c("file", "file_path"),
  derived_task_ids = NULL



Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


character string. Path to the file being validated relative to the hub's model-output directory.


Character string. The name of the column containing round_ids. Usually, the value of round property round_id in hub tasks.json config file. Defaults to NULL and determined from the config if applicable.


Path to validations.yml file. If NULL defaults to hub-config/validations.yml.


character string. One of "from_config", "auto", "character", "double", "integer", "logical", "Date". Defaults to "from_config" which uses the setting in the output_type_id_datatype property in the tasks.json config file if available. If the property is not set in the config, the argument falls back to "auto" which determines the output_type_id data type automatically from the tasks.json config file as the simplest data type required to represent all output type ID values across all output types in the hub. When only point estimate output types (where output_type_ids are NA,) are being collected by a hub, the output_type_id column is assigned a character data type when auto-determined. Other data type values can be used to override automatic determination. Note that attempting to coerce output_type_id to a data type that is not valid for the data (e.g. trying to coerce"character" values to "double") will likely result in an error or potentially unexpected behaviour so use with care.


Logical. Whether to skip the submission window check.


Logical. Whether to skip the hub config validation check. check.


whether to get the reference date around which relative submission windows will be determined from the file's file_path round ID or the file contents themselves. file requires that the file can be read. Only applicable when a round is configured to determine the submission windows relative to the value in a date column in model output files. Not applicable when explicit submission window start and end dates are provided in the hub's config.


Character vector of derived task ID names (task IDs whose values depend on other task IDs) to ignore. Columns for such task ids will contain NAs. If NULL, defaults to extracting derived task IDs from hub task.json. See get_hub_derived_task_ids() for more details.


Note that it is necessary for derived_task_ids to be specified if any task IDs with required values have dependent derived task IDs. If this is the case and derived task IDs are not specified, the dependent nature of derived task ID values will result in false validation errors when validating required values.

Details of checks performed by validate_submission()

Name Check Early return Fail output Extra info
valid_config Hub config valid TRUE check_error
submission_time Current time within file submission window FALSE check_failure
file_exists File exists at `file_path` provided TRUE check_error
file_name File name valid TRUE check_error
file_location File located in correct team directory FALSE check_failure
round_id_valid File round ID is valid hub round IDs TRUE check_error
file_format File format is accepted hub/round format TRUE check_error
file_n Number of submission files per round per team does not exceed allowed number FALSE check_failure
metadata_exists Model metadata file exists in expected location FALSE check_failure
file_read File can be read without errors TRUE check_error
valid_round_id_col Round ID var from config exists in data column names. Skipped if `round_id_from_var` is FALSE in config. FALSE check_failure
unique_round_id Round ID column contains a single unique round ID. Skipped if `round_id_from_var` is FALSE in config. TRUE check_error
match_round_id Round ID from file contents matches round ID from file name. Skipped if `round_id_from_var` is FALSE in config. TRUE check_error
colnames File column names match expected column names for round (i.e. task ID names + hub standard column names) TRUE check_error
col_types File column types match expected column types from config. Mainly applicable to parquet & arrow files. FALSE check_failure
valid_vals Columns (excluding the `value` and any derived task ID columns) contain valid combinations of task ID / output type / output type ID values TRUE check_error error_tbl: table of invalid task ID/output type/output type ID value combinations
derived_task_id_vals Derived task ID columns contain valid values. FALSE check_failure errors: named list of derived task ID values. Each element contains the invalid values for each derived task ID that failed the check.
rows_unique Columns (excluding the `value` and any derived task ID columns) contain unique combinations of task ID / output type / output type ID values FALSE check_failure
req_vals Columns (excluding the `value` and any derived task ID columns) contain all required combinations of task ID / output type / output type ID values FALSE check_failure missing_df: table of missing task ID/output type/output type ID value combinations
value_col_valid Values in `value` column are coercible to data type configured for each output type FALSE check_failure
value_col_non_desc Values in `value` column are non-decreasing as output_type_ids increase for all unique task ID /output type value combinations. Applies to `quantile` or `cdf` output types only FALSE check_failure error_tbl: table of rows affected
value_col_sum1 Values in the `value` column of `pmf` output type data for each unique task ID combination sum to 1. FALSE check_failure error_tbl: table of rows affected
spl_compound_taskid_set Sample compound task id sets for each modeling task match or are coarser than the expected set defined in tasks.json config. TRUE check_error errors: list containing item for each failing modeling task. Exact structure dependent on type of validation failure. See check function documentation for more details.
spl_compound_tid Samples contain single unique values for each compound task ID within individual samples (v3 and above schema only). TRUE check_error errors: list containing item for each sample failing validation with breakdown of unique values for each compound task ID.
spl_non_compound_tid Samples contain single unique combination of non-compound task ID values across all samples (v3 and above schema only). TRUE check_error errors: list containing item for each modeling task with vectors of output type ids of samples failing validation and example table of most frequent non-compound task ID value combination across all samples in the modeling task.
spl_n Number of samples for a given compound idx falls within accepted compound task range (v3 and above schema only). FALSE check_failure errors: list containing item for each compound_idx failing validation with sample count, metadata on expected samples and example table of expected structure for samples belonging to the compound idx in question.


An object of class hub_validations. Each named element contains a hub_check class object reflecting the result of a given check. Function will return early if a check returns an error.

For more details on the structure of ⁠<hub_validations>⁠ objects, including how to access more information on individual checks, see article on ⁠<hub_validations>⁠ S3 class objects.


hub_path <- system.file("testhubs/simple", package = "hubValidations")
file_path <- "team1-goodmodel/2022-10-08-team1-goodmodel.csv"
validate_submission(hub_path, file_path)

Validate a submitted model data file submission time.


Validate a submitted model data file submission time.


  ref_date_from = c("file_path", "file")



Either a character string path to a local Modeling Hub directory or an object of class ⁠<SubTreeFileSystem>⁠ created using functions s3_bucket() or gs_bucket() by providing a string S3 or GCS bucket name or path to a Modeling Hub directory stored in the cloud. For more details consult the Using cloud storage (S3, GCS) in the arrow package. The hub must be fully configured with valid admin.json and tasks.json files within the hub-config directory.


character string. Path to the file being validated relative to the hub's model-output directory.


whether to get the reference date around which relative submission windows will be determined from the file's file_path round ID or the file contents themselves. file requires that the file can be read. Only applicable when a round is configured to determine the submission windows relative to the value in a date column in model output files. Not applicable when explicit submission window start and end dates are provided in the hub's config.


An object of class hub_validations. Each named element contains a hub_check class object reflecting the result of a given check. Function will return early if a check returns an error.

For more details on the structure of ⁠<hub_validations>⁠ objects, including how to access more information on individual checks, see article on ⁠<hub_validations>⁠ S3 class objects.


hub_path <- system.file("testhubs/simple", package = "hubValidations")
file_path <- "team1-goodmodel/2022-10-08-team1-goodmodel.csv"
validate_submission_time(hub_path, file_path)