Title: | Tools for accessing and working with hubverse data |
---|---|
Description: | A set of utility functions for accessing and working with forecast and target data from Infectious Disease Modeling Hubs. |
Authors: | Anna Krystalli [aut, cre] , Li Shandross [ctb], Nicholas G. Reich [ctb] , Evan L. Ray [ctb], Consortium of Infectious Disease Modeling Hubs [cph] |
Maintainer: | Anna Krystalli <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.3.0 |
Built: | 2024-11-25 10:23:50 UTC |
Source: | https://github.com/hubverse-org/hubData |
Coerce data.frame/tibble column data types to hub schema data types or character.
coerce_to_hub_schema( tbl, config_tasks, skip_date_coercion = FALSE, as_arrow_table = FALSE, output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date") ) coerce_to_character(tbl, as_arrow_table = FALSE)
coerce_to_hub_schema( tbl, config_tasks, skip_date_coercion = FALSE, as_arrow_table = FALSE, output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date") ) coerce_to_character(tbl, as_arrow_table = FALSE)
tbl |
a model output data.frame/tibble |
config_tasks |
a list version of the content's of a hub's |
skip_date_coercion |
Logical. Whether to skip coercing dates. This can be faster,
especially for larger |
as_arrow_table |
Logical. Whether to return an arrow table. Defaults to |
output_type_id_datatype |
character string. One of |
tbl
with column data types coerced to hub schema data types or character.
if as_arrow_table = TRUE
, output is also converted to arrow table.
coerce_to_hub_schema()
: coerce columns to hub schema data types.
coerce_to_character()
: coerce all columns to character
collect_hub
retrieves data from a <hub_connection>/<mod_out_connection>
after executing any <arrow_dplyr_query>
into a local tibble.
The function also attempts to convert the output to a model_out_tbl
class
object before returning.
collect_hub(x, silent = FALSE, ...)
collect_hub(x, silent = FALSE, ...)
x |
a |
silent |
Logical. Whether to suppress message generated if conversion to |
... |
Further argument passed on to |
A model_out_tbl
, unless conversion to model_out_tbl
fails in which case a tibble
is returned.
hub_path <- system.file("testhubs/simple", package = "hubUtils") hub_con <- connect_hub(hub_path) # Collect all data in a hub hub_con %>% collect_hub() # Filter data before collecting hub_con %>% dplyr::filter(is.na(output_type_id)) %>% collect_hub() # Pass arguments to as_model_out_tbl() dplyr::filter(hub_con, is.na(output_type_id)) %>% collect_hub(remove_empty = TRUE)
hub_path <- system.file("testhubs/simple", package = "hubUtils") hub_con <- connect_hub(hub_path) # Collect all data in a hub hub_con %>% collect_hub() # Filter data before collecting hub_con %>% dplyr::filter(is.na(output_type_id)) %>% collect_hub() # Pass arguments to as_model_out_tbl() dplyr::filter(hub_con, is.na(output_type_id)) %>% collect_hub(remove_empty = TRUE)
collect_zoltar
retrieves data from a zoltardata.com project and
transforms it from Zoltar's native download format into a hubverse one. Zoltar (documentation
here) is a pre-hubverse research project that implements a repository of model
forecast results, including tools to administer, query, and visualize uploaded data, along with R and Python APIs to
access data programmatically (zoltr and
zoltpy, respectively.) (This hubData function is itself implemented using
the zoltr package.)
collect_zoltar( project_name, models = NULL, timezeros = NULL, units = NULL, targets = NULL, types = NULL, as_of = NULL, point_output_type = "median" )
collect_zoltar( project_name, models = NULL, timezeros = NULL, units = NULL, targets = NULL, types = NULL, as_of = NULL, point_output_type = "median" )
project_name |
A string naming the Zoltar project to load forecasts from. Assumes the host is zoltardata.com . |
models |
A character vector that specifies the models to query. Must be model abbreviations. Defaults to NULL, which queries all models in the project. |
timezeros |
A character vector that specifies the timezeros to query. Must be yyyy-mm-dd format. Defaults to NULL, which queries all timezeros in the project. |
units |
A character vector that specifies the units to query. Must be unit abbreviations. Defaults to NULL, which queries all units in the project. |
targets |
A character vector that specifies the targets to query. Must be target names. Defaults to NULL, which queries all targets in the project. |
types |
A character vector that specifies the forecast types to query. Choices are "bin", "point", "sample", "quantile", "mean", and "median". Defaults to NULL, which queries all types in the project. Note: While Zoltar supports "named" and "mode" forecasts, this function ignores them. |
as_of |
A datetime string that specifies the forecast version. The datetime must include timezone information
for disambiguation, without which the query will fail. The datatime parsing function used below ( |
point_output_type |
A string that specifies how to convert zoltar |
Zoltar's data model differs from that of the hubverse in a few important ways. While Zoltar's model has the concepts of unit, target, and timezero, hubverse projects have hub-configurable columns, which makes the mapping from the former to the latter imperfect. In particular, Zoltar units translate roughly to hubverse task IDs, Zoltar targets include both the target outcome and numeric horizon in the target name, and Zoltar timezeros map to round ids. Finally, Zoltar's forecast types differ from those of the hubverse. Whereas Zoltar has seven types (bin, named, point, sample, quantile, mean, median, and mode), the hubverse has six (cdf, mean, median, pmf, quantile, sample), only some of which overlap.
Additional notes:
Requires the user to have a Zoltar account (use the Zoltar contact page to request one).
Requires Z_USERNAME
and Z_PASSWORD
environment vars to be set to those of the user's Zoltar account.
While Zoltar supports "named" and "mode" forecasts, this function ignores them.
Rows with non-numeric values are ignored.
This function removes numeric_horizon mentions from zoltar target names. Target names can contain a maximum of one numeric_horizon. Example: "1 wk ahead inc case" -> "wk ahead inc case".
Querying a large number of rows may cause errors, so we recommend providing one or more filtering arguments (e.g., models, timezeros, etc.) to limit the result.
A hubverse model_out_tbl containing the following columns: "model_id", "timezero", "season", "unit", "horizon", "target", "output_type", "output_type_id", and "value".
## Not run: df <- collect_zoltar("Docs Example Project") df <- collect_zoltar("Docs Example Project", models = c("docs_mod"), timezeros = c("2011-10-16"), units = c("loc1", "loc3"), targets = c("pct next week", "cases next week"), types = c("point"), as_of = NULL, point_output_type = "mean") ## End(Not run)
## Not run: df <- collect_zoltar("Docs Example Project") df <- collect_zoltar("Docs Example Project", models = c("docs_mod"), timezeros = c("2011-10-16"), units = c("loc1", "loc3"), targets = c("pct next week", "cases next week"), types = c("point"), as_of = NULL, point_output_type = "mean") ## End(Not run)
Connect to data in a model output directory through a Modeling Hub or directly. Data can be stored in a local directory or in the cloud on AWS or GCS.
connect_hub( hub_path, file_format = c("csv", "parquet", "arrow"), output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date"), partitions = list(model_id = arrow::utf8()), skip_checks = FALSE ) connect_model_output( model_output_dir, file_format = c("csv", "parquet", "arrow"), partition_names = "model_id", schema = NULL, skip_checks = FALSE )
connect_hub( hub_path, file_format = c("csv", "parquet", "arrow"), output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date"), partitions = list(model_id = arrow::utf8()), skip_checks = FALSE ) connect_model_output( model_output_dir, file_format = c("csv", "parquet", "arrow"), partition_names = "model_id", schema = NULL, skip_checks = FALSE )
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
file_format |
The file format model output files are stored in.
For connection to a fully configured hub, accessed through |
output_type_id_datatype |
character string. One of |
partitions |
a named list specifying the arrow data types of any partitioning column. |
skip_checks |
Logical. If |
model_output_dir |
Either a character string path to a local directory
containing model output data
or an object of class |
partition_names |
character vector that defines the field names to which
recursive directory names correspond to. Defaults to a single |
schema |
An arrow::Schema object for the Dataset. If NULL (the default), the schema will be inferred from the data sources. |
connect_hub
returns an S3 object of class <hub_connection>
.
connect_mod_out
returns an S3 object of class <mod_out_connection>
.
Both objects are connected to the data in the model-output directory via an
Apache arrow FileSystemDataset
connection.
The connection can be used to extract data using dplyr
custom queries. The
<hub_connection>
class also contains modeling hub metadata.
connect_hub()
: connect to a fully configured Modeling Hub directory.
connect_model_output()
: connect directly to a model-output
directory. This
function can be used to access data directly from an appropriately set up
model output directory which is not part of a fully configured hub.
# Connect to a local simple forecasting Hub. hub_path <- system.file("testhubs/simple", package = "hubUtils") hub_con <- connect_hub(hub_path) hub_con hub_con <- connect_hub(hub_path, output_type_id_datatype = "character") hub_con # Connect directly to a local `model-output` directory mod_out_path <- system.file("testhubs/simple/model-output", package = "hubUtils") mod_out_con <- connect_model_output(mod_out_path) mod_out_con # Query hub_connection for data library(dplyr) hub_con %>% filter( origin_date == "2022-10-08", horizon == 2 ) %>% collect_hub() mod_out_con %>% filter( origin_date == "2022-10-08", horizon == 2 ) %>% collect_hub() # Connect to a simple forecasting Hub stored in an AWS S3 bucket. ## Not run: hub_path <- s3_bucket("hubverse/hubutils/testhubs/simple/") hub_con <- connect_hub(hub_path) hub_con ## End(Not run)
# Connect to a local simple forecasting Hub. hub_path <- system.file("testhubs/simple", package = "hubUtils") hub_con <- connect_hub(hub_path) hub_con hub_con <- connect_hub(hub_path, output_type_id_datatype = "character") hub_con # Connect directly to a local `model-output` directory mod_out_path <- system.file("testhubs/simple/model-output", package = "hubUtils") mod_out_con <- connect_model_output(mod_out_path) mod_out_con # Query hub_connection for data library(dplyr) hub_con %>% filter( origin_date == "2022-10-08", horizon == 2 ) %>% collect_hub() mod_out_con %>% filter( origin_date == "2022-10-08", horizon == 2 ) %>% collect_hub() # Connect to a simple forecasting Hub stored in an AWS S3 bucket. ## Not run: hub_path <- s3_bucket("hubverse/hubutils/testhubs/simple/") hub_con <- connect_hub(hub_path) hub_con ## End(Not run)
Create an arrow schema from a tasks.json
config file. For use when
opening an arrow dataset.
create_hub_schema( config_tasks, partitions = list(model_id = arrow::utf8()), output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date"), r_schema = FALSE )
create_hub_schema( config_tasks, partitions = list(model_id = arrow::utf8()), output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date"), r_schema = FALSE )
config_tasks |
a list version of the content's of a hub's |
partitions |
a named list specifying the arrow data types of any partitioning column. |
output_type_id_datatype |
character string. One of |
r_schema |
Logical. If |
an arrow schema object that can be used to define column datatypes when
opening model output data. If r_schema = TRUE
, a character vector of R data types.
hub_path <- system.file("testhubs/simple", package = "hubUtils") config_tasks <- hubUtils::read_config(hub_path, "tasks") schema <- create_hub_schema(config_tasks)
hub_path <- system.file("testhubs/simple", package = "hubUtils") config_tasks <- hubUtils::read_config(hub_path, "tasks") schema <- create_hub_schema(config_tasks)
This function has been moved to the hubValidations
package and renamed to submission_tmpl()
.
create_model_out_submit_tmpl( hub_con, config_tasks, round_id, required_vals_only = FALSE, complete_cases_only = TRUE )
create_model_out_submit_tmpl( hub_con, config_tasks, round_id, required_vals_only = FALSE, complete_cases_only = TRUE )
hub_con |
A |
config_tasks |
a list version of the content's of a hub's |
round_id |
Character string. Round identifier. If the round is set to
|
required_vals_only |
Logical. Whether to return only combinations of Task ID and related output type ID required values. |
complete_cases_only |
Logical. If |
For task IDs or output_type_ids where all values are optional, by default, columns
are included as columns of NA
s when required_vals_only = TRUE
.
When such columns exist, the function returns a tibble with zero rows, as no
complete cases of required value combinations exists.
(Note that determination of complete cases does excludes valid NA
output_type_id
values in "mean"
and "median"
output types).
To return a template of incomplete required cases, which includes NA
columns, use
complete_cases_only = FALSE
.
When sample output types are included in the output, the output_type_id
column contains example sample indexes which are useful for identifying the
compound task ID structure of multivariate sampling distributions in particular,
i.e. which combinations of task ID values represent individual samples.
When a round is set to round_id_from_variable: true
,
the value of the task ID from which round IDs are derived (i.e. the task ID
specified in round_id
property of config_tasks
) is set to the value of the
round_id
argument in the returned output.
a tibble template containing an expanded grid of valid task ID and
output type ID value combinations for a given submission round
and output type.
If required_vals_only = TRUE
, values are limited to the combination of required
values only.
This function has been moved to the hubValidations
package and renamed to expand_model_out_grid()
.
expand_model_out_val_grid( config_tasks, round_id, required_vals_only = FALSE, all_character = FALSE, as_arrow_table = FALSE, bind_model_tasks = TRUE, include_sample_ids = FALSE )
expand_model_out_val_grid( config_tasks, round_id, required_vals_only = FALSE, all_character = FALSE, as_arrow_table = FALSE, bind_model_tasks = TRUE, include_sample_ids = FALSE )
config_tasks |
a list version of the content's of a hub's |
round_id |
Character string. Round identifier. If the round is set to
|
required_vals_only |
Logical. Whether to return only combinations of Task ID and related output type ID required values. |
all_character |
Logical. Whether to return all character column. |
as_arrow_table |
Logical. Whether to return an arrow table. Defaults to |
bind_model_tasks |
Logical. Whether to bind expanded grids of values from multiple modeling tasks into a single tibble/arrow table or return a list. |
include_sample_ids |
Logical. Whether to include sample identifiers in
the |
When a round is set to round_id_from_variable: true
,
the value of the task ID from which round IDs are derived (i.e. the task ID
specified in round_id
property of config_tasks
) is set to the value of the
round_id
argument in the returned output.
When sample output types are included in the output and include_sample_ids = TRUE
,
the output_type_id
column contains example sample indexes which are useful
for identifying the compound task ID structure of multivariate sampling
distributions in particular, i.e. which combinations of task ID values
represent individual samples.
If bind_model_tasks = TRUE
(default) a tibble or arrow table
containing all possible task ID and related output type ID
value combinations. If bind_model_tasks = FALSE
, a list containing a
tibble or arrow table for each round modeling task.
Columns are coerced to data types according to the hub schema,
unless all_character = TRUE
. If all_character = TRUE
, all columns are returned as
character which can be faster when large expanded grids are expected.
If required_vals_only = TRUE
, values are limited to the combinations of required
values only.
See arrow::gs_bucket()
for details.
A SubTreeFileSystem
containing an GcsFileSystem
and the bucket's
relative path. Note that this function's success does not guarantee that you
are authorized to access the bucket's contents.
bucket <- gs_bucket("voltrondata-labs-datasets")
bucket <- gs_bucket("voltrondata-labs-datasets")
Loads in hub model metadata for all models or a specified subset of models and compiles it into a tibble with one row per model.
load_model_metadata(hub_path, model_ids = NULL)
load_model_metadata(hub_path, model_ids = NULL)
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
model_ids |
A vector of character strings of models for which to load metadata. Defaults to NULL, in which case metadata for all models is loaded. |
tibble
with model metadata. One row for each model, one column for
each top-level field in the metadata file. For metadata files with nested structures,
this tibble may contain list-columns where the entries are lists containing the nested metadata values.
# Load in model metadata from local hub hub_path <- system.file("testhubs/simple", package = "hubUtils") load_model_metadata(hub_path) load_model_metadata(hub_path, model_ids = c("hub-baseline"))
# Load in model metadata from local hub hub_path <- system.file("testhubs/simple", package = "hubUtils") load_model_metadata(hub_path) load_model_metadata(hub_path, model_ids = c("hub-baseline"))
<hub_connection>
or <mod_out_connection>
S3 class objectPrint a <hub_connection>
or <mod_out_connection>
S3 class object
## S3 method for class 'hub_connection' print(x, verbose = FALSE, ...) ## S3 method for class 'mod_out_connection' print(x, verbose = FALSE, ...)
## S3 method for class 'hub_connection' print(x, verbose = FALSE, ...) ## S3 method for class 'mod_out_connection' print(x, verbose = FALSE, ...)
x |
A |
verbose |
Logical. Whether to print the full structure of the object.
Defaults to |
... |
Further arguments passed to or from other methods. |
print(hub_connection)
: print a <hub_connection>
object.
print(mod_out_connection)
: print a <mod_out_connection>
object.
hub_path <- system.file("testhubs/simple", package = "hubUtils") hub_con <- connect_hub(hub_path) hub_con print(hub_con) print(hub_con, verbose = TRUE) mod_out_path <- system.file("testhubs/simple/model-output", package = "hubUtils") mod_out_con <- connect_model_output(mod_out_path) print(mod_out_con)
hub_path <- system.file("testhubs/simple", package = "hubUtils") hub_con <- connect_hub(hub_path) hub_con print(hub_con) print(hub_con, verbose = TRUE) mod_out_path <- system.file("testhubs/simple/model-output", package = "hubUtils") mod_out_con <- connect_model_output(mod_out_path) print(mod_out_con)
See arrow::s3_bucket()
for details.
A SubTreeFileSystem
containing an S3FileSystem
and the bucket's
relative path. Note that this function's success does not guarantee that you
are authorized to access the bucket's contents.
bucket <- s3_bucket("voltrondata-labs-datasets") # Turn on debug logging. The following line of code should be run in a fresh # R session prior to any calls to `s3_bucket()` (or other S3 functions) Sys.setenv("ARROW_S3_LOG_LEVEL"="DEBUG") bucket <- s3_bucket("voltrondata-labs-datasets")
bucket <- s3_bucket("voltrondata-labs-datasets") # Turn on debug logging. The following line of code should be run in a fresh # R session prior to any calls to `s3_bucket()` (or other S3 functions) Sys.setenv("ARROW_S3_LOG_LEVEL"="DEBUG") bucket <- s3_bucket("voltrondata-labs-datasets")