| Title: | Tools for accessing and working with hubverse data |
|---|---|
| Description: | A set of utility functions for accessing and working with forecast and target data from Infectious Disease Modeling Hubs. |
| Authors: | Anna Krystalli [aut, cre] (ORCID: <https://orcid.org/0000-0002-2378-4915>), Li Shandross [ctb], Nicholas G. Reich [ctb] (ORCID: <https://orcid.org/0000-0003-3503-9899>), Evan L. Ray [ctb], Becky Sweger [ctb], Dylan H. [email protected] Morris [ctb] (ORCID: <https://orcid.org/0000-0002-3655-406X>), Consortium of Infectious Disease Modeling Hubs [cph] |
| Maintainer: | Anna Krystalli <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 2.2.1 |
| Built: | 2026-05-15 14:04:42 UTC |
| Source: | https://github.com/hubverse-org/hubData |
A named character vector mapping common arrow::Schema field types (as strings) to their corresponding base R types. This mapping is used to translate or validate column types when working with Parquet files or Arrow datasets, especially for schema inference and compatibility checks.
arrow_to_r_datatypesarrow_to_r_datatypes
A named character vector with 8 entries.
Only the safest and most portable Arrow types are supported in the hubverse. Types not present in this mapping should be treated as unsupported.
| Arrow type | R type | Notes |
bool |
logical |
|
int32 |
integer |
|
int64 |
integer |
R supports via Arrow |
float |
double |
Promoted to double in R |
double |
double |
|
string |
character |
|
date32[day] |
Date |
|
timestamp[ms] |
POSIXct |
Safest timestamp format |
as_r_schema(), arrow_schema_to_string()
These functions help convert or validate an arrow::Schema object (typically from a Parquet file or Arrow dataset) by translating Arrow types to R equivalents, extracting type strings, or checking for compatibility.
as_r_schema(arrow_schema, call = rlang::caller_env()) arrow_schema_to_string(arrow_schema) is_supported_arrow_type(arrow_schema) validate_arrow_schema(arrow_schema, call = rlang::caller_env())as_r_schema(arrow_schema, call = rlang::caller_env()) arrow_schema_to_string(arrow_schema) is_supported_arrow_type(arrow_schema) validate_arrow_schema(arrow_schema, call = rlang::caller_env())
arrow_schema |
An arrow::Schema object, such as one returned by
|
call |
The calling environment, used for error reporting in |
as_r_schema() maps Arrow types to base R types (e.g., "int32" → "integer").
It throws an error if unsupported column types are present.
arrow_schema_to_string() returns a named character vector of raw Arrow
type strings (e.g., "int64", "date32[day]") for schema field.
is_supported_arrow_type() returns a named logical vector indicating whether
each schema field type is supported.
validate_arrow_schema() throws an error if any fields has an unsupported
Arrow type.
For a full list of supported types and their R mappings, see arrow_to_r_datatypes().
as_r_schema(): A named character vector mapping column names to base R type strings
(e.g., "integer", "double", "logical").
arrow_schema_to_string(): A named character vector mapping column names to Arrow type strings.
is_supported_arrow_type(): A named logical vector indicating whether each column is supported.
validate_arrow_schema(): Returns the original schema (invisibly) if all
column types are supported; otherwise throws an error.
# Path to a single Parquet file file_path <- system.file( "testhubs/parquet/model-output/hub-baseline/2022-10-01-hub-baseline.parquet", package = "hubUtils" ) # Get schema from the file file_schema <- arrow::read_parquet(file_path, as_data_frame = FALSE)$schema # Convert to R types as_r_schema(file_schema) # Get raw Arrow type strings arrow_schema_to_string(file_schema) # Check which columns are supported is_supported_arrow_type(file_schema) # Validate schema (throws error if any unsupported types are present) validate_arrow_schema(file_schema) # From a multi-file dataset dataset_path <- system.file( "testhubs/parquet/model-output/hub-baseline", package = "hubUtils" ) ds <- arrow::open_dataset(dataset_path) as_r_schema(ds$schema) arrow_schema_to_string(ds$schema) is_supported_arrow_type(ds$schema) validate_arrow_schema(ds$schema)# Path to a single Parquet file file_path <- system.file( "testhubs/parquet/model-output/hub-baseline/2022-10-01-hub-baseline.parquet", package = "hubUtils" ) # Get schema from the file file_schema <- arrow::read_parquet(file_path, as_data_frame = FALSE)$schema # Convert to R types as_r_schema(file_schema) # Get raw Arrow type strings arrow_schema_to_string(file_schema) # Check which columns are supported is_supported_arrow_type(file_schema) # Validate schema (throws error if any unsupported types are present) validate_arrow_schema(file_schema) # From a multi-file dataset dataset_path <- system.file( "testhubs/parquet/model-output/hub-baseline", package = "hubUtils" ) ds <- arrow::open_dataset(dataset_path) as_r_schema(ds$schema) arrow_schema_to_string(ds$schema) is_supported_arrow_type(ds$schema) validate_arrow_schema(ds$schema)
Coerce data.frame/tibble column data types to hub schema data types or character.
coerce_to_hub_schema( tbl, config_tasks, skip_date_coercion = FALSE, as_arrow_table = FALSE, output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date") ) coerce_to_character(tbl, as_arrow_table = FALSE)coerce_to_hub_schema( tbl, config_tasks, skip_date_coercion = FALSE, as_arrow_table = FALSE, output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date") ) coerce_to_character(tbl, as_arrow_table = FALSE)
tbl |
a model output data.frame/tibble |
config_tasks |
a list version of the content's of a hub's |
skip_date_coercion |
Logical. Whether to skip coercing dates. This can be faster,
especially for larger |
as_arrow_table |
Logical. Whether to return an arrow table. Defaults to |
output_type_id_datatype |
character string. One of |
tbl with column data types coerced to hub schema data types or character.
if as_arrow_table = TRUE, output is also converted to arrow table.
coerce_to_hub_schema(): coerce columns to hub schema data types.
coerce_to_character(): coerce all columns to character
collect_hub retrieves data from a <hub_connection>/<mod_out_connection>
after executing any <arrow_dplyr_query> into a local tibble.
The function also attempts to convert the output to a model_out_tbl class
object before returning.
collect_hub(x, silent = FALSE, ...)collect_hub(x, silent = FALSE, ...)
x |
a |
silent |
Logical. Whether to suppress message generated if conversion to |
... |
Further argument passed on to |
A model_out_tbl, unless conversion to model_out_tbl fails in which case a tibble is returned.
hub_path <- system.file("testhubs/simple", package = "hubUtils") hub_con <- connect_hub(hub_path) # Collect all data in a hub hub_con |> collect_hub() # Filter data before collecting hub_con |> dplyr::filter(is.na(output_type_id)) |> collect_hub() # Pass arguments to as_model_out_tbl() dplyr::filter(hub_con, is.na(output_type_id)) |> collect_hub(remove_empty = TRUE)hub_path <- system.file("testhubs/simple", package = "hubUtils") hub_con <- connect_hub(hub_path) # Collect all data in a hub hub_con |> collect_hub() # Filter data before collecting hub_con |> dplyr::filter(is.na(output_type_id)) |> collect_hub() # Pass arguments to as_model_out_tbl() dplyr::filter(hub_con, is.na(output_type_id)) |> collect_hub(remove_empty = TRUE)
collect_zoltar retrieves data from a zoltardata.com project and
transforms it from Zoltar's native download format into a hubverse one. Zoltar (documentation
here) is a pre-hubverse research project that implements a repository of model
forecast results, including tools to administer, query, and visualize uploaded data, along with R and Python APIs to
access data programmatically (zoltr and
zoltpy, respectively.) (This hubData function is itself implemented using
the zoltr package.)
collect_zoltar( project_name, models = NULL, timezeros = NULL, units = NULL, targets = NULL, types = NULL, as_of = NULL, point_output_type = "median" )collect_zoltar( project_name, models = NULL, timezeros = NULL, units = NULL, targets = NULL, types = NULL, as_of = NULL, point_output_type = "median" )
project_name |
A string naming the Zoltar project to load forecasts from. Assumes the host is zoltardata.com . |
models |
A character vector that specifies the models to query. Must be model abbreviations. Defaults to NULL, which queries all models in the project. |
timezeros |
A character vector that specifies the timezeros to query. Must be yyyy-mm-dd format. Defaults to NULL, which queries all timezeros in the project. |
units |
A character vector that specifies the units to query. Must be unit abbreviations. Defaults to NULL, which queries all units in the project. |
targets |
A character vector that specifies the targets to query. Must be target names. Defaults to NULL, which queries all targets in the project. |
types |
A character vector that specifies the forecast types to query. Choices are "bin", "point", "sample", "quantile", "mean", and "median". Defaults to NULL, which queries all types in the project. Note: While Zoltar supports "named" and "mode" forecasts, this function ignores them. |
as_of |
A datetime string that specifies the forecast version. The datetime must include timezone information
for disambiguation, without which the query will fail. The datatime parsing function used below ( |
point_output_type |
A string that specifies how to convert zoltar |
Zoltar's data model differs from that of the hubverse in a few important ways. While Zoltar's model has the concepts of unit, target, and timezero, hubverse projects have hub-configurable columns, which makes the mapping from the former to the latter imperfect. In particular, Zoltar units translate roughly to hubverse task IDs, Zoltar targets include both the target outcome and numeric horizon in the target name, and Zoltar timezeros map to round ids. Finally, Zoltar's forecast types differ from those of the hubverse. Whereas Zoltar has seven types (bin, named, point, sample, quantile, mean, median, and mode), the hubverse has six (cdf, mean, median, pmf, quantile, sample), only some of which overlap.
Additional notes:
Requires the user to have a Zoltar account (use the Zoltar contact page to request one).
Requires Z_USERNAME and Z_PASSWORD environment vars to be set to those of the user's Zoltar account.
While Zoltar supports "named" and "mode" forecasts, this function ignores them.
Rows with non-numeric values are ignored.
This function removes numeric_horizon mentions from zoltar target names. Target names can contain a maximum of one numeric_horizon. Example: "1 wk ahead inc case" -> "wk ahead inc case".
Querying a large number of rows may cause errors, so we recommend providing one or more filtering arguments (e.g., models, timezeros, etc.) to limit the result.
A hubverse model_out_tbl containing the following columns: "model_id", "timezero", "season", "unit", "horizon", "target", "output_type", "output_type_id", and "value".
## Not run: df <- collect_zoltar("Docs Example Project") df <- collect_zoltar("Docs Example Project", models = c("docs_mod"), timezeros = c("2011-10-16"), units = c("loc1", "loc3"), targets = c("pct next week", "cases next week"), types = c("point"), as_of = NULL, point_output_type = "mean") ## End(Not run)## Not run: df <- collect_zoltar("Docs Example Project") df <- collect_zoltar("Docs Example Project", models = c("docs_mod"), timezeros = c("2011-10-16"), units = c("loc1", "loc3"), targets = c("pct next week", "cases next week"), types = c("point"), as_of = NULL, point_output_type = "mean") ## End(Not run)
Connect to data in a model output directory through a Modeling Hub or directly. Data can be stored in a local directory or in the cloud on AWS or GCS.
connect_hub( hub_path, file_format = c("csv", "parquet", "arrow"), output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date"), partitions = list(model_id = arrow::utf8()), skip_checks = TRUE, na = c("NA", ""), ignore_files = NULL ) connect_model_output( model_output_dir, file_format = c("csv", "parquet", "arrow"), partition_names = "model_id", schema = NULL, skip_checks = FALSE, na = c("NA", ""), ignore_files = NULL )connect_hub( hub_path, file_format = c("csv", "parquet", "arrow"), output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date"), partitions = list(model_id = arrow::utf8()), skip_checks = TRUE, na = c("NA", ""), ignore_files = NULL ) connect_model_output( model_output_dir, file_format = c("csv", "parquet", "arrow"), partition_names = "model_id", schema = NULL, skip_checks = FALSE, na = c("NA", ""), ignore_files = NULL )
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
file_format |
The file format model output files are stored in.
For connection to a fully configured hub, accessed through |
output_type_id_datatype |
character string. One of |
partitions |
a named list specifying the arrow data types of any partitioning column. |
skip_checks |
Logical. If |
na |
A character vector of strings to interpret as missing values. Only
applies to CSV files. The default is |
ignore_files |
A character vector of file names (not paths) or
file prefixes to ignore when discovering model output files to
include in dataset connections.
Parent directory names should not be included.
Common non-data files such as |
model_output_dir |
Either a character string path to a local directory
containing model output data
or an object of class |
partition_names |
character vector that defines the field names to which
recursive directory names correspond to. Defaults to a single |
schema |
An arrow::Schema object for the Dataset. If NULL (the default), the schema will be inferred from the data sources. |
By default, common non-data files that may be present in model output directories
(e.g. "README", ".DS_Store") are excluded automatically to prevent errors
when connecting via Arrow. Additional files can be excluded using the ignore_files
parameter.
connect_hub returns an S3 object of class <hub_connection>.
connect_model_output returns an S3 object of class <mod_out_connection>.
Both objects are connected to the data in the model-output directory via an
Apache arrow FileSystemDataset connection.
The connection can be used to extract data using dplyr custom queries. The
<hub_connection> class also contains modeling hub metadata.
connect_hub(): connect to a fully configured Modeling Hub directory.
connect_model_output(): connect directly to a model-output directory. This
function can be used to access data directly from an appropriately set up
model output directory which is not part of a fully configured hub.
# Connect to a local simple forecasting Hub. hub_path <- system.file("testhubs/simple", package = "hubUtils") hub_con <- connect_hub(hub_path) hub_con hub_con <- connect_hub(hub_path, output_type_id_datatype = "character") hub_con # Connect directly to a local `model-output` directory mod_out_path <- system.file("testhubs/simple/model-output", package = "hubUtils") mod_out_con <- connect_model_output(mod_out_path) mod_out_con # Query hub_connection for data library(dplyr) hub_con |> filter( origin_date == "2022-10-08", horizon == 2 ) |> collect_hub() mod_out_con |> filter( origin_date == "2022-10-08", horizon == 2 ) |> collect_hub() # Ignore a file connect_hub(hub_path, ignore_files = c("README", "2022-10-08-team1-goodmodel.csv")) # Connect to a simple forecasting Hub stored in an AWS S3 bucket. ## Not run: hub_path <- s3_bucket("hubverse/hubutils/testhubs/simple/") hub_con <- connect_hub(hub_path) hub_con ## End(Not run)# Connect to a local simple forecasting Hub. hub_path <- system.file("testhubs/simple", package = "hubUtils") hub_con <- connect_hub(hub_path) hub_con hub_con <- connect_hub(hub_path, output_type_id_datatype = "character") hub_con # Connect directly to a local `model-output` directory mod_out_path <- system.file("testhubs/simple/model-output", package = "hubUtils") mod_out_con <- connect_model_output(mod_out_path) mod_out_con # Query hub_connection for data library(dplyr) hub_con |> filter( origin_date == "2022-10-08", horizon == 2 ) |> collect_hub() mod_out_con |> filter( origin_date == "2022-10-08", horizon == 2 ) |> collect_hub() # Ignore a file connect_hub(hub_path, ignore_files = c("README", "2022-10-08-team1-goodmodel.csv")) # Connect to a simple forecasting Hub stored in an AWS S3 bucket. ## Not run: hub_path <- s3_bucket("hubverse/hubutils/testhubs/simple/") hub_con <- connect_hub(hub_path) hub_con ## End(Not run)
Open the oracle-output target data file(s)
in a hub as an arrow dataset.
connect_target_oracle_output( hub_path = ".", date_col = NULL, na = c("NA", ""), ignore_files = NULL, output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date") )connect_target_oracle_output( hub_path = ".", date_col = NULL, na = c("NA", ""), ignore_files = NULL, output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date") )
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
date_col |
Optional column name to be interpreted as date. Default is |
na |
A character vector of strings to interpret as missing values. Only
applies to CSV files. The default is |
ignore_files |
A character vector of file names (not paths) or
file prefixes to ignore when discovering model output files to
include in dataset connections.
Parent directory names should not be included.
Common non-data files such as |
output_type_id_datatype |
character string. One of |
If the target data is split across multiple files in a oracle-output directory,
all files must share the same file format, either csv or parquet.
No other types of files are currently allowed in a oracle-output directory.
This function uses different methods to create the Arrow schema depending on the hub configuration version:
v6+ hubs (with target-data.json): Schema is created directly from the
target-data.json configuration file using create_oracle_output_schema().
This config-based approach is fast and deterministic, requiring no filesystem
I/O to scan data files. It's especially beneficial for cloud storage where
file scanning can be slow.
Hubs (without target-data.json): Schema is inferred by scanning
the actual data files. This inference-based approach examines file structure
and content to determine column types.
The function automatically detects which method to use based on the presence
of target-data.json in the hub configuration.
Column ordering in the resulting dataset depends on configuration version and file format:
v6+ hubs (with target-data.json):
Parquet: Columns are reordered to the standard hubverse convention (see get_target_data_colnames()).
Parquet's column-by-name matching enables safe reordering.
CSV: Original file ordering is preserved to avoid column name/position mismatches during collection.
Hubs (without target-data.json): Original file ordering is preserved regardless of format.
An arrow dataset object of subclass <target_oracle_output>.
# Column Ordering: CSV vs Parquet in v6+ hubs # For v6+ hubs with target-data.json, ordering differs by file format # Example 1: CSV format (single file) - preserves original file ordering hub_path_csv <- system.file("testhubs/v6/target_file", package = "hubUtils") oo_con_csv <- connect_target_oracle_output(hub_path_csv) # CSV columns are in their original file order names(oo_con_csv) # Collect and filter as usual oo_con_csv |> dplyr::collect() oo_con_csv |> dplyr::filter(location == "US") |> dplyr::collect() # Example 2: Parquet format (directory) - reordered to hubverse convention hub_path_parquet <- system.file("testhubs/v6/target_dir", package = "hubUtils") oo_con_parquet <- connect_target_oracle_output(hub_path_parquet) # Parquet columns follow hubverse convention (date first, then alphabetical) names(oo_con_parquet) # Reordering is safe for Parquet because it matches columns by name # rather than position during collection oo_con_parquet |> dplyr::collect() # Both formats support the same filtering operations oo_con_parquet |> dplyr::filter(target_end_date == "2022-12-31") |> dplyr::collect() # Get distinct target_end_date values oo_con_parquet |> dplyr::distinct(target_end_date) |> dplyr::pull(as_vector = TRUE) ## Not run: # Access Target oracle-output data from a cloud hub s3_hub_path <- s3_bucket("example-complex-forecast-hub") s3_con <- connect_target_oracle_output(s3_hub_path) s3_con s3_con |> dplyr::collect() ## End(Not run)# Column Ordering: CSV vs Parquet in v6+ hubs # For v6+ hubs with target-data.json, ordering differs by file format # Example 1: CSV format (single file) - preserves original file ordering hub_path_csv <- system.file("testhubs/v6/target_file", package = "hubUtils") oo_con_csv <- connect_target_oracle_output(hub_path_csv) # CSV columns are in their original file order names(oo_con_csv) # Collect and filter as usual oo_con_csv |> dplyr::collect() oo_con_csv |> dplyr::filter(location == "US") |> dplyr::collect() # Example 2: Parquet format (directory) - reordered to hubverse convention hub_path_parquet <- system.file("testhubs/v6/target_dir", package = "hubUtils") oo_con_parquet <- connect_target_oracle_output(hub_path_parquet) # Parquet columns follow hubverse convention (date first, then alphabetical) names(oo_con_parquet) # Reordering is safe for Parquet because it matches columns by name # rather than position during collection oo_con_parquet |> dplyr::collect() # Both formats support the same filtering operations oo_con_parquet |> dplyr::filter(target_end_date == "2022-12-31") |> dplyr::collect() # Get distinct target_end_date values oo_con_parquet |> dplyr::distinct(target_end_date) |> dplyr::pull(as_vector = TRUE) ## Not run: # Access Target oracle-output data from a cloud hub s3_hub_path <- s3_bucket("example-complex-forecast-hub") s3_con <- connect_target_oracle_output(s3_hub_path) s3_con s3_con |> dplyr::collect() ## End(Not run)
Open the time-series target data file(s)
in a hub as an arrow dataset.
connect_target_timeseries( hub_path = ".", date_col = NULL, na = c("NA", ""), ignore_files = NULL )connect_target_timeseries( hub_path = ".", date_col = NULL, na = c("NA", ""), ignore_files = NULL )
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
date_col |
Optional column name to be interpreted as date. Default is |
na |
A character vector of strings to interpret as missing values. Only
applies to CSV files. The default is |
ignore_files |
A character vector of file names (not paths) or
file prefixes to ignore when discovering model output files to
include in dataset connections.
Parent directory names should not be included.
Common non-data files such as |
If the target data is split across multiple files in a time-series directory,
all files must share the same file format, either csv or parquet.
No other types of files are currently allowed in a time-series directory.
This function uses different methods to create the Arrow schema depending on the hub configuration version:
v6+ hubs (with target-data.json): Schema is created directly from the
target-data.json configuration file using create_timeseries_schema().
This config-based approach is fast and deterministic, requiring no filesystem
I/O to scan data files. It's especially beneficial for cloud storage where
file scanning can be slow.
Hubs (without target-data.json): Schema is inferred by scanning
the actual data files. This inference-based approach examines file structure
and content to determine column types.
The function automatically detects which method to use based on the presence
of target-data.json in the hub configuration.
Column ordering in the resulting dataset depends on configuration version and file format:
v6+ hubs (with target-data.json):
Parquet: Columns are reordered to the standard hubverse convention (see get_target_data_colnames()).
Parquet's column-by-name matching enables safe reordering.
CSV: Original file ordering is preserved to avoid column name/position mismatches during collection.
Hubs (without target-data.json): Original file ordering is preserved regardless of format.
An arrow dataset object of subclass <target_timeseries>.
# Column Ordering: CSV vs Parquet in v6+ hubs # For v6+ hubs with target-data.json, ordering differs by file format # Example 1: CSV format (single file) - preserves original file ordering hub_path_csv <- system.file("testhubs/v6/target_file", package = "hubUtils") ts_con_csv <- connect_target_timeseries(hub_path_csv) # CSV columns are in their original file order names(ts_con_csv) # Note: columns appear in the order they are in the CSV file # Collect and filter as usual ts_con_csv |> dplyr::collect() ts_con_csv |> dplyr::filter(location == "US") |> dplyr::collect() # Example 2: Parquet format (directory) - reordered to hubverse convention hub_path_parquet <- system.file("testhubs/v6/target_dir", package = "hubUtils") ts_con_parquet <- connect_target_timeseries(hub_path_parquet) # Parquet columns follow hubverse convention names(ts_con_parquet) # Reordering is safe for Parquet because it matches columns by name # rather than position during collection ts_con_parquet |> dplyr::collect() # Both formats support the same filtering operations ts_con_parquet |> dplyr::filter(target_end_date == "2022-12-31") |> dplyr::collect() ## Not run: # Access Target time-series data from a cloud hub s3_hub_path <- s3_bucket("example-complex-forecast-hub") s3_con <- connect_target_timeseries(s3_hub_path) s3_con s3_con |> dplyr::collect() ## End(Not run)# Column Ordering: CSV vs Parquet in v6+ hubs # For v6+ hubs with target-data.json, ordering differs by file format # Example 1: CSV format (single file) - preserves original file ordering hub_path_csv <- system.file("testhubs/v6/target_file", package = "hubUtils") ts_con_csv <- connect_target_timeseries(hub_path_csv) # CSV columns are in their original file order names(ts_con_csv) # Note: columns appear in the order they are in the CSV file # Collect and filter as usual ts_con_csv |> dplyr::collect() ts_con_csv |> dplyr::filter(location == "US") |> dplyr::collect() # Example 2: Parquet format (directory) - reordered to hubverse convention hub_path_parquet <- system.file("testhubs/v6/target_dir", package = "hubUtils") ts_con_parquet <- connect_target_timeseries(hub_path_parquet) # Parquet columns follow hubverse convention names(ts_con_parquet) # Reordering is safe for Parquet because it matches columns by name # rather than position during collection ts_con_parquet |> dplyr::collect() # Both formats support the same filtering operations ts_con_parquet |> dplyr::filter(target_end_date == "2022-12-31") |> dplyr::collect() ## Not run: # Access Target time-series data from a cloud hub s3_hub_path <- s3_bucket("example-complex-forecast-hub") s3_con <- connect_target_timeseries(s3_hub_path) s3_con s3_con |> dplyr::collect() ## End(Not run)
Create an arrow schema from a tasks.json config file. For use when
opening an arrow dataset.
create_hub_schema( config_tasks, partitions = list(model_id = arrow::utf8()), output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date"), r_schema = FALSE )create_hub_schema( config_tasks, partitions = list(model_id = arrow::utf8()), output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date"), r_schema = FALSE )
config_tasks |
a list version of the content's of a hub's |
partitions |
a named list specifying the arrow data types of any partitioning column. |
output_type_id_datatype |
character string. One of |
r_schema |
Logical. If |
an arrow schema object that can be used to define column datatypes when
opening model output data. If r_schema = TRUE, a character vector of R data types.
hub_path <- system.file("testhubs/simple", package = "hubUtils") config_tasks <- hubUtils::read_config(hub_path, "tasks") schema <- create_hub_schema(config_tasks)hub_path <- system.file("testhubs/simple", package = "hubUtils") config_tasks <- hubUtils::read_config(hub_path, "tasks") schema <- create_hub_schema(config_tasks)
This function has been moved to the
hubValidations
package and renamed to submission_tmpl().
create_model_out_submit_tmpl( hub_con, config_tasks, round_id, required_vals_only = FALSE, complete_cases_only = TRUE )create_model_out_submit_tmpl( hub_con, config_tasks, round_id, required_vals_only = FALSE, complete_cases_only = TRUE )
hub_con |
A |
config_tasks |
a list version of the content's of a hub's |
round_id |
Character string. Round identifier. If the round is set to
|
required_vals_only |
Logical. Whether to return only combinations of Task ID and related output type ID required values. |
complete_cases_only |
Logical. If |
For task IDs or output_type_ids where all values are optional, by default, columns
are included as columns of NAs when required_vals_only = TRUE.
When such columns exist, the function returns a tibble with zero rows, as no
complete cases of required value combinations exists.
(Note that determination of complete cases does excludes valid NA
output_type_id values in "mean" and "median" output types).
To return a template of incomplete required cases, which includes NA columns, use
complete_cases_only = FALSE.
When sample output types are included in the output, the output_type_id
column contains example sample indexes which are useful for identifying the
compound task ID structure of multivariate sampling distributions in particular,
i.e. which combinations of task ID values represent individual samples.
When a round is set to round_id_from_variable: true,
the value of the task ID from which round IDs are derived (i.e. the task ID
specified in round_id property of config_tasks) is set to the value of the
round_id argument in the returned output.
a tibble template containing an expanded grid of valid task ID and
output type ID value combinations for a given submission round
and output type.
If required_vals_only = TRUE, values are limited to the combination of required
values only.
Create oracle-output target data file schema
create_oracle_output_schema( hub_path, date_col = NULL, na = c("NA", ""), ignore_files = NULL, r_schema = FALSE, output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date") )create_oracle_output_schema( hub_path, date_col = NULL, na = c("NA", ""), ignore_files = NULL, r_schema = FALSE, output_type_id_datatype = c("from_config", "auto", "character", "double", "integer", "logical", "Date") )
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
date_col |
Optional column name to be interpreted as date. Default is |
na |
A character vector of strings to interpret as missing values. Only
applies to CSV files. The default is |
ignore_files |
A character vector of file names (not paths) or
file prefixes to ignore when discovering model output files to
include in dataset connections.
Parent directory names should not be included.
Common non-data files such as |
r_schema |
Logical. If |
output_type_id_datatype |
character string. One of |
When target-data.json (v6.0.0+) is present, schema is created directly from config
without reading target data files. Otherwise, schema is inferred by reading the dataset.
Config-based approach avoids file I/O (especially beneficial for cloud storage) and
provides deterministic schema creation.
an arrow <schema> class object
hub_path <- system.file("testhubs/v5/target_file", package = "hubUtils") # Create target oracle-output schema create_oracle_output_schema(hub_path) # target oracle-output schema from a cloud hub s3_hub_path <- s3_bucket("example-complex-forecast-hub") create_oracle_output_schema(s3_hub_path)hub_path <- system.file("testhubs/v5/target_file", package = "hubUtils") # Create target oracle-output schema create_oracle_output_schema(hub_path) # target oracle-output schema from a cloud hub s3_hub_path <- s3_bucket("example-complex-forecast-hub") create_oracle_output_schema(s3_hub_path)
Create time-series target data file schema
create_timeseries_schema( hub_path, date_col = NULL, na = c("NA", ""), ignore_files = NULL, r_schema = FALSE )create_timeseries_schema( hub_path, date_col = NULL, na = c("NA", ""), ignore_files = NULL, r_schema = FALSE )
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
date_col |
Optional column name to be interpreted as date. Default is |
na |
A character vector of strings to interpret as missing values. Only
applies to CSV files. The default is |
ignore_files |
A character vector of file names (not paths) or
file prefixes to ignore when discovering model output files to
include in dataset connections.
Parent directory names should not be included.
Common non-data files such as |
r_schema |
Logical. If |
When target-data.json (v6.0.0+) is present, schema is created directly from config
without reading target data files. Otherwise, schema is inferred by reading the dataset.
Config-based approach avoids file I/O (especially beneficial for cloud storage) and
provides deterministic schema creation.
an arrow <schema> class object
hub_path <- system.file("testhubs/v5/target_file", package = "hubUtils") # Create target time-series schema create_timeseries_schema(hub_path) # target time-series schema from a cloud hub s3_hub_path <- s3_bucket("example-complex-forecast-hub") create_timeseries_schema(s3_hub_path)hub_path <- system.file("testhubs/v5/target_file", package = "hubUtils") # Create target time-series schema create_timeseries_schema(hub_path) # target time-series schema from a cloud hub s3_hub_path <- s3_bucket("example-complex-forecast-hub") create_timeseries_schema(s3_hub_path)
This function has been moved to the
hubValidations
package and renamed to expand_model_out_grid().
expand_model_out_val_grid( config_tasks, round_id, required_vals_only = FALSE, all_character = FALSE, as_arrow_table = FALSE, bind_model_tasks = TRUE, include_sample_ids = FALSE )expand_model_out_val_grid( config_tasks, round_id, required_vals_only = FALSE, all_character = FALSE, as_arrow_table = FALSE, bind_model_tasks = TRUE, include_sample_ids = FALSE )
config_tasks |
a list version of the content's of a hub's |
round_id |
Character string. Round identifier. If the round is set to
|
required_vals_only |
Logical. Whether to return only combinations of Task ID and related output type ID required values. |
all_character |
Logical. Whether to return all character column. |
as_arrow_table |
Logical. Whether to return an arrow table. Defaults to |
bind_model_tasks |
Logical. Whether to bind expanded grids of values from multiple modeling tasks into a single tibble/arrow table or return a list. |
include_sample_ids |
Logical. Whether to include sample identifiers in
the |
When a round is set to round_id_from_variable: true,
the value of the task ID from which round IDs are derived (i.e. the task ID
specified in round_id property of config_tasks) is set to the value of the
round_id argument in the returned output.
When sample output types are included in the output and include_sample_ids = TRUE,
the output_type_id column contains example sample indexes which are useful
for identifying the compound task ID structure of multivariate sampling
distributions in particular, i.e. which combinations of task ID values
represent individual samples.
If bind_model_tasks = TRUE (default) a tibble or arrow table
containing all possible task ID and related output type ID
value combinations. If bind_model_tasks = FALSE, a list containing a
tibble or arrow table for each round modeling task.
Columns are coerced to data types according to the hub schema,
unless all_character = TRUE. If all_character = TRUE, all columns are returned as
character which can be faster when large expanded grids are expected.
If required_vals_only = TRUE, values are limited to the combinations of required
values only.
Given a filesystem path, this function extracts Hive-style partition
key-value pairs (i.e., path components formatted as key=value). It supports
decoding URL-encoded values (e.g., "wk%20flu" → "wk flu"), and handles
empty values (e.g., "key=") as NA, consistent with Hive and Arrow semantics.
extract_hive_partitions(path, strict = TRUE)extract_hive_partitions(path, strict = TRUE)
path |
A character string of length 1: the path to a file or directory. |
strict |
Logical. If |
If strict = TRUE, the function will abort with a detailed error message
if any malformed partition-like segments are found.
A named character vector where the names are partition keys and the values
are decoded values. Returns NULL if no valid partitions are found.
extract_hive_partitions("data/country=US/year=2024/file.parquet") extract_hive_partitions("data/country=/year=2024/", strict = TRUE) # extract_hive_partitions("data/=US/year=2024/", strict = TRUE) # This will errorextract_hive_partitions("data/country=US/year=2024/file.parquet") extract_hive_partitions("data/country=/year=2024/", strict = TRUE) # extract_hive_partitions("data/=US/year=2024/", strict = TRUE) # This will error
Get the bucket name for the cloud storage location.
get_s3_bucket_name(hub_path = ".")get_s3_bucket_name(hub_path = ".")
hub_path |
Path to a hub directory. |
The bucket name for the cloud storage location.
hub_path <- system.file("testhubs/v5/target_file", package = "hubUtils") get_s3_bucket_name(hub_path) # Get config info from GitHub get_s3_bucket_name( "https://github.com/hubverse-org/example-complex-forecast-hub" )hub_path <- system.file("testhubs/v5/target_file", package = "hubUtils") get_s3_bucket_name(hub_path) # Get config info from GitHub get_s3_bucket_name( "https://github.com/hubverse-org/example-complex-forecast-hub" )
Extracts the expected column names for target data from a hub's configuration files in the correct order. This is useful for validation and schema generation without needing to inspect the actual dataset.
get_target_data_colnames( config_target_data, target_type = c("time-series", "oracle-output") )get_target_data_colnames( config_target_data, target_type = c("time-series", "oracle-output") )
config_target_data |
A |
target_type |
Character string specifying the target data type.
Must be either |
The function builds the column name vector directly from the configuration objects without requiring dataset inspection. This makes it lightweight, efficient, and suitable for validation purposes.
For time-series data, columns are ordered as:
Task ID columns from observable_unit
Date column (if not in observable_unit)
Non-task ID columns from target-data.json (if present)
observation column (target value)
as_of column (if versioned = TRUE)
For oracle-output data, columns are ordered as:
Task ID columns from observable_unit
Date column (if not in observable_unit)
output_type and output_type_id columns (if has_output_type_ids = TRUE)
oracle_value column (target value)
as_of column (if versioned = TRUE)
A character vector of expected column names in the correct order:
Date column
Task ID columns (from observable_unit)
Non-task ID columns (time-series only, if specified in config)
Output type columns (output_type and output_type_id, oracle-output only if specified in config)
Target value column (observation for time-series, oracle_value for oracle-output)
as_of column (if data is versioned)
# Note: These examples require test data hub_path <- system.file("testhubs/v6/target_file", package = "hubUtils") config_target_data <- hubUtils::read_config(hub_path, "target-data") # Get time-series column names get_target_data_colnames( config_target_data, target_type = "time-series" ) # Get oracle-output column names get_target_data_colnames( config_target_data, target_type = "oracle-output" )# Note: These examples require test data hub_path <- system.file("testhubs/v6/target_file", package = "hubUtils") config_target_data <- hubUtils::read_config(hub_path, "target-data") # Get time-series column names get_target_data_colnames( config_target_data, target_type = "time-series" ) # Get oracle-output column names get_target_data_colnames( config_target_data, target_type = "oracle-output" )
Get the unique file extension(s) of the target data file(s) in target_path.
If target_path is a directory, the function will return the unique file
extensions of all files in the directory. If target_path is a file,
the function will return the file extension of that file.
get_target_file_ext(hub_path = NULL, target_path)get_target_file_ext(hub_path = NULL, target_path)
hub_path |
If not |
target_path |
character string. The path to the target data
file or directory. Usually the output of |
hub_path <- system.file("testhubs/v5/target_file", package = "hubUtils") target_path <- get_target_path(hub_path, "time-series") get_target_file_ext(hub_path, target_path)hub_path <- system.file("testhubs/v5/target_file", package = "hubUtils") target_path <- get_target_path(hub_path, "time-series") get_target_file_ext(hub_path, target_path)
Get the path(s) to the target data file(s) in the hub directory.
get_target_path(hub_path, target_type = c("time-series", "oracle-output"))get_target_path(hub_path, target_type = c("time-series", "oracle-output"))
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
target_type |
Type of target data to retrieve matching files. One of "time-series" or "oracle-output". Defaults to "time-series". |
a character vector of path(s) to target data file(s) (in the target-data directory) that make the
target_type requested.
hub_path <- system.file("testhubs/v5/target_file", package = "hubUtils") get_target_path(hub_path) get_target_path(hub_path, "time-series") get_target_path(hub_path, "oracle-output") # Access cloud data s3_bucket_name <- get_s3_bucket_name(hub_path) s3_hub_path <- s3_bucket(s3_bucket_name) get_target_path(s3_hub_path) get_target_path(s3_hub_path, "oracle-output")hub_path <- system.file("testhubs/v5/target_file", package = "hubUtils") get_target_path(hub_path) get_target_path(hub_path, "time-series") get_target_path(hub_path, "oracle-output") # Access cloud data s3_bucket_name <- get_s3_bucket_name(hub_path) s3_hub_path <- s3_bucket(s3_bucket_name) get_target_path(s3_hub_path) get_target_path(s3_hub_path, "oracle-output")
See arrow::gs_bucket() for details.
A SubTreeFileSystem containing an GcsFileSystem and the bucket's
relative path. Note that this function's success does not guarantee that you
are authorized to access the bucket's contents.
bucket <- gs_bucket("voltrondata-labs-datasets")bucket <- gs_bucket("voltrondata-labs-datasets")
This function checks if a given file or directory path includes one or more
Hive-style partition segments (i.e., subdirectories formatted as key=value).
This function can operate in a strict or lenient mode, depending on whether
you want to catch malformed partition-like segments.
is_hive_partitioned_path(path, strict = TRUE)is_hive_partitioned_path(path, strict = TRUE)
path |
Character string. Path to a file or directory. |
strict |
Logical. If |
A valid partition segment must:
Contain an equals sign (=)
Have a non-empty key before the equals sign
May have an empty value (interpreted as NA in most Hive/Arrow contexts)
In strict mode, the function validates that all key=value segments are well-formed
and will abort if any are not.
A logical value: TRUE if the path contains one or more valid Hive-style
partition segments, FALSE otherwise.
extract_hive_partitions() to extract key-value pairs from Hive-style paths.
is_hive_partitioned_path("data/country=US/year=2024/file.parquet") is_hive_partitioned_path("data/country=/year=2024/", strict = TRUE) # is_hive_partitioned_path("data/=US/year=2024/", strict = TRUE) # This will erroris_hive_partitioned_path("data/country=US/year=2024/file.parquet") is_hive_partitioned_path("data/country=/year=2024/", strict = TRUE) # is_hive_partitioned_path("data/=US/year=2024/", strict = TRUE) # This will error
Loads in hub model metadata for all models or a specified subset of models and compiles it into a tibble with one row per model.
load_model_metadata(hub_path, model_ids = NULL)load_model_metadata(hub_path, model_ids = NULL)
hub_path |
Either a character string path to a local Modeling Hub directory
or an object of class |
model_ids |
A vector of character strings of models for which to load metadata. Defaults to NULL, in which case metadata for all models is loaded. |
tibble with model metadata. One row for each model, one column for
each top-level field in the metadata file. For metadata files with nested structures,
this tibble may contain list-columns where the entries are lists containing the nested metadata values.
# Load in model metadata from local hub hub_path <- system.file("testhubs/simple", package = "hubUtils") load_model_metadata(hub_path) load_model_metadata(hub_path, model_ids = c("hub-baseline"))# Load in model metadata from local hub hub_path <- system.file("testhubs/simple", package = "hubUtils") load_model_metadata(hub_path) load_model_metadata(hub_path, model_ids = c("hub-baseline"))
<hub_connection> or <mod_out_connection> S3 class objectPrint a <hub_connection> or <mod_out_connection> S3 class object
## S3 method for class 'hub_connection' print(x, verbose = FALSE, ...) ## S3 method for class 'mod_out_connection' print(x, verbose = FALSE, ...)## S3 method for class 'hub_connection' print(x, verbose = FALSE, ...) ## S3 method for class 'mod_out_connection' print(x, verbose = FALSE, ...)
x |
A |
verbose |
Logical. Whether to print the full structure of the object.
Defaults to |
... |
Further arguments passed to or from other methods. |
print(hub_connection): print a <hub_connection> object.
print(mod_out_connection): print a <mod_out_connection> object.
hub_path <- system.file("testhubs/simple", package = "hubUtils") hub_con <- connect_hub(hub_path) hub_con print(hub_con) print(hub_con, verbose = TRUE) mod_out_path <- system.file("testhubs/simple/model-output", package = "hubUtils") mod_out_con <- connect_model_output(mod_out_path) print(mod_out_con)hub_path <- system.file("testhubs/simple", package = "hubUtils") hub_con <- connect_hub(hub_path) hub_con print(hub_con) print(hub_con, verbose = TRUE) mod_out_path <- system.file("testhubs/simple/model-output", package = "hubUtils") mod_out_con <- connect_model_output(mod_out_path) print(mod_out_con)
Returns a named list mapping base R type strings (e.g., "character", "integer")
to their corresponding Arrow arrow::DataType objects. This is the inverse
of arrow_to_r_datatypes and is useful when creating Arrow schemas
programmatically from R type specifications.
r_to_arrow_datatypes()r_to_arrow_datatypes()
This function generates the mapping dynamically. The R type
strings match those used in the non_task_id_schema field of target-data.json
configuration files.
This is particularly useful for:
Creating custom Arrow schemas from R type specifications
Converting configuration-based type information to Arrow schemas
Programmatic schema generation
A named list with 6 entries mapping R types to Arrow DataType objects:
arrow::bool()
arrow::int32() (uses int32 as default)
arrow::float64()
arrow::utf8()
arrow::date32()
arrow::timestamp(unit = "ms")
arrow_to_r_datatypes, create_timeseries_schema(), create_oracle_output_schema()
# Get the mapping type_map <- r_to_arrow_datatypes() # Use it to create Arrow types from R type strings r_types <- c("character", "integer", "double") arrow_types <- type_map[r_types] # Create a simple Arrow schema my_schema <- arrow::schema( name = type_map[["character"]], age = type_map[["integer"]], score = type_map[["double"]] ) my_schema# Get the mapping type_map <- r_to_arrow_datatypes() # Use it to create Arrow types from R type strings r_types <- c("character", "integer", "double") arrow_types <- type_map[r_types] # Create a simple Arrow schema my_schema <- arrow::schema( name = type_map[["character"]], age = type_map[["integer"]], score = type_map[["double"]] ) my_schema
See arrow::s3_bucket() for details.
A SubTreeFileSystem containing an S3FileSystem and the bucket's
relative path. Note that this function's success does not guarantee that you
are authorized to access the bucket's contents.
bucket <- s3_bucket("voltrondata-labs-datasets") # Turn on debug logging. The following line of code should be run in a fresh # R session prior to any calls to `s3_bucket()` (or other S3 functions) Sys.setenv("ARROW_S3_LOG_LEVEL" = "DEBUG") bucket <- s3_bucket("voltrondata-labs-datasets")bucket <- s3_bucket("voltrondata-labs-datasets") # Turn on debug logging. The following line of code should be run in a fresh # R session prior to any calls to `s3_bucket()` (or other S3 functions) Sys.setenv("ARROW_S3_LOG_LEVEL" = "DEBUG") bucket <- s3_bucket("voltrondata-labs-datasets")