| Title: | Basic tools for scoring hubverse forecasts |
|---|---|
| Description: | Using functionality from the scoringutils package, this software provides basic tools for scoring hubverse forecasts. |
| Authors: | Nicholas Reich [aut, cre] (ORCID: <https://orcid.org/0000-0003-3503-9899>), Evan Ray [aut], Nikos Bosse [aut] (ORCID: <https://orcid.org/0000-0002-7750-5280>), Matthew Cornell [aut], Zhian Kamvar [ctb] (ORCID: <https://orcid.org/0000-0003-1458-7108>), Li Shandross [ctb] (ORCID: <https://orcid.org/0009-0008-1348-1954>), Becky Sweger [aut], Kimberlyn Roosa [aut], Anna Krystalli [aut] (ORCID: <https://orcid.org/0000-0002-2378-4915>) |
| Maintainer: | Nicholas Reich <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.2.0 |
| Built: | 2026-06-07 07:15:03 UTC |
| Source: | https://github.com/hubverse-org/hubEvals |
Scores model outputs with a single output_type against observed data.
score_model_out( model_out_tbl, oracle_output, metrics = NULL, relative_metrics = NULL, baseline = NULL, summarize = TRUE, by = "model_id", output_type_id_order = NULL, compound_taskid_set = NULL, transform = NULL, transform_append = FALSE, transform_label = NULL, ... )score_model_out( model_out_tbl, oracle_output, metrics = NULL, relative_metrics = NULL, baseline = NULL, summarize = TRUE, by = "model_id", output_type_id_order = NULL, compound_taskid_set = NULL, transform = NULL, transform_append = FALSE, transform_label = NULL, ... )
model_out_tbl |
Model output tibble with predictions |
oracle_output |
Predictions that would have been generated by an oracle model that knew the observed target data values in advance |
metrics |
Character vector of scoring metrics to compute. If |
relative_metrics |
Character vector of scoring metrics for which to
compute relative skill scores. The |
baseline |
String with the name of a model to use as a baseline for
relative skill scores. If a baseline is given, then a scaled relative skill
with respect to the baseline will be returned. By default ( |
summarize |
Boolean indicator of whether summaries of forecast scores
should be computed. Defaults to |
by |
Character vector naming columns to summarize by. For example,
specifying |
output_type_id_order |
For ordinal variables in pmf format, this is a vector of levels for pmf forecasts, in increasing order of the levels. The order of the values for the output_type_id can be found by referencing the hub's tasks.json configuration file. For all output types other than pmf, this is ignored. |
compound_taskid_set |
When |
transform |
A function to apply as a scale transformation to both
predictions and observations before scoring. Common choices include
|
transform_append |
Logical. If |
transform_label |
A character string label for the transformation
(e.g., "log"). If |
... |
Additional arguments passed to the |
See the hubverse documentation for the expected format of the oracle output data.
Default metrics are provided by the scoringutils package. You can select
metrics by passing in a character vector of metric names to the metrics
argument.
The following metrics can be selected (all are used by default) for the
different output_types:
Quantile forecasts: (output_type == "quantile")
wis
overprediction
underprediction
dispersion
bias
ae_median
"interval_coverage_XX": interval coverage at the "XX" level. For example, "interval_coverage_95" is the 95% interval coverage rate, which would be calculated based on quantiles at the probability levels 0.025 and 0.975.
See scoringutils::get_metrics.forecast_quantile for details.
Nominal forecasts: (output_type == "pmf" and output_type_id_order is NULL)
log_score
See scoringutils::get_metrics.forecast_nominal for details.
Ordinal forecasts: (output_type == "pmf" and output_type_id_order is a vector)
log_score
rps
See scoringutils::get_metrics.forecast_ordinal for details.
Median forecasts: (output_type == "median")
ae_point: absolute error of the point forecast (recommended for the median, see Gneiting (2011))
See scoringutils::get_metrics.forecast_point for details.
Mean forecasts: (output_type == "mean")
se_point: squared error of the point forecast (recommended for the mean, see Gneiting (2011))
Sample forecasts (marginal): (output_type == "sample", compound_taskid_set = NULL)
bias
dss
crps
overprediction
underprediction
dispersion
log_score
mad
ae_median
se_mean
Note: log_score uses kernel density estimation, which may not be
appropriate for integer-valued forecasts. scoringutils will warn when
this is detected.
See scoringutils::get_metrics.forecast_sample for details.
Sample forecasts (compound): (output_type == "sample", compound_taskid_set provided)
energy_score
variogram_score
See scoringutils::get_metrics.forecast_sample_multivariate for details.
The output includes a .mv_group_id column assigned by scoringutils to
identify the multivariate groups used for scoring (equivalent to the
compound_idx concept in the
hubverse sample output type documentation ).
Correct scoring depends on providing the right compound_taskid_set from
the hub's tasks.json configuration. If the specified grouping does not
match the actual dependence structure of the submitted samples (e.g.,
because some models submitted coarser samples than configured),
.mv_group_id may not correspond to the original sample draws as
indicated by their output_type_id values.
See scoringutils::add_relative_skill for details on relative skill scores.
A data.table with scores
Gneiting, Tilmann. 2011. "Making and Evaluating Point Forecasts." Journal of the American Statistical Association 106 (494): 746–62. <doi: 10.1198/jasa.2011.r10138>.
# compute WIS and interval coverage rates at 80% and 90% levels based on # quantile forecasts, summarized by the mean score for each model quantile_scores <- score_model_out( model_out_tbl = hubExamples::forecast_outputs |> dplyr::filter(.data[["output_type"]] == "quantile"), oracle_output = hubExamples::forecast_oracle_output, metrics = c("wis", "interval_coverage_80", "interval_coverage_90"), relative_metrics = "wis", by = "model_id" ) quantile_scores # compute log scores based on pmf predictions for categorical targets, # summarized by the mean score for each combination of model and location. # Note: if the model_out_tbl had forecasts for multiple targets using a # pmf output_type with different bins, it would be necessary to score the # predictions for those targets separately. pmf_scores <- score_model_out( model_out_tbl = hubExamples::forecast_outputs |> dplyr::filter(.data[["output_type"]] == "pmf"), oracle_output = hubExamples::forecast_oracle_output, metrics = c("log_score", "rps"), by = c("model_id", "location", "horizon"), output_type_id_order = c("low", "moderate", "high", "very high") ) head(pmf_scores) # Score sample forecasts marginally (each modeling task scored independently). # Note: this data has compound structure (samples span horizons), but marginal # scoring is still valid -- it evaluates each horizon independently. sample_scores <- score_model_out( model_out_tbl = hubExamples::forecast_outputs |> dplyr::filter(.data[["output_type"]] == "sample"), oracle_output = hubExamples::forecast_oracle_output, metrics = "crps", by = "model_id" ) sample_scores # Score compound sample forecasts jointly using the energy score. # compound_taskid_set specifies which task IDs stay constant within # a sample group -- here, each sample draw spans all horizons for a # given reference_date and location (i.e., a trajectory over time). compound_scores <- score_model_out( model_out_tbl = hubExamples::forecast_outputs |> dplyr::filter(.data[["output_type"]] == "sample"), oracle_output = hubExamples::forecast_oracle_output, compound_taskid_set = c("reference_date", "location"), by = "model_id" ) compound_scores# compute WIS and interval coverage rates at 80% and 90% levels based on # quantile forecasts, summarized by the mean score for each model quantile_scores <- score_model_out( model_out_tbl = hubExamples::forecast_outputs |> dplyr::filter(.data[["output_type"]] == "quantile"), oracle_output = hubExamples::forecast_oracle_output, metrics = c("wis", "interval_coverage_80", "interval_coverage_90"), relative_metrics = "wis", by = "model_id" ) quantile_scores # compute log scores based on pmf predictions for categorical targets, # summarized by the mean score for each combination of model and location. # Note: if the model_out_tbl had forecasts for multiple targets using a # pmf output_type with different bins, it would be necessary to score the # predictions for those targets separately. pmf_scores <- score_model_out( model_out_tbl = hubExamples::forecast_outputs |> dplyr::filter(.data[["output_type"]] == "pmf"), oracle_output = hubExamples::forecast_oracle_output, metrics = c("log_score", "rps"), by = c("model_id", "location", "horizon"), output_type_id_order = c("low", "moderate", "high", "very high") ) head(pmf_scores) # Score sample forecasts marginally (each modeling task scored independently). # Note: this data has compound structure (samples span horizons), but marginal # scoring is still valid -- it evaluates each horizon independently. sample_scores <- score_model_out( model_out_tbl = hubExamples::forecast_outputs |> dplyr::filter(.data[["output_type"]] == "sample"), oracle_output = hubExamples::forecast_oracle_output, metrics = "crps", by = "model_id" ) sample_scores # Score compound sample forecasts jointly using the energy score. # compound_taskid_set specifies which task IDs stay constant within # a sample group -- here, each sample draw spans all horizons for a # given reference_date and location (i.e., a trajectory over time). compound_scores <- score_model_out( model_out_tbl = hubExamples::forecast_outputs |> dplyr::filter(.data[["output_type"]] == "sample"), oracle_output = hubExamples::forecast_oracle_output, compound_taskid_set = c("reference_date", "location"), by = "model_id" ) compound_scores
Transform pmf model output into a forecast object
transform_pmf_model_out( model_out_tbl, oracle_output, output_type_id_order = NULL )transform_pmf_model_out( model_out_tbl, oracle_output, output_type_id_order = NULL )
model_out_tbl |
Model output tibble with predictions |
oracle_output |
Predictions that would have been generated by an oracle model that knew the observed target data values in advance |
output_type_id_order |
For ordinal variables in pmf format, this is a vector of levels for pmf forecasts, in increasing order of the levels. The order of the values for the output_type_id can be found by referencing the hub's tasks.json configuration file. For all output types other than pmf, this is ignored. |
forecast_quantile
Transform either mean or median model output into a point forecast object:
transform_point_model_out(model_out_tbl, oracle_output, output_type)transform_point_model_out(model_out_tbl, oracle_output, output_type)
model_out_tbl |
Model output tibble with predictions |
oracle_output |
Predictions that would have been generated by an oracle model that knew the observed target data values in advance |
output_type |
Forecast output type: "mean" or "median" |
This function transforms a model output tibble in the Hubverse format (with either "mean" or "median" output type) to a scoringutils "point" forecast object
forecast_point
Transform quantile model output into a forecast object
transform_quantile_model_out(model_out_tbl, oracle_output)transform_quantile_model_out(model_out_tbl, oracle_output)
model_out_tbl |
Model output tibble with predictions |
oracle_output |
Predictions that would have been generated by an oracle model that knew the observed target data values in advance |
forecast_quantile
Transform sample model output into a forecast object
transform_sample_model_out( model_out_tbl, oracle_output, compound_taskid_set = NULL )transform_sample_model_out( model_out_tbl, oracle_output, compound_taskid_set = NULL )
model_out_tbl |
Model output tibble with predictions |
oracle_output |
Predictions that would have been generated by an oracle model that knew the observed target data values in advance |
compound_taskid_set |
Character vector of task ID column names that
stay constant within each sample draw (i.e., define the compound modeling
task grouping). When |
A forecast_sample object (when compound_taskid_set is NULL)
or a forecast_sample_multivariate object (when compound_taskid_set is
provided).