--- title: "Getting Started with ssdtools" author: "ssdtools Team" date: "`r Sys.Date()`" bibliography: references.bib csl: my-style.csl latex_engine: MathJax mainfont: Arial mathfont: Courier output: rmarkdown::html_vignette #output: rmarkdown::pdf_document vignette: > %\VignetteIndexEntry{Getting Started with ssdtools} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 4, fig.height = 4 ) ``` ## Introduction `ssdtools` is an R package to fit Species Sensitivity Distributions (SSDs) using Maximum Likelihood and model averaging. SSDs are cumulative probability distributions that are used to estimate the percent of species that are affected and/or protected by a given concentration of a chemical. The concentration that affects 5% of the species is referred to as the 5% Hazard Concentration (*HC5*). This is equivalent to a 95% protection value (*PC~95~*). For more information on SSDs the reader is referred to @posthuma_species_2001. In order to use `ssdtools` you need to install R (see below) or use the Shiny [app](https://bcgov-env.shinyapps.io/ssdtools/). The shiny app includes a user guide. This vignette is a user manual for the R package. ## Philosophy `ssdtools` provides the key functionality required to fit SSDs using Maximum Likelihood and model averaging in R. It is intended to be used in conjunction with [tidyverse](https://www.tidyverse.org) packages such as `readr` to input data, `tidyr` and `dplyr` to group and manipulate data and `ggplot2` [@ggplot2] to plot data. As such it endeavors to fulfill the tidyverse [manifesto](https://tidyverse.tidyverse.org/articles/manifesto.html). ## Installing In order to install R [@r] the appropriate binary for the users operating system should be downloaded from [CRAN](https://cran.r-project.org) and then installed. Once R is installed, the `ssdtools` package can be installed (together with the tidyverse) by executing the following code at the R console ```{r, eval = FALSE} install.packages(c("ssdtools", "tidyverse")) ``` The `ssdtools` package (and ggplot2 package) can then be loaded into the current session using ```{r, message = FALSE} library(ssdtools) library(ggplot2) ``` ## Getting Help To get additional information on a particular function just type `?` followed by the name of the function at the R console. For example `?ssd_gof` brings up the R documentation for the `ssdtools` goodness of fit function. For more information on using R the reader is referred to [R for Data Science](https://r4ds.had.co.nz) [@wickham_r_2016]. If you discover a bug in `ssdtools` please file an issue with a [reprex](https://reprex.tidyverse.org/articles/reprex-dos-and-donts.html) (repeatable example) at . ## Inputting Data Once the `ssdtools` package has been loaded the next task is to input some data. An easy way to do this is to save the concentration data for a *single* chemical as a column called `Conc` in a comma separated file (`.csv`). Each row should be the sensitivity concentration for a separate species. If species and/or group information is available then this can be saved as `Species` and `Group` columns. The `.csv` file can then be read into R using the following ```{r, eval = FALSE} data <- read_csv(file = "path/to/file.csv") ``` For the purposes of this manual we use the CCME dataset for boron. ```{r} ccme_boron <- ssddata::ccme_boron print(ccme_boron) ``` ## Fitting Distributions The function `ssd_fit_dists()` inputs a data frame and fits one or more distributions. The user can specify a subset of the following `r length(ssd_dists_all())` distributions. Please see the [Distributions](https://bcgov.github.io/ssdtools/articles/B_distributions.html) and [Model averaging](https://bcgov.github.io/ssdtools/articles/A_model_averaging.html) vignettes for more information regarding appropriate use of distributions and the use of model-averaged SSDs. ```{r} ssd_dists_all() ``` using the `dists` argument. ```{r} fits <- ssd_fit_dists(ccme_boron, dists = c("llogis", "lnorm", "gamma")) ``` ## Coefficients The estimates for the various terms can be extracted using the tidyverse generic `tidy` function (or the base R generic `coef` function). ```{r} tidy(fits) ``` ## Plots It is generally more informative to plot the fits using the `autoplot` generic function (a wrapper on `ssd_plot_cdf()`). As `autoplot` returns a `ggplot` object it can be modified prior to plotting. ```{r, fig.width = 7, fig.height = 4} theme_set(theme_bw()) # set plot theme autoplot(fits) + ggtitle("Species Sensitivity Distributions for Boron") + scale_colour_ssd() ``` ## Selecting One Distribution Given multiple distributions the user is faced with choosing the "best" distribution (or as discussed below averaging the results weighted by the fit). ```{r} ssd_gof(fits) ``` The `ssd_gof()` function returns three test statistics that can be used to evaluate the fit of the various distributions to the data. - [Anderson-Darling](https://en.wikipedia.org/wiki/Anderson–Darling_test) (`ad`) statistic, - [Kolmogorov-Smirnov](https://en.wikipedia.org/wiki/Kolmogorov–Smirnov_test) (`ks`) statistic and - [Cramer-von Mises](https://en.wikipedia.org/wiki/Cramér–von_Mises_criterion) (`cvm`) statistic and three information criteria - Akaike's Information Criterion (`AIC`), - Akaike's Information Criterion corrected for sample size (`AICc`) and - Bayesian Information Criterion (`BIC`) Note if `ssd_gof()` is called with `pvalue = TRUE` then the p-values rather than the statistics are returned for the ad, ks and cvm tests. Following @burnham_model_2002 we recommend the `AICc` for model selection. The best predictive model is that with the lowest `AICc` (indicated by the model with a `delta` value of 0.000 in the goodness of fit table). In the current example the best predictive model is the gamma distribution but the lnorm distribution has some support. For further information on the advantages of an information theoretic approach in the context of selecting SSDs the reader is referred to @fox_recent_2021. ## Averaging Multiple Distributions Often other distributions will fit the data almost as well as the best distribution as evidenced by `delta` values < 2 [@burnham_model_2002]. In this situation the recommended approach is to estimate the average fit based on the relative weights of the distributions [@burnham_model_2002]. The `AICc` based weights are indicated by the `weight` column in the goodness of fit table. In the current example, the gamma and log-normal distributions have `delta` values < 2. A detailed introduction to model averaging can be found in the [Model averaging](https://bcgov.github.io/ssdtools/articles/A_model_averaging.html) vignette. A discussion on the recommended set of default distributions can be found in the [Distributions](https://bcgov.github.io/ssdtools/articles/B_distributions.html) vignette. ## Estimating the Fit The `predict` function can be used to generate model-averaged (or if `average = FALSE` individual) estimates by parametric bootstrapping. Model averaging is based on `AICc` unless the data censored is which case `AICc` in undefined. In this situation model averaging is only possible if the distributions have the same number of parameters. Parametric bootstrapping is computationally intensive. To bootstrap for each distribution in parallel register the future back-end and then select the evaluation strategy. ```{r, eval = FALSE, cache=TRUE} doFuture::registerDoFuture() future::plan(future::multisession) set.seed(99) boron_pred <- predict(fits, ci = TRUE) ``` The resultant object is a data frame of the estimated concentration (`est`) with standard error (`se`) and lower (`lcl`) and upper (`ucl`) 95% confidence limits (CLs) by percent of species affected (`percent`). The object includes the number of bootstraps (`nboot`) data sets generated as well as the proportion of the data sets that successfully fitted (`pboot`). There is no requirement for the bootstrap samples to converge. ```{r} boron_pred ``` The data frame of the estimates can then be plotted together with the original data using the `ssd_plot()` function to summarize an analysis. Once again the returned object is a `ggplot` object which can be customized prior to plotting. ```{r, fig.width = 7, fig.height = 4} ssd_plot(ccme_boron, boron_pred, color = "Group", label = "Species", xlab = "Concentration (mg/L)", ribbon = TRUE ) + expand_limits(x = 5000) + # to ensure the species labels fit ggtitle("Species Sensitivity for Boron") + scale_colour_ssd() ``` In the above plot the model-averaged 95% confidence interval is indicated by the shaded band and the model-averaged 5%/95% Hazard/Protection Concentration (*HC5*/ *PC~95~*) by the dotted line. Hazard/Protection concentrations are discussed below. ## Hazard/Protection Concentrations The 5% hazard concentration (*HC5*) is the concentration that affects 5% of the species tested. This is equivalent to the 95% protection concentration which protects 95% of species (*PC~95~*). The hazard and protection concentrations are directly interchangeable, and terminology depends simply on user preference. The hazard/protection concentrations can be obtained using the ssd_hc function, which can be used to obtain any desired percentage value. The fitted SSD can also be used to determine the percentage of species protected at a given concentration using ssd_hp. ```{r, cache=TRUE} set.seed(99) boron_hc5 <- ssd_hc(fits, proportion = 0.05, ci = TRUE) print(boron_hc5) boron_pc <- ssd_hp(fits, conc = boron_hc5$est, ci = TRUE) print(boron_pc) ``` ### Censored Data Censored data is that for which only a lower and/or upper limit is known for a particular species. If the `right` argument in `ssd_fit_dists()` is different to the `left` argument then the data are considered to be censored. Let's make some example censored data. ```{r} example_dat <- ssddata::ccme_boron |> dplyr::mutate(left=Conc, right=Conc) left_censored_example <- example_dat left_censored_example$left[c(3,6,8)] <- NA ``` There are less goodness-of-fit statistics available for fits to censored data (currently just `AIC` and `BIC`). The `delta` values are calculated using AIC`. As the sample size `n` is undefined for censored data, `AICc` cannot be calculated. However, if all the models have the same number of parameters, the `AIC` `delta` values are identical to those for `AICc`. For this reason, `ssdtools` only permits the analysis of censored data using two-parameter models. We can call only the default two parameter models using `ssd_dists_bcanz(n = 2)`. ```{r, eval = TRUE} left_censored_dists <- ssd_fit_dists(left_censored_example, dists = ssd_dists_bcanz(n = 2), left = "left", right = "right") ssd_hc(left_censored_dists, average = FALSE) ssd_hc(left_censored_dists) ssd_gof(left_censored_dists) ``` The model-averaged predictions (and hazard concentrations complete with 95% confidence limits) can be calculated using `AIC` and the results plotted complete with arrows indicating the censorship. ```{r, eval = TRUE} set.seed(99) left_censored_pred <- predict(left_censored_dists, ci = TRUE) ssd_plot(left_censored_example, left_censored_pred, left = "left", right = "right", xlab = "Concentration (mg/L)" ) ``` Note that `ssdtools` doesn't currently support right censored data: ```{r, eval = TRUE} right_censored_example <- example_dat right_censored_example$right[c(3,6,8)] <- NA right_censored_dists <- try(ssd_fit_dists(right_censored_example, dists = ssd_dists_bcanz(n = 2), left = "left", right = "right")) ``` ## References
```{r, results = "asis", echo = FALSE} cat(licensing_md()) ```