---
title: "Model Averaged SSDs"
author: "ssdtools Team"
date: "`r Sys.Date()`"
bibliography: references.bib
csl: my-style.csl
latex_engine: MathJax
mainfont: Arial
mathfont: Courier
output: rmarkdown::html_vignette
#output: rmarkdown::pdf_document
vignette: >
%\VignetteIndexEntry{Model Averaged SSDs}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{css, echo=FALSE}
pre {
max-height: 300px;
overflow-y: auto;
}
pre[class] {
max-height: 200px;
}
```
```{css, echo=FALSE}
.scroll-200 {
max-height: 200px;
overflow-y: auto;
background-color: inherit;
}
```
```{r, include = FALSE}
library(tinytex)
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.height = 4,
fig.width = 6
)
```
```{r include=FALSE}
# this just loads the package and suppresses the load package message
library(ssdtools)
require(ggplot2)
require(patchwork)
```
## Background
*"Many authors have noted that there is no guiding theory in ecotoxicology to justify any particular distributional form for the SSD other than that its domain be restricted to the positive real line
[@newman_2000], [@Zajdlik_2005], [@chapman_2007], [@fox_2016]. Indeed, [@chapman_2007] described the identification of a suitable probability model as one of the most
important and difficult choices in the use of SSDs. Compounding this lack of clarity about the functional
form of the SSD is the omnipresent, and equally vexatious issue of small sample size, meaning that
any plausible candidate model is unlikely to be rejected [@fox_recent_2021]. The ssdtools R package
uses a model averaging procedure to avoid the need to a-priori select a candidate distribution and
instead uses a measure of ‘fit’ for each model to compute weights to be applied to an initial set of
candidate distributions. The method, as applied in the SSD context is described in detail in [@fox_recent_2021], and potentially provides a level of flexibility and parsimony that is difficult to achieve with a
single SSD distribution".* [@fox_methodologies_2021]
## Preliminaries
Before we jump into model averaging and in particular, SSD Model Averaging, let's backup a little and consider why we average and the advantages and disadvantages of averaging.
Now, the correct method of determining the *HCx* is to work with the proper model-averaged *cdf* $G\left( x \right)$. This means finding the **inverse** function ${G^{ - 1}}\left( p \right)$. We'll address how we do this in a moment.
The reason why $H\left( p \right)$ does **not** return the correct result is because of the implicit assumption that the inverse of $G\left( x \right)$ is equivalent to $H\left( p \right)$. This is akin to stating the inverse of a *sum* is equal to the sum of the inverses i.e. $$\sum\limits_{i = 1}^n {\frac{1}{{{X_i}}}} = \frac{1}{{\sum\limits_{i = 1}^n {{X_i}} }}{\text{ ???}}$$For example, consider two complex numbers $${\text{a = }}\frac{{\left( {5 - i} \right)}}{2}{\text{ and }}b = - 1.683 - 1.915i$$ It can be shown that $$\frac{1}{{a + b}} = \frac{1}{a} + \frac{1}{b} = 0.126 + 0.372i$$
Back to the issue at hand, and since we're not dealing with complex numbers, it's safe to say:$${G^{ - 1}}\left( p \right) \ne H\left( p \right)$$
If you need a visual demonstration, we can plot $G\left( x \right)$ and the *inverse* of $H\left( p \right)$ both as functions of *x* (a quantile) for our two-component lognormal distribution above. ```{r echo=FALSE, fig.width=8,fig.height=6} t<-seq(0.01,0.99,by=0.001) F<-0.4419*qlnorm(t,-1.067,0.414) + 0.5581*qlnorm(t,-0.387,0.617) plot(xx,(0.4419*plnorm(xx,meanlog=mean(log(samp[5:15])),sd=sd(log(samp[5:15])))+ 0.5581*plnorm(xx,meanlog=mean(log(samp[1:10])),sd=sd(log(samp[1:10])))) , col="#d40830",type="l",xlab="Concentration",ylab="Probability") lines(F,t,col="#51c157",lwd=1.75) segments(-1,0.2,0.34,0.2,col="black",lty=21,lwd=2) segments(0.28,0.2,0.28,-1,col="red",lty=21,lwd=2) segments(0.34,0.2,0.34,-1,col="#51c157",lty=21,lwd=2) segments(1.12,-1,1.12,0.9,col="grey",lty=21,lwd=1.7) text(0.25,0.6,"Correct MA-SSD",col="red",cex=0.75) text(0.75,0.4,"Erroneous MA-SSD",col="#51c157",cex=0.75) mtext("1.12",side=1,at=1.12,cex=0.8,col="grey") ``` Clearly, the two functions are **not** the same and thus *HCx* values derived from each will nearly always be different (as indicated by the positions of the vertical red and green dashed lines in the Figure above corresponding to the 2 values of the *HC20*). (Note: The two curves do cross over at a concentration of about 1.12 corresponding to the 90^th^ percentile, but in the region of ecotoxicological interest, there is no such cross-over and so the two approaches will **always** yield different *HCx* values with this difference → 0 as x → 0).WE next discuss the use of a model-averaged SSD to obtain the *correct* model-averaged *HCx*. ## Computing a model-averaged *HCx* A proper *HCx* needs to satisfy what David Fox refers to as **the inversion principle**.
More formally, the inversion principle states that an *HCx* (denoted as ${\varphi _x}$) must satisfy the following:
$$df\left( {{\varphi _x}} \right) = x\quad \quad and\quad \quad qf\left( x \right) = {\varphi _x}$$
where $df\left( \cdot \right)$ is a model-averaged *distribution function* (i.e. SSD) and $qf\left( \cdot \right)$ is a model-averaged *quantile function*. For this equality to hold, it is necessary that $qf\left( p \right) = d{f^{ - 1}}\left( p \right)$.
In `ssdtools` the following is a check that the inversion principle holds:
The *inversion principle* ensures that we only use a **single** distribution function to compute both the *HCx* *and* the fraction affected. Referring to the figure below, the *HCx* is obtained from the MA-SSD (red curve) by following the → arrows while the fraction affected is obtained by following the ← arrows. ![](images/Figure2.jpg){width=100% align="center"}
Finally, we'll briefly discuss how the *HCx* is computed in `R` using the same method as has been implemented in `ssdtools`.
### Computing the *HCx* in `R`/`ssdtools` Recall, our MA-SSD was given as $$G\left( x \right) = \sum\limits_{i = 1}^k {{w_i}} {F_i}\left( x \right)$$ and an *HCx* is obtained from the MA-SSD by essentially working 'in reverse' by starting at a value of $x$ on the **vertical** scale in the Figure above and following the → arrows and reading off the corresponding value on the horizontal scale.Obviously, we need to be able to 'codify' this process in `R` (or any other computer language).
Mathematically this is equivalent to seeking a solution to the following equation:$${x:G\left( x \right) = p}$$
or, equivalently:$$x:G\left( x \right) - p = 0$$
for some fraction affected, $p$.
Finding the solution to this last equation is referred to as *finding the root(s)* of the function $G\left( x \right)$ or, as is made clear in the figure below, *finding the zero-crossing* of the function $G\left( x \right)$ for the case $p=0.2$.
![](images/uniroot.jpg){width=100% align="center"}
In `R` finding the roots of $x:G\left( x \right) - p = 0$ is achieved using the `uniroot()` function.
Help on the `uniroot` function can be found [here](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/uniroot.html) ## Where do the model-averaged weights come from? This is a little more complex, although we'll try to provide a non-mathematical explanation. For those interested in going deeper, a more comprehensive treatment can be found in [@model_averaging] and [@fletcher].This time, we'll look at fitting a gamma, lognormal, and pareto distribution to our sample data:
```{r echo=FALSE,warning=FALSE, results="markup",message=FALSE,class.output="scroll-100"}
samp<-c(1.73,0.57,0.33,0.28,0.3,0.29,2.15,0.8,0.76,0.54,0.42,0.83,0.21,0.18,0.59)
print(samp)
# knitr::kable(samp,caption="Some toxicity data (concentrations)")
```
The adequacy (or otherwise) of a fitted model can be assessed using a variety of numerical measures known as **goodness-of-fit** or GoF statistics. These are invariably based on a measure of discrepancy between the emprical data and the hypothesized model. Common GoF statistics used to test whether the hypothesis of some specified theoretical probability distribution is plausible for a given data set include: *Kolmogorov-Smirnov test; Anderson-Darling test; Shapiro-Wilk test;and Cramer-von Mises test*.
[The Cramer-von Mises](https://en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93von_Mises_criterion) test is a good choice and is readily performed using the `cvm.test()` function in the `goftest` package in `R` as follows:
```{r, echo=TRUE,results='hide',warning=FALSE,message=FALSE}
dat<-data.frame(Conc=c(1.73,0.57,0.33,0.28,0.3,0.29,2.15,0.8,0.76,0.54,0.42,0.83,0.21,0.18,0.59))
library(goftest)
library(EnvStats) # this is required for the Pareto cdf (ppareto)
# Examine the fit for the gamma distribution (NB: parameters estimated from the data)
cvm.test(dat$Conc,null = "pgamma",shape = 2.0591977,scale = 0.3231032,estimated = TRUE)
# Examine the fit for the lognormal distribution (NB: parameters estimated from the data)
cvm.test(dat$Conc,null = "plnorm",meanlog=-0.6695120,sd=0.7199573,estimated = TRUE)
# Examine the fit for the Pareto distribution (NB: parameters estimated from the data)
cvm.test(dat$Conc,null = "ppareto",location = 0.1800000,shape = 0.9566756,estimated = TRUE)
```
```
Cramer-von Mises test of goodness-of-fit
Braun's adjustment using 4 groups
Null hypothesis: Gamma distribution
with parameters shape = 2.0591977, scale = 0.3231032
Parameters assumed to have been estimated from data
data: dat$Conc
omega2max = 0.34389, p-value = 0.3404
Cramer-von Mises test of goodness-of-fit
Braun's adjustment using 4 groups
Null hypothesis: log-normal distribution
with parameter meanlog = -0.669512
Parameters assumed to have been estimated from data
data: dat$Conc
omega2max = 0.32845, p-value = 0.3719
Cramer-von Mises test of goodness-of-fit
Braun's adjustment using 4 groups
Null hypothesis: distribution ‘ppareto’
with parameters location = 0.18, shape = 0.9566756
Parameters assumed to have been estimated from data
data: dat$Conc
omega2max = 0.31391, p-value = 0.4015
```
From this output and using a level of significance of $p = 0.05$, we see that none of the distributions is implausible. However, if *forced* to choose just one distribution, we would choose the *Pareto* distribution (smaller values of the `omega2max` statistic are better). However, this does not mean that the gamma and lognormal distributions are of no value in describing the data. We can see from the plot below, that in fact both the gamma and lognormal distributions do a reasonable job over the range of toxicity values. The use of the Pareto may be a questionable choice given it is truncated at 0.18 (which is the minimum value of our toxicity data).
```{r echo=FALSE,fig.cap="Emprirical cdf (black); lognormal (green); gamma (blue); and Pareeto (red)", fig.width=7,fig.height=4.5}
library(EnvStats)
samp<-c(1.73,0.57,0.33,0.28,0.3,0.29,2.15,0.8,0.76,0.54,0.42,0.83,0.21,0.18,0.59)
samp<-sort(samp)
plot(ecdf(samp),main="Empirical and fitted SSDs",xlab="Concentration",ylab="Probability")
xx<-seq(0.01,3,by=0.01)
lines(xx,plnorm(xx,meanlog=-0.6695120,sd=0.7199573),col= "#77d408" )
lines(xx,pgamma(xx,shape = 2.0591977,scale = 0.3231032),col="#08afd4")
lines(xx,ppareto(xx, location = 0.1800000,shape = 0.9566756),col="red")
#lines(xx,(0.4419*plnorm(xx,meanlog=mean(log(samp[5:15])),sd=sd(log(samp[5:15])))+
# 0.5581*plnorm(xx,meanlog=mean(log(samp[1:10])),sd=sd(log(samp[1:10])))) ,col="#d40830")
```
As in the earlier example, we might expect to find a better fitting distribution by combining *all three distributions* using a *weighted SSD*.
The issue we face now is *how do we choose the weights* to reflect the relative fits of the three distributions? Like all tests of statistical significance, a *p-value* is computed from the value of the relevant *test statistic* - in this case, the value of the `omega2max` test statistic. For this particular test, it's a case of the smaller the better. From the output above we see that the `omega2max` values are $0.344$ for the gamma distribution, $0.328$ for the lognormal distribution, and $0.314$ for the Pareto distribution.
We might somewhat naively compute the relative weights as:
${w_1} = \frac{{{{0.344}^{ - 1}}}}{{\left( {{{0.344}^{ - 1}} + {{0.328}^{ - 1}} + {{0.314}^{ - 1}}} \right)}} = 0.318$ ${w_2} = \frac{{{{0.328}^{ - 1}}}}{{\left( {{{0.344}^{ - 1}} + {{0.328}^{ - 1}} + {{0.314}^{ - 1}}} \right)}} = 0.333$ and ${w_3} = \frac{{{{0.314}^{ - 1}}}}{{\left( {{{0.344}^{ - 1}} + {{0.328}^{ - 1}} + {{0.314}^{ - 1}}} \right)}} = 0.349$
(we use *reciprocals* since smaller values of `omega2max` represent better fits). As will be seen shortly - these are incorrect.
However, being based on a simplistic measure of discrepancy between the *observed* and *hypothesized* distributions, the `omega2max` statistic is a fairly 'blunt instrument' and has no grounding in information theory which *is* the basis for determining the weights that we seek.
The likelihood for our three distributions can be computed in `R` as follows: ```{r, echo=TRUE} sum(log(dgamma(dat$Conc,shape = 2.0591977,scale = 0.3231032))) sum(log(dlnorm(dat$Conc, meanlog = -0.6695120,sdlog = 0.7199573))) sum(log(EnvStats::dpareto(dat$Conc,location = 0.1800000, shape=0.9566756))) ``` From which the *AIC* values readily follow: ```{r echo=FALSE,results='markup'} dat<-c(1.73,0.57,0.33,0.28,0.3,0.29,2.15,0.8,0.76,0.54,0.42,0.83,0.21,0.18,0.59) k<-2 # number of parameters for each of the distributions # Gamma distribution aic1<-2*k-2*sum(log(dgamma(dat,shape = 2.0591977,scale = 0.3231032))) cat("AIC for gamma distribution =",aic1,"\n") # lognormal distribution aic2<-2*k-2*sum(log(dlnorm(dat, meanlog = -0.6695120,sdlog = 0.7199573))) cat("AIC for lognormal distribution =",aic2,"\n") # Pareto distribution aic3<-2*k-2*sum(log(EnvStats::dpareto(dat,location = 0.1800000, shape=0.9566756))) cat("AIC for Pareto distribution =",aic3,"\n") ```
As with the `omega2max` statistic, **smaller** values of *AIC* are better. Thus, a comparison of the AIC values above gives the ranking of distributional fits (best to worst) as: *Pareto > lognormal > gamma*
### Computing model weights from the `AIC` We will simply provide a formula for computing the model weights from the `AIC` values. More detailed information can be found [here](https://training.visionanalytix.com/ssd-model-averaging/). The *AIC* for the *i^th^* distribution fitted to the data is $$AI{C_i} = 2{k_i} - 2\ln \left( {{L_i}} \right) $$ where ${L_i}$ is the *i^th^ likelihood* and ${k_i}$ is the *number of parameters* for the *i^th^ distribution*. Next, we form the differences:$${\Delta _i} = AI{C_i} - AI{C_0}$$ where $AI{C_0}$ is the *AIC* for the **best-fitting** model (i.e.$AI{C_0} = \mathop {\min }\limits_i \left\{ {AI{C_i}} \right\}$ ). The *model-averaged weights* ${w_i}$ are then computed as:The model-averaged weights for the gamma, lognormal, and Pareto distributions used in the previous example can be computed 'manually' in `R` as follows: ```{r echo=TRUE,results='hide'} dat<-c(1.73,0.57,0.33,0.28,0.3,0.29,2.15,0.8,0.76,0.54,0.42,0.83,0.21,0.18,0.59) aic<-NULL k<-2 # number of parameters for each of the distributions aic[1]<-2*k-2*sum(log(dgamma(dat,shape = 2.0591977,scale = 0.3231032))) # Gamma distribution aic[2]<-2*k-2*sum(log(dlnorm(dat, meanlog = -0.6695120,sdlog = 0.7199573))) # lognormal distribution aic[3]<-2*k-2*sum(log(EnvStats::dpareto(dat,location = 0.1800000, shape=0.9566756))) # Pareto distribution delta<-aic-min(aic) # compute the delta values aic.w<-exp(-0.5*delta); aic.w<-round(aic.w/sum(aic.w),4) cat(" AIC weight for gamma distribution =",aic.w[1],"\n", "AIC weight for lognormal distribution =",aic.w[2],"\n", "AIC weight for pareto distribution =",aic.w[3],"\n") ``` ``` AIC weight for gamma distribution = 0.1191 AIC weight for lognormal distribution = 0.3985 AIC weight for pareto distribution = 0.4824 ```
Finally, let's look at the fitted *model-averaged SSD*: ```{r echo=FALSE,fig.cap="Empirical cdf (black) and model-averaged fit (magenta)",fig.width=8,fig.height=5} samp<-c(1.73,0.57,0.33,0.28,0.3,0.29,2.15,0.8,0.76,0.54,0.42,0.83,0.21,0.18,0.59) samp<-sort(samp) plot(ecdf(samp),main="Empirical and fitted SSDs",xlab="Concentration",ylab="Probability") xx<-seq(0.01,3,by=0.005) lines(xx,plnorm(xx,meanlog=-0.6695120,sd=0.7199573),col= "#959495",lty=2 ) lines(xx,pgamma(xx,shape = 2.0591977,scale = 0.3231032),col="#959495",lty=3) lines(xx,ppareto(xx, location = 0.1800000,shape = 0.9566756),col="#959495",lty=4) lines(xx,0.1191*pgamma(xx,shape = 2.0591977,scale = 0.3231032) + 0.3985*plnorm(xx,meanlog=-0.6695120,sd=0.7199573) + 0.4824*ppareto(xx, location = 0.1800000,shape= 0.9566756),col="#FF33D5",lwd=1.5) ``` As can be seen from the figure above, the model-averaged fit provides a very good fit to the empirical data. ### Correcting for distributions having differing numbers of parameters In deriving the AIC, Akaike had to make certain, strong assumptions. In addition, the bias factor (the $2k$ term) was derived from theoretical considerations (such as mathematical *expectation*) that relate to *infinite* sample sizes. For small sample sizes, the AIC is likely to select models having too many parameters (i.e models which *over-fit*) In 1978, Sugiura proposed a modification to the AIC to address this problem, although it too relied on a number of assumptions. This 'correction' to the AIC for small samples (referred to as $AI{{C}_{c}}$) is: