Sample Size Calculations for Evaluating Mediation
Eric Vittinghoff, Saunak Sen, and Charles E. McCulloch
University of California, San Francisco
Contact Eric Vittinghoff with questions. Last updated June 19, 2008.
This page makes available R functions for computing sample size for regression problems where the goal is to assess mediation of the effects of a primary predictor by an intermediate variable or mediator. The rationale, approach, and derivation of formulas for these functions will be described in our forthcoming paper, "Sample size calculations for evaluating mediation." Tabulations of the results for Figures 1-4 of the paper, as well as the results of a sensitivity analysis for the Cox model, are available here.
Mediation has been thought of in terms of the proportion of effect explained, or the relative attenuation of b1, the coefficient for the primary predictor X1, when the mediator, X2, is added to the model. The goal is to show that b1*, the coefficient for X1 in the reduced model (i.e., the model with only X1), differs from b1, its coefficient in the full model (i.e., the model with both X1 and the mediator X2). Our paper shows that if X1 and X2 are correlated, then showing that b2, the coefficient for X2, differs from zero is equivalent to showing b1* differs from b1. Thus the problem reduces to detecting an effect of X2, controlling for X1. In short, it amounts to the more familiar problem of inflating sample size to account for loss of precision due to adjustment for X1.
Our approach, following many earlier papers, is to approximate the expected information matrix from the regression model including both X1 and X2, to obtain the expected standard error of the estimate of b2, evaluated at the MLE. The sample size follows from comparing the Wald test statistic (i.e., the ratio of the estimate of b2 to its SE) to the standard normal distribution, with the expected value of the numerator and denominator of the statistic computed under the alternative hypothesis. This reflects the Wald test for the statistical significance of a coefficient implemented in most regression packages.
Our paper provides methods to calculate sample sizes for the mediation problem for the linear, logistic, Poisson, and Cox regression models. We consider four cases for each model:
- CpCm: continuous primary predictor, continuous mediator
- BpCm: binary primary predictor, continuous mediator
- CpBm: continuous primary predictor, binary mediator
- BpBm: binary primary predictor, binary mediator
For the linear model, the analytic solution for all four cases, based on a 1998 Statistics in Medicine paper by Hsieh et al., is to inflate sample size by a variance inflation factor, 1/(1-rho2), where rho is the correlation of X1 and X2. This also turns out to be the analytic solution in cases CpCm and BpCm for the Poisson model, and underlies approximate solutions for the logistic and Cox models. Our paper also gives an analytic solution for cases CpBm and BpBm for the Poisson model. Since analytic solutions are not available for the logistic and Cox models, we used a simulation approach to obtain the expected information matrix instead.
The file mediation.R includes R functions to calculate sample sizes for all four cases for these models, implementing approximate, analytic, and simulation approaches. The various functions are described in more detail below.
Disclaimer: The software distributed in this web page has no implied warranty. Use at your own risk.
ARGUMENTS AND DEFAULT VALUES
The arguments to the functions in mediation.R variously include
- b1, the regression coefficient for the primary predictor X1
- b2, the regression coefficient for the mediator X2
- rho, the correlation between X1 and X2
- sdx1, sdx2, the standard deviations (SDs) of X1 and X2
- f1, f2, the prevalence of binary X1 and X2
- sdy, the residual SD of the outcome for the linear model
- p, the marginal prevalence of the binary outcome in the logistic model
- m, the marginal mean of the count outcome in a Poisson model
- f, the proportion of uncensored observations for the Cox model
- fc, the proportion of observations censored early
- alpha, the one-sided type-I error rate
- gamma, the type-II error rate
- ns, the number of observations to be simulated
The arguments sdx1 and sdx2 are set by default to 1. This can be over-ridden for binary X2 (cases CpBm and BpBm) by setting sdx2 = sqrt(f2*(1-f2)). The default for sdy is also 1. In addition, the defaults for alpha and gamma are 0.025 and 0.2 respectively, resulting in two-sided 5% tests and 80% power. The default for fc is 0.0. Finally, the default for ns is 10,000. If you have a fast computer, it would make sense to increase this to 50,000 or 100,000 to reduce simulation error. Any of these defaults can be over-ridden by supplying arguments in the usual way.
LINEAR MODEL
For the linear model, a single function, getn.linear, implements the analytic solution for all four cases. The arguments are b2, rho, sdx2, sdy, alpha, and gamma. For cases CpBm and BpBm, set sdx2 = sqrt(f2*(1-f2)).
Example of function calls:
# CpCm, b2=0.5, rho=0.3, all other parameters at their defaults
> getn.linear(0.5, 0.3)
[1] 35
# BpBm, b2=0.75, f2=0.25, rho=0.3, sdy=3, gamma=0.1 (90% power)
> getn.linear(0.75, 0.3, sdx2=sqrt(0.25*0.75), sdy=3, gamma=0.1)
[1] 985
Three alternative functions are included for the linear model: getn.lineara, getn.linearb, and getn.linearc. These functions make it possible to supply other combinations of input parameters affecting mediation:
- b1*, the coefficient for the primary predictor in the reduced model excluding the mediator (called b1star in the code)
- b1, the coefficient for the primary predictor in the full model including the mediator
- PTE, the proportion of the effect of the primary predictor explained by the mediator, defined as (b1*-b1)/b1*
These alternative functions for the linear model require specification of an extra parameter, but are provided for convenience, along with two utility files for computing PTE and b1* from the other parameters. The required arguments are explained in comments within the R code.
LOGISTIC MODEL
For the logistic model, the approximate solution due to Hsieh is implemented in the function getn.logistic.approx, and can be used for all four cases. Arguments are p, b2, rho, sdx2, alpha, and gamma. For a binary mediator with prevalence f2, sdx2 should be reset to sqrt(f2*(1-f2)).
Our paper shows that simulating the information matrix of the logistic model provides somewhat more accurate sample size estimates than the Hsieh approximation. The functions for cases CpCm, BpCm, CpBm, and BpBm are respectively getn.logistic.ccs, getn.logistic.bcs, getn.logistic.cbs, and getn.logistic.bbs, following a naming convention we also use for the Poisson and Cox models. Arguments for these functions include p, b1, sdx1 or f1, b2, sdx2 or f2, rho, alpha, gamma, and ns. As in other functions, sdx1, sdx2, alpha, and gamma are set to the defaults listed above. These four functions call two utility functions, getb0 (to calculate the intercept parameter from the others) and antilogit, which are supplied in mediation.R.
Examples of function calls:
# CpBm, p=0.25, b1=log(1.5), sdx1=4.5, b2=log(.5), f2=0.5, rho = 0.5
> getn.logistic.approx(0.25, log(0.5), 0.5, sdx2=0.5)
[1] 465
> getn.logistic.cbs(0.5, log(1.5), log(0.5), 0.5, 0.5, sdx1=4.5)
[1] 514
> getn.logistic.cbs(0.5, log(1.5), log(0.5), 0.5, 0.5, sdx1=4.5)
[1] 521
> getn.logistic.cbs(0.5, log(1.5), log(0.5), 0.5, 0.5, sdx1=4.5, ns=50000)
[1] 518
POISSON MODEL
The function implementing the approximate solution based on the variance inflation factor is getn.poisson.approx, and can be used for all four cases. Arguments are EY (the marginal mean of the Poisson outcome), b2, sdx2, rho, alpha and gamma, with sdx2, alpha and gamma set to the usual defaults; use sdx2=sqrt(f2*(1-f2)) for a binary mediator with prevalence f2 (cases CpBm and BpBm).
For cases CpCm and BpCm (continuous mediators), the approximate formula is also the analytic solution. For these cases, we supply redundant functions getn.poisson.cc and getn.poisson.bc, with the same arguments and defaults as for getn.poisson.approx (it's the same function). For the two cases with binary mediators, the functions are getn.poisson.cb and getn.poisson.bb. In addition to m, b2, f2, rho, alpha, and gamma, b1 and sdx1 or f1 must be specified. Defaults are as usual.
For completeness we also include four functions using simulation for the Poisson model: getn.poisson.ccs, getn.poisson.bcs, getn.poisson.cbs, and getn.poisson.bbs. As in the logistic case, these require arguments b1 and sdx1 or f1. For this case, however, the analytic functions are faster, avoid simulation error, and should be used. We include these functions as templates that could be adapted to other joint predictor distributions.
Examples of function calls:
# BpBm, m=0.5, b1=log(1.4), f1=0.25, b2=log(1.25), f2=0.25, rho=0.3
> getn.poisson.approx(0.5, log(1.25), 0.3, sdx2=sqrt(0.25*0.75))
[1] 1848
> getn.poisson.bb(0.5, log(1.4), 0.25, log(1.25), 0.25, 0.3)
[1] 1630
> getn.poisson.bbs(0.5, log(1.4), 0.25, log(1.25), 0.25, 0.3)
[1] 1630
COX MODEL
The function implementing the approximate solution, using the variance inflation factor and derived in a 2000 Statistics in Medicine paper by Schmoor et al., is getn.cox.approx, and can be used for all four cases. Arguments are b2, sdx2, rho, alpha, gamma, and f. For binary X2 set sdx2 = sqrt(f2*(1-f2)).
Simulation studies in our paper show that the approximation works very well for cases CpCm and BpCm (continuous mediators), but is a bit less accurate for cases CpBm and BpBm (binary mediators). We get some improvement for those cases using the simulation approach. This approach is implemented for all four, as functions getn.cox.ccs, getn.cox.bcs, getn.cox.cbs, and getn.cox.bbs. Arguments are b1, sdx1 or f1, b2, sdx2 or f2, rho, alpha, gamma, f, and ns, with defaults as described above. Slight variants of these functions, getn.cox.ccs2, getn.cox.bcs2, getn.cox.cbs2, and getn.cox.bbs2, make it possible to allow for early censoring of a fraction fc of observations; but in our experience this has virtually no effect, even with values of fc of 0.5. The default for fc is 0.
Examples of function calls:
# BpBm, b1=log(2), f1=0.5, b2=log(1.5), f2=0.25, rho=0.45, f=0.2
> getn.cox.approx(log(1.5), 0.45, 0.2, sdx2=sqrt(0.25*0.75))
$d
[1] 319
$n
[1] 1596
> getn.cox.bbs(log(2), 0.5, log(1.5), 0.25, 0.45, 0.2)
$d
[1] 248
$n
[1] 1242
> getn.cox.bbs(log(2), 0.5, log(1.5), 0.25, 0.45, 0.2, ns=50000)
$d
[1] 246
$n
[1] 1228
SAMPLE SIZES FOR A PRIMARY PREDICTOR IN THE PRESENCE OF CONFOUNDING
Our functions are also generally applicable to the analogous problem of calculating sample size adequate to detect the effect of a primary predictor in the presence of confounding. Simply treat X2 as the primary predictor and consider X1 the confounder.

