http://biostat.ucsf.edu/mediation.html
UCSF
Department of Epidemiology & Biostatistics

Division of Biostatistics

Sample Size Calculations for Evaluating Mediation

Eric Vittinghoff, Saunak Sen, and Charles E. McCulloch
University of California, San Francisco

Contact Eric Vittinghoff with questions. Last updated June 19, 2008.

R functions

This page makes available R functions for computing sample size for regression problems where the goal is to assess mediation of the effects of a primary predictor by an intermediate variable or mediator. The rationale, approach, and derivation of formulas for these functions will be described in our forthcoming paper, "Sample size calculations for evaluating mediation." Tabulations of the results for Figures 1-4 of the paper, as well as the results of a sensitivity analysis for the Cox model, are available here.

Tables for Figures 1-4

Sensitivity Analysis

Mediation has been thought of in terms of the proportion of effect explained, or the relative attenuation of b1, the coefficient for the primary predictor X1, when the mediator, X2, is added to the model. The goal is to show that b1*, the coefficient for X1 in the reduced model (i.e., the model with only X1), differs from b1, its coefficient in the full model (i.e., the model with both X1 and the mediator X2). Our paper shows that if X1 and X2 are correlated, then showing that b2, the coefficient for X2, differs from zero is equivalent to showing b1* differs from b1. Thus the problem reduces to detecting an effect of X2, controlling for X1. In short, it amounts to the more familiar problem of inflating sample size to account for loss of precision due to adjustment for X1.

Our approach, following many earlier papers, is to approximate the expected information matrix from the regression model including both X1 and X2, to obtain the expected standard error of the estimate of b2, evaluated at the MLE. The sample size follows from comparing the Wald test statistic (i.e., the ratio of the estimate of b2 to its SE) to the standard normal distribution, with the expected value of the numerator and denominator of the statistic computed under the alternative hypothesis. This reflects the Wald test for the statistical significance of a coefficient implemented in most regression packages.

Our paper provides methods to calculate sample sizes for the mediation problem for the linear, logistic, Poisson, and Cox regression models. We consider four cases for each model:

For the linear model, the analytic solution for all four cases, based on a 1998 Statistics in Medicine paper by Hsieh et al., is to inflate sample size by a variance inflation factor, 1/(1-rho2), where rho is the correlation of X1 and X2. This also turns out to be the analytic solution in cases CpCm and BpCm for the Poisson model, and underlies approximate solutions for the logistic and Cox models. Our paper also gives an analytic solution for cases CpBm and BpBm for the Poisson model. Since analytic solutions are not available for the logistic and Cox models, we used a simulation approach to obtain the expected information matrix instead.

The file mediation.R includes R functions to calculate sample sizes for all four cases for these models, implementing approximate, analytic, and simulation approaches. The various functions are described in more detail below.

Disclaimer: The software distributed in this web page has no implied warranty. Use at your own risk.

ARGUMENTS AND DEFAULT VALUES

The arguments to the functions in mediation.R variously include

The arguments sdx1 and sdx2 are set by default to 1. This can be over-ridden for binary X2 (cases CpBm and BpBm) by setting sdx2 = sqrt(f2*(1-f2)). The default for sdy is also 1. In addition, the defaults for alpha and gamma are 0.025 and 0.2 respectively, resulting in two-sided 5% tests and 80% power. The default for fc is 0.0. Finally, the default for ns is 10,000. If you have a fast computer, it would make sense to increase this to 50,000 or 100,000 to reduce simulation error. Any of these defaults can be over-ridden by supplying arguments in the usual way.

LINEAR MODEL

For the linear model, a single function, getn.linear, implements the analytic solution for all four cases. The arguments are b2, rho, sdx2, sdy, alpha, and gamma. For cases CpBm and BpBm, set sdx2 = sqrt(f2*(1-f2)).

Example of function calls:

# CpCm, b2=0.5, rho=0.3, all other parameters at their defaults
> getn.linear(0.5, 0.3)
[1] 35

# BpBm, b2=0.75, f2=0.25, rho=0.3, sdy=3, gamma=0.1 (90% power)
> getn.linear(0.75, 0.3, sdx2=sqrt(0.25*0.75), sdy=3, gamma=0.1)
[1] 985

Three alternative functions are included for the linear model: getn.lineara, getn.linearb, and getn.linearc. These functions make it possible to supply other combinations of input parameters affecting mediation:

These alternative functions for the linear model require specification of an extra parameter, but are provided for convenience, along with two utility files for computing PTE and b1* from the other parameters. The required arguments are explained in comments within the R code.

LOGISTIC MODEL

For the logistic model, the approximate solution due to Hsieh is implemented in the function getn.logistic.approx, and can be used for all four cases. Arguments are p, b2, rho, sdx2, alpha, and gamma. For a binary mediator with prevalence f2, sdx2 should be reset to sqrt(f2*(1-f2)).

Our paper shows that simulating the information matrix of the logistic model provides somewhat more accurate sample size estimates than the Hsieh approximation. The functions for cases CpCm, BpCm, CpBm, and BpBm are respectively getn.logistic.ccs, getn.logistic.bcs, getn.logistic.cbs, and getn.logistic.bbs, following a naming convention we also use for the Poisson and Cox models. Arguments for these functions include p, b1, sdx1 or f1, b2, sdx2 or f2, rho, alpha, gamma, and ns. As in other functions, sdx1, sdx2, alpha, and gamma are set to the defaults listed above. These four functions call two utility functions, getb0 (to calculate the intercept parameter from the others) and antilogit, which are supplied in mediation.R.

Examples of function calls:

# CpBm, p=0.25, b1=log(1.5), sdx1=4.5, b2=log(.5), f2=0.5, rho = 0.5
> getn.logistic.approx(0.25, log(0.5), 0.5, sdx2=0.5)
[1] 465
> getn.logistic.cbs(0.5, log(1.5), log(0.5), 0.5, 0.5, sdx1=4.5)
[1] 514
> getn.logistic.cbs(0.5, log(1.5), log(0.5), 0.5, 0.5, sdx1=4.5)
[1] 521
> getn.logistic.cbs(0.5, log(1.5), log(0.5), 0.5, 0.5, sdx1=4.5, ns=50000)
[1] 518

POISSON MODEL

The function implementing the approximate solution based on the variance inflation factor is getn.poisson.approx, and can be used for all four cases. Arguments are EY (the marginal mean of the Poisson outcome), b2, sdx2, rho, alpha and gamma, with sdx2, alpha and gamma set to the usual defaults; use sdx2=sqrt(f2*(1-f2)) for a binary mediator with prevalence f2 (cases CpBm and BpBm).

For cases CpCm and BpCm (continuous mediators), the approximate formula is also the analytic solution. For these cases, we supply redundant functions getn.poisson.cc and getn.poisson.bc, with the same arguments and defaults as for getn.poisson.approx (it's the same function). For the two cases with binary mediators, the functions are getn.poisson.cb and getn.poisson.bb. In addition to m, b2, f2, rho, alpha, and gamma, b1 and sdx1 or f1 must be specified. Defaults are as usual.

For completeness we also include four functions using simulation for the Poisson model: getn.poisson.ccs, getn.poisson.bcs, getn.poisson.cbs, and getn.poisson.bbs. As in the logistic case, these require arguments b1 and sdx1 or f1. For this case, however, the analytic functions are faster, avoid simulation error, and should be used. We include these functions as templates that could be adapted to other joint predictor distributions.

Examples of function calls:

# BpBm, m=0.5, b1=log(1.4), f1=0.25, b2=log(1.25), f2=0.25, rho=0.3
> getn.poisson.approx(0.5, log(1.25), 0.3, sdx2=sqrt(0.25*0.75))
[1] 1848
> getn.poisson.bb(0.5, log(1.4), 0.25, log(1.25), 0.25, 0.3)
[1] 1630
> getn.poisson.bbs(0.5, log(1.4), 0.25, log(1.25), 0.25, 0.3)
[1] 1630

COX MODEL

The function implementing the approximate solution, using the variance inflation factor and derived in a 2000 Statistics in Medicine paper by Schmoor et al., is getn.cox.approx, and can be used for all four cases. Arguments are b2, sdx2, rho, alpha, gamma, and f. For binary X2 set sdx2 = sqrt(f2*(1-f2)).

Simulation studies in our paper show that the approximation works very well for cases CpCm and BpCm (continuous mediators), but is a bit less accurate for cases CpBm and BpBm (binary mediators). We get some improvement for those cases using the simulation approach. This approach is implemented for all four, as functions getn.cox.ccs, getn.cox.bcs, getn.cox.cbs, and getn.cox.bbs. Arguments are b1, sdx1 or f1, b2, sdx2 or f2, rho, alpha, gamma, f, and ns, with defaults as described above. Slight variants of these functions, getn.cox.ccs2, getn.cox.bcs2, getn.cox.cbs2, and getn.cox.bbs2, make it possible to allow for early censoring of a fraction fc of observations; but in our experience this has virtually no effect, even with values of fc of 0.5. The default for fc is 0.

Examples of function calls:

# BpBm, b1=log(2), f1=0.5, b2=log(1.5), f2=0.25, rho=0.45, f=0.2
> getn.cox.approx(log(1.5), 0.45, 0.2, sdx2=sqrt(0.25*0.75))
$d
[1] 319
$n
[1] 1596

> getn.cox.bbs(log(2), 0.5, log(1.5), 0.25, 0.45, 0.2)
$d
[1] 248
$n
[1] 1242

> getn.cox.bbs(log(2), 0.5, log(1.5), 0.25, 0.45, 0.2, ns=50000)
$d
[1] 246
$n
[1] 1228

SAMPLE SIZES FOR A PRIMARY PREDICTOR IN THE PRESENCE OF CONFOUNDING

Our functions are also generally applicable to the analogous problem of calculating sample size adequate to detect the effect of a primary predictor in the presence of confounding. Simply treat X2 as the primary predictor and consider X1 the confounder.