Introduction
Permutation tests are a class of widely-applicable non-parametric tests. They use random shuffles of the data to get the correct distribution of a test statistic under a null hypothesis. They provide valid tests (that is, tests with the advertised Type I error), although they are more computationally intensive than standard statistical tests. In this note, we will examine how they work using a few examples.
Permutation tests are widely used in genetics and genomics. They are most useful when we have insufficient information about the distribution of the data, are uncomfortable making assumptions about the distribution, or if the distribution of the test statistic is not easily computed. They are used in candidate-gene and genomewide association studies, as well as in family-based association tests. Below, we show how a permutation test works using hypertension data in a mouse backcross. Next we analyze a gene expression dataset where a standard permutation would be inappropriate, and a stratified test should be used. We follow with an example from a diabetes study and conclude with a summary.
Hypertension data
Systolic blood pressure was measured in 250 progeny from a backcross between two mouse strains. For simplicity, we focus on 50 (randomly chosen) mice genotyped at the D4Mit214 marker (although more markers were genotyped). We want to detect association between the D4Mit214 marker genotype and blood pressure. The table below shows the systolic blood pressure (in mm of Hg) by the marker genotype, BA (heterozygous) or BB (homozygous) arranged in increasing order.
BA: 86 88 89 89 92 93 94 94 94 95 95 96 96 97 97 98 98 99 99 101 106 107 110 113 116 118 BB: 89 90 92 93 93 96 99 99 99 102 103 104 105 106 106 107 108 108 110 110 112 114 116 116
A cursory look suggests that mice with the BB genotype have higher blood pressure compared to those with the BA genotype. We can make a more objective assessment using a t-test testing the null hypothesis that the mean blood pressure in the two genotype groups is equal.
To do this, we first read in the data in Stata.
/* read data */ . insheet using hyper.csv (22 vars, 50 obs)
Next we perform a t-test.
/* perform t-test allowing for unequal variances */
. ttest bp, by(d4mit214) unequal
Two-sample t test with unequal variances
+------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
+---------+--------------------------------------------------------------------
BA | 26 98.48077 1.657301 8.45061 95.06749 101.894
BB | 24 103.1833 1.637388 8.021529 99.79614 106.5705
+---------+--------------------------------------------------------------------
combined | 50 100.738 1.202249 8.501185 98.32199 103.154
+---------+--------------------------------------------------------------------
diff | -4.702564 2.329739 -9.386925 -.0182032
+------------------------------------------------------------------------------
diff = mean(BA) - mean(BB) t = -2.0185
Ho: diff = 0 Satterthwaite's degrees of freedom = 47.958
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.0246 Pr(|T| > |t|) = 0.0492 Pr(T > t) = 0.9754
The t-test indicates that the mean blood pressure in the BA genotype group is 4.7 units lower than that BB genotype group. The p-value for the two-sided test is 0.049 — just “significant.”
Note, however, that the t-test makes two important assumptions — (a) the measurements are independent and (b) the data is approximately normally distributed. The test is robust to deviations the normality assumption if the underlying distribution is symmetric, unimodal, and continuous. There are no a priori reasons to expect that blood pressure will be normally distributed. The sample size is too small to check the assumption reliably (low power), so we resort to a visual exploration using a stem and leaf plot.
There is reason to suspect that the data may not have a symmetric unimodal distribution conditional on the genotype. That would raise questions about the validity of the t-test. A generally applicable solution in this setting is to use a permutation test.
/* sort data by d4mit214 marker, and make a stem and leaf plot rounding data to the nearest integer */ . by d4mit214, sort: stem bp, round(1) +------------------------------------------------------------------------------- -> d4mit214 = BA Stem-and-leaf plot for bp bp rounded to integers 8. | 6899 9* | 23444 9. | 5566778899 10* | 1 10. | 67 11* | 03 11. | 68 +------------------------------------------------------------------------------- -> d4mit214 = BB Stem-and-leaf plot for bp bp rounded to integers 8. | 9 9* | 0233 9. | 6999 10* | 234 10. | 566788 11* | 0024 11. | 66
Shuffling the data
If we shuffle the blood pressures randomly and assign them to genotype groups, then we will destroy any association between the genotype and blood pressure (the null hypothesis). Therefore, the distribution of the test statistic obtained from repeated random shuffling, can be used to approximate the null distribution of the test statistic.
A particular shuffle the blood pressure values may look like this (labeling a value in the original BA group by a, and that in the BB group by b.):
BA: 101a 104b 98a 94a 92a 99b 116b 103b 86a 90b 88a 93b 112b 106b
107b 97a 93a 113a 110a 118a 99b 95a 96a 99a 110b 89a
BB: 105b 108b 116a 94a 106b 95a 94a 106a 108b 114b 99b 89b 99a 96b
102b 110b 97a 98a 93b 96a 89a 92b 116b 107a
The mean blood pressure is now 100.3 and 101.2 in the BA and BB genotype groups respectively. The difference is now smaller (0.9) relative to the actual observed difference, -4.7. We can now ask the question that any test would, if there was no association between blood pressure and genotype, what would the difference in blood pressures be? How does that compare with what we saw in the sample? To answer that question, we repeat the shuffles and record the difference between the means in the two genotype groups. After 400 shuffles the distribution of the difference between means in the BA and BB groups looked like this.
/* make a new variable that is 1 if d4mit214 is BA and 0 otherwise */
generate geno=0 if d4mit214=="BB"
replace geno=1 if d4mit214=="BA"
/* we can get the difference between the means in the two genotype
groups using the regress command */
. regress bp geno
Source | SS df MS Number of obs = 50
+-------------+------------------------------ F( 1, 48) = 4.06
Model | 275.984052 1 275.984052 Prob > F = 0.0496
Residual | 3265.25335 48 68.0261114 R-squared = 0.0779
+-------------+------------------------------ Adj R-squared = 0.0587
Total | 3541.2374 49 72.270151 Root MSE = 8.2478
+------------------------------------------------------------------------------
bp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
+-------------+----------------------------------------------------------------
geno | -4.702564 2.334697 -2.01 0.050 -9.396787 -.0083409
_cons | 103.1833 1.683574 61.29 0.000 99.79828 106.5684
+------------------------------------------------------------------------------
. permute bp "regress bp geno" _b, saving(perm400) reps(400)
. clear
. use perm400
. stem b_geno, line(1)
Stem-and-leaf plot for b_geno (_b[geno])
b_geno rounded to nearest multiple of .1
plot in units of .1
-7* | 7
-6* | 0
-5* | 533
-4* | 988776643332200
-3* | 777776655555544444332221111000
-2* | 99887775544443332222211111111100000
-1* | 999987777766655554444433332222221111100000000
-0* | 99999998888888888877777777666555555444333332222222111
0* | 0000000000011112222222233333334444445555555555667777788888899999999999
1* | 0000011111112222222333334444445555555556666667777888999999
2* | 0000000122222233344445555556666667777889999
3* | 0000001123333344566679
4* | 000112223333566789
5* | 1456
6* | 35
And after a whole lot more (100,000 shuffles), we get this from Stata.
. permute bp "regress bp geno" _b, reps(100000)
command: regress bp geno
statistics: b_geno = _b[geno]
b_cons = _b[_cons]
permute var: bp
Monte Carlo permutation statistics Number of obs = 50
Replications = 100000
+------------------------------------------------------------------------------
T | T(obs) c n p=c/n SE(p) [95% Conf. Interval]
+-------------+----------------------------------------------------------------
b_geno | -4.702564 4986 1.0e+05 0.0499 0.0007 .0485191 .0512271
b_cons | 103.1833 2485 1.0e+05 0.0249 0.0005 .023894 .0258337
+------------------------------------------------------------------------------
Note: confidence intervals are with respect to p=c/n
Note: c = #{|T| >= |T(obs)|}
It tells us that about 4.99% of the shuffled differences were greater in magnitude than the observed difference of -4.7. This implies a p-value of 0.0499, but notice that there is a confidence interval for this p-value. Why? Because this p-value is estimated from the random shuffles. In general, the smaller the p-value, the more shuffles you will need to estimate it accurately (the rarer the event, the longer you have to look to find it).
In this case the p-value of the permutation test is almost identical to that obtained from the t-test. However, that need not be the case (and is an indication of the robustness of the t-test).
Let us now take a moment to examine the three choices we made in performing the permutation test. First, we decided what to permute, which depended on the null hypothesis being tested — mean blood pressure in the two genotype groups is the same. Second, we chose a test statistic to evaluate the null hypothesis; this has an impact on power. Third, we chose the number of permutations; this determines the accuracy of the estimated p-value.
Exercise: Wilcoxon ranksum test
We could have used the rank sum test (see below). What assumptions (if any) does it make? How would you evaluate the evidence it provides?
. ranksum bp, by(geno)
Two-sample Wilcoxon rank-sum (Mann-Whitney) test
geno | obs rank sum expected
-------------+---------------------------------
0 | 24 717 612
1 | 26 558 663
-------------+---------------------------------
combined | 50 1275 1275
unadjusted variance 2652.00
adjustment for ties -0.25
----------
adjusted variance 2651.75
Ho: bp(geno==0) = bp(geno==1)
z = 2.039
Prob > |z| = 0.0414
A regression example
Insulin area under the curve (IAUC) is an index of the magnitude of insulinemia. The table below gives the age in years and IAUC for 17 subjects. We want to know if younger patients have a greater IAUC.
/* read data */
. insheet using obesity.csv
/* display data */
. list
+----------------+
| age iauc |
|----------------|
1. | 43 12.51976 |
2. | 36 13.07431 |
3. | 38 13.43339 |
4. | 28 13.67507 |
5. | 18 13.67342 |
|----------------|
6. | 41 13.04644 |
7. | 49 12.2841 |
8. | 41 13.00492 |
9. | 22 14.98371 |
10. | 45 13.96624 |
|----------------|
11. | 42 14.69479 |
12. | 51 13.08556 |
13. | 33 12.73661 |
14. | 23 14.2175 |
15. | 43 14.96251 |
+----------------+
We can measure the association using the correlation coefficient, or by using linear regression. To attach a significance value to the correlation, we can use a permutation test.
. permute age "corr age iauc" r(rho), reps(100000)
command: corr age iauc
statistic: _pm_1 = r(rho)
permute var: age
Monte Carlo permutation statistics Number of obs = 15
Replications = 100000
+------------------------------------------------------------------------------
T | T(obs) c n p=c/n SE(p) [95% Conf. Interval]
+-------------+----------------------------------------------------------------
_pm_1 | -.3740956 16874 1.0e+05 0.1687 0.0012 .1664234 .1710759
+------------------------------------------------------------------------------
Note: confidence interval is with respect to p=c/n
Note: c = #{|T| >= |T(obs)|}
In this case we find the correlation coefficient to be not significant with a p-value of 0.17. If we had dichotomized both variables at the median, we could have constructed a contingency table and used a $\chi^2$ test.
. generate old=1 if age>=41
. replace old=0 if age<41
. gen iauc2 = 1 if iauc >= 13.43 & iauc ~= .
. replace iauc2 = 0 if iauc < 13.43
. tab old iauc2, chi2
| iauc2
old | 0 1 | Total
+-----------+----------------------+----------
0 | 2 5 | 7
1 | 5 3 | 8
+-----------+----------------------+----------
Total | 7 8 | 15
Pearson chi2(1) = 1.7267 Pr = 0.189
With such small cell counts, we cannot trust the $\chi^2$ approximation. So we can just use the permutation test on the $\chi^2$ statistics of the 2x2 table counts. The p-value is 0.32, which indicates that the association is not significant.
. permute old "tab old iauc2, chi2 " r(chi2), reps(100000)
command: tab old iauc2 , chi2
statistic: _pm_1 = r(chi2)
permute var: old
Monte Carlo permutation statistics Number of obs = 15
Replications = 100000
+------------------------------------------------------------------------------
T | T(obs) c n p=c/n SE(p) [95% Conf. Interval]
+-------------+----------------------------------------------------------------
_pm_1 | 1.726722 31818 1.0e+05 0.3182 0.0015 .3152935 .3210771
+------------------------------------------------------------------------------
Note: confidence interval is with respect to p=c/n
Note: c = #{|T| >= |T(obs)|}
Exercise: Fisher’s exact test:
We could have used Fisher’s exact test. How is it connected to a permutation test?
. tab old iauc2, exact
| iauc2
old | 0 1 | Total
+-----------+----------------------+----------
0 | 2 5 | 7
1 | 5 3 | 8
+-----------+----------------------+----------
Total | 7 8 | 15
Fisher's exact = 0.315
1-sided Fisher's exact = 0.214
Stratified shuffles
Despite the apparent ease and automatic nature of permutation testing, we have to avoid inappropriate use. A key element is recognizing what is permutable (exchangeable) under the null hypothesis. To examine this issue further, we consider an example where the usual permutation procedure would not be applicable.
A genomewide survey of gene expression in mouse spinal cord tissue was conducted using a microarray with approximately 48K probes. The study extracted mRNA from six mice from each of two strains. The mRNA from three mice from each strain was hybridized onto each of two chips. For simplicity, we focus on the gene expression measured by one particular probe (we will revisit the genomewide data in a future lecture).
The experimental design of the study and the hybridization intensity from one probe is shown below.
/* set the memory to be higher than usual */
. set memory 1G
/* read in data */
. insheet using gene-expr.csv
(3 vars, 12 obs)
/* multiply gene expression values by 10 for display */
. replace y = 10*y
(12 real changes made)
/* display data with separator every 3 rows */
. list, separator(3)
+------------------------------+
| y strainid chipid |
|------------------------------|
1. | 95.70958 b6 chip1 |
2. | 95.09264 b6 chip1 |
3. | 95.38004 b6 chip1 |
|------------------------------|
4. | 95.48625 lp chip1 |
5. | 95.89638 lp chip1 |
6. | 97.81987 lp chip1 |
|------------------------------|
7. | 95.98891 b6 chip2 |
8. | 95.38393 b6 chip2 |
9. | 95.49183 b6 chip2 |
|------------------------------|
10. | 97.56024 lp chip2 |
11. | 98.54643 lp chip2 |
12. | 97.60174 lp chip2 |
+------------------------------+
We want to see if the gene expression, as measured by this probe, is the same in the two strains (B6, and LP). However, the measurements were assayed in two different chips (Chip1 and Chip2), and it is possible that there are some chip-specific effects.
Here is a stem and leaf plot of the data which suggests that the data may be skewed and bimodal, and hence a non-parametric test may be desirable.
/* stem and leaf plot rounding to the first decimal place */ . stem y, round(0.1) Stem-and-leaf plot for y y rounded to nearest multiple of 0.1 plot in units of 0.1 95* | 144 95. | 5579 96* | 0 96. | 97* | 97. | 668 98* | 98. | 5
Under the null hypothesis of no strain effect, the gene expression levels are exchangable within chips, but not between chips. To be exchangable between chips as well, we have to assume that the chip effect is absent as well. Thus, the correct permutation procedure to test the null hypothesis of no strain effect (in the presence of a possible chip effect) would permute expression values within chip only. If we permute the gene expression values among all 12 rows (chips and strains), then we would be making the additional assumption that there is no chip effect. The null hypothesis of no strain effect does not imply no chip effect. For this reason, the usual permutation is invalid.
To perform a stratified permutation test we have to create some dummy variables.
/* create dummy variables */
. xi i.strainid i.chipid
i.strainid _Istrainid_1-2 (_Istrainid_1 for strainid==b6 omitted)
i.chipid _Ichipid_1-2 (_Ichipid_1 for chipid==chip1 omitted)
/* show them */
. list, separator(3)
+----------------------------------------------------+
| y strainid chipid _Istra~2 _Ichip~2 |
|----------------------------------------------------|
1. | 95.70958 b6 chip1 0 0 |
2. | 95.09264 b6 chip1 0 0 |
3. | 95.38004 b6 chip1 0 0 |
|----------------------------------------------------|
4. | 95.48625 lp chip1 1 0 |
5. | 95.89638 lp chip1 1 0 |
6. | 97.81987 lp chip1 1 0 |
|----------------------------------------------------|
7. | 95.98891 b6 chip2 0 1 |
8. | 95.38393 b6 chip2 0 1 |
9. | 95.49183 b6 chip2 0 1 |
|----------------------------------------------------|
10. | 97.56024 lp chip2 1 1 |
11. | 98.54643 lp chip2 1 1 |
12. | 97.60174 lp chip2 1 1 |
+----------------------------------------------------+
Now we perform a stratified permutation test by restricting the permutations by strata defined by the chipid variable.
. permute y "regress y _Istrainid _Ichipid" _b, reps(1000000) strata(chipid)
command: regress y _Istrainid _Ichipid
statistics: b__Istra~2 = _b[_Istrainid_2]
b__Ichip~2 = _b[_Ichipid_2]
b_cons = _b[_cons]
permute var: y
Monte Carlo permutation statistics Number of obs = 12
Number of strata = 2
Replications = 1000000
+------------------------------------------------------------------------------
T | T(obs) c n p=c/n SE(p) [95% Conf. Interval]
+-------------+----------------------------------------------------------------
b__Istra~2 | 1.644 10264 1.0e+06 0.0103 0.0001 .0100674 .0104635
b__Ichip~2 | .8647206 1.0e+06 1.0e+06 1.0000 0.0000 .9999963 1
b_cons | 95.07545 1.0e+06 1.0e+06 0.9975 0.0001 .9973818 .9975793
+------------------------------------------------------------------------------
Note: confidence intervals are with respect to p=c/n
Note: c = #{|T| >= |T(obs)|}
We find that the p-value corresponding to the null hypothesis of no strain effect is 0.01. We used a million permutations here, although that is not necessary. It is possible to enumerate all possible permutations of the data which is ${6 \choose 3} \times {6 \choose 3} = 20 \times 20 = 400$.
More complex settings
So far we have considered permutation tests in settings where there is a single primary outcome variable and a single predictor variable. The stratified shuffle example showed us how we can design permutation tests when there is a categorical confounding variable. The common theme is that under the null hypothesis, we could regard certain individuals as exchangable, and we could permute the outcome and predictor appropriately (within strata, if appropriate). This approach no longer works when (a) individuals in a population are correlated (as in a genetically structured population) and (b) when we have non-categorical covariates. Below, we outline what may be done in such circumstances.
When we have covariates, two possibilities can be explored. For binary ourcomes, one can use biased permutations where the outcomes are scrambled with respect to the primary predictor conditional on the covariates. For quantitative outcomes, we can permute the residuals conditional on the covariates (excluding the primary predictor of interest).
If individuals are correlated, as would be the case in family-based studies, two clases of approaches can be considered in genetic studies. One approach is to simulate genotypes conditional on the founder genotypes. Another approach is to permute residuals, such that the correlated nature of the data is preserved.
Questions
What are some examples in genetics that use permutation tests?
Association studies, where several polymorphisms are examined for association with a phenotype, may use permutation tests to correct for multiple comparisons. Single-marker family-based association tests may also use permutation to obtain the null distribution of the test statistic.
We get a p-value from permutation tests. Is that all we need?
P-values are a useful objective statistical quantity, however, they do not give a complete picture of the weight of evidence. Except for some selected settings (such as for goodness-of-fit tests), p-values are almost always better used in conjunction with confidence intervals.
At first glance, there does not appear to be a simple way of generating confidence intervals from permutation tests. However, a trick, called inverting a test, allows us to get confidence intervals from permutation tests (or any other test, for that matter).
To invert a test, we perform a sequence of statistical tests with different parameter values. The hypotheses that are accepted at the $\alpha$ level of significance, will fall in the $(1{-}\alpha)$ confidence interval. We illustrate this with an example.]
Consider our blood pressure example. We performed 400 permutations (initially), to test the hypothesis that there was no difference in blood pressures between the two genotype groups. We can test the hypothesis that the difference in blood pressure is 2 (mm of Hg), and perform a permutation test. Then we can pick 3 (mm of Hg) and so on. Hypotheses that are accepted with a 5% level of significance are contained in a 95% confidence interval. Following this process, it can be seen that (left as an exercise) that an approximate 95% confidence interval would be ${-}4.7 \pm 0.5$ (mm of Hg).
How many permutations do we need?
As many as needed to estimate the p-value to the desired precision. When you perform permutations in Stata, the output contains the confidence interval for the p-value in addition to the estimate. This confidence interval is a confidence interval for a binomial proportion (since that’s what we are estimating).
The approximate 95% confidence interval for a proportion estimated to be $\hat{p}$ from $m$ samples (permutations) is
for small $\hat{p}$. Thus, to estimate a p-value of the order of 0.01, to two decimal places, we must have
which implies that
or $m>1600.$ Since we usually do not care about p-values smaller than 0.01, a good rule of thumb is to have between 1,000 and 10,000 permutations.
Are permutation tests the same as bootstrapping?
No. They have a similar in that they are both Monte Carlo methods (any procedure that uses simulation). More specifically, they are both resampling methods, that repeatedly reuse the data by (re)sampling from it. Bootstrap methods also have very general applicability, but their validity is not generally guaranteed and must be used with some caution. For example, our research has shown that confidence intervals for genetic loci obtained using bootstrap methods may not be reliable.
Summary
Permutation tests are a class of non-parametric tests requiring few assumptions that can be used when the assumptions of off-the-shelf tests do not hold (or there is insufficient information about their applicability).
They have three ingredients: (1) the null hypothesis (which determines what we permute), (2) the test statistic (which affects power), and (3) the number of permutations (which affects precision of the estimated p-value).
A permutation test is guaranteed to have the correct desired false positive rate (Type I error) regardless of the distributional characteristics of the data at hand. This flexibility comes at the cost of increased computational complexity.