## Introduction

Permutation tests are a class of widely-applicable non-parametric tests. They use random shuffles of the data to get the correct distribution of a test statistic under a null hypothesis. They provide valid tests (that is, tests with the advertised Type I error), although they are more computationally intensive than standard statistical tests. In this note, we will examine how they work using a few examples.

Permutation tests are widely used in genetics and genomics. They are most useful when we have insufficient information about the distribution of the data, are uncomfortable making assumptions about the distribution, or if the distribution of the test statistic is not easily computed. They are used in candidate-gene and genomewide association studies, as well as in family-based association tests. Below, we show how a permutation test works using hypertension data in a mouse backcross. Next we analyze a gene expression dataset where a standard permutation would be inappropriate, and a stratified test should be used. We follow with an example from a diabetes study and conclude with a summary.

## Hypertension data

Systolic blood pressure was measured in 250 progeny from a backcross between two mouse strains. For simplicity, we focus on 50 (randomly chosen) mice genotyped at the D4Mit214 marker (although more markers were genotyped). We want to detect association between the D4Mit214 marker genotype and blood pressure. The table below shows the systolic blood pressure (in mm of Hg) by the marker genotype, BA (heterozygous) or BB (homozygous) arranged in increasing order.

BA: 86  88  89  89  92  93  94  94  94  95  95  96  96  97  97  98  98  99  99
101 106 107 110 113 116 118
BB: 89  90  92  93  93  96  99  99  99 102 103 104 105 106 106 107 108 108 110
110 112 114 116 116

A cursory look suggests that mice with the BB genotype have higher blood pressure compared to those with the BA genotype. We can make a more objective assessment using a t-test testing the null hypothesis that the mean blood pressure in the two genotype groups is equal.

To do this, we first read in the data in Stata.

/* read data */
. insheet using hyper.csv
(22 vars, 50 obs)

Next we perform a t-test.

/* perform t-test allowing for unequal variances */
. ttest bp, by(d4mit214) unequal

Two-sample t test with unequal variances
+------------------------------------------------------------------------------
Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
+---------+--------------------------------------------------------------------
BA |      26    98.48077    1.657301     8.45061    95.06749     101.894
BB |      24    103.1833    1.637388    8.021529    99.79614    106.5705
+---------+--------------------------------------------------------------------
combined |      50     100.738    1.202249    8.501185    98.32199     103.154
+---------+--------------------------------------------------------------------
diff |           -4.702564    2.329739               -9.386925   -.0182032
+------------------------------------------------------------------------------
diff = mean(BA) - mean(BB)                                    t =  -2.0185
Ho: diff = 0                     Satterthwaite's degrees of freedom =   47.958

Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
Pr(T < t) = 0.0246         Pr(|T| > |t|) = 0.0492          Pr(T > t) = 0.9754

The t-test indicates that the mean blood pressure in the BA genotype group is 4.7 units lower than that BB genotype group. The p-value for the two-sided test is 0.049 — just “significant.”

Note, however, that the t-test makes two important assumptions — (a) the measurements are independent and (b) the data is approximately normally distributed. The test is robust to deviations the normality assumption if the underlying distribution is symmetric, unimodal, and continuous. There are no a priori reasons to expect that blood pressure will be normally distributed. The sample size is too small to check the assumption reliably (low power), so we resort to a visual exploration using a stem and leaf plot.

There is reason to suspect that the data may not have a symmetric unimodal distribution conditional on the genotype. That would raise questions about the validity of the t-test. A generally applicable solution in this setting is to use a permutation test.

/* sort data by d4mit214 marker, and make a stem and leaf plot rounding
data to the nearest integer */

. by d4mit214, sort: stem bp, round(1)

+-------------------------------------------------------------------------------
-> d4mit214 = BA

Stem-and-leaf plot for bp

bp rounded to integers

8. | 6899
9* | 23444
9. | 5566778899
10* | 1
10. | 67
11* | 03
11. | 68

+-------------------------------------------------------------------------------
-> d4mit214 = BB

Stem-and-leaf plot for bp

bp rounded to integers

8. | 9
9* | 0233
9. | 6999
10* | 234
10. | 566788
11* | 0024
11. | 66

## Shuffling the data

If we shuffle the blood pressures randomly and assign them to genotype groups, then we will destroy any association between the genotype and blood pressure (the null hypothesis). Therefore, the distribution of the test statistic obtained from repeated random shuffling, can be used to approximate the null distribution of the test statistic.

A particular shuffle the blood pressure values may look like this (labeling a value in the original BA group by a, and that in the BB group by b.):

BA: 101a 104b  98a  94a  92a  99b 116b 103b  86a  90b  88a  93b 112b 106b
107b  97a  93a 113a 110a 118a  99b  95a  96a  99a 110b  89a
BB: 105b 108b 116a  94a 106b  95a  94a 106a 108b 114b  99b  89b  99a  96b
102b 110b  97a  98a  93b  96a  89a  92b 116b 107a

The mean blood pressure is now 100.3 and 101.2 in the BA and BB genotype groups respectively. The difference is now smaller (0.9) relative to the actual observed difference, -4.7. We can now ask the question that any test would, if there was no association between blood pressure and genotype, what would the difference in blood pressures be? How does that compare with what we saw in the sample? To answer that question, we repeat the shuffles and record the difference between the means in the two genotype groups. After 400 shuffles the distribution of the difference between means in the BA and BB groups looked like this.

/* make a new variable that is 1 if d4mit214 is BA and 0 otherwise */

generate geno=0 if d4mit214=="BB"
replace geno=1 if d4mit214=="BA"

/* we can get the difference between the means in the two genotype
groups using the regress command */

. regress bp geno

Source |       SS       df       MS              Number of obs =      50
+-------------+------------------------------           F(  1,    48) =    4.06
Model |  275.984052     1  275.984052           Prob > F      =  0.0496
Residual |  3265.25335    48  68.0261114           R-squared     =  0.0779
Total |   3541.2374    49   72.270151           Root MSE      =  8.2478

+------------------------------------------------------------------------------
bp |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
+-------------+----------------------------------------------------------------
geno |  -4.702564   2.334697    -2.01   0.050    -9.396787   -.0083409
_cons |   103.1833   1.683574    61.29   0.000     99.79828    106.5684
+------------------------------------------------------------------------------

. permute bp "regress bp geno" _b, saving(perm400) reps(400)
. clear
. use perm400
. stem b_geno, line(1)

Stem-and-leaf plot for b_geno (_b[geno])

b_geno rounded to nearest multiple of .1
plot in units of .1

-7* | 7
-6* | 0
-5* | 533
-4* | 988776643332200
-3* | 777776655555544444332221111000
-2* | 99887775544443332222211111111100000
-1* | 999987777766655554444433332222221111100000000
-0* | 99999998888888888877777777666555555444333332222222111
0* | 0000000000011112222222233333334444445555555555667777788888899999999999
1* | 0000011111112222222333334444445555555556666667777888999999
2* | 0000000122222233344445555556666667777889999
3* | 0000001123333344566679
4* | 000112223333566789
5* | 1456
6* | 35

And after a whole lot more (100,000 shuffles), we get this from Stata.

. permute bp "regress bp geno" _b, reps(100000)

command:      regress bp geno
statistics:   b_geno     = _b[geno]
b_cons     = _b[_cons]
permute var:  bp

Monte Carlo permutation statistics                Number of obs    =        50
Replications     =    100000

+------------------------------------------------------------------------------
T            |     T(obs)       c       n   p=c/n   SE(p) [95% Conf. Interval]
+-------------+----------------------------------------------------------------
b_geno       |  -4.702564    4986  1.0e+05  0.0499  0.0007  .0485191   .0512271
b_cons       |   103.1833    2485  1.0e+05  0.0249  0.0005   .023894   .0258337
+------------------------------------------------------------------------------
Note:  confidence intervals are with respect to p=c/n
Note:  c = #{|T| >= |T(obs)|}

It tells us that about 4.99% of the shuffled differences were greater in magnitude than the observed difference of -4.7. This implies a p-value of 0.0499, but notice that there is a confidence interval for this p-value. Why? Because this p-value is estimated from the random shuffles. In general, the smaller the p-value, the more shuffles you will need to estimate it accurately (the rarer the event, the longer you have to look to find it).

In this case the p-value of the permutation test is almost identical to that obtained from the t-test. However, that need not be the case (and is an indication of the robustness of the t-test).

Let us now take a moment to examine the three choices we made in performing the permutation test. First, we decided what to permute, which depended on the null hypothesis being tested — mean blood pressure in the two genotype groups is the same. Second, we chose a test statistic to evaluate the null hypothesis; this has an impact on power. Third, we chose the number of permutations; this determines the accuracy of the estimated p-value.

### Exercise: Wilcoxon ranksum test

We could have used the rank sum test (see below). What assumptions (if any) does it make? How would you evaluate the evidence it provides?

. ranksum bp, by(geno)

Two-sample Wilcoxon rank-sum (Mann-Whitney) test

geno |      obs    rank sum    expected
-------------+---------------------------------
0 |       24         717         612
1 |       26         558         663
-------------+---------------------------------
combined |       50        1275        1275

----------

Ho: bp(geno==0) = bp(geno==1)
z =   2.039
Prob > |z| =   0.0414

## A regression example

Insulin area under the curve (IAUC) is an index of the magnitude of insulinemia. The table below gives the age in years and IAUC for 17 subjects. We want to know if younger patients have a greater IAUC.

/* read data */
. insheet using obesity.csv

/* display data */
. list

+----------------+
| age       iauc |
|----------------|
1. |  43   12.51976 |
2. |  36   13.07431 |
3. |  38   13.43339 |
4. |  28   13.67507 |
5. |  18   13.67342 |
|----------------|
6. |  41   13.04644 |
7. |  49    12.2841 |
8. |  41   13.00492 |
9. |  22   14.98371 |
10. |  45   13.96624 |
|----------------|
11. |  42   14.69479 |
12. |  51   13.08556 |
13. |  33   12.73661 |
14. |  23    14.2175 |
15. |  43   14.96251 |
+----------------+

We can measure the association using the correlation coefficient, or by using linear regression. To attach a significance value to the correlation, we can use a permutation test.

. permute age "corr age iauc" r(rho), reps(100000)

command:      corr age iauc
statistic:    _pm_1      = r(rho)
permute var:  age

Monte Carlo permutation statistics                Number of obs    =        15
Replications     =    100000

+------------------------------------------------------------------------------
T            |     T(obs)       c       n   p=c/n   SE(p) [95% Conf. Interval]
+-------------+----------------------------------------------------------------
_pm_1        |  -.3740956   16874  1.0e+05  0.1687  0.0012  .1664234   .1710759
+------------------------------------------------------------------------------
Note:  confidence interval is with respect to p=c/n
Note:  c = #{|T| >= |T(obs)|}

In this case we find the correlation coefficient to be not significant with a p-value of 0.17. If we had dichotomized both variables at the median, we could have constructed a contingency table and used a $\chi^2$ test.

. generate old=1 if age>=41
. replace old=0 if age<41
. gen iauc2 = 1 if iauc >= 13.43 & iauc ~= .
. replace iauc2 = 0 if iauc < 13.43
. tab old iauc2, chi2

|         iauc2
old |         0          1 |     Total
+-----------+----------------------+----------
0 |         2          5 |         7
1 |         5          3 |         8
+-----------+----------------------+----------
Total |         7          8 |        15

Pearson chi2(1) =   1.7267   Pr = 0.189

With such small cell counts, we cannot trust the $\chi^2$ approximation. So we can just use the permutation test on the $\chi^2$ statistics of the 2x2 table counts. The p-value is 0.32, which indicates that the association is not significant.

. permute old "tab old iauc2, chi2 " r(chi2), reps(100000)

command:      tab old iauc2 , chi2
statistic:    _pm_1      = r(chi2)
permute var:  old

Monte Carlo permutation statistics                Number of obs    =        15
Replications     =    100000

+------------------------------------------------------------------------------
T            |     T(obs)       c       n   p=c/n   SE(p) [95% Conf. Interval]
+-------------+----------------------------------------------------------------
_pm_1        |   1.726722   31818  1.0e+05  0.3182  0.0015  .3152935   .3210771
+------------------------------------------------------------------------------
Note:  confidence interval is with respect to p=c/n
Note:  c = #{|T| >= |T(obs)|}

### Exercise: Fisher’s exact test:

We could have used Fisher’s exact test. How is it connected to a permutation test?

. tab old iauc2, exact

|         iauc2
old |         0          1 |     Total
+-----------+----------------------+----------
0 |         2          5 |         7
1 |         5          3 |         8
+-----------+----------------------+----------
Total |         7          8 |        15

Fisher's exact =                 0.315
1-sided Fisher's exact =                 0.214

## Stratified shuffles

Despite the apparent ease and automatic nature of permutation testing, we have to avoid inappropriate use. A key element is recognizing what is permutable (exchangeable) under the null hypothesis. To examine this issue further, we consider an example where the usual permutation procedure would not be applicable.

A genomewide survey of gene expression in mouse spinal cord tissue was conducted using a microarray with approximately 48K probes. The study extracted mRNA from six mice from each of two strains. The mRNA from three mice from each strain was hybridized onto each of two chips. For simplicity, we focus on the gene expression measured by one particular probe (we will revisit the genomewide data in a future lecture).

The experimental design of the study and the hybridization intensity from one probe is shown below.

/* set the memory to be higher than usual */
. set memory 1G

. insheet using gene-expr.csv
(3 vars, 12 obs)

/* multiply gene expression values by 10 for display */
. replace y = 10*y

/* display data with separator every 3 rows */
. list, separator(3)

+------------------------------+
|        y   strainid   chipid |
|------------------------------|
1. | 95.70958         b6    chip1 |
2. | 95.09264         b6    chip1 |
3. | 95.38004         b6    chip1 |
|------------------------------|
4. | 95.48625         lp    chip1 |
5. | 95.89638         lp    chip1 |
6. | 97.81987         lp    chip1 |
|------------------------------|
7. | 95.98891         b6    chip2 |
8. | 95.38393         b6    chip2 |
9. | 95.49183         b6    chip2 |
|------------------------------|
10. | 97.56024         lp    chip2 |
11. | 98.54643         lp    chip2 |
12. | 97.60174         lp    chip2 |
+------------------------------+

We want to see if the gene expression, as measured by this probe, is the same in the two strains (B6, and LP). However, the measurements were assayed in two different chips (Chip1 and Chip2), and it is possible that there are some chip-specific effects.

Here is a stem and leaf plot of the data which suggests that the data may be skewed and bimodal, and hence a non-parametric test may be desirable.

/* stem and leaf plot rounding to the first decimal place */
. stem y, round(0.1)

Stem-and-leaf plot for y

y rounded to nearest multiple of 0.1
plot in units of 0.1

95* | 144
95. | 5579
96* | 0
96. |
97* |
97. | 668
98* |
98. | 5

Under the null hypothesis of no strain effect, the gene expression levels are exchangable within chips, but not between chips. To be exchangable between chips as well, we have to assume that the chip effect is absent as well. Thus, the correct permutation procedure to test the null hypothesis of no strain effect (in the presence of a possible chip effect) would permute expression values within chip only. If we permute the gene expression values among all 12 rows (chips and strains), then we would be making the additional assumption that there is no chip effect. The null hypothesis of no strain effect does not imply no chip effect. For this reason, the usual permutation is invalid.

To perform a stratified permutation test we have to create some dummy variables.

/* create dummy variables */
.  xi i.strainid i.chipid
i.strainid        _Istrainid_1-2      (_Istrainid_1 for strainid==b6 omitted)
i.chipid          _Ichipid_1-2        (_Ichipid_1 for chipid==chip1 omitted)

/* show them */

. list, separator(3)

+----------------------------------------------------+
|        y   strainid   chipid   _Istra~2   _Ichip~2 |
|----------------------------------------------------|
1. | 95.70958         b6    chip1          0          0 |
2. | 95.09264         b6    chip1          0          0 |
3. | 95.38004         b6    chip1          0          0 |
|----------------------------------------------------|
4. | 95.48625         lp    chip1          1          0 |
5. | 95.89638         lp    chip1          1          0 |
6. | 97.81987         lp    chip1          1          0 |
|----------------------------------------------------|
7. | 95.98891         b6    chip2          0          1 |
8. | 95.38393         b6    chip2          0          1 |
9. | 95.49183         b6    chip2          0          1 |
|----------------------------------------------------|
10. | 97.56024         lp    chip2          1          1 |
11. | 98.54643         lp    chip2          1          1 |
12. | 97.60174         lp    chip2          1          1 |
+----------------------------------------------------+

Now we perform a stratified permutation test by restricting the permutations by strata defined by the chipid variable.

. permute y "regress y _Istrainid _Ichipid" _b, reps(1000000) strata(chipid)

command:      regress y _Istrainid _Ichipid
statistics:   b__Istra~2 = _b[_Istrainid_2]
b__Ichip~2 = _b[_Ichipid_2]
b_cons     = _b[_cons]
permute var:  y

Monte Carlo permutation statistics                Number of obs    =        12
Number of strata =         2
Replications     =   1000000

+------------------------------------------------------------------------------
T            |     T(obs)       c       n   p=c/n   SE(p) [95% Conf. Interval]
+-------------+----------------------------------------------------------------
b__Istra~2   |      1.644    10264  1.0e+06  0.0103  0.0001  .0100674  .0104635
b__Ichip~2   |   .8647206  1.0e+06  1.0e+06  1.0000  0.0000  .9999963  1
b_cons       |   95.07545  1.0e+06  1.0e+06  0.9975  0.0001  .9973818  .9975793
+------------------------------------------------------------------------------
Note:  confidence intervals are with respect to p=c/n
Note:  c = #{|T| >= |T(obs)|}

We find that the p-value corresponding to the null hypothesis of no strain effect is 0.01. We used a million permutations here, although that is not necessary. It is possible to enumerate all possible permutations of the data which is ${6 \choose 3} \times {6 \choose 3} = 20 \times 20 = 400$.

## More complex settings

So far we have considered permutation tests in settings where there is a single primary outcome variable and a single predictor variable. The stratified shuffle example showed us how we can design permutation tests when there is a categorical confounding variable. The common theme is that under the null hypothesis, we could regard certain individuals as exchangable, and we could permute the outcome and predictor appropriately (within strata, if appropriate). This approach no longer works when (a) individuals in a population are correlated (as in a genetically structured population) and (b) when we have non-categorical covariates. Below, we outline what may be done in such circumstances.

When we have covariates, two possibilities can be explored. For binary ourcomes, one can use biased permutations where the outcomes are scrambled with respect to the primary predictor conditional on the covariates. For quantitative outcomes, we can permute the residuals conditional on the covariates (excluding the primary predictor of interest).

If individuals are correlated, as would be the case in family-based studies, two clases of approaches can be considered in genetic studies. One approach is to simulate genotypes conditional on the founder genotypes. Another approach is to permute residuals, such that the correlated nature of the data is preserved.

## Questions

### What are some examples in genetics that use permutation tests?

Association studies, where several polymorphisms are examined for association with a phenotype, may use permutation tests to correct for multiple comparisons. Single-marker family-based association tests may also use permutation to obtain the null distribution of the test statistic.

### We get a p-value from permutation tests. Is that all we need?

P-values are a useful objective statistical quantity, however, they do not give a complete picture of the weight of evidence. Except for some selected settings (such as for goodness-of-fit tests), p-values are almost always better used in conjunction with confidence intervals.

At first glance, there does not appear to be a simple way of generating confidence intervals from permutation tests. However, a trick, called inverting a test, allows us to get confidence intervals from permutation tests (or any other test, for that matter).

To invert a test, we perform a sequence of statistical tests with different parameter values. The hypotheses that are accepted at the $\alpha$ level of significance, will fall in the $(1{-}\alpha)$ confidence interval. We illustrate this with an example.]

Consider our blood pressure example. We performed 400 permutations (initially), to test the hypothesis that there was no difference in blood pressures between the two genotype groups. We can test the hypothesis that the difference in blood pressure is 2 (mm of Hg), and perform a permutation test. Then we can pick 3 (mm of Hg) and so on. Hypotheses that are accepted with a 5% level of significance are contained in a 95% confidence interval. Following this process, it can be seen that (left as an exercise) that an approximate 95% confidence interval would be ${-}4.7 \pm 0.5$ (mm of Hg).

### How many permutations do we need?

As many as needed to estimate the p-value to the desired precision. When you perform permutations in Stata, the output contains the confidence interval for the p-value in addition to the estimate. This confidence interval is a confidence interval for a binomial proportion (since that’s what we are estimating).

The approximate 95% confidence interval for a proportion estimated to be $\hat{p}$ from $m$ samples (permutations) is

for small $\hat{p}$. Thus, to estimate a p-value of the order of 0.01, to two decimal places, we must have

which implies that

or $m>1600.$ Since we usually do not care about p-values smaller than 0.01, a good rule of thumb is to have between 1,000 and 10,000 permutations.

### Are permutation tests the same as bootstrapping?

No. They have a similar in that they are both Monte Carlo methods (any procedure that uses simulation). More specifically, they are both resampling methods, that repeatedly reuse the data by (re)sampling from it. Bootstrap methods also have very general applicability, but their validity is not generally guaranteed and must be used with some caution. For example, our research has shown that confidence intervals for genetic loci obtained using bootstrap methods may not be reliable.

## Summary

Permutation tests are a class of non-parametric tests requiring few assumptions that can be used when the assumptions of off-the-shelf tests do not hold (or there is insufficient information about their applicability).

They have three ingredients: (1) the null hypothesis (which determines what we permute), (2) the test statistic (which affects power), and (3) the number of permutations (which affects precision of the estimated p-value).

A permutation test is guaranteed to have the correct desired false positive rate (Type I error) regardless of the distributional characteristics of the data at hand. This flexibility comes at the cost of increased computational complexity.