A quick guide to using multi-model inference for model selection in R

Use computing power to choose the best fit model.
Share:


Model selection is a longstanding issue in statistical analysis. How should we choose the best model? Kenneth Burnham and David Anderson argue we should be using insight from information theory to select models. Their approach is called Multi-Model Inference.


Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach


How does multi-model inference work?


Very simply, multi-model inference works by testing all possible combinations of variables in a model and ranking them by their fit. Ranking is done using the Aikaike Information Criterion (AIC). AIC is a measure of model fit, taking into account model simplicity. The aim of multi-model inference and AIC is to determine the best fit model using the fewest variables.


Using multi-model inference in R


The package MuMIn has everything needed to perform multi-model inference. To test it out, we will use data from the University of Wisconsin County Health Project. This dataset contains health data from six states in the U.S.


packages <- c("readr", "MuMIn", "nlme", "car")                      # packages you will need
#lapply(packages, install.packages, character.only = TRUE)          # install packages
lapply(packages, library, character.only = TRUE)                    # load packages
dat <- as.data.frame(read_csv(url("https://raw.githubusercontent.com/waltscience/mumin-r/master/healthdata.csv")))


We will build models testing whether diabetes prevalence in counties is related to the following factors:

  • Smoking
  • Food environment index
  • Physical inactivity
  • Access to exercise opportunities
  • Excessive drinking prevalence

We begin by building a model with these fixed effects and their two-way interactions, using county and state as random effects. We can think of this model as the “global” model.


mod <- lme(
          diabetesprevalence ~ 
            smoking +
            foodenvironment +
            physicalinactivity +
            accesstoexercise +
            excessivedrinking +
              smoking:foodenvironment +
              smoking:physicalinactivity +
              smoking:accesstoexercise +
              smoking:excessivedrinking +
              foodenvironment:physicalinactivity +
              foodenvironment:accesstoexercise +
              foodenvironment:excessivedrinking +
              physicalinactivity:accesstoexercise +
              physicalinactivity:excessivedrinking +
              accesstoexercise:excessivedrinking,
          random = ~ 1 | state/county, data = dat, method = "ML"
          )


From the global model, we will use the dredge function to test every model with all possible combinations of the fixed effects.


moddredge <- dredge(mod, extra = c("R^2", F = function(x)summary(x)$fstatistic[[1]]))


Model averaging


We can create a plot showing model variable how often variables appear in the top candidate models to get a sense of what factors are important. Then create a model average from the models with the lowest AIC. Typically, models are not thought to be different if they are within two AIC units, so the model.avg function creates an average model from all models within that range.


mods2 <- moddredge[which(moddredge$AICc <= min(moddredge$AICc) + 2), ]
plot(mods2, labAsExpr = TRUE)


Occurrence of model variables across all models within two AIC units


modav <- model.avg(moddredge, subset = delta < 2)
summary(modav)
## 
## Call:
## model.avg(object = moddredge, subset = delta < 2)
## 
## Component model call: 
## lme.formula(fixed = diabetesprevalence ~ <25 unique rhs>, data = 
##      dat, random = ~1 | state/county, method = ML)
## 
## Component models: 
##                          df  logLik    AICc delta weight
## 1/2/3/4/5/8/12/13/14     13 -785.61 1597.99  0.00   0.08
## 1/2/3/4/5/8/12/14        12 -786.71 1598.10  0.10   0.07
## 1/2/3/4/5/8/11/14        12 -787.04 1598.75  0.75   0.05
## 1/2/3/4/5/6/8/12/14      13 -786.05 1598.88  0.89   0.05
## 1/2/3/4/5/8/12/13/14/15  14 -785.00 1598.90  0.90   0.05
## 1/2/3/4/5/6/8/12/13/14   14 -785.06 1599.03  1.04   0.05
## 1/2/3/4/5/6/8/9/12/13/14 15 -784.04 1599.12  1.12   0.04
## 1/2/3/4/5/8/9/12/13/14   14 -785.12 1599.14  1.14   0.04
## 1/2/3/4/5/8/10/11/14     13 -786.27 1599.31  1.32   0.04
## 1/2/3/4/5/8/11/13/14     13 -786.28 1599.34  1.34   0.04
## 1/2/3/4/5/6/8/10/14      13 -786.32 1599.42  1.43   0.04
## 1/2/3/4/5/6/8/9/10/13/14 15 -784.20 1599.44  1.45   0.04
## 1/2/3/4/5/6/8/11/14      13 -786.33 1599.45  1.46   0.04
## 1/2/3/4/5/8/11/12/14     13 -786.37 1599.52  1.53   0.04
## 1/2/3/4/5/6/8/10/13/14   14 -785.38 1599.66  1.67   0.03
## 1/2/3/4/5/8/10/12/14     13 -786.46 1599.70  1.70   0.03
## 1/2/3/4/5/6/8/9/13/14    14 -785.43 1599.77  1.78   0.03
## 1/2/3/4/5/6/8/9/12/14    14 -785.44 1599.78  1.78   0.03
## 1/2/3/4/5/8/9/12/14      13 -786.51 1599.81  1.81   0.03
## 1/2/3/4/5/6/8/13/14      13 -786.51 1599.81  1.81   0.03
## 1/2/3/4/5/8/10/12/13/14  14 -785.47 1599.85  1.85   0.03
## 1/2/3/4/5/6/8/10/11/14   14 -785.48 1599.86  1.87   0.03
## 1/2/3/4/5/7/8/12/13/14   14 -785.49 1599.89  1.90   0.03
## 1/2/3/4/5/8/11/12/13/14  14 -785.53 1599.96  1.96   0.03
## 1/2/3/4/5/8/10/13/14     13 -786.59 1599.96  1.97   0.03
## 
## Term codes: 
##                     accesstoexercise                    excessivedrinking 
##                                    1                                    2 
##                      foodenvironment                   physicalinactivity 
##                                    3                                    4 
##                              smoking   accesstoexercise:excessivedrinking 
##                                    5                                    6 
##     accesstoexercise:foodenvironment  accesstoexercise:physicalinactivity 
##                                    7                                    8 
##             accesstoexercise:smoking    excessivedrinking:foodenvironment 
##                                    9                                   10 
## excessivedrinking:physicalinactivity            excessivedrinking:smoking 
##                                   11                                   12 
##   foodenvironment:physicalinactivity              foodenvironment:smoking 
##                                   13                                   14 
##           physicalinactivity:smoking 
##                                   15 
## 
## Model-averaged coefficients:  
## (full average) 
##                                        Estimate Std. Error Adjusted SE
## (Intercept)                           2.676e+01  7.456e+00   7.465e+00
## accesstoexercise                     -9.752e-02  3.605e-02   3.610e-02
## excessivedrinking                    -2.371e-01  3.763e-01   3.766e-01
## foodenvironment                      -1.305e+00  7.534e-01   7.547e-01
## physicalinactivity                    2.564e-01  2.222e-01   2.225e-01
## smoking                              -5.276e-01  2.635e-01   2.640e-01
## accesstoexercise:physicalinactivity   2.814e-03  7.665e-04   7.681e-04
## excessivedrinking:smoking            -9.976e-03  1.067e-02   1.069e-02
## foodenvironment:physicalinactivity   -2.060e-02  2.619e-02   2.622e-02
## foodenvironment:smoking               7.378e-02  3.010e-02   3.016e-02
## excessivedrinking:physicalinactivity -2.435e-03  5.428e-03   5.433e-03
## accesstoexercise:excessivedrinking    7.602e-04  1.252e-03   1.254e-03
## physicalinactivity:smoking           -3.179e-04  1.923e-03   1.926e-03
## accesstoexercise:smoking              2.922e-04  7.744e-04   7.753e-04
## excessivedrinking:foodenvironment     1.146e-02  2.647e-02   2.650e-02
## accesstoexercise:foodenvironment     -4.431e-05  6.023e-04   6.036e-04
##                                      z value Pr(>|z|)    
## (Intercept)                            3.584 0.000338 ***
## accesstoexercise                       2.701 0.006904 ** 
## excessivedrinking                      0.630 0.528998    
## foodenvironment                        1.730 0.083685 .  
## physicalinactivity                     1.152 0.249150    
## smoking                                1.999 0.045636 *  
## accesstoexercise:physicalinactivity    3.663 0.000249 ***
## excessivedrinking:smoking              0.934 0.350523    
## foodenvironment:physicalinactivity     0.785 0.432199    
## foodenvironment:smoking                2.446 0.014426 *  
## excessivedrinking:physicalinactivity   0.448 0.653979    
## accesstoexercise:excessivedrinking     0.606 0.544200    
## physicalinactivity:smoking             0.165 0.868882    
## accesstoexercise:smoking               0.377 0.706281    
## excessivedrinking:foodenvironment      0.433 0.665246    
## accesstoexercise:foodenvironment       0.073 0.941481    
##  
## (conditional average) 
##                                        Estimate Std. Error Adjusted SE
## (Intercept)                          26.7569183  7.4557483   7.4652449
## accesstoexercise                     -0.0975216  0.0360478   0.0360999
## excessivedrinking                    -0.2370969  0.3762760   0.3766217
## foodenvironment                      -1.3053319  0.7534044   0.7546605
## physicalinactivity                    0.2564386  0.2222194   0.2225226
## smoking                              -0.5275941  0.2635284   0.2639631
## accesstoexercise:physicalinactivity   0.0028139  0.0007665   0.0007681
## excessivedrinking:smoking            -0.0165921  0.0089292   0.0089511
## foodenvironment:physicalinactivity   -0.0375073  0.0248004   0.0248618
## foodenvironment:smoking               0.0737757  0.0300976   0.0301558
## excessivedrinking:physicalinactivity -0.0092381  0.0069943   0.0070095
## accesstoexercise:excessivedrinking    0.0018604  0.0013379   0.0013411
## physicalinactivity:smoking           -0.0064996  0.0059547   0.0059703
## accesstoexercise:smoking              0.0013360  0.0011610   0.0011637
## excessivedrinking:foodenvironment     0.0424335  0.0357544   0.0358378
## accesstoexercise:foodenvironment     -0.0014878  0.0031674   0.0031757
##                                      z value Pr(>|z|)    
## (Intercept)                            3.584 0.000338 ***
## accesstoexercise                       2.701 0.006904 ** 
## excessivedrinking                      0.630 0.528998    
## foodenvironment                        1.730 0.083685 .  
## physicalinactivity                     1.152 0.249150    
## smoking                                1.999 0.045636 *  
## accesstoexercise:physicalinactivity    3.663 0.000249 ***
## excessivedrinking:smoking              1.854 0.063792 .  
## foodenvironment:physicalinactivity     1.509 0.131392    
## foodenvironment:smoking                2.446 0.014426 *  
## excessivedrinking:physicalinactivity   1.318 0.187524    
## accesstoexercise:excessivedrinking     1.387 0.165375    
## physicalinactivity:smoking             1.089 0.276305    
## accesstoexercise:smoking               1.148 0.250938    
## excessivedrinking:foodenvironment      1.184 0.236396    
## accesstoexercise:foodenvironment       0.468 0.639430    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


The model averaging computes two models - full average and conditional average. The full average model averages coefficients whether they were included in the candidate model or not. In models without a coefficient, the average will include zeroes. The conditional average model averages coefficients only when they are included in the model.

In both instances, food environment increases diabetes prevalence and the effect is enhaced by smoking. Also, access to exercise opportunities interacts with physical inactivity to increase diabetes prevalence in counties in MD, WV, VA, KY, and OH.


How can we estimate R2 of an averaged model?


There are a variety of pseudo-R2 values proposed. However, their use is still debated. An alternative practice is to report the range of R2 values for the candidate models. Since we are using mixed effects models here, the R2 that is reported is the marginal R2 - the variance in diabetes prevalence explained by the fixed effects.


range(mods2$'R^2')
## [1] 0.6771998 0.6812153


What if I don’t want to average models?


Sometimes it isn’t appropriate to average models. In those cases, we can build a final model using the variables that were included in the average models. This model still includes all variables within 2 AIC, however the coefficients will not be averaged.

In this case, we can calculate Type III error sums of squares, and get estimates for both R2marginal (fixed effects) and R2conditional (fixed + random effects).


vrs <- data.frame("nam" = as.character("name"))
vrs$nam <- capture.output(cat(names(coefficients(modav)[-1]), sep = " + "))
vrs$nam <- paste(vrs$nam, ", random = ~ 1 | state/county, data = dat, method='ML'", sep = "")
bestmod <- eval(parse(text = paste("lme(diabetesprevalence ~ ", vrs$nam, ")")))
r.squaredGLMM(bestmod)
##            R2m       R2c
## [1,] 0.6774568 0.9615326
cbind(coef(summary(bestmod)), Anova(bestmod, type = "III"))
##                                             Value    Std.Error  DF
## (Intercept)                          23.884512106 16.786285669 458
## accesstoexercise                     -0.112250650  0.052228331 458
## excessivedrinking                    -0.460061952  0.572103493 458
## foodenvironment                      -0.820038967  1.328468064 458
## physicalinactivity                    0.548863088  0.387838029 458
## smoking                              -0.586348212  0.443317594 458
## accesstoexercise:physicalinactivity   0.002298371  0.000998868 458
## excessivedrinking:smoking            -0.013225712  0.013755645 458
## foodenvironment:physicalinactivity   -0.052711163  0.032270927 458
## foodenvironment:smoking               0.089424581  0.033684736 458
## excessivedrinking:physicalinactivity  0.001175111  0.009701954 458
## accesstoexercise:excessivedrinking    0.002114319  0.001542461 458
## physicalinactivity:smoking           -0.003667962  0.007087200 458
## accesstoexercise:smoking              0.001413389  0.001374463 458
## excessivedrinking:foodenvironment     0.023166055  0.042320257 458
## accesstoexercise:foodenvironment     -0.001873276  0.003696808 458
##                                         t-value     p-value      Chisq Df
## (Intercept)                           1.4228587 0.155457886 2.09448884  1
## accesstoexercise                     -2.1492291 0.032139562 4.77881201  1
## excessivedrinking                    -0.8041586 0.421722572 0.66901824  1
## foodenvironment                      -0.6172817 0.537355723 0.39420422  1
## physicalinactivity                    1.4151864 0.157693429 2.07196202  1
## smoking                              -1.3226369 0.186616196 1.80982173  1
## accesstoexercise:physicalinactivity   2.3009756 0.021841314 5.47745158  1
## excessivedrinking:smoking            -0.9614752 0.336820603 0.95638051  1
## foodenvironment:physicalinactivity   -1.6333948 0.103073208 2.76017639  1
## foodenvironment:smoking               2.6547509 0.008213113 7.29125118  1
## excessivedrinking:physicalinactivity  0.1211211 0.903648237 0.01517729  1
## accesstoexercise:excessivedrinking    1.3707440 0.171126151 1.94386997  1
## physicalinactivity:smoking           -0.5175474 0.605023966 0.27711165  1
## accesstoexercise:smoking              1.0283206 0.304341824 1.09398549  1
## excessivedrinking:foodenvironment     0.5473987 0.584371462 0.31000029  1
## accesstoexercise:foodenvironment     -0.5067281 0.612589445 0.26564678  1
##                                      Pr(>Chisq)
## (Intercept)                          0.14783115
## accesstoexercise                     0.02881199
## excessivedrinking                    0.41339410
## foodenvironment                      0.53009772
## physicalinactivity                   0.15002841
## smoking                              0.17852962
## accesstoexercise:physicalinactivity  0.01926332
## excessivedrinking:smoking            0.32810049
## foodenvironment:physicalinactivity   0.09663758
## foodenvironment:smoking              0.00692912
## excessivedrinking:physicalinactivity 0.90195179
## accesstoexercise:excessivedrinking   0.16324895
## physicalinactivity:smoking           0.59860063
## accesstoexercise:smoking             0.29558984
## excessivedrinking:foodenvironment    0.57768001
## accesstoexercise:foodenvironment     0.60626701


When should I use multi-model inference?


Multi-model inference is an excellent tool for model selection. It relies on computing power and information theory to select best fit models from hundeds to thousands of candidate models. And, it is rapidly becoming the standard for model selection in terrestrial ecology. We can use it any time we need to select the best-fit regression model.


All the limits to linear models still apply


In this example we didn’t test for independence of variables, multicollinearity, variance inflation, or normality of residuals. These practices all still apply in multi-model inference, but were left out of this guide for consideration of the reader’s time.

Finally, it is always good practice to select global models based on pre-defined hypotheses.

The next quick guide will focus on plotting linear mixed effects models.


County health data source: University of Wisconsin County Health Rankings Project

Data and R code available on GitHub

Written on July 16, 2020

Share:


Articles by category


Articles by technology


Article archive