
Multi-response (regression)
Source:vignettes/articles/Multi-response (regression).Rmd
Multi-response (regression).Rmd
mrIML
is an R package that allows users to generate and
interpret multi-response models (i.e., joint species distribution
models) leveraging advances in data science and machine learning.
mrIML
couples the tidymodels
infrastructure, developed by Max Kuhn and colleagues, with
model-agnostic interpretable machine learning tools to gain insights
into multiple response data. mrIML
is flexible and easily
extendable, allowing users to construct everything from simple linear
models to tree-based methods for each response using the same syntax and
compare them under the same predictive performance criteria.
In this vignette, we will guide you through how to apply this package
to ecological genomics problems using the regression functionality of
the package. The data set we’ll use comes from Fitzpatrick et al. 2014,
who examined adaptive genetic variation in relation to geography and
climate adaptation (current and future) in balsam poplar (Populus
balsamifera). See Ecology Letters, (2014) doi: 10.1111/ele.12376. In
this paper, they used the similar gradient forests routine (see Ellis et
al. 2012 Ecology), and we show that mrIML
can not only
provide more flexible model choice and interpretive capabilities but can
derive new insights into the relationship between climate and genetic
variation. Further, we show that linear models of each loci have
slightly greater predictive performance.
# Read in data file with minor allele freqs & env/space variables
load("gfData.RData")
# Get climate & MEM variables for predictors
X <- gfData %>%
select(
contains("bio_"),
"elevation",
contains("MEM.")
) %>%
mutate_if(is.integer, as.numeric)
Y <- gfData %>%
select(contains("GI5")) # GIGANTEA-5 (GI5)
We focus on the adaptive SNP loci from GIGANTEA-5 (GI5) gene that has known links to stem development, plant circadian clock, and light perception pathway. The data represents the proportion of individuals in each population with that SNP loci.
Parallel processing
mrIML
uses the flexible future.apply::future_lapply()
to set up multi-core processing. This can greatly speed up the more
computationally expensive mrIML
functions. In the example
below, we set up a cluster using 4 cores. If you don’t set up a cluster,
the default settings will be used and the analysis will run
sequentially.
future::plan("multisession", workers = 4)
Building and comparing models
Performing the analysis is very similar to our classification
example. Let’s start by constructing a linear model for this data set.
We set Model 1 to a linear regression. See https://www.tidymodels.org/find/ for other regression
model options. Note that ‘mode’ must be ‘regression’ and in
mrIML
predicts, model has to be set to ‘regression’.
model_lm <- linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
yhats_lm <- mrIMLpredicts(
X = X,
Y = Y,
Model = model_lm,
prop = 0.7,
tune_grid_size = 10,
k = 10,
racing = FALSE
)
Model performance can be examined the same way as in the classification
example, however the reported performance metrics are different.
Running mrIMLperformance()
on a regression model provides
the root mean square error (rmse) and R².
ModelPerf_lm <- mrIMLperformance(yhats_lm)
ModelPerf_lm$model_performance
#> # A tibble: 20 × 4
#> response model_name rmse rsquared
#> <chr> <chr> <dbl> <dbl>
#> 1 CANDIDATE_GI5_108 linear_reg 0.0440 0.290
#> 2 CANDIDATE_GI5_198 linear_reg 0.140 0.392
#> 3 CANDIDATE_GI5_268 linear_reg 0.0847 0.182
#> 4 CANDIDATE_GI5_92 linear_reg 0.0965 0.116
#> 5 CANDIDATE_GI5_1950 linear_reg 0.243 0.000340
#> 6 CANDIDATE_GI5_2382 linear_reg 0.137 0.00896
#> 7 CANDIDATE_GI5_2405 linear_reg 0.145 0.398
#> 8 CANDIDATE_GI5_2612 linear_reg 0.120 0.585
#> 9 CANDIDATE_GI5_2641 linear_reg 0.0669 0.120
#> 10 CANDIDATE_GI5_33 linear_reg 0.167 0.618
#> 11 CANDIDATE_GI5_3966 linear_reg 0.178 0.0244
#> 12 CANDIDATE_GI5_5033 linear_reg 0.0725 0.00277
#> 13 CANDIDATE_GI5_5090 linear_reg 0.136 0.262
#> 14 CANDIDATE_GI5_5119 linear_reg 0.121 0.0118
#> 15 CANDIDATE_GI5_8997 linear_reg 0.194 0.233
#> 16 CANDIDATE_GI5_9287 linear_reg 0.0632 0.0150
#> 17 CANDIDATE_GI5_9447 linear_reg 0.0816 NA
#> 18 CANDIDATE_GI5_9551 linear_reg 0.151 0.192
#> 19 CANDIDATE_GI5_9585 linear_reg 0.177 0.321
#> 20 CANDIDATE_GI5_9659 linear_reg 0.151 0.637
ModelPerf_lm$global_performance_summary
#> [1] 0.1284043
You can see that the overall R² is 0.13 but there is substantial variation in predictive performance across loci.
Let’s compare the performance of the linear model to that of a random
forest. Random forest is the computational engine in gradient forests.
Notice for random forests we have two hyperparameters to tune:
mtry
(number of features to randomly include at each split)
and min_n
(the minimum number of data points in a node that
are required for the node to be split further). The syntax tune()
acts as a placeholder to
tell mrIML
to tune those hyperparameters across a grid of
values (defined in mrIMLpredicts
tune_grid_size
argument). Different algorithms will have
different hyperparameters; see https://www.tidymodels.org/find/parsnip/ for parameter
details. Note that large grid sizes (>10) for algorithms with lots of
hyperparameters (such as extreme gradient boosting) will be
computationally demanding. In this case we choose a grid size of 5.
model_rf <- rand_forest(
trees = 100,
mode = "regression",
mtry = tune(),
min_n = tune()
) %>%
set_engine("randomForest")
yhats_rf <- mrIMLpredicts(
X = X,
Y = Y,
Model = model_rf,
tune_grid_size = 5
)
#> | | | 0% | |==== | 5% | |======= | 10% | |========== | 15% | |============== | 20% | |================== | 25% | |===================== | 30% | |======================== | 35% | |============================ | 40% | |================================ | 45% | |=================================== | 50% | |====================================== | 55% | |========================================== | 60% | |============================================== | 65% | |================================================= | 70% | |==================================================== | 75% | |======================================================== | 80% | |============================================================ | 85% | |=============================================================== | 90% | |================================================================== | 95% | |======================================================================| 100%
ModelPerf_rf <- mrIMLperformance(yhats_rf)
ModelPerf_rf$model_performance
#> # A tibble: 20 × 4
#> response model_name rmse rsquared
#> <chr> <chr> <dbl> <dbl>
#> 1 CANDIDATE_GI5_108 rand_forest 0.0309 0.0738
#> 2 CANDIDATE_GI5_198 rand_forest 0.0804 0.508
#> 3 CANDIDATE_GI5_268 rand_forest 0.0757 0.658
#> 4 CANDIDATE_GI5_92 rand_forest 0.160 0.352
#> 5 CANDIDATE_GI5_1950 rand_forest 0.141 0.222
#> 6 CANDIDATE_GI5_2382 rand_forest 0.0650 0.0148
#> 7 CANDIDATE_GI5_2405 rand_forest 0.109 0.556
#> 8 CANDIDATE_GI5_2612 rand_forest 0.145 0.0809
#> 9 CANDIDATE_GI5_2641 rand_forest 0.0825 0.00293
#> 10 CANDIDATE_GI5_33 rand_forest 0.154 0.285
#> 11 CANDIDATE_GI5_3966 rand_forest 0.116 0.206
#> 12 CANDIDATE_GI5_5033 rand_forest 0.0438 0.418
#> 13 CANDIDATE_GI5_5090 rand_forest 0.101 0.474
#> 14 CANDIDATE_GI5_5119 rand_forest 0.124 0.0933
#> 15 CANDIDATE_GI5_8997 rand_forest 0.168 0.449
#> 16 CANDIDATE_GI5_9287 rand_forest 0.0528 0.0517
#> 17 CANDIDATE_GI5_9447 rand_forest 0.144 0.00788
#> 18 CANDIDATE_GI5_9551 rand_forest 0.116 0.456
#> 19 CANDIDATE_GI5_9585 rand_forest 0.140 0.235
#> 20 CANDIDATE_GI5_9659 rand_forest 0.126 0.371
ModelPerf_rf$global_performance_summary
#> [1] 0.1086905
#easier to see with plots
plots <- mrPerformancePlot(
ModelPerf1 = ModelPerf_lm,
ModelPerf2 = ModelPerf_rf,
mode = "regression"
)
plots[[1]] /
plots[[2]]
You can see that predictive performance is actually slightly less
using RF (overall R² = 0.11) but for some loci RF does better than our
LM and sometimes worse. Which to choose? Generally, the simpler the
better, the linear model in this case, but it depends on how important
you think non-linear responses are. In future versions of
mrIML
we will implement ensemble models that will overcome
this issue. However, for the time being, we’ll need to interrogate the
model and make our own informed decision.
Interpreting your models
A first step is to look at variable importance. We do this below for
the RF model using mrVip()
.
VI <- mrVip(
yhats_rf,
mrBootstrap_obj = NULL,
threshold = 0.1,
global_top_var = 10,
local_top_var = 5,
taxa = "CANDIDATE_GI5_9585"
)
VI[[3]]
In the above figure, the top left plot shows the ranked most
important variables across all models collectively, while the subplots
in the top right show the most important variables in the top few models
(the number of variables and models to visualize can be controlled using
the global_top_var
, local_top_var
, and
threshold
arguments of mrVip()
, see
?mrVip()
). The bottom plot shows the most important
variables in a specific response model, which we defined by
taxa = "CANDIDATE_GI5_9585"
. Generally, bio_10
(mean summer temperature), bio_1
(mean annual temperature),
and bio_18
(summer precipitation) are the most influential
predictor variables, however this varies quite a lot across the loci.
Summer precipitation was not as important in Fitzpatrick et al. but
otherwise these results are similar.
We can also perform PCA of the variable importance scores in the different models to group loci that behave similarly and to identify outliers.
# PCA
VI_PCA <- VI %>%
mrVipPCA()
VI_PCA$eigenvalues
#> [1] 3.35524390 2.35918680 1.80331596 0.97732841 0.71118104 0.64171481
#> [7] 0.49911497 0.26273404 0.19623715 0.14511906 0.04882385
VI_PCA[[1]]
#> Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider
#> increasing max.overlaps
The right plot of the first two PCs shows that candidate 5119, 9287, 5033, 92, and 108 are shaped similarly by the features we included and may, for example, be products of linked selection. The left plot shows that most of the variability in the variable importance data is captured by the first three principal components.
Note that you can also supply bootstraps for importance scores in
mrVip()
, but this functionality is
still under development for regression models.
Next, we can explore the model further by plotting the relationships
between our SNPs and a feature in our set. Let’s choose
bio_1
(mean annual temperature) and plot the individual and
global (average of all SNPs) partial dependency (PD) plots.
PD_bio1 <- mrCovar(
yhats_rf,
var = "bio_1",
sdthresh = 0.01
)
(PD_bio1[[1]] + ylim(0, 0.4)) /
(PD_bio1[[2]] + ylim(0, 0.4)) /
(PD_bio1[[3]] + ylim(0, NA))
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
The top plot in the above figure shows partial dependency curves for
SNPs that respond to mean annual temperature. What we mean by “respond”
here is that the prediction surface (the line) deviates across the Y
axis of the PD plots. We measure this deviation by calculating the
standard deviation and use that as a threshold
(sdthresh = 0.01
in this case, and this will differ by data
set). The middle plot is the average partial dependency curve of all
SNPs across an annual temperature gradient. The individual PDs are shown
as grey silhouettes in the background. This is very similar to the
pattern observed by Fitzpatrick et al., except
with a slight decline in SNP turnover with mean annual temperatures >
0. Combined, you can see here only a few candidate SNPs are
driving this pattern and these may warrant further interrogation.
The bottom plot in the figure shows the rate of change over the SNP PD curves. There is a general rise from -80 to -20. This kind of plot can help to identify general thresholds.
Let’s compare the PDs to accumulated local effect (ALE) plots that
are less sensitive to correlations among features (see Molnar 2019). We
generate ALE plots by supplying type = "ale"
to
mrCovar()
.
ALE_bio1 <- mrCovar(
yhats_rf,
var = "bio_1",
sdthresh = 0.01,
type = "ale"
)
(ALE_bio1[[1]] + ylim(0, 0.4)) /
(ALE_bio1[[2]] + ylim(0, 0.4)) /
(ALE_bio1[[3]] + ylim(0, NA))
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
The effect of mean annual temperature on SNP turnover is not as distinct in the global ALE plot. This may mean that correlations between features may be important for the predictions.
Other interpretation methods
mrIML
has easy-to-use
functionality that can quantify interactions between features. Note that
this can take a while to compute and will be the topic of future
work.
This is touching only the surface of what is possible in terms of interrogating this model. Both Flashlight and IML packages have a wide variety of tools that can offer novel insights into how these models perform. See https://cran.r-project.org/web/packages/flashlight/vignettes/flashlight.html and https://cran.r-project.org/web/packages/iml/vignettes/intro.html for other options.