Multi-response (classification) • mrIML

mrIML is an R package that allows users to generate and interpret multi-response models (i.e., joint species distribution models) leveraging advances in data science and machine learning. mrIML couples the tidymodels infrastructure developed by Max Kuhn and colleagues with model-agnostic interpretable machine learning tools to gain insights into multiple response data. As such, mrIML is flexible and easily extendable, allowing users to construct everything from simple linear models to tree-based methods for each response using the same syntax and comparisons of predictive performance. In this vignette, we will guide you through how to apply this package to landscape genetics problems, but keep in mind that any multiple-response data can be interrogated using this package (e.g., species or OTU presence/absence data).

Let’s start by loading the required packages and data. The data we are going to use is from Fountain-Jones et al. (2017) and consists of single nucleotide polymorphism (SNP) data from a virus (feline immunodeficiency virus, FIV) infecting bobcats in Southern California.

library(mrIML)
library(tidymodels)
#> ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
#> ✔ broom        1.0.8     ✔ recipes      1.3.1
#> ✔ dials        1.4.0     ✔ rsample      1.3.0
#> ✔ dplyr        1.1.4     ✔ tibble       3.3.0
#> ✔ ggplot2      3.5.2     ✔ tidyr        1.3.1
#> ✔ infer        1.0.9     ✔ tune         1.3.0
#> ✔ modeldata    1.4.0     ✔ workflows    1.2.0
#> ✔ parsnip      1.3.2     ✔ workflowsets 1.1.1
#> ✔ purrr        1.1.0     ✔ yardstick    1.3.2
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard() masks scales::discard()
#> ✖ dplyr::filter()  masks stats::filter()
#> ✖ dplyr::lag()     masks stats::lag()
#> ✖ recipes::step()  masks stats::step()
library(flashlight)

set.seed(7007)

load("Features.RData")
load("Responsedata.RData")

# Print first 6 cols
head(Features)[1:6]
#>      Grassland Shrub.Scrub      Forest   HighlyDev       Urban    Suburban
#> X21 0.07001059   0.5569084 0.010719958 0.000000000 0.000000000 0.003573319
#> X22 0.06766419   0.7673464 0.030587924 0.000000000 0.000000000 0.132680085
#> X23 0.18449304   0.5236581 0.008614977 0.002253148 0.001590457 0.013253810
#> X24 0.09806776   0.7858655 0.000661726 0.000000000 0.000000000 0.001191106
#> X27 0.15776500   0.8416187 0.000616270 0.000000000 0.000000000 0.000000000
#> X31 0.08730321   0.6391331 0.008791658 0.000000000 0.000000000 0.007360458
head(Responsedata)[1:6]
#>     env_1 env_10 env_100 env_104 env_107 env_108
#> X21     0      0       1       0       0       0
#> X22     0      0       0       0       0       0
#> X23     0      0       0       0       1       0
#> X24     0      0       0       1       0       0
#> X27     0      0       0       0       1       0
#> X31     1      0       0       1       0       0

When constructing these types of models, SNPs that are common (occur in >80% of samples) or rare (<10% of samples) are difficult to model. This can be a problem for potentially adaptive loci, which tend to be rarer. We provide down-sampling or up-sampling strategies for dealing with unbalanced data (see below), but filtering out common and rare loci is another option. The more response variables (SNPs) you include, the longer the computational time, so for this example, we will drop SNPs occurring in <40% or >70% of individuals. This leaves 29 SNPs remaining. In practice, these filtering steps are stringent, so <20% and >80% may be more appropriate.

# Define the responses
Y <- filterRareCommon(
  Responsedata,
  lower = 0.4,
  higher = 0.7
) 

# Define the predictors
X <- Features %>%
  select(where(~ n_distinct(.) > 1))

If your feature data consists of landscape resistance data generated from Circuitscape or similar, we provide functionality that can extract each surface from a folder and generate resistance components that can be used in these models (using principal coordinate analysis). This data can be easily merged with other non-matrix data (e.g., age/sex) if needed. There were no resistance surfaces here, but if there were, we could load them as follows:

R <- resist_components(
  filename = "FILE_PATH", p_val = 0.01
)

# Add to predictors
X <- cbind(R, X)

Now all the data is loaded and ready to go we can formulate the model using tidymodel syntax. In this case we have binary data (SNP presence/absence at each loci) but the data could also be counts or continuous (the mode argument would be "regression" instead of "classification"). The user can specify any model from the tidymodel universe (see https://www.tidymodels.org/find/ for details). However, we have done most of our testing on random forests (RF), xgr boost and glms (generalized linear models). Here we will specify an RF classification model to fit each response.

model_rf <- rand_forest(
  trees = 100,
  mode = "classification",
  mtry = tune(),
  min_n = tune()
) %>%
 set_engine("randomForest")

Now we can run the model. Hyper parameter tuning (for algorithms that have hyper parameters) is done automatically by testing how model performance changes across a random grid and the best performing combination is kept. The only choice to make is to either down/up sample data or leave it as is. As our viral data set is small, we will not do any up- or down-sampling (i.e.,balance_data = "no"). We can also implement parallel processing for larger data sets to speed up the mrIML workflow.

future::plan("multisession", workers = 4)

yhats_rf <- mrIMLpredicts(
  X = X,
  Y = Y,
  Model = model_rf,
  tune_grid_size = 5,
  k = 5,
  racing = FALSE
)

It takes a couple of minutes to run on a laptop and there are warnings that you can ignore (this data set is small). The above code constructs tuned random forest models for the features (Y) you have selected for each response (29 viral SNP in this case). We can then check the performance of each model separately and overall. The performance metrics we provide for classification models are area under curve (roc_AUC), Mathew’s correlation coefficient (mcc - a good metric when data is imbalanced i.e., unequal numbers of 0/1s in each response), specificity (proportion correctly classified not having the mutation at that loci) and sensitivity (proportion correctly classified having the mutation at that loci). We also provide “prevalence”, which tells you how common that SNP was in the population sampled.

ModelPerf_rf <- mrIMLperformance(yhats_rf)
ModelPerf_rf$model_performance
#> # A tibble: 29 × 8
#>    response model_name roc_AUC    mcc sensitivity     ppv specificity prevalence
#>    <chr>    <chr>        <dbl>  <dbl>       <dbl>   <dbl>       <dbl>      <dbl>
#>  1 env_131  rand_fore…   0.625  0.25         0.75   0.75          0.5      0.421
#>  2 env_163  rand_fore…   1      0            0    NaN             1        0.632
#>  3 env_164  rand_fore…   0.444  0            1      0.5           0        0.421
#>  4 env_167  rand_fore…   0     -0.632        0.2    0.5           0        0.421
#>  5 env_169  rand_fore…   0.875  0            1      0.333         0        0.421
#>  6 env_212  rand_fore…   0      0            0    NaN             1        0.684
#>  7 env_23   rand_fore…   0.5    0.316        0.25   1             1        0.632
#>  8 env_24   rand_fore…   0.5    0            1      0.333         0        0.421
#>  9 env_41   rand_fore…   0.625  0            0.5    0.667         0.5      0.474
#> 10 env_47   rand_fore…   1      1            1      1             1        0.474
#> # ℹ 19 more rows
ModelPerf_rf$global_performance_summary
#> [1] 0.1128229

You can compare the predictive performance of random forests with any other technique simply by altering the following. Lets try a logistic regression. Note that for logistic regression you may need to scale/transform the data or create dummy variables if you include categorical variables. These steps can easily be added to the pipeline following tidymodel syntax.

# Define LM
model_glm <- logistic_reg() %>%
  set_engine("glm")

# Fit mrIML model
yhats_glm <- mrIMLpredicts(
  X = X,
  Y = Y,
  Model = model_glm,
  tune_grid_size = 5,
  k = 5,
  racing = FALSE
)
#>   |                                                                              |                                                                      |   0%  |                                                                              |==                                                                    |   3%  |                                                                              |=====                                                                 |   7%  |                                                                              |=======                                                               |  10%  |                                                                              |==========                                                            |  14%  |                                                                              |============                                                          |  17%  |                                                                              |==============                                                        |  21%  |                                                                              |=================                                                     |  24%  |                                                                              |===================                                                   |  28%  |                                                                              |======================                                                |  31%  |                                                                              |========================                                              |  34%  |                                                                              |===========================                                           |  38%  |                                                                              |=============================                                         |  41%  |                                                                              |===============================                                       |  45%  |                                                                              |==================================                                    |  48%  |                                                                              |====================================                                  |  52%  |                                                                              |=======================================                               |  55%  |                                                                              |=========================================                             |  59%  |                                                                              |===========================================                           |  62%  |                                                                              |==============================================                        |  66%  |                                                                              |================================================                      |  69%  |                                                                              |===================================================                   |  72%  |                                                                              |=====================================================                 |  76%  |                                                                              |========================================================              |  79%  |                                                                              |==========================================================            |  83%  |                                                                              |============================================================          |  86%  |                                                                              |===============================================================       |  90%  |                                                                              |=================================================================     |  93%  |                                                                              |====================================================================  |  97%  |                                                                              |======================================================================| 100%

# Get performance
ModelPerf_glm <- mrIMLperformance(yhats_glm)
ModelPerf_glm$global_performance_summary
#> [1] 0.1907751

You can see that the random forests model outperforms the logistic regression in this case (0.1907751 versus 0.1128229). Once we are happy with the underlying model we can then dive in to see how this model predicts the SNPs. First we can check variable importance.

glm_impVI <- mrVip(
  yhats_glm,
  threshold = 0.5,
  global_top_var = 10,
  local_top_var = 5,
  model_perf = ModelPerf_glm
)
glm_impVI[[3]]

The output of mrVip() contains a figure summarising variable imporance. The main plot on the left shows the global importance (what features shape genetic change overall) and the sub plots on the right show the individual models (with an option to filter by the mcc value using threshold). We also provide functionality to assess outlier SNPs utilizing a PCA on importance scores.

glm_impVI_PCA <- glm_impVI %>%
  mrVipPCA()
glm_impVI_PCA[[1]]
#> Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
#> increasing max.overlaps

We can see only one outlier was detected on PC3 (env24) indicating that this SNP was bets predicted by a different set of predictors. Looking at the indiivudal importance plots shows that, unlike the other SNPs, forest cover was the best predictor for env24. See the regression vignette for more details.

If your features can be grouped together this can make it easier to interpret variable importance plots. The order of the groups should match the order of the columns in the X data, for example:

groupCov <- c(
  rep("Host_characteristics", 1),
  rep("Urbanisation", 3),
  rep("Vegetation", 2),
  rep("Urbanisation", 1),
  rep("Spatial", 2),
  rep("Host_relatedness", 6),
  rep("Host_characteristics", 1),
  rep("Vegetation", 2),
  rep("Urbanisation",1)
)

mrIML allows a full suite of model agnostic interpretable machine learning tools to be applied to individual models or the “global” throught mrFlashlight(). We utilize the R package flashlight which offers an exciting and ever growing tool set to interpret these kinds of models. For example, once we create the flashlight objects, we can plot the predicted values of a feature for each response.

fl <- mrFlashlight(
  yhats_rf,
  response = "single",
  index = 1
)

fl %>%
  light_scatter(
    v = "Grassland",
    type = "predicted"
  ) %>%
  plot()

mrIMl also provides a wraper function for some flashlight functionality to visualize the marginal (i.e. partial dependencies) or conditional (accumulated local effects) effect of a feature on the responses. Partial dependencies take longer to calculate and are more sensitive to correlated features, therefore the results of the two can vary. For example, in the two plots bellow generated with mrCovar(), the results differ slightly but not significantly. ALE plots are a better option if your feature set is even moderately impacted by collinearity (e.g., rho = 0.6). The second plot is the overall smoothed genetic-turnover function.

# Partial dependencies
profilePlot_pd <- mrCovar(
  yhats_rf,
  var = "Grassland",
  sdthresh = 0.05
)
profilePlot_pd[[1]] +
  xlim(0, NA)


# Acumulated local effects
profilePlot_ale <- mrCovar(
  yhats_rf,
  var = "Grassland",
  sdthresh = 0.05,
  type = "ale"
)
profilePlot_ale[[1]] +
  xlim(0, NA)

Finally, we can assess how features interact overall to shape genetic change using mrInteractions(). Be warned, this is memory intensive. Future updates to this package will enable users to visualize these interactions and explore them in more detail using 2D ALE plots for example.

interactions <- mrInteractions(
  yhats_rf,
  feature = "Forest",
  num_bootstrap = 1,
  top_int = 2
)

interactions[[1]] /
  interactions[[2]] /
  interactions[[3]]

You could easily compare the plots generated above from our random forest model with the logistic regression model we fit earlier.

You can visualize detected interactions by converting the mrIML object into a flashlight object and using the flashlight::light_profile2d() function. In effect, the resulting heatmap shows the average prediction of the interacting features across loci/response variables. Both ALEs and partial dependencies can be used - but we highly recommend using ALEs for the most robust results (see Molnar (2019) for more detail)

fl <- mrFlashlight(
  yhats_rf,
  response = "single",
  index = 1
)

fl %>%
  light_profile2d(
    c("Forest", "Grassland"),
    cbind(Y, X)
  ) %>%
  plot() +
  theme_bw()

This is just an example of the tools can be applied to better interpret multi-response models and in turn gain insights into how landscape impacts genetic change. We’ve demonstrated that this package is modular and flexible, allowing future work on ‘tidymodels’ and interpretable machine learning to easily be incorporated. The predictions from these models, for example, can be used to identify individual SNPs that may be adaptive assess gene flow across space (future updates) or to help better formulate probabilistic models such as generalized dissimilarity models (GDMs).