Overview
This package aims to enable users to build and interpret multivariate machine learning models harnessing the tidyverse (tidy model syntax in particular). This package builds off ideas from Gradient Forests (Ellis et al., 2012), ecological genomic approaches (Fitzpatrick & Keller, 2015), and multi-response stacking algorithms (Xing et al., 2020).
This package can be of use for any multi-response machine learning problem, but was designed to handle data common to community ecology (site by species data) and ecological genomics (individual or population by SNP loci).
How to Install
You can install the development version of mrIML
using devtools:
install.packages("mrIML")
# Install development version
devtools::install_github('nickfountainjones/mrIML')
Using mrIML
To get started, load mrIML and tidymodels:
library(mrIML)
library(tidymodels)
#> ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
#> ✔ broom 1.0.8 ✔ recipes 1.3.0
#> ✔ dials 1.4.0 ✔ rsample 1.3.0
#> ✔ dplyr 1.1.4 ✔ tibble 3.2.1
#> ✔ ggplot2 3.5.2 ✔ tidyr 1.3.1
#> ✔ infer 1.0.8 ✔ tune 1.3.0
#> ✔ modeldata 1.4.0 ✔ workflows 1.2.0
#> ✔ parsnip 1.3.1 ✔ workflowsets 1.1.0
#> ✔ purrr 1.0.4 ✔ yardstick 1.3.2
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard() masks scales::discard()
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ✖ recipes::step() masks stats::step()
Many functions in mrIML benefit from parallel processing.
future::plan("multisession", workers = 2)
The core function of mrIML
is mrIMLpredicts()
, which is a wrapper around the tidymodels workflow that fits a provided model to each response variable in a multi-response data set.
# Load example multi-response data
data <- MRFcov::Bird.parasites
# Split into response and predictor data
Y <- data %>%
select(-c("scale.prop.zos"))
X <- data %>%
select(scale.prop.zos)
# Define tidymodel
model <- rand_forest(
trees = 100,
mode = "classification",
mtry = tune(),
min_n = tune()
) %>%
set_engine("randomForest")
# Fit multi-response model
mrIML_model <- mrIMLpredicts(
X = X,
Y = Y,
Model = model,
prop = 0.7,
k = 5
)
#> | | | 0% | |================== | 25% | |=================================== | 50% | |==================================================== | 75% | |======================================================================| 100%
The object mrIML_model
can be investigated using:
-
mrIMLperformance()
to get performance metrics for each response variable, -
mrvip()
to get variable importance for each response variable, -
mrFlashlight()
to get partial dependence plots for each response variable, -
mrCovar()
to get covariate importance for each predictor variable, and -
mrInteractions()
to get interaction importance for each predictor variable in the response models.
Two multi-response models can be compared using mrPerformance()
.
Bootstrapping can be implemented using mrBootstrap()
, which can then be used to quantify uncertainty around partial dependence plots, mrPdPlotBootstrap()
, and variable importance, mrvipBootstrap()
, as well as build co-occurrence networks using mrCoOccurNet()
.
Recent mrIML publications
Fountain-Jones, N. M., Kozakiewicz, C. P., Forester, B. R., Landguth, E. L., Carver, S., Charleston, M., Gagne, R. B., Greenwell, B., Kraberger, S., Trumbo, D. R., Mayer, M., Clark, N. J., & Machado, G. (2021). MrIML: Multi-response interpretable machine learning to model genomic landscapes. Molecular Ecology Resources, 21, 2766–2781. https://doi.org/10.1111/1755-0998.13495
Sykes, A. L., Silva, G. S., Holtkamp, D. J., Mauch, B. W., Osemeke, O., Linhares, D. C. L., & Machado, G. (2021). Interpretable machine learning applied to on-farm biosecurity and porcine reproductive and respiratory syndrome virus. Transboundary and Emerging Diseases, 00, 1–15. https://doi.org/10.1111/tbed.14369
Fountain-Jones, N. M., Appaw, R., Alkhamis, M., Baker, S., Clark, N., Powell-Romero, F., Mayer, M., Machado, G., & Videvall, E. (2024). Advancing ecological community analysis with MrIML 2.0: Unravelling taxa associations through interpretable machine learning. Authorea [preprint]. https://doi.org/10.22541/au.172676147.77148600/v1
References
Ellis, N., Smith, S. J., & Pitcher, C. R. (2012). Gradient forests: calculating importance gradients on physical predictors. Ecology, 93, 156-168. https://doi.org/10.1890/11-0252.1
Fitzpatrick, M. C., & Keller, S. R. (2015). Ecological genomics meets community-level modelling of biodiversity: Mapping the genomic landscape of current and future environmental adaptation. Ecology Letters, 18, 1–16. https://doi.org/10.1111/ele.12376
Xing, L., Lesperance, M. L., & Zhang, X. (2020). Simultaneous prediction of multiple outcomes using revised stacking algorithms. Bioinformatics, 36, 65-72. https://doi.org/10.1093/bioinformatics/btz531