corrselect

Fast and Flexible Predictor Pruning for Data Analysis and Modeling

The corrselect package provides simple, high-level functions for predictor pruning using association-based and model-based approaches. Whether you need to reduce multicollinearity before modeling or clean correlated predictors in your dataset, corrselect offers fast, deterministic solutions with minimal code.

Quick Start

library(corrselect)
data(mtcars)

# Association-based pruning (model-free)
pruned <- corrPrune(mtcars, threshold = 0.7)
names(pruned)

# Model-based pruning (VIF)
pruned <- modelPrune(mpg ~ ., data = mtcars, limit = 5)
attr(pruned, "selected_vars")

Statement of Need

Variable selection is a central task in statistics and machine learning, particularly when working with high-dimensional or collinear data. In many applications, users aim to retain sets of variables that are weakly associated with one another to avoid redundancy and reduce overfitting. Common approaches such as greedy filtering or regularized regression either discard useful features or do not guarantee bounded pairwise associations.

This package addresses the admissible set problem: selecting all maximal subsets of variables such that no pair exceeds a user-defined threshold. It generalizes to mixed-type data, supports multiple association metrics, and allows constrained subset selection via force_in (e.g. always include key predictors).

These features make the package useful in domains like:

ecological and bioclimatic modeling,
trait-based species selection,
interpretable machine learning pipelines.

Features

High-Level Pruning Functions

corrPrune(): Association-based predictor pruning
- Model-free, works on raw data
- Automatic correlation/association measure selection
- Exact mode for guaranteed optimal solutions (recommended for p ≤ 100)
- Fast greedy mode for large datasets (p > 100)
- Protect important variables with force_in
modelPrune(): Model-based predictor pruning
- VIF-based iterative removal
- Supports lm, glm, lme4, glmmTMB engines
- Custom engine support for any modeling package (INLA, mgcv, brms, etc.)
- Prunes fixed effects in mixed models
- Returns fitted model with pruned predictors

Advanced Subset Enumeration

Exhaustive exact subset search using graph algorithms:
- Eppstein–Löffler–Strash (ELS)
- Bron–Kerbosch (with optional pivoting)
- Used internally by corrPrune(mode = "exact")
Multiple association metrics:
- "pearson", "spearman", "kendall"
- "bicor" (WGCNA), "distance" (energy), "maximal" (minerva)
- "eta", "cramersv" for mixed-type data
force_in: protect variables from removal
Deterministic tie-breaking for reproducibility

Installation

# Install from CRAN
install.packages("corrselect")

# Or install development version from GitHub
# install.packages("pak")
pak::pak("gcol33/corrselect")

Usage Examples

Association-Based Pruning (`corrPrune`)

library(corrselect)
data(mtcars)

# Basic: Remove correlated predictors
pruned <- corrPrune(mtcars, threshold = 0.7)
names(pruned)

# Protect important variables
pruned <- corrPrune(mtcars, threshold = 0.7, force_in = "mpg")

# Use exact mode (slower, guaranteed optimal)
pruned <- corrPrune(mtcars, threshold = 0.7, mode = "exact")

# Use greedy mode (faster for large datasets)
pruned <- corrPrune(mtcars, threshold = 0.7, mode = "greedy")

# Check what was retained
attr(pruned, "selected_vars")

Model-Based Pruning (`modelPrune`)

# Linear model with VIF threshold
pruned <- modelPrune(mpg ~ cyl + disp + hp + wt, data = mtcars, limit = 5)
attr(pruned, "removed_vars")

# GLM with binomial family
mtcars_glm <- mtcars
mtcars_glm$am_binary <- as.factor(mtcars_glm$am)
pruned <- modelPrune(am_binary ~ cyl + disp + hp,
                     data = mtcars_glm, engine = "glm",
                     family = binomial(), limit = 5)

# Mixed model (requires lme4)
if (requireNamespace("lme4", quietly = TRUE)) {
  # Use built-in sleepstudy data with polynomial terms
  sleep <- lme4::sleepstudy
  sleep$Days2 <- sleep$Days^2
  suppressWarnings(
    pruned <- modelPrune(Reaction ~ Days + Days2 + (1|Subject),
                         data = sleep, engine = "lme4", limit = 5)
  )
  attr(pruned, "selected_vars")
}

# Custom engine (advanced: works with any modeling package)
# Example: INLA-based pruning
if (requireNamespace("INLA", quietly = TRUE)) {
  inla_engine <- list(
    name = "inla",
    fit = function(formula, data, ...) {
      INLA::inla(formula = formula, data = data,
                 family = "gaussian", ...)
    },
    diagnostics = function(model, fixed_effects) {
      # Use posterior SD as badness metric
      scores <- model$summary.fixed[, "sd"]
      names(scores) <- rownames(model$summary.fixed)
      scores[fixed_effects]
    }
  )

  pruned <- modelPrune(y ~ x1 + x2, data = df,
                       engine = inla_engine, limit = 0.5)
}

Exact Subset Enumeration (Advanced)

# Find ALL maximal subsets
res <- corrSelect(mtcars, threshold = 0.7)
show(res)

# Extract a specific subset
subset1 <- corrSubset(res, mtcars, which = 1)

# Convert to data frame
as.data.frame(res)

Choosing Between `corrPrune` and `modelPrune`

Feature	`corrPrune()`	`modelPrune()`
Requires model specification?	No	Yes
Based on	Pairwise correlations/associations	Model diagnostics (VIF)
Speed	Fast (greedy mode)	Moderate (refits models)
Works without response?	Yes	No
Supports mixed models?	No	Yes (lme4, glmmTMB)
Best for	Exploratory analysis, large p	Regression workflows, VIF reduction

Tip: Use corrPrune() first to reduce dimensionality, then modelPrune() for final cleanup within a modeling framework.

Advanced Features

Mixed-Type Data

Use assocSelect() for exact enumeration with mixed data types:

df <- data.frame(
  height = rnorm(30, 170, 10),
  weight = rnorm(30, 70, 12),
  group  = factor(sample(c("A","B"), 30, TRUE)),
  rating = ordered(sample(c("low","med","high"), 30, TRUE))
)

res <- assocSelect(df, threshold = 0.6)
show(res)

Precomputed Correlation Matrices

Work directly with correlation matrices:

mat <- cor(mtcars[, sapply(mtcars, is.numeric)])
res <- MatSelect(mat, threshold = 0.7, method = "els")

Documentation

Support

"Software is like sex: it's better when it's free." — Linus Torvalds

I'm a PhD student who builds R packages in my free time because I believe good tools should be free and open. I started these projects for my own work and figured others might find them useful too.

If this package saved you some time, buying me a coffee is a nice way to say thanks. It helps with my coffee addiction.

License

MIT (see the LICENSE.md file)

Citation

@software{corrselect,
  author = {Colling, Gilles},
  title = {corrselect: Correlation-Based and Model-Based Predictor Pruning},
  year = {2025},
  url = {https://CRAN.R-project.org/package=corrselect},
  doi = {10.32614/CRAN.package.corrselect}
}

Name		Name	Last commit message	Last commit date
Latest commit History 177 Commits
.github/workflows		.github/workflows
R		R
data-raw		data-raw
data		data
docs		docs
figure		figure
inst		inst
man		man
paper		paper
pkgdown		pkgdown
src		src
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.Rprofile		.Rprofile
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
_pkgdown.yml		_pkgdown.yml
build_site.R		build_site.R
check_dois_thorough.py		check_dois_thorough.py
corrselect.Rproj		corrselect.Rproj
cran-comments.md		cran-comments.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Uh oh!

Repository files navigation

corrselect

Quick Start

Statement of Need

Features

High-Level Pruning Functions

Advanced Subset Enumeration

Installation

Usage Examples

Association-Based Pruning (`corrPrune`)

Model-Based Pruning (`modelPrune`)

Exact Subset Enumeration (Advanced)

Choosing Between `corrPrune` and `modelPrune`

Advanced Features

Mixed-Type Data

Precomputed Correlation Matrices

Documentation

Support

License

Citation

About

Licenses found

Uh oh!

Releases 3

Packages

Contributors 3

Uh oh!

Languages

License

Licenses found

gcol33/corrselect

Folders and files

Latest commit

History

Repository files navigation

corrselect

Quick Start

Statement of Need

Features

High-Level Pruning Functions

Advanced Subset Enumeration

Installation

Usage Examples

Association-Based Pruning (corrPrune)

Model-Based Pruning (modelPrune)

Exact Subset Enumeration (Advanced)

Choosing Between corrPrune and modelPrune

Advanced Features

Mixed-Type Data

Precomputed Correlation Matrices

Documentation

Support

License

Citation

About

Topics

Resources

License

Licenses found

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 3

Uh oh!

Languages

Association-Based Pruning (`corrPrune`)

Model-Based Pruning (`modelPrune`)

Choosing Between `corrPrune` and `modelPrune`

Packages