Skip to content

Add Joint-RPCA to mia: methods, tests, examples, vignette#789

Merged
TuomasBorman merged 17 commits intomicrobiome:rpcafrom
aituar17:feature/joint-rpca
Jan 20, 2026
Merged

Add Joint-RPCA to mia: methods, tests, examples, vignette#789
TuomasBorman merged 17 commits intomicrobiome:rpcafrom
aituar17:feature/joint-rpca

Conversation

@aituar17
Copy link

@aituar17 aituar17 commented Nov 7, 2025

Summary

This PR adds a complete Joint-RPCA (OptSpace) module for multi-omic integration in mia.
It introduces the full backend implementation, preprocessing utilities, and example workflows for replicating iHMP/IBDMDB analyses.

The update includes:

  • Core functions (jointRPCA, jointRPCAuniversal, and supporting OptSpace solvers) for single- and multi-omic RPCA ordination.
  • Comprehensive preprocessing, projection, and masking utilities ensuring stable compositional handling.
  • Unit tests validating structure, reproducibility, and projection consistency on synthetic data.
  • Quarto example notebooks demonstrating two- and three-omic IBDMDB analyses, benchmarking, and replication of published results.
  • Example IBDMDB data subsets for reproducible testing (with instructions for optional MTX data download).

Together, these additions enable full Joint-RPCA analysis and validation within R, reproducing the same behavior as the original Python implementation on IBD multi-omic datasets.

What's included

Core Joint-RPCA Implementation

  • R/jointRPCA.R — Implements the core Joint Robust Principal Component Analysis (Joint-RPCA) using the OptSpace algorithm across one or more compositional tables. Handles preprocessing, filtering, shared sample alignment, and factorization for multi-omic inputs.
  • R/jointRPCAuniversal.R — Provides a universal user-facing wrapper for jointRPCA(), automatically detecting input type (matrix, list, SummarizedExperiment, or MultiAssayExperiment) and dispatching accordingly.
  • R/jointRPCAutils.R — Utility functions supporting the main pipeline, including data validation, helper wrappers, and structure checks used throughout the Joint-RPCA modules.

OptSpace Optimization Backend

  • R/jointOptspaceHelper.R — Internal engine that orchestrates the Joint OptSpace factorization: splitting data into train/test subsets, calling the solver, constructing ordination results, and projecting held-out samples.
  • R/jointOptspaceSolve.R — Low-level OptSpace solver implementation for joint factorization across multiple paired matrices, producing shared sample embeddings and per-view feature loadings.
  • R/optspaceHelper.R — Helper for single-view (non-joint) OptSpace-based RPCA. Used for initialization and simple validation steps when only one table is provided.

Preprocessing and Data Handling

  • R/rpcaTableProcessing.R — Handles filtering, normalization, and zero-count removal for compositional tables prior to RPCA. Ensures numeric stability and removes low-prevalence or low-abundance features/samples.
  • R/maskValueOnly.R — Safely converts numeric inputs into masked matrices by replacing non-finite values with NA while preserving dimensionality and a logical mask for missing positions.
  • R/transform.R — Implements .transform(), an internal projection function for aligning new compositional data (matrices or per-view lists) to an existing Joint-RPCA ordination space.
  • R/transformHelper.R — Supports .transform() with .transform_helper(), handling both single-view and multi-view projections, optional sample deduplication, and alignment to training feature sets.

Testing

  • tests/testthat/test-jointRPCA.R — Synthetic unit tests verifying output structure, behavior, and robustness of jointRPCA() and .transform() across single- and multi-omic data.
    Covers shared-sample alignment, projection correctness, duplicate-ID errors, and consistency of component dimensions.

Reproducible Examples and Benchmarks

Located under inst/examples/, these Quarto notebooks demonstrate reproducible analysis pipelines and validation of the Joint-RPCA implementation.

  • inst/examples/joint_rpca_example.qmd — Minimal working example showing how to run Joint-RPCA on synthetic data and inspect ordination results.
  • inst/examples/ihmp_ibd_replication.qmd — Replication of the iHMP IBD dataset single-omic (16S) analysis to validate consistency with published Python-based results.
  • inst/examples/ibdmdb_2omic_jointrpca.qmd — Two-omic (MGX + MTX) Joint-RPCA integration on the IBDMDB dataset, demonstrating successful cross-omic ordination.
  • inst/examples/ibdmdb_3omic_jointrpca.qmd — Three-omic (16S + MGX + MTX) integration example; demonstrates full pipeline behavior and visualization when shared samples are limited.
  • inst/examples/ibdmdb_benchmarking.qmd — Performance benchmarking and variance-explained analysis comparing different omic combinations to assess numerical stability and efficiency of the R implementation.

Example Data (IBDMDB)

Stored under inst/examples/data_ibdmdb_raw/ — small real-world data subset for demonstration and testing.

  • taxonomic_profiles_mtx_new.tsv — Metatranscriptomic taxonomic profiles (subset of HMP2 IBDMDB).
  • taxonomic_profiles_mgx_new.tsv, taxonomic_profiles_mgx.tsv — Metagenomic taxonomic profiles.
  • taxonomic_profiles_16s_new.tsv, taxonomic_profiles_16s.tsv — 16S rRNA sequencing taxonomic profiles.
  • hmp2_metadata_2018-08-20.csv — Subject-level and sample-level metadata, including diagnosis and visit information.

Note: The metatranscriptomic functional table ecs_relab.tsv (taxonomic profiles for MTX from HMP2) is not included due to size limits.
To reproduce full analyses, download it manually from the official iHMP IBDMDB data portal and place it in the same folder (inst/examples/data_ibdmdb_raw/).

@antagomir
Copy link
Member

Looks very clear. Some overall comments on the PR:

  1. Example data

Is now in inst/examples subfolder and takes around 18Mb (maybe less with compression).

We could include this as a readily prepared demo data set in data/ folder (like the other ones there). Then we can easily use it in examples and vignettes. Data preparation scripts would be stored in inst/scripts/ (as in inst/scripts/Tito2024QMP.R).

Alternatively, if the data is too large for an R pkg we can store it externally in github.microbiome.io/data/ but then it cannot be directly used in examples. Then we would move the Quarto examples in that github.microbiome.io/data/ subfolder instead. If the package passes Bioc checks then I assume it is OK for the package but you can check what Bioc guidelines say about allowed data size.

  1. Quarto examples

I would put the lightweight Quarto files under vignettes/ - then we can have them visible on the project website under articles tab.

For heavier Quarto workflows it is necessary to see if they can be precomputed and shared this way, or if we need to host them otherwise. One option for this is to add them under github.microbiome.io/data/ in a readily calculated final form and link from a lightweight vignette.

  1. vegan optspace

vegan also has optspace functions that we contributed. Can we use those instead of replicating the same functions here?

  1. internal functions

I suggest to collect the internal functions to utils.R where we have also other internal functions; or alternatively they could be located in the end of the R file using them.

  1. references

We should cite the reference but this is not yet available, so let's keep in mind.

  1. Files

We might like to collect at least some of the jointRPCA and OptSpace functions into one or two R files.

So overall good, just some housekeeping for standard organization.

@TuomasBorman may have some more feedback.

Copy link
Member

@antagomir antagomir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here some essential formatting suggestions.

Copy link
Member

@antagomir antagomir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more of the readily existing machinery could be used, see suggestions.

Copy link
Contributor

@TuomasBorman TuomasBorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Looks very promising and useful!!

General points:

The structure of getJoinRPCA could be improved:

getJoinRPCA <- function(SingleCellExperiment, ...){
a. Create a list from input
b. Call list method
}

getJoinRPCA <- function(MultiAssayExperiment, ...){
a. Create a list from input
b. Call list method
}

getJoinRPCA <- function(list, ...){
a. Check that the input is a list of matrices
b. run jointrpca
}

Also, the structure would benefit if more internal functions would be used. Generic rule is that function should not exceed 50 lines. so they should be rather short. It is easier to maintain well-structured functions.

By using generic methods, improving structure and other ways, I believe these functions could be simplified. The function should be as simple as possible. It should do the minimum that is needed for the functionality. This means that data transformations, for instance, should be done before calling the function. This helps in maintenance and improves transparency.

This is PR is very large. It seems that there are also functions in utils.R that are related to vignettes? It would be good idea to split large PRs into multiple ones. For instance,

  1. data + basic functionality
  2. Vignettes

@antagomir
Copy link
Member

The method's original name is "Joint-RPCA", should one change the name getJoinRPCA -> getJointRPCA ?

@antagomir
Copy link
Member

@aituar17 can you confirm when you have added all critical points in this PR, and made notes for the less critical that can be added in a separate PR? We can then check and hopefully merge immediately

@TuomasBorman
Copy link
Contributor

  1. Keep only necessary features, functions and files (critical for merging)

Keep only the data and getJointRCPA() related functionality.

a. Add the data documentation to mia.R
b. Reduce the size of the data to <5MB (larger cannot be pushed to Bioconductor)
c. Rename the function to getJointRCPA()
d. Add getJointRCPA() related functionality to getJointRCPA.R
e. vignette-related stuff can be added later (or I think, e.g., https://github.com/microbiome/workflows could be more suitable)
f. Remove Rd files for internal functions (see comment) (this PR must have only changes in mia.R, getJointRCPA.R, data files (including the script), unit test files, (and utils.R if there is common functionality that is used in multiple R/*.R files)

  1. Polish the code (can be done later)

a. Simplify the code and logic
b. The code must be polished to match the style of the rest of the package

@antagomir
Copy link
Member

@aituar17 can you add the one remaining suggestion and confirm whether the feedback from above has been taken into account (or if not, provide a short justification)? Then we can merge.

@antagomir
Copy link
Member

@TuomasBorman I like to merge this one if critical things OK - More can be elaborated_in further PRs

Copy link
Contributor

@TuomasBorman TuomasBorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code is fine to merge, we can polish it later (it should not be too hard as the main functionality is very nicely done!). However, the code should be in correct place and no changes to other files.

These remaining tasks should not take too much time. Hopefully, @aituar17 can find time for this. Sorry for this delay and caused time challenges.

Remember to add your name to DESCRIPTION as contributor.

@antagomir
Copy link
Member

Checks still failing..?

@antagomir
Copy link
Member

At least this - did you run Bioconductor build/check before push?

  • checking for missing documentation entries ... WARNING
    Warning: Undocumented data sets:
    ‘mae2’ ‘se_mgx’ ‘se_mtx’
    All user-level objects in a package should have documentation entries.
    See chapter ‘Writing R documentation files’ in the ‘Writing R
    Extensions’ manual.

@antagomir
Copy link
Member

Hmm stll tfails on this..?

❯ checking for missing documentation entries ... WARNING
Undocumented data sets:
‘mae2’ ‘se_mgx’ ‘se_mtx’
All user-level objects in a package should have documentation entries.
See chapter ‘Writing R documentation files’ in the ‘Writing R
Extensions’ manual.

-> Is the roxygen documentation (manpages) added?

@TuomasBorman
Copy link
Contributor

Hmm stll tfails on this..?

❯ checking for missing documentation entries ... WARNING Undocumented data sets: ‘mae2’ ‘se_mgx’ ‘se_mtx’ All user-level objects in a package should have documentation entries. See chapter ‘Writing R documentation files’ in the ‘Writing R Extensions’ manual.

-> Is the roxygen documentation (manpages) added?

Related to this, only MUltiAssayExperiment needs to be returned with data(name_of_dataset). The returned dataset must be in variable called "name_of_dataset"

@antagomir
Copy link
Member

Almost there.. just one failing test in Ubuntu any more..

── Building function reference ─────────────────────────────────────────────────
Error in build_reference_index():
! In pkgdown/_pkgdown.yml, 2 topics missing from index: "getJointRPCA"
and "ibdmdb_2omic_demo".
ℹ Either add to the reference index, or use @keywords internal to drop from
the index.
Backtrace:

  1. └─pkgdown::build_site_github_pages(new_process = FALSE, install = FALSE)
  2. └─pkgdown::build_site(...)
  3. └─pkgdown:::build_site_local(...)
    
  4.   └─pkgdown::build_reference(...)
    
  5.     └─pkgdown::build_reference_index(pkg)
    
  6.       ├─pkgdown::render_page(...)
    
  7.       │ └─pkgdown:::render_page_html(pkg, name = name, data = data, depth = depth)
    
  8.       │   └─pkgdown:::modify_list(data_template(pkg, depth = depth), data)
    
  9.       └─pkgdown:::data_reference_index(pkg)
    
  10.         └─pkgdown:::check_missing_topics(rows, pkg, error_call = error_call)
    
  11.           └─pkgdown:::config_abort(...)
    
  12.             └─cli::cli_abort(message, ..., call = call, .envir = .envir)
    
  13.               └─rlang::abort(...)
    

Execution halted
Error: Process completed with exit code 1.

Copy link
Contributor

@TuomasBorman TuomasBorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These things should be fixed, maybe not in this PR, but hopefully soon:

  1. The documentation and code style should be similar to rest of the package
  • E.g., all methods should have examples
  1. The name is currently get* even though the function is adding results. Thus it should be add*

  2. The methods should be using generic methods

  3. There are room for simplifying the code. There are overlapping checks and the code is unnecessary complex in some places. Also jointRPCAuniversal and getJointRPCA are duplicated; they do essentially the same thing which might be confusing.

The structure could be something like this:

addJointRPCA.SingleCellExperiment <- function(x, name, ...){
    res <- getJointPCA.SingleCellExperiment(x, ...)
    x <- .add_results_to_sce(x, res, name)
    return(x)
}

addJointRPCA.MultiAssayExperiment <- function(x, name, ...){
    res <- getJointPCA.MultiAssayExperiment(x, ...)
    x <- .add_results_to_mae(x, res, name)
    return(x)
}

getJointRPCA.SingleCellExperiment <- function(x, ...){
    # Extract data from SCE and do input checks
    dat <- .extract_data_from_sce(x)
    # Run jointPCA
    res <- .run_joint_rpca(dat, ...)
    return(res)
}

getJointRPCA.MultiAssayExperiment <- function(x, ...){
    # Extract data from MAE and do input checks
    dat <- .extract_data_from_mae(x)
    # Run jointPCA
    res <- .run_joint_rpca(dat, ...)
    return(res)
}

.run_joint_rpca <- function(tables, ...){
    # Method-specific data processing
    tables <- .rpca_table_processing(tables, ...)
    # Divide into train and test set
    tables <- .spit_into_test(tables, ...)
    # Stack tables. It is easier to work with single table than list.
    # Run RPCA, get "raw" results.
    res <- .calculate_rpca(stacked_matrix, ...)
    # Project test samples
    # Process results into polished format and calculate additional info
    return(res)
}

.spit_into_test <- function(tables, ...){
    if( training_samples_not_specified ){
        res <- .calculate_rpca(tables[[1L]], ...)
    }
    ...
    return(splitted_tables)
}

# It seems that joint-RPCA and RPCA are essentially the same? The joint-one is with stacked table. *If I read the code correctly*
.calculate_rpca <- function(matrix){
    lower_representation <- .get_lower_rank_representation()
    # Do PCA
}

.get_lower_rank_representation <- function(matrix){
    # Do optspace
    # Calculate errors
}
  1. We should think about the output type again, e.g., all other ordination methods add detailed results to reducedDim(tse) |> metadata()

  2. Should we have also "robust PCA" for single omics? Could be supported easily after the code is restructured.

Comment on lines +1 to +9
## Joint RPCA front-end helpers
##
## Exported:
## - jointRPCAuniversal()
## - getJointRPCA()
##

#' Run Joint-RPCA and store embedding in reducedDim
#' @name getJointRPCA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the documentation format of other functions. Use same style.

#' @param scale Logical; whether to scale the reconstructed matrix prior to
#' SVD/PCA steps. Defaults to \code{FALSE}.
#' @param ... Additional arguments passed to \code{jointRPCAuniversal()} and then
#' to the internal \code{.joint_rpca()} engine (e.g. \code{n.components},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internal functions are not visible for user so no need to describe them. It is more confusing.

Comment on lines +50 to +57
experiments = NULL,
altexp = NULL,
name = "JointRPCA",
transform = c("rclr", "none"),
optspace.tol = 1e-5,
center = TRUE,
scale = FALSE,
...) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many arguments are missing input tests.

Comment on lines +128 to +132
# Store embedding in reducedDim only if supported (SCE / TreeSE / mia-specific)
cls <- class(x)
if (any(cls %in% c("SingleCellExperiment", "TreeSummarizedExperiment"))) {
reducedDim(x, name) <- emb
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is get* function. We have decided to have get* and add* functions get returns raw results, add adds them to the object.

Comment on lines +96 to +102
if (is.null(target_cols)) {
stop("Cannot store reducedDim: 'x' has no colnames().", call. = FALSE)
}

if (is.null(rownames(emb))) {
stop("Cannot store reducedDim: embedding has no rownames().", call. = FALSE)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These could be caught earlier. Also, the function could add sample names if needed in the beginning of the function if needed for internal use.

Comment on lines +365 to +383
# Determine train/test split
if (!is.null(sample.metadata) && !is.null(train.test.column)) {
md <- as.data.frame(sample.metadata)
md <- md[shared.all.samples, , drop = FALSE]
train.samples <- rownames(md)[md[[train.test.column]] == "train"]
test.samples <- rownames(md)[md[[train.test.column]] == "test"]
} else {
ord.tmp <- .optspace_helper(
rclr.table = t(rclr_tables[[1]]),
feature.ids = rownames(rclr_tables[[1]]),
sample.ids = colnames(rclr_tables[[1]]),
n.components = n.components,
max.iterations = max.iterations,
tol = optspace.tol,
center = center,
scale = scale
)$ord_res
sorted.ids <- rownames(ord.tmp$samples[order(ord.tmp$samples[, 1]), ])
idx <- round(seq(1, length(sorted.ids), length.out = n.test.samples))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be explained why this is done; it is not obvious from the code: to get as different/representative as possible samples for testing.

if (transform == "rclr") {
mat[!is.finite(mat)] <- 0
mat[mat < 0] <- 0
out <- vegan::decostand(mat, method = "rclr", MARGIN = 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data transformations should be applied outside of the function for transparency, especially as user do not have full control on which table the transformation is applied.


# Intersect views by name, preserve training order
views <- intersect(names(Vobj), names(tables))
if (!length(views)) stop("[.transform] No overlapping view names between ordination and new tables.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the training and test data sets are constructed from the same table, this should be never a problem?

Comment on lines +766 to +774
# Sample dedup (after combining views)
if (dedup.samples) {
sid <- sub("_\\d+$", "", rownames(Usum))
if (any(duplicated(sid))) {
Usum <- rowsum(Usum, group = sid, reorder = FALSE) / as.vector(table(sid))
} else {
rownames(Usum) <- sid
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can cause problematic results. Usually we cannot assume that "similarly"-named samples are actually duplicated

Comment on lines +937 to +940
#' @keywords internal
#' @noRd
.extract_mae_tables <- function(x, experiments = NULL) {
exps <- experiments(x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User cannot decide tables, i.e., assays from experiments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We created here helper function that could be re-used: https://github.com/bioFAM/MOFA2/pull/144/files

@TuomasBorman
Copy link
Contributor

The current GHA error is ok, it is related to deployment (forks do not have all the permissions which is correct behavior). The code runs correctly.

@antagomir
Copy link
Member

Ok - merging is not possible before the comments have been resolved. I wonder how we should do this..

@aituar17 - do you have a chance to:

  1. immediately handle any simple cases so we can just close them
  2. open issue/s so we can keep track on the remaining things to update after this PR is done

With this we can resolve the comments and hopefully merge really soon.

I am waiting this to be merged so I can demonstrate the performance through the pkg.

@TuomasBorman
Copy link
Contributor

We can push this to microbiome/mia to new branch (not devel) and work there. It might be easiest.

@antagomir
Copy link
Member

Yes, great. Who will take the necessary steps?

@TuomasBorman TuomasBorman changed the base branch from devel to rpca January 20, 2026 13:55
@TuomasBorman
Copy link
Contributor

Could @raivo-otus have a look? @aituar17 did huge work and the key functionality is already there; only finishing touch needed.

@TuomasBorman TuomasBorman merged commit d0b55c5 into microbiome:rpca Jan 20, 2026
2 of 3 checks passed
@raivo-otus
Copy link
Contributor

raivo-otus commented Jan 27, 2026

I'll get to work.

EDIT: .. this might take a while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants