Add Joint-RPCA to mia: methods, tests, examples, vignette by aituar17 · Pull Request #789 · microbiome/mia

aituar17 · 2025-11-07T16:17:51Z

Summary

This PR adds a complete Joint-RPCA (OptSpace) module for multi-omic integration in mia.
It introduces the full backend implementation, preprocessing utilities, and example workflows for replicating iHMP/IBDMDB analyses.

The update includes:

Core functions (jointRPCA, jointRPCAuniversal, and supporting OptSpace solvers) for single- and multi-omic RPCA ordination.
Comprehensive preprocessing, projection, and masking utilities ensuring stable compositional handling.
Unit tests validating structure, reproducibility, and projection consistency on synthetic data.
Quarto example notebooks demonstrating two- and three-omic IBDMDB analyses, benchmarking, and replication of published results.
Example IBDMDB data subsets for reproducible testing (with instructions for optional MTX data download).

Together, these additions enable full Joint-RPCA analysis and validation within R, reproducing the same behavior as the original Python implementation on IBD multi-omic datasets.

What's included

Core Joint-RPCA Implementation

R/jointRPCA.R — Implements the core Joint Robust Principal Component Analysis (Joint-RPCA) using the OptSpace algorithm across one or more compositional tables. Handles preprocessing, filtering, shared sample alignment, and factorization for multi-omic inputs.
R/jointRPCAuniversal.R — Provides a universal user-facing wrapper for jointRPCA(), automatically detecting input type (matrix, list, SummarizedExperiment, or MultiAssayExperiment) and dispatching accordingly.
R/jointRPCAutils.R — Utility functions supporting the main pipeline, including data validation, helper wrappers, and structure checks used throughout the Joint-RPCA modules.

OptSpace Optimization Backend

R/jointOptspaceHelper.R — Internal engine that orchestrates the Joint OptSpace factorization: splitting data into train/test subsets, calling the solver, constructing ordination results, and projecting held-out samples.
R/jointOptspaceSolve.R — Low-level OptSpace solver implementation for joint factorization across multiple paired matrices, producing shared sample embeddings and per-view feature loadings.
R/optspaceHelper.R — Helper for single-view (non-joint) OptSpace-based RPCA. Used for initialization and simple validation steps when only one table is provided.

Preprocessing and Data Handling

R/rpcaTableProcessing.R — Handles filtering, normalization, and zero-count removal for compositional tables prior to RPCA. Ensures numeric stability and removes low-prevalence or low-abundance features/samples.
R/maskValueOnly.R — Safely converts numeric inputs into masked matrices by replacing non-finite values with NA while preserving dimensionality and a logical mask for missing positions.
R/transform.R — Implements .transform(), an internal projection function for aligning new compositional data (matrices or per-view lists) to an existing Joint-RPCA ordination space.
R/transformHelper.R — Supports .transform() with .transform_helper(), handling both single-view and multi-view projections, optional sample deduplication, and alignment to training feature sets.

Testing

tests/testthat/test-jointRPCA.R — Synthetic unit tests verifying output structure, behavior, and robustness of jointRPCA() and .transform() across single- and multi-omic data.
Covers shared-sample alignment, projection correctness, duplicate-ID errors, and consistency of component dimensions.

Reproducible Examples and Benchmarks

Located under inst/examples/, these Quarto notebooks demonstrate reproducible analysis pipelines and validation of the Joint-RPCA implementation.

inst/examples/joint_rpca_example.qmd — Minimal working example showing how to run Joint-RPCA on synthetic data and inspect ordination results.
inst/examples/ihmp_ibd_replication.qmd — Replication of the iHMP IBD dataset single-omic (16S) analysis to validate consistency with published Python-based results.
inst/examples/ibdmdb_2omic_jointrpca.qmd — Two-omic (MGX + MTX) Joint-RPCA integration on the IBDMDB dataset, demonstrating successful cross-omic ordination.
inst/examples/ibdmdb_3omic_jointrpca.qmd — Three-omic (16S + MGX + MTX) integration example; demonstrates full pipeline behavior and visualization when shared samples are limited.
inst/examples/ibdmdb_benchmarking.qmd — Performance benchmarking and variance-explained analysis comparing different omic combinations to assess numerical stability and efficiency of the R implementation.

Example Data (IBDMDB)

Stored under inst/examples/data_ibdmdb_raw/ — small real-world data subset for demonstration and testing.

taxonomic_profiles_mtx_new.tsv — Metatranscriptomic taxonomic profiles (subset of HMP2 IBDMDB).
taxonomic_profiles_mgx_new.tsv, taxonomic_profiles_mgx.tsv — Metagenomic taxonomic profiles.
taxonomic_profiles_16s_new.tsv, taxonomic_profiles_16s.tsv — 16S rRNA sequencing taxonomic profiles.
hmp2_metadata_2018-08-20.csv — Subject-level and sample-level metadata, including diagnosis and visit information.

Note: The metatranscriptomic functional table ecs_relab.tsv (taxonomic profiles for MTX from HMP2) is not included due to size limits.
To reproduce full analyses, download it manually from the official iHMP IBDMDB data portal and place it in the same folder (inst/examples/data_ibdmdb_raw/).

R/jointRPCA.R

R/jointRPCAutils.R

antagomir · 2025-11-10T11:35:17Z

Looks very clear. Some overall comments on the PR:

Example data

Is now in inst/examples subfolder and takes around 18Mb (maybe less with compression).

We could include this as a readily prepared demo data set in data/ folder (like the other ones there). Then we can easily use it in examples and vignettes. Data preparation scripts would be stored in inst/scripts/ (as in inst/scripts/Tito2024QMP.R).

Alternatively, if the data is too large for an R pkg we can store it externally in github.microbiome.io/data/ but then it cannot be directly used in examples. Then we would move the Quarto examples in that github.microbiome.io/data/ subfolder instead. If the package passes Bioc checks then I assume it is OK for the package but you can check what Bioc guidelines say about allowed data size.

Quarto examples

I would put the lightweight Quarto files under vignettes/ - then we can have them visible on the project website under articles tab.

For heavier Quarto workflows it is necessary to see if they can be precomputed and shared this way, or if we need to host them otherwise. One option for this is to add them under github.microbiome.io/data/ in a readily calculated final form and link from a lightweight vignette.

vegan optspace

vegan also has optspace functions that we contributed. Can we use those instead of replicating the same functions here?

internal functions

I suggest to collect the internal functions to utils.R where we have also other internal functions; or alternatively they could be located in the end of the R file using them.

references

We should cite the reference but this is not yet available, so let's keep in mind.

Files

We might like to collect at least some of the jointRPCA and OptSpace functions into one or two R files.

So overall good, just some housekeeping for standard organization.

@TuomasBorman may have some more feedback.

antagomir

Here some essential formatting suggestions.

R/jointRPCA.R

antagomir

Some more of the readily existing machinery could be used, see suggestions.

R/dataIBDMDBdemo.R

vignettes/ibdmdb_2omic_jointrpca.qmd

vignettes/ibdmdb_3omic_jointrpca.qmd

vignettes/ihmp_ibd_replication.qmd

TuomasBorman

Thank you! Looks very promising and useful!!

General points:

The structure of getJoinRPCA could be improved:

getJoinRPCA <- function(SingleCellExperiment, ...){
a. Create a list from input
b. Call list method
}

getJoinRPCA <- function(MultiAssayExperiment, ...){
a. Create a list from input
b. Call list method
}

getJoinRPCA <- function(list, ...){
a. Check that the input is a list of matrices
b. run jointrpca
}

Also, the structure would benefit if more internal functions would be used. Generic rule is that function should not exceed 50 lines. so they should be rather short. It is easier to maintain well-structured functions.

By using generic methods, improving structure and other ways, I believe these functions could be simplified. The function should be as simple as possible. It should do the minimum that is needed for the functionality. This means that data transformations, for instance, should be done before calling the function. This helps in maintenance and improves transparency.

This is PR is very large. It seems that there are also functions in utils.R that are related to vignettes? It would be good idea to split large PRs into multiple ones. For instance,

data + basic functionality
Vignettes

DESCRIPTION

R/dataIBDMDBdemo.R

R/utils.R

vignettes/ibdmdb_benchmarking.qmd

R/utils.R

vignettes/ibdmdb_benchmarking.qmd

R/utils.R

antagomir · 2025-12-14T22:25:31Z

The method's original name is "Joint-RPCA", should one change the name getJoinRPCA -> getJointRPCA ?

antagomir · 2025-12-16T22:55:39Z

@aituar17 can you confirm when you have added all critical points in this PR, and made notes for the less critical that can be added in a separate PR? We can then check and hopefully merge immediately

TuomasBorman · 2025-12-17T09:20:39Z

Keep only necessary features, functions and files (critical for merging)

Keep only the data and getJointRCPA() related functionality.

a. Add the data documentation to mia.R
b. Reduce the size of the data to <5MB (larger cannot be pushed to Bioconductor)
c. Rename the function to getJointRCPA()
d. Add getJointRCPA() related functionality to getJointRCPA.R
e. vignette-related stuff can be added later (or I think, e.g., https://github.com/microbiome/workflows could be more suitable)
f. Remove Rd files for internal functions (see comment) (this PR must have only changes in mia.R, getJointRCPA.R, data files (including the script), unit test files, (and utils.R if there is common functionality that is used in multiple R/*.R files)

Polish the code (can be done later)

a. Simplify the code and logic
b. The code must be polished to match the style of the rest of the package

R/dataIBDMDBdemo.R

antagomir · 2026-01-13T19:44:28Z

@aituar17 can you add the one remaining suggestion and confirm whether the feedback from above has been taken into account (or if not, provide a short justification)? Then we can merge.

antagomir · 2026-01-14T23:00:46Z

@TuomasBorman I like to merge this one if critical things OK - More can be elaborated_in further PRs

tests/testthat/test-5prevalence.R

R/dataIBDMDBdemo.R

DESCRIPTION

R/utils.R

man/jointRPCAuniversal.Rd

tests/testthat/test-jointRPCA.R

tests/testthat/test-mediate.R

inst/extdata/split.csv

TuomasBorman

Code is fine to merge, we can polish it later (it should not be too hard as the main functionality is very nicely done!). However, the code should be in correct place and no changes to other files.

These remaining tasks should not take too much time. Hopefully, @aituar17 can find time for this. Sorry for this delay and caused time challenges.

Remember to add your name to DESCRIPTION as contributor.

… ecs_relab.tsv)

…ION, and clean build artifacts

…nction

… internal Rd and vignettes

… docs

antagomir · 2026-01-16T13:51:31Z

Checks still failing..?

antagomir · 2026-01-16T13:53:11Z

At least this - did you run Bioconductor build/check before push?

checking for missing documentation entries ... WARNING
Warning: Undocumented data sets:
‘mae2’ ‘se_mgx’ ‘se_mtx’
All user-level objects in a package should have documentation entries.
See chapter ‘Writing R documentation files’ in the ‘Writing R
Extensions’ manual.

antagomir · 2026-01-17T16:22:12Z

Hmm stll tfails on this..?

❯ checking for missing documentation entries ... WARNING
Undocumented data sets:
‘mae2’ ‘se_mgx’ ‘se_mtx’
All user-level objects in a package should have documentation entries.
See chapter ‘Writing R documentation files’ in the ‘Writing R
Extensions’ manual.

-> Is the roxygen documentation (manpages) added?

TuomasBorman · 2026-01-18T08:46:45Z

Hmm stll tfails on this..?

❯ checking for missing documentation entries ... WARNING Undocumented data sets: ‘mae2’ ‘se_mgx’ ‘se_mtx’ All user-level objects in a package should have documentation entries. See chapter ‘Writing R documentation files’ in the ‘Writing R Extensions’ manual.

-> Is the roxygen documentation (manpages) added?

Related to this, only MUltiAssayExperiment needs to be returned with data(name_of_dataset). The returned dataset must be in variable called "name_of_dataset"

antagomir · 2026-01-18T21:16:46Z

Almost there.. just one failing test in Ubuntu any more..

── Building function reference ─────────────────────────────────────────────────
Error in build_reference_index():
! In pkgdown/_pkgdown.yml, 2 topics missing from index: "getJointRPCA"
and "ibdmdb_2omic_demo".
ℹ Either add to the reference index, or use @keywords internal to drop from
the index.
Backtrace:
▆

└─pkgdown::build_site_github_pages(new_process = FALSE, install = FALSE)
└─pkgdown::build_site(...)
```
└─pkgdown:::build_site_local(...)
```
```
  └─pkgdown::build_reference(...)
```

    └─pkgdown::build_reference_index(pkg)

```
      ├─pkgdown::render_page(...)
```

      │ └─pkgdown:::render_page_html(pkg, name = name, data = data, depth = depth)

      │   └─pkgdown:::modify_list(data_template(pkg, depth = depth), data)

      └─pkgdown:::data_reference_index(pkg)

        └─pkgdown:::check_missing_topics(rows, pkg, error_call = error_call)

          └─pkgdown:::config_abort(...)

            └─cli::cli_abort(message, ..., call = call, .envir = .envir)

```
              └─rlang::abort(...)
```

Execution halted
Error: Process completed with exit code 1.

TuomasBorman

These things should be fixed, maybe not in this PR, but hopefully soon:

The documentation and code style should be similar to rest of the package

E.g., all methods should have examples

The name is currently get* even though the function is adding results. Thus it should be add*
The methods should be using generic methods
There are room for simplifying the code. There are overlapping checks and the code is unnecessary complex in some places. Also jointRPCAuniversal and getJointRPCA are duplicated; they do essentially the same thing which might be confusing.

The structure could be something like this:

addJointRPCA.SingleCellExperiment <- function(x, name, ...){
    res <- getJointPCA.SingleCellExperiment(x, ...)
    x <- .add_results_to_sce(x, res, name)
    return(x)
}

addJointRPCA.MultiAssayExperiment <- function(x, name, ...){
    res <- getJointPCA.MultiAssayExperiment(x, ...)
    x <- .add_results_to_mae(x, res, name)
    return(x)
}

getJointRPCA.SingleCellExperiment <- function(x, ...){
    # Extract data from SCE and do input checks
    dat <- .extract_data_from_sce(x)
    # Run jointPCA
    res <- .run_joint_rpca(dat, ...)
    return(res)
}

getJointRPCA.MultiAssayExperiment <- function(x, ...){
    # Extract data from MAE and do input checks
    dat <- .extract_data_from_mae(x)
    # Run jointPCA
    res <- .run_joint_rpca(dat, ...)
    return(res)
}

.run_joint_rpca <- function(tables, ...){
    # Method-specific data processing
    tables <- .rpca_table_processing(tables, ...)
    # Divide into train and test set
    tables <- .spit_into_test(tables, ...)
    # Stack tables. It is easier to work with single table than list.
    # Run RPCA, get "raw" results.
    res <- .calculate_rpca(stacked_matrix, ...)
    # Project test samples
    # Process results into polished format and calculate additional info
    return(res)
}

.spit_into_test <- function(tables, ...){
    if( training_samples_not_specified ){
        res <- .calculate_rpca(tables[[1L]], ...)
    }
    ...
    return(splitted_tables)
}

# It seems that joint-RPCA and RPCA are essentially the same? The joint-one is with stacked table. *If I read the code correctly*
.calculate_rpca <- function(matrix){
    lower_representation <- .get_lower_rank_representation()
    # Do PCA
}

.get_lower_rank_representation <- function(matrix){
    # Do optspace
    # Calculate errors
}

We should think about the output type again, e.g., all other ordination methods add detailed results to reducedDim(tse) |> metadata()
Should we have also "robust PCA" for single omics? Could be supported easily after the code is restructured.

TuomasBorman · 2026-01-18T08:52:47Z

R/getJointRPCA.R

+## Joint RPCA front-end helpers
+##
+## Exported:
+##   - jointRPCAuniversal()
+##   - getJointRPCA()
+##
+
+#' Run Joint-RPCA and store embedding in reducedDim
+#' @name getJointRPCA


Check the documentation format of other functions. Use same style.

TuomasBorman · 2026-01-18T08:53:30Z

R/getJointRPCA.R

+#' @param scale Logical; whether to scale the reconstructed matrix prior to
+#'   SVD/PCA steps. Defaults to \code{FALSE}.
+#' @param ... Additional arguments passed to \code{jointRPCAuniversal()} and then
+#'   to the internal \code{.joint_rpca()} engine (e.g. \code{n.components},


Internal functions are not visible for user so no need to describe them. It is more confusing.

TuomasBorman · 2026-01-18T08:54:07Z

R/getJointRPCA.R

+                         experiments = NULL,
+                         altexp = NULL,
+                         name = "JointRPCA",
+                         transform = c("rclr", "none"),
+                         optspace.tol = 1e-5,
+                         center = TRUE,
+                         scale = FALSE,
+                         ...) {


Many arguments are missing input tests.

TuomasBorman · 2026-01-18T08:55:51Z

R/getJointRPCA.R

+    # Store embedding in reducedDim only if supported (SCE / TreeSE / mia-specific)
+    cls <- class(x)
+    if (any(cls %in% c("SingleCellExperiment", "TreeSummarizedExperiment"))) {
+        reducedDim(x, name) <- emb
+    }


This is get* function. We have decided to have get* and add* functions get returns raw results, add adds them to the object.

TuomasBorman · 2026-01-18T08:57:48Z

R/getJointRPCA.R

+    if (is.null(target_cols)) {
+        stop("Cannot store reducedDim: 'x' has no colnames().", call. = FALSE)
+    }
+
+    if (is.null(rownames(emb))) {
+        stop("Cannot store reducedDim: embedding has no rownames().", call. = FALSE)
+    }


These could be caught earlier. Also, the function could add sample names if needed in the beginning of the function if needed for internal use.

TuomasBorman · 2026-01-19T09:13:27Z

R/getJointRPCA.R

+    # Determine train/test split
+    if (!is.null(sample.metadata) && !is.null(train.test.column)) {
+        md <- as.data.frame(sample.metadata)
+        md <- md[shared.all.samples, , drop = FALSE]
+        train.samples <- rownames(md)[md[[train.test.column]] == "train"]
+        test.samples <- rownames(md)[md[[train.test.column]] == "test"]
+    } else {
+        ord.tmp <- .optspace_helper(
+            rclr.table      = t(rclr_tables[[1]]),
+            feature.ids     = rownames(rclr_tables[[1]]),
+            sample.ids      = colnames(rclr_tables[[1]]),
+            n.components    = n.components,
+            max.iterations  = max.iterations,
+            tol             = optspace.tol,
+            center          = center,
+            scale           = scale
+        )$ord_res
+        sorted.ids <- rownames(ord.tmp$samples[order(ord.tmp$samples[, 1]), ])
+        idx <- round(seq(1, length(sorted.ids), length.out = n.test.samples))


Could be explained why this is done; it is not obvious from the code: to get as different/representative as possible samples for testing.

TuomasBorman · 2026-01-19T10:08:37Z

R/getJointRPCA.R

+        if (transform == "rclr") {
+            mat[!is.finite(mat)] <- 0
+            mat[mat < 0] <- 0
+            out <- vegan::decostand(mat, method = "rclr", MARGIN = 2)


Data transformations should be applied outside of the function for transparency, especially as user do not have full control on which table the transformation is applied.

TuomasBorman · 2026-01-19T10:11:25Z

R/getJointRPCA.R

+
+    # Intersect views by name, preserve training order
+    views <- intersect(names(Vobj), names(tables))
+    if (!length(views)) stop("[.transform] No overlapping view names between ordination and new tables.")


As the training and test data sets are constructed from the same table, this should be never a problem?

TuomasBorman · 2026-01-19T10:16:24Z

R/getJointRPCA.R

+    # Sample dedup (after combining views)
+    if (dedup.samples) {
+        sid <- sub("_\\d+$", "", rownames(Usum))
+        if (any(duplicated(sid))) {
+            Usum <- rowsum(Usum, group = sid, reorder = FALSE) / as.vector(table(sid))
+        } else {
+            rownames(Usum) <- sid
+        }
+    }


This can cause problematic results. Usually we cannot assume that "similarly"-named samples are actually duplicated

TuomasBorman · 2026-01-20T07:30:47Z

R/getJointRPCA.R

+#' @keywords internal
+#' @noRd
+.extract_mae_tables <- function(x, experiments = NULL) {
+    exps <- experiments(x)


User cannot decide tables, i.e., assays from experiments

We created here helper function that could be re-used: https://github.com/bioFAM/MOFA2/pull/144/files

TuomasBorman · 2026-01-20T07:43:14Z

The current GHA error is ok, it is related to deployment (forks do not have all the permissions which is correct behavior). The code runs correctly.

antagomir · 2026-01-20T08:49:32Z

Ok - merging is not possible before the comments have been resolved. I wonder how we should do this..

@aituar17 - do you have a chance to:

immediately handle any simple cases so we can just close them
open issue/s so we can keep track on the remaining things to update after this PR is done

With this we can resolve the comments and hopefully merge really soon.

I am waiting this to be merged so I can demonstrate the performance through the pkg.

TuomasBorman · 2026-01-20T08:51:54Z

We can push this to microbiome/mia to new branch (not devel) and work there. It might be easiest.

antagomir · 2026-01-20T08:57:25Z

Yes, great. Who will take the necessary steps?

TuomasBorman · 2026-01-20T13:59:03Z

Could @raivo-otus have a look? @aituar17 did huge work and the key functionality is already there; only finishing touch needed.

raivo-otus · 2026-01-27T08:28:16Z

I'll get to work.

EDIT: .. this might take a while.

antagomir requested changes Nov 10, 2025

View reviewed changes

R/jointRPCA.R Outdated Show resolved Hide resolved

R/jointRPCAutils.R Outdated Show resolved Hide resolved

R/jointRPCAutils.R Outdated Show resolved Hide resolved

antagomir requested changes Nov 19, 2025

View reviewed changes

R/jointRPCA.R Outdated Show resolved Hide resolved

R/jointRPCA.R Outdated Show resolved Hide resolved

R/jointRPCA.R Outdated Show resolved Hide resolved

R/jointRPCA.R Outdated Show resolved Hide resolved

antagomir requested changes Nov 19, 2025

View reviewed changes

TuomasBorman force-pushed the devel branch from c57393f to 26eec84 Compare December 8, 2025 18:27

TuomasBorman requested changes Dec 10, 2025

View reviewed changes

antagomir approved these changes Jan 13, 2026

View reviewed changes

R/dataIBDMDBdemo.R Outdated Show resolved Hide resolved