Add Joint-RPCA to mia: methods, tests, examples, vignette#789
Add Joint-RPCA to mia: methods, tests, examples, vignette#789TuomasBorman merged 17 commits intomicrobiome:rpcafrom
Conversation
|
Looks very clear. Some overall comments on the PR:
Is now in inst/examples subfolder and takes around 18Mb (maybe less with compression). We could include this as a readily prepared demo data set in data/ folder (like the other ones there). Then we can easily use it in examples and vignettes. Data preparation scripts would be stored in inst/scripts/ (as in inst/scripts/Tito2024QMP.R). Alternatively, if the data is too large for an R pkg we can store it externally in github.microbiome.io/data/ but then it cannot be directly used in examples. Then we would move the Quarto examples in that github.microbiome.io/data/ subfolder instead. If the package passes Bioc checks then I assume it is OK for the package but you can check what Bioc guidelines say about allowed data size.
I would put the lightweight Quarto files under vignettes/ - then we can have them visible on the project website under articles tab. For heavier Quarto workflows it is necessary to see if they can be precomputed and shared this way, or if we need to host them otherwise. One option for this is to add them under github.microbiome.io/data/ in a readily calculated final form and link from a lightweight vignette.
vegan also has optspace functions that we contributed. Can we use those instead of replicating the same functions here?
I suggest to collect the internal functions to utils.R where we have also other internal functions; or alternatively they could be located in the end of the R file using them.
We should cite the reference but this is not yet available, so let's keep in mind.
We might like to collect at least some of the jointRPCA and OptSpace functions into one or two R files. So overall good, just some housekeeping for standard organization. @TuomasBorman may have some more feedback. |
antagomir
left a comment
There was a problem hiding this comment.
Here some essential formatting suggestions.
antagomir
left a comment
There was a problem hiding this comment.
Some more of the readily existing machinery could be used, see suggestions.
TuomasBorman
left a comment
There was a problem hiding this comment.
Thank you! Looks very promising and useful!!
General points:
The structure of getJoinRPCA could be improved:
getJoinRPCA <- function(SingleCellExperiment, ...){
a. Create a list from input
b. Call list method
}
getJoinRPCA <- function(MultiAssayExperiment, ...){
a. Create a list from input
b. Call list method
}
getJoinRPCA <- function(list, ...){
a. Check that the input is a list of matrices
b. run jointrpca
}
Also, the structure would benefit if more internal functions would be used. Generic rule is that function should not exceed 50 lines. so they should be rather short. It is easier to maintain well-structured functions.
By using generic methods, improving structure and other ways, I believe these functions could be simplified. The function should be as simple as possible. It should do the minimum that is needed for the functionality. This means that data transformations, for instance, should be done before calling the function. This helps in maintenance and improves transparency.
This is PR is very large. It seems that there are also functions in utils.R that are related to vignettes? It would be good idea to split large PRs into multiple ones. For instance,
- data + basic functionality
- Vignettes
|
The method's original name is "Joint-RPCA", should one change the name getJoinRPCA -> getJointRPCA ? |
|
@aituar17 can you confirm when you have added all critical points in this PR, and made notes for the less critical that can be added in a separate PR? We can then check and hopefully merge immediately |
Keep only the data and a. Add the data documentation to
a. Simplify the code and logic |
|
@aituar17 can you add the one remaining suggestion and confirm whether the feedback from above has been taken into account (or if not, provide a short justification)? Then we can merge. |
|
@TuomasBorman I like to merge this one if critical things OK - More can be elaborated_in further PRs |
TuomasBorman
left a comment
There was a problem hiding this comment.
Code is fine to merge, we can polish it later (it should not be too hard as the main functionality is very nicely done!). However, the code should be in correct place and no changes to other files.
These remaining tasks should not take too much time. Hopefully, @aituar17 can find time for this. Sorry for this delay and caused time challenges.
Remember to add your name to DESCRIPTION as contributor.
…ION, and clean build artifacts
… internal Rd and vignettes
a9ced38 to
2209f3a
Compare
|
Checks still failing..? |
|
At least this - did you run Bioconductor build/check before push?
|
|
Hmm stll tfails on this..? ❯ checking for missing documentation entries ... WARNING -> Is the roxygen documentation (manpages) added? |
Related to this, only MUltiAssayExperiment needs to be returned with |
|
Almost there.. just one failing test in Ubuntu any more.. ── Building function reference ─────────────────────────────────────────────────
Execution halted |
TuomasBorman
left a comment
There was a problem hiding this comment.
These things should be fixed, maybe not in this PR, but hopefully soon:
- The documentation and code style should be similar to rest of the package
- E.g., all methods should have examples
-
The name is currently
get*even though the function is adding results. Thus it should beadd* -
The methods should be using generic methods
-
There are room for simplifying the code. There are overlapping checks and the code is unnecessary complex in some places. Also jointRPCAuniversal and getJointRPCA are duplicated; they do essentially the same thing which might be confusing.
The structure could be something like this:
addJointRPCA.SingleCellExperiment <- function(x, name, ...){
res <- getJointPCA.SingleCellExperiment(x, ...)
x <- .add_results_to_sce(x, res, name)
return(x)
}
addJointRPCA.MultiAssayExperiment <- function(x, name, ...){
res <- getJointPCA.MultiAssayExperiment(x, ...)
x <- .add_results_to_mae(x, res, name)
return(x)
}
getJointRPCA.SingleCellExperiment <- function(x, ...){
# Extract data from SCE and do input checks
dat <- .extract_data_from_sce(x)
# Run jointPCA
res <- .run_joint_rpca(dat, ...)
return(res)
}
getJointRPCA.MultiAssayExperiment <- function(x, ...){
# Extract data from MAE and do input checks
dat <- .extract_data_from_mae(x)
# Run jointPCA
res <- .run_joint_rpca(dat, ...)
return(res)
}
.run_joint_rpca <- function(tables, ...){
# Method-specific data processing
tables <- .rpca_table_processing(tables, ...)
# Divide into train and test set
tables <- .spit_into_test(tables, ...)
# Stack tables. It is easier to work with single table than list.
# Run RPCA, get "raw" results.
res <- .calculate_rpca(stacked_matrix, ...)
# Project test samples
# Process results into polished format and calculate additional info
return(res)
}
.spit_into_test <- function(tables, ...){
if( training_samples_not_specified ){
res <- .calculate_rpca(tables[[1L]], ...)
}
...
return(splitted_tables)
}
# It seems that joint-RPCA and RPCA are essentially the same? The joint-one is with stacked table. *If I read the code correctly*
.calculate_rpca <- function(matrix){
lower_representation <- .get_lower_rank_representation()
# Do PCA
}
.get_lower_rank_representation <- function(matrix){
# Do optspace
# Calculate errors
}
-
We should think about the output type again, e.g., all other ordination methods add detailed results to
reducedDim(tse) |> metadata() -
Should we have also "robust PCA" for single omics? Could be supported easily after the code is restructured.
| ## Joint RPCA front-end helpers | ||
| ## | ||
| ## Exported: | ||
| ## - jointRPCAuniversal() | ||
| ## - getJointRPCA() | ||
| ## | ||
|
|
||
| #' Run Joint-RPCA and store embedding in reducedDim | ||
| #' @name getJointRPCA |
There was a problem hiding this comment.
Check the documentation format of other functions. Use same style.
| #' @param scale Logical; whether to scale the reconstructed matrix prior to | ||
| #' SVD/PCA steps. Defaults to \code{FALSE}. | ||
| #' @param ... Additional arguments passed to \code{jointRPCAuniversal()} and then | ||
| #' to the internal \code{.joint_rpca()} engine (e.g. \code{n.components}, |
There was a problem hiding this comment.
Internal functions are not visible for user so no need to describe them. It is more confusing.
| experiments = NULL, | ||
| altexp = NULL, | ||
| name = "JointRPCA", | ||
| transform = c("rclr", "none"), | ||
| optspace.tol = 1e-5, | ||
| center = TRUE, | ||
| scale = FALSE, | ||
| ...) { |
There was a problem hiding this comment.
Many arguments are missing input tests.
| # Store embedding in reducedDim only if supported (SCE / TreeSE / mia-specific) | ||
| cls <- class(x) | ||
| if (any(cls %in% c("SingleCellExperiment", "TreeSummarizedExperiment"))) { | ||
| reducedDim(x, name) <- emb | ||
| } |
There was a problem hiding this comment.
This is get* function. We have decided to have get* and add* functions get returns raw results, add adds them to the object.
| if (is.null(target_cols)) { | ||
| stop("Cannot store reducedDim: 'x' has no colnames().", call. = FALSE) | ||
| } | ||
|
|
||
| if (is.null(rownames(emb))) { | ||
| stop("Cannot store reducedDim: embedding has no rownames().", call. = FALSE) | ||
| } |
There was a problem hiding this comment.
These could be caught earlier. Also, the function could add sample names if needed in the beginning of the function if needed for internal use.
| # Determine train/test split | ||
| if (!is.null(sample.metadata) && !is.null(train.test.column)) { | ||
| md <- as.data.frame(sample.metadata) | ||
| md <- md[shared.all.samples, , drop = FALSE] | ||
| train.samples <- rownames(md)[md[[train.test.column]] == "train"] | ||
| test.samples <- rownames(md)[md[[train.test.column]] == "test"] | ||
| } else { | ||
| ord.tmp <- .optspace_helper( | ||
| rclr.table = t(rclr_tables[[1]]), | ||
| feature.ids = rownames(rclr_tables[[1]]), | ||
| sample.ids = colnames(rclr_tables[[1]]), | ||
| n.components = n.components, | ||
| max.iterations = max.iterations, | ||
| tol = optspace.tol, | ||
| center = center, | ||
| scale = scale | ||
| )$ord_res | ||
| sorted.ids <- rownames(ord.tmp$samples[order(ord.tmp$samples[, 1]), ]) | ||
| idx <- round(seq(1, length(sorted.ids), length.out = n.test.samples)) |
There was a problem hiding this comment.
Could be explained why this is done; it is not obvious from the code: to get as different/representative as possible samples for testing.
| if (transform == "rclr") { | ||
| mat[!is.finite(mat)] <- 0 | ||
| mat[mat < 0] <- 0 | ||
| out <- vegan::decostand(mat, method = "rclr", MARGIN = 2) |
There was a problem hiding this comment.
Data transformations should be applied outside of the function for transparency, especially as user do not have full control on which table the transformation is applied.
|
|
||
| # Intersect views by name, preserve training order | ||
| views <- intersect(names(Vobj), names(tables)) | ||
| if (!length(views)) stop("[.transform] No overlapping view names between ordination and new tables.") |
There was a problem hiding this comment.
As the training and test data sets are constructed from the same table, this should be never a problem?
| # Sample dedup (after combining views) | ||
| if (dedup.samples) { | ||
| sid <- sub("_\\d+$", "", rownames(Usum)) | ||
| if (any(duplicated(sid))) { | ||
| Usum <- rowsum(Usum, group = sid, reorder = FALSE) / as.vector(table(sid)) | ||
| } else { | ||
| rownames(Usum) <- sid | ||
| } | ||
| } |
There was a problem hiding this comment.
This can cause problematic results. Usually we cannot assume that "similarly"-named samples are actually duplicated
| #' @keywords internal | ||
| #' @noRd | ||
| .extract_mae_tables <- function(x, experiments = NULL) { | ||
| exps <- experiments(x) |
There was a problem hiding this comment.
User cannot decide tables, i.e., assays from experiments
There was a problem hiding this comment.
We created here helper function that could be re-used: https://github.com/bioFAM/MOFA2/pull/144/files
|
The current GHA error is ok, it is related to deployment (forks do not have all the permissions which is correct behavior). The code runs correctly. |
|
Ok - merging is not possible before the comments have been resolved. I wonder how we should do this.. @aituar17 - do you have a chance to:
With this we can resolve the comments and hopefully merge really soon. I am waiting this to be merged so I can demonstrate the performance through the pkg. |
|
We can push this to microbiome/mia to new branch (not devel) and work there. It might be easiest. |
|
Yes, great. Who will take the necessary steps? |
|
Could @raivo-otus have a look? @aituar17 did huge work and the key functionality is already there; only finishing touch needed. |
|
I'll get to work. EDIT: .. this might take a while. |
Summary
This PR adds a complete Joint-RPCA (OptSpace) module for multi-omic integration in
mia.It introduces the full backend implementation, preprocessing utilities, and example workflows for replicating iHMP/IBDMDB analyses.
The update includes:
jointRPCA,jointRPCAuniversal, and supporting OptSpace solvers) for single- and multi-omic RPCA ordination.Together, these additions enable full Joint-RPCA analysis and validation within R, reproducing the same behavior as the original Python implementation on IBD multi-omic datasets.
What's included
Core Joint-RPCA Implementation
R/jointRPCA.R— Implements the core Joint Robust Principal Component Analysis (Joint-RPCA) using the OptSpace algorithm across one or more compositional tables. Handles preprocessing, filtering, shared sample alignment, and factorization for multi-omic inputs.R/jointRPCAuniversal.R— Provides a universal user-facing wrapper forjointRPCA(), automatically detecting input type (matrix,list,SummarizedExperiment, orMultiAssayExperiment) and dispatching accordingly.R/jointRPCAutils.R— Utility functions supporting the main pipeline, including data validation, helper wrappers, and structure checks used throughout the Joint-RPCA modules.OptSpace Optimization Backend
R/jointOptspaceHelper.R— Internal engine that orchestrates the Joint OptSpace factorization: splitting data into train/test subsets, calling the solver, constructing ordination results, and projecting held-out samples.R/jointOptspaceSolve.R— Low-level OptSpace solver implementation for joint factorization across multiple paired matrices, producing shared sample embeddings and per-view feature loadings.R/optspaceHelper.R— Helper for single-view (non-joint) OptSpace-based RPCA. Used for initialization and simple validation steps when only one table is provided.Preprocessing and Data Handling
R/rpcaTableProcessing.R— Handles filtering, normalization, and zero-count removal for compositional tables prior to RPCA. Ensures numeric stability and removes low-prevalence or low-abundance features/samples.R/maskValueOnly.R— Safely converts numeric inputs into masked matrices by replacing non-finite values withNAwhile preserving dimensionality and a logical mask for missing positions.R/transform.R— Implements.transform(), an internal projection function for aligning new compositional data (matrices or per-view lists) to an existing Joint-RPCA ordination space.R/transformHelper.R— Supports.transform()with.transform_helper(), handling both single-view and multi-view projections, optional sample deduplication, and alignment to training feature sets.Testing
tests/testthat/test-jointRPCA.R— Synthetic unit tests verifying output structure, behavior, and robustness ofjointRPCA()and.transform()across single- and multi-omic data.Covers shared-sample alignment, projection correctness, duplicate-ID errors, and consistency of component dimensions.
Reproducible Examples and Benchmarks
Located under
inst/examples/, these Quarto notebooks demonstrate reproducible analysis pipelines and validation of the Joint-RPCA implementation.inst/examples/joint_rpca_example.qmd— Minimal working example showing how to run Joint-RPCA on synthetic data and inspect ordination results.inst/examples/ihmp_ibd_replication.qmd— Replication of the iHMP IBD dataset single-omic (16S) analysis to validate consistency with published Python-based results.inst/examples/ibdmdb_2omic_jointrpca.qmd— Two-omic (MGX + MTX) Joint-RPCA integration on the IBDMDB dataset, demonstrating successful cross-omic ordination.inst/examples/ibdmdb_3omic_jointrpca.qmd— Three-omic (16S + MGX + MTX) integration example; demonstrates full pipeline behavior and visualization when shared samples are limited.inst/examples/ibdmdb_benchmarking.qmd— Performance benchmarking and variance-explained analysis comparing different omic combinations to assess numerical stability and efficiency of the R implementation.Example Data (IBDMDB)
Stored under
inst/examples/data_ibdmdb_raw/— small real-world data subset for demonstration and testing.taxonomic_profiles_mtx_new.tsv— Metatranscriptomic taxonomic profiles (subset of HMP2 IBDMDB).taxonomic_profiles_mgx_new.tsv,taxonomic_profiles_mgx.tsv— Metagenomic taxonomic profiles.taxonomic_profiles_16s_new.tsv,taxonomic_profiles_16s.tsv— 16S rRNA sequencing taxonomic profiles.hmp2_metadata_2018-08-20.csv— Subject-level and sample-level metadata, including diagnosis and visit information.Note: The metatranscriptomic functional table
ecs_relab.tsv(taxonomic profiles for MTX from HMP2) is not included due to size limits.To reproduce full analyses, download it manually from the official iHMP IBDMDB data portal and place it in the same folder (
inst/examples/data_ibdmdb_raw/).