Skip to content

GMELab/lmutils.r

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

lmutils.r

Table of Contents

Installation

lmutils is not currently available on CRAN, but it can be installed on Linux with the following command. This will also install the Rust programming language which is required for lmutils.

curl https://raw.githubusercontent.com/GMELab/lmutils.r/refs/heads/master/install.sh | sh

Important Information

Terms

  • Matrix convertible object - a data frame, matrix, file name (to read from), a numeric column vector, or a Mat object.
  • List of matrix convertible objects - a list of matrix convertible objects, a character vector of file names (to read from), or a single matrix convertible object.
  • Standard output file - a character vector of file names matching the length of the inputs, or NULL to return the output. If a single input, not in a list, was provided, the output will not be in a list.
  • Join - an inner join means only rows that match in both matrices are kept, a left join means all rows in the left matrix are kept, a right join means all rows in the right matrix are kept.

File Types

All files can be optionally compressed with gzip, rdata files are assumed to be compressed without looking for a .gz file extension (as is the standard in R).

  • .mat (recommended, custom binary format designed for matrices)
  • .csv (requires column headers)
  • .tsv (requires column headers)
  • .txt (requires column headers)
  • .json
  • .cbor
  • .rkyv
  • .rdata
  • .rds
  • .bed (requires a corresponding .bim and .fam file, lmutils does not support writing to .bed files)

Introduction

lmutils is an R package that provides utilities for working with matrices and data frames. It is built on top of the Rust programming language for performance and safety. The package provides a way to store matrices in memory and perform operations on them, as well as functions for working with data frames.

lmutils is built primarily around the Mat object. These are designed to be used to perform operations on matrices without loading them into memory until necessary. This can be useful for working with lots of large matrices, like hundreds of gene blocks.

To get started with your first Mat object, you can use the following code:

mat <- lmutils::Mat$new("matrix1.csv")

This will create a new Mat object from a file. You can then perform operations on this object, like combining it with other matrices, removing columns, or standardizing the columns. If you want this matrix to be loaded into R, you can use the r method:

mat$combine_columns("matrix2.csv")
mat$remove_columns(c(1, 2, 3))
mat$standardize_columns()
m <- mat$r()

You can also pass the object directly into functions that accept a matrix convertible object, it'll then be loaded automatically (with all the stored operations applied) only when needed.

lmutils::calculate_r2(
    mat,
    "outcomes1.RData",
)

Example

outcomes <- lmutils::Mat$new("outcomes.RData")
geneBlocks <- lapply(c(
    "geneBlock1.csv",
    "geneBlock2.csv",
    "geneBlock3.csv",
    "geneBlock4.csv",
    "geneBlock5.csv",
), function(mat) {
    mat <- lmutils::Mat$new(mat)
    mat$match_to_by_name(outcomes$col("eid"), "IID", 0)
    mat$remove_column("IID")
    mat$min_column_sum(2)
    mat$na_to_column_mean()
    mat$standardize_columns()
    mat
})
outcomes$remove_column("eid")
results <- lmutils::calculate_r2(geneBlocks, outcomes)

Mat Objects

lmutils::Mat objects are a way to store matrices in memory and perform operations on them. They can be used to store operations or chain operations together for later execution. This can be useful if, for example, you wish to a hundred large matrices from files and standardize them all before using lmutils::calculate_r2. Using Mat objects, you can store the operations you wish to perform and Mat will execute them only when the matrix is loaded.

Passing the same Mat object multiple times in a single function call may cause undefined behavior. For example, the following code may not work as expected:

mat <- lmutils::Mat$new("matrix1.csv")
lmutils::calculate_r2(list(mat, mat), mat)

lmutils::Mat$new

Creates a new Mat object.

  • data is a matrix convertible object.
mat <- lmutils::Mat$new("matrix1.csv")

lmutils::Mat$r

Loads the matrix from the Mat object.

m <- mat$r()

lmutils::Mat$col

Get a column by name or index.

col <- mat$col("eid")
col <- mat$col(1)

lmutils::Mat$colnames

Get the column names of the matrix or NULL if there are none.

colnames <- mat$colnames()

lmutils::Mat$save

Saves the matrix to a file.

  • file is the file name to write to.
mat$save("matrix1.mat.gz")

lmutils::Mat$combine_columns

Combines this matrix with other matrices by columns. (cbind)

  • data is a list of matrix convertible objects.
mat$combine_columns("matrix2.csv")

lmutils::Mat$combine_rows

Combines this matrix with other matrices by rows. (rbind)

  • data is a list of matrix convertible objects.
mat$combine_rows("matrix2.csv")

lmutils::Mat$remove_columns

Removes columns from the matrix.

  • columns is a vector of column indices (1-based) to remove.
mat$remove_columns(c(1, 2, 3))

lmutils::Mat$remove_column

Removes a column from the matrix by name.

  • column is the column name to remove.
mat$remove_column("eid")

lmutils::Mat$remove_column_if_exists

Removes a column from the matrix by name if it exists.

  • column is the column name to remove.
mat$remove_column_if_exists("eid")

lmutils::Mat$remove_rows

Removes rows from the matrix.

  • rows is a vector of row indices (1-based) to remove.
mat$remove_rows(c(1, 2, 3))

lmutils::Mat$transpose

Transposes the matrix.

mat$transpose()

lmutils::Mat$sort

Sort by the column at the given index.

  • by is the column index (1-based) to sort by.
mat$sort(1)

lmutils::Mat$sort_by_name

Sort by the column with the given name.

  • by is the column name to sort by.
mat$sort_by_name("eid")

lmutils::Mat$sort_by_order

Sort by the given order of rows.

  • order is a vector of row indices (1-based) to sort by.
mat$sort_by_order(c(3, 2, 1))

lmutils::Mat$dedup

Deduplicate the matrix by a column.

  • by is the column index (1-based) to deduplicate by.
mat$dedup(1)

lmutils::Mat$dedup_by_name

Deduplicate the matrix by a column name.

  • by is the column name to deduplicate by.
mat$dedup_by_name("eid")

lmutils::Mat$match_to

Match the rows of the matrix to the values in a vector by a column.

  • with is a numeric vector to match the rows to.
  • by is the column index (1-based) to match the rows by.
  • join is the type of join to perform. 0 is inner, 1 is left, 2 is right, and 3 is full. If a row is not matched for a left or right join, it will error.
mat$match_to(c(1, 2, 3), 1, 0)

lmutils::Mat$match_to_by_name

Match the rows of the matrix to the values in a vector by a column name.

  • with is a numeric vector to match the rows to.
  • by is the column name to match the rows by.
  • join is the type of join to perform. 0 is inner, 1 is left, 2 is right, and 3 is full. If a row is not matched for a left or right join, it will error.
mat$match_to_by_name(c(1, 2, 3), "eid", 0)

lmutils::Mat$join

Join the matrix with another matrix by a column.

  • other is a matrix convertible object.
  • by is the column index (1-based) to join by.
  • join is the type of join to perform. 0 is inner, 1 is left, 2 is right, and 3 is full. If a row is not matched for a left or right join, it will error.
mat$join("matrix2.csv", 1, 0)

lmutils::Mat$join_by_name

Join the matrix with another matrix by a column name.

  • other is a matrix convertible object.
  • by is the column name to join by.
  • join is the type of join to perform. 0 is inner, 1 is left, 2 is right, and 3 is full. If a row is not matched for a left or right join, it will error.
mat$join_by_name("matrix2.csv", "eid", 0)

lmutils::Mat$standardize_columns

Standardize the columns of the matrix to have a mean of 0 and a standard deviation of 1.

mat$standardize_columns()

lmutils::Mat$standardize_rows

Standardize the rows of the matrix to have a mean of 0 and a standard deviation of 1.

mat$standardize_rows()

lmutils::Mat$remove_na_rows

Remove rows with any NA values.

mat$remove_na_rows()

lmutils::Mat$remove_na_columns

Remove columns with any NA values.

mat$remove_na_columns()

lmutils::Mat$na_to_value

Replace all NA values with a given value.

mat$na_to_value(0)

lmutils::Mat$na_to_column_mean

Replace all NA values with the mean of the column.

mat$na_to_column_mean()

lmutils::Mat$na_to_row_mean

Replace all NA values with the mean of the row.

mat$na_to_row_mean()

lmutils::Mat$min_column_sum

Remove columns with a sum less than a given value.

mat$min_column_sum(10)

lmutils::Mat$max_column_sum

Remove columns with a sum greater than a given value.

mat$max_column_sum(10)

lmutils::Mat$min_row_sum

Remove rows with a sum less than a given value.

mat$min_row_sum(10)

lmutils::Mat$max_row_sum

Remove rows with a sum greater than a given value.

mat$max_row_sum(10)

lmutils::Mat$rename_column

Rename a column by name.

mat$rename_column("IID", "eid")

lmutils::Mat$rename_column_if_exists

Rename a column by name if it exists.

mat$rename_column_if_exists("IID", "eid")

lmutils::Mat$remove_duplicate_columns

Remove columns that are duplicates of other columns. The first column is kept.

mat$remove_duplicate_columns()

lmutils::Mat$remove_identical_columns

Remove columns with all identical entries.

mat$remove_identical_columns()

lmutils::Mat$eigen

Compute the eigenvalues and eigenvectors of the matrix. The matrix must be square.

eigen <- mat$eigen()
# a vector of real or complex eigenvalues
eigen$values
# a n by n matrix of real or complex eigenvectors
eigen$vectors

lmutils::Mat$subset_columns

Subset the matrix to only include the given columns (1-based indices or names).

mat$subset_columns(c(1, 2, 3))

lmutils::Mat$rename_columns_with_regex

Rename columns with a regex and a replacement string.

mat$rename_columns_with_regex("[0-9]", "X")

lmutils::Mat$scale_columns

Scale the columns of a matrix by a given scalar or vector. The vector must be the same length as the number of columns in the matrix.

mat$scale_columns(2)
mat$scale_columns(c(1, 2, 3))

lmutils::Mat$scale_rows

Scale the rows of a matrix by a given scalar or vector. The vector must be the same length as the number of rows in the matrix.

mat$scale_rows(2)
mat$scale_rows(c(1, 2, 3))

Matrix Functions

lmutils::save

Saves a list of matrix convertible objects to files.

  • from is a list of matrix convertible objects.
  • to is a character vector of file names to write to.
lmutils::save(
    list("file1.csv", matrix(1:9, nrow=3), 1:3, data.frame(a=1:3, b=4:6)),
    c("file1.json", "file2.mat.gz", "file3.csv", "file4.rdata"),
)

lmutils::save_dir

Recursively converts a directory of files to the selected file type.

  • from is a string directory name to read the files from.
  • to is a string directory name to write the files to or NULL to write to from.
  • file_type is a string file extension to write the files as.
lmutils::save_dir(
    "data",
    "converted_data", # or NULL
    "mat.gz",
)

lmutils::calculate_r2

Calculates the R^2 and adjusted R^2 values for blocks and outcomes.

  • data is a list of matrix convertible objects.
  • outcomes is a single matrix convertible object. Returns a data frame with columns r2, adj_r2, data, outcome, n, m, predicted, and betas.
results <- lmutils::calculate_r2(
    c("block1.csv", "block2.mat.gz"),
    "outcomes1.RData",
)

lmutils::column_p_values

Compute the p value of a linear regression between each pair of columns in data and outcomes.

  • data is a list of matrix convertible objects.
  • outcomes is a single matrix convertible object. The function returns a data frame with columns p_value, beta, intercept, data, data_column, and outcome.
results <- lmutils::column_p_values(
    c("block1.csv", "block2.mat.gz"),
    "outcomes1.RData",
)

lmutils::linear_regression

Perform a linear regression between each data element and each outcome column.

  • data is a list of matrix convertible objects.
  • outcomes is a single matrix convertible object. The function returns a list of data frames with columns slopes, intercept, r2, adj_r2, data, outcome, n, m, and predicted (if enabled).
results <- lmutils::linear_regression(
    c("block1.csv", "block2.mat.gz"),
    "outcomes1.RData",
)

lmutils::logistic_regression

Perform a logistic regression between each data element and each outcome column.

  • data is a list of matrix convertible objects.
  • outcomes is a single matrix convertible object. The function returns a data frame with columns slopes, intercept, r2, adj_r2, data, outcome, n, m, predicted (if enabled), and coefs. Each model (row) contains a list object in the coefs column with a list that contains the coefficients of the model. Each item in the list is another list with fields label, coef, se, t, and p.
results <- lmutils::logistic_regression(
    c("block1.csv", "block2.mat.gz"),
    "outcomes1.RData",
)
coefs <- results$coefs[[1]] # results for block1.csv
coefs[[1]]$label # first column label
coefs[[1]]$coef # coefficient for the first column
coefs[[1]]$se # standard error for the first column
coefs[[1]]$t # t value for the first column
coefs[[1]]$p # p value for the first column

lmutils::logistic_regression_firth

Perform a logistic regression with Firth's penalization between each data element and each outcome column

  • data is a list of matrix convertible objects.
  • outcomes is a single matrix convertible object. The function returns a list of data frames with columns slopes, intercept, r2, adj_r2, r2_tjur, data, outcome, n, m, predicted (if enabled), and coefs (see description above).
results <- lmutils::logistic_regression_firth(
    c("block1.csv", "block2.mat.gz"),
    "outcomes1.RData",
)

lmutils::cv_elnet

Performs cross-validated elastic net regression between each data element and each outcome column.

  • data is a list of matrix convertible objects.
  • outcomes is a single matrix convertible object.
  • alpha is a numeric value between 0 and 1, where 0 is ridge regression, 1 is lasso regression, and values in between are elastic net.
  • nfolds is the number of folds to use for cross-validation. The function returns a data frame with columns slopes, intercept, lambda, r2, mse, data, and outcome.
results <- lmutils::cv_elnet(
    c("block1.csv", "block2.mat.gz"),
    "outcomes1.RData",
    1, # alpha, 0 for ridge, 1 for lasso, in between for elastic net
    5, # number of folds
)

lmutils::cv_elnet_foldids

Performs cross-validated elastic net regression between each data element and each outcome column, using pre-defined fold IDs.

  • data is a list of matrix convertible objects.
  • outcomes is a single matrix convertible object.
  • alpha is a numeric value between 0 and 1, where 0 is ridge regression, 1 is lasso regression, and values in between are elastic net.
  • nfolds is the number of folds to use for cross-validation.
  • foldids is a numeric vector of fold IDs, where each element is which fold the corresponding row belongs to.
results <- lmutils::cv_elnet_foldids(
    c("block1.csv", "block2.mat.gz"),
    "outcomes1.RData",
    1, # alpha, 0 for ridge, 1 for lasso, in between for elastic net
    5, # number of folds
    c(5, 3, 1, 2, 4, 2, 1, 3, 4, 5) # fold IDs for each row
)

lmutils::step_aic

Performs stepwise feature selection of a logistic regression model by AIC.

  • data is a matrix convertible object. It must have column names.
  • outcomes is a numeric vector of binary outcomes (0 or 1).
  • from is a string indicating the starting model. It can be "full" (all columns) or "null" (intercept only).
  • direction is a string indicating the direction of the stepwise selection. It can be "both", "backward", or "forward". Returns a list object with fields slopes, intercept, r2, adj_r2, aic, and coefs.
results <- lmutils::step_aic(
    "matrix1.csv",
    c(0, 1, 0, 1, 0, 1, 0, 1, 0, 1),
    "null", # "full" or "null"
    "both", # "both", "backward", or "forward"
)

lmutils::ld_prune

Perform 100% plink 1.9 compatible LD pruning on the provided bed file.

  • bed is the path to the bed file.
  • window_size is the size of the window in base pairs.
  • step_size is the number of variants to step between windows.
  • threshold is the R^2 threshold above which variants will be pruned. This function returns a list object with three fields:
  • pruned: the number of variants pruned.
  • prune_in: a vector of variant IDs that were kept.
  • prune_out: a vector of variant IDs that were pruned.
results <- lmutils::ld_prune(
    "genotypes.bed",
    50000, # window size in base pairs
    1000,  # step size in variants
    0.01,  # R^2 threshold
)

lmutils::combine_vectors

Combine a list of double vectors into a single matrix using the vectors as columns.

  • data is a list of double vectors.
  • out is an output file name or NULL to return the matrix.
lmutils::combine_vectors(
    list(1:3, 4:6),
    "combined_matrix.csv",
)

lmutils::combine_rows

Combine a potentially nested list of rows (double vectors) into a matrix.

  • data is a list of double vectors.
  • out is an output file name or NULL to return the matrix.
lmutils::combine_rows(
    list(list(c(1, 2, 3)), c(4, 5, 6)),
    "combined_matrix.csv",
)

lmutils::remove_rows

Removes rows from a matrix.

  • data is list of matrix convertible objects.
  • rows is a vector of row indices (1-based) to remove.
  • out is a standard output file.
lmutils::remove_rows(
    "matrix1.csv",
    c(1, 2, 3),
    "matrix1_removed_rows.csv",
)

lmutils::crossprod

Calculates the cross product of two matrices. Equivalent to t(data) %*% data.

  • data is a list of matrix convertible objects.
  • out is a standard output file.
lmutils::crossprod(
    "matrix1.csv",
    "crossprod_matrix1.csv",
)

lmutils::mul

Multiplies two matrices. Equivalent to a %*% b.

  • a is a list of matrix convertible objects.
  • b is a list of matrix convertible objects.
  • out is a standard output file.
lmutils::mul(
    "matrix1.csv",
    "matrix2.mat.gz",
    "mul_matrix1_matrix2.csv",
)

lmutils::load

Loads a matrix convertible object into R.

  • obj is a list matrix convertible objects. If a single object is provided, the function will return the matrix directly, otherwise it will return a list of matrices.
lmutils::load("matrix1.csv")

lmutils::match_rows

Matches rows of a matrix by the values of a vector.

  • data is a list of matrix convertible objects.
  • with is a numeric vector.
  • by is the column name to match the rows by.
  • out is a standard output file.
lmutils::match_rows(
    "matrix1.csv",
    c(1, 2, 3),
    "eid",
    "matched_matrix1.csv",
)

lmutils::match_rows_dir

Matches rows of all matrices in a directory to the values in a vector by a column.

  • from is a string directory name to read the files from.
  • to is a string directory name to write the files to or NULL to write to from.
  • with is a numeric vector to match the rows to.
  • by is the column name to match the rows by.
lmutils::match_rows_dir(
    "matrices",
    "matched_matrices",
    c(1, 2, 3),
    "eid",
)

lmutils::dedup

Deduplicate a matrix by a column. The first occurrence of each value is kept.

  • data is a list of matrix convertible objects.
  • by is the column name to deduplicate by.
  • out is a standard output file.
lmutils::dedup(
    "matrix1.csv",
    "eid",
    "matrix1_dedup.csv",
)

Data Frame Functions

lmutils::new_column_from_regex

Compute a new column for a data frame from a Rust-flavored regex and an existing column.

  • df is a data frame.
  • column is the column name to match.
  • regex is the regex to match. The first capture group is used.
  • new_column is the new column name.
lmutils::new_column_from_regex(
    data.frame(a=c("a1", "b2", "c3")),
    "a",
    "([a-z])",
    "b",
)

lmutils::map_from_pairs

Converts two character vectors into a named list, where the first vector is the names and the second vector is the values. Only the first occurrence of each name is used, essentially creating a map.

  • names is a character vector of names.
  • values is a character vector of values.
lmutils::map_from_pairs(
    c("a", "b", "c"),
    c("1", "2", "3"),
)

lmutils::new_column_from_map

Compute a new column for a data frame from a list of values and an existing column, matching by the names of the values.

  • df is a data frame.
  • column is the column name to match.
  • values is a named list of values.
  • new_column is the new column name.
lmutils::new_column_from_map(
    data.frame(a=c("a", "b", "c")),
    "a",
    lmutils::map_from_pairs(
        c("a", "b", "c"),
        c("1", "2", "3"),
    ),
    "b",
)

lmutils::new_column_from_map_pairs

Compute a new column for a data frame from two character vectors of names and values, matching by the names.

  • df is a data frame.
  • column is the column name to match.
  • names is a character vector of names.
  • values is a character vector of values.
  • new_column is the new column name.
lmutils::new_column_from_map_pairs(
    data.frame(a=c("a", "b", "c")),
    "a",
    c("a", "b", "c"),
    c("1", "2", "3"),
    "b",
)

lmutils::df_sort_asc

Mutably sorts a data frame in ascending order by multiple columns in ascending order. All columns must be numeric (double or integer), character, or logical vectors.

  • df is a data frame.
  • columns is a character vector of column names to sort by.
df <- data.frame(a=c(3, 3, 2, 2, 1, 1), b=c("b", "a", "b", "a", "b", "a"))
lmutils::df_sort_asc(
    df,
    c("a", "b"),
)

lmutils::df_split

Splits a data frame into multiple data frames by a column. This function will mutably sort the data frame by the column before splitting.

  • df is a data frame.
  • by is the column name to split by.
df <- data.frame(a=c(1, 2, 3), b=c("a", "b", "c"))
lmutils::df_split(
    df,
    "b",
)

lmutils::df_combine

Combines a potentially nested list of data frames into a single data frame. The data frames must have the same columns.

  • data is a list of data frames.
lmutils::df_combine(
    list(data.frame(a=1:3), data.frame(a=4:6))
)

Other Functions

lmutils::compute_r2

Compute the R^2 value for given actual and predicted vectors.

lmutils::compute_r2(
    c(1, 2, 3),
    c(1, 2, 3),
)

lmutils::compute_r2_tjur

Compute the Tjur R^2 value for given actual and predicted vectors.

lmutils::compute_r2_tjur(
    c(1, 0, 1),
    c(0.8, 0.2, 0.9),
)

lmutils::mean

Computes the mean of a vector.

lmutils::mean(
    c(1, 2, 3),
)

lmutils::median

Computes the median of a vector.

lmutils::median(
    c(1, 2, 3),
)

lmutils::sd

Computes the standard deviation of a vector.

lmutils::sd(
    c(1, 2, 3),
)

lmutils::var

Computes the variance of a vector.

lmutils::var(
    c(1, 2, 3),
)

lmutils::num_cores

Returns the number of cores available on the system. This can be used to determine the number of cores to use for parallel operations.

lmutils::num_cores()

Configuration

lmutils exposes a number global config options that can be set using environment variables or the lmutils package functions:

  • LMUTILS_LOG/lmutils::set_log_level to set the log level (default: info). Available log levels in order of increasing verbosity are off, error, warn, info, debug, and trace.
  • LMUTILS_CORE_PARALLELISM/lmutils::set_core_parallelism to set the core parallelism (default: 16). This is the number of primary operations to run in parallel.
  • LMUTILS_NUM_WORKER_THREADS/lmutils::set_num_worker_threads to set the number of worker threads to use (default: num_cpus::get() / 2). This is the number of threads to use for parallel operations. Once an operation has been run, this value cannot be changed.
  • LMUTILS_ENABLE_PREDICTED/lmutils::disable_predicted/lmutils::enable_predicted to enable the calculation of the predicted values in lmutils::calculate_r2.
  • LMUTILS_IGNORE_CORE_PARALLEL_ERRORS/lmutils::ignore_core_parallel_errors/lmutils::dont_ignore_core_parallel_errors to ignore errors in core parallel operations. By default, if an error occurs in a core parallel operation it will be retried, if it fails its allowed number of retries then the error will be logged and the next operation will be attempted. If this option is disabled, Rust will panic after the allowed number of retries and the operation will fail.

About

Tools for blazingly fast statistical analysis in R

Resources

Stars

Watchers

Forks

Packages

No packages published