- Installation
- Important Information
- Introduction
- Mat Objects
lmutils::Mat$newlmutils::Mat$rlmutils::Mat$collmutils::Mat$colnameslmutils::Mat$savelmutils::Mat$combine_columnslmutils::Mat$combine_rowslmutils::Mat$remove_columnslmutils::Mat$remove_columnlmutils::Mat$remove_column_if_existslmutils::Mat$remove_rowslmutils::Mat$transposelmutils::Mat$sortlmutils::Mat$sort_by_namelmutils::Mat$sort_by_orderlmutils::Mat$deduplmutils::Mat$dedup_by_namelmutils::Mat$match_tolmutils::Mat$match_to_by_namelmutils::Mat$joinlmutils::Mat$join_by_namelmutils::Mat$standardize_columnslmutils::Mat$standardize_rowslmutils::Mat$remove_na_rowslmutils::Mat$remove_na_columnslmutils::Mat$na_to_valuelmutils::Mat$na_to_column_meanlmutils::Mat$na_to_row_meanlmutils::Mat$min_column_sumlmutils::Mat$max_column_sumlmutils::Mat$min_row_sumlmutils::Mat$max_row_sumlmutils::Mat$rename_columnlmutils::Mat$rename_column_if_existslmutils::Mat$remove_duplicate_columnslmutils::Mat$remove_identical_columnslmutils::Mat$eigenlmutils::Mat$subset_columnslmutils::Mat$rename_columns_with_regexlmutils::Mat$scale_columns
- Matrix Functions
lmutils::savelmutils::save_dirlmutils::calculate_r2lmutils::column_p_valueslmutils::linear_regressionlmutils::logistic_regressionlmutils::logistic_regression_firthlmutils::cv_elnetlmutils::cv_elnet_foldidslmutils::step_aiclmutils::ld_prunelmutils::combine_vectorslmutils::combine_rowslmutils::remove_rowslmutils::crossprodlmutils::mullmutils::loadlmutils::match_rowslmutils::match_rows_dirlmutils::dedup
- Data Frame Functions
- Other Functions
- Configuration
lmutils is not currently available on CRAN, but it can be installed on Linux with the following command. This will also install the Rust programming language which is required for lmutils.
curl https://raw.githubusercontent.com/GMELab/lmutils.r/refs/heads/master/install.sh | sh- Matrix convertible object - a data frame, matrix, file name (to read from), a numeric column vector, or a
Matobject. - List of matrix convertible objects - a list of matrix convertible objects, a character vector of file names (to read from), or a single matrix convertible object.
- Standard output file - a character vector of file names matching the length of the inputs, or
NULLto return the output. If a single input, not in a list, was provided, the output will not be in a list. - Join - an inner join means only rows that match in both matrices are kept, a left join means all rows in the left matrix are kept, a right join means all rows in the right matrix are kept.
All files can be optionally compressed with gzip, rdata files are assumed to be compressed without looking for a .gz file extension (as is the standard in R).
.mat(recommended, custom binary format designed for matrices).csv(requires column headers).tsv(requires column headers).txt(requires column headers).json.cbor.rkyv.rdata.rds.bed(requires a corresponding.bimand.famfile,lmutilsdoes not support writing to.bedfiles)
lmutils is an R package that provides utilities for working with matrices and data frames. It is built on top of the Rust programming language for performance and safety. The package provides a way to store matrices in memory and perform operations on them, as well as functions for working with data frames.
lmutils is built primarily around the Mat object. These are designed to be used to perform operations on matrices without loading them into memory until necessary. This can be useful for working with lots of large matrices, like hundreds of gene blocks.
To get started with your first Mat object, you can use the following code:
mat <- lmutils::Mat$new("matrix1.csv")This will create a new Mat object from a file. You can then perform operations on this object, like combining it with other matrices, removing columns, or standardizing the columns. If you want this matrix to be loaded into R, you can use the r method:
mat$combine_columns("matrix2.csv")
mat$remove_columns(c(1, 2, 3))
mat$standardize_columns()
m <- mat$r()You can also pass the object directly into functions that accept a matrix convertible object, it'll then be loaded automatically (with all the stored operations applied) only when needed.
lmutils::calculate_r2(
mat,
"outcomes1.RData",
)outcomes <- lmutils::Mat$new("outcomes.RData")
geneBlocks <- lapply(c(
"geneBlock1.csv",
"geneBlock2.csv",
"geneBlock3.csv",
"geneBlock4.csv",
"geneBlock5.csv",
), function(mat) {
mat <- lmutils::Mat$new(mat)
mat$match_to_by_name(outcomes$col("eid"), "IID", 0)
mat$remove_column("IID")
mat$min_column_sum(2)
mat$na_to_column_mean()
mat$standardize_columns()
mat
})
outcomes$remove_column("eid")
results <- lmutils::calculate_r2(geneBlocks, outcomes)lmutils::Mat objects are a way to store matrices in memory and perform operations on them. They can be used to store operations or chain operations together for later execution. This can be useful if, for example, you wish to a hundred large matrices from files and standardize them all before using lmutils::calculate_r2. Using Mat objects, you can store the operations you wish to perform and Mat will execute them only when the matrix is loaded.
Passing the same Mat object multiple times in a single function call may cause undefined behavior. For example, the following code may not work as expected:
mat <- lmutils::Mat$new("matrix1.csv")
lmutils::calculate_r2(list(mat, mat), mat)Creates a new Mat object.
datais a matrix convertible object.
mat <- lmutils::Mat$new("matrix1.csv")Loads the matrix from the Mat object.
m <- mat$r()Get a column by name or index.
col <- mat$col("eid")
col <- mat$col(1)Get the column names of the matrix or NULL if there are none.
colnames <- mat$colnames()Saves the matrix to a file.
fileis the file name to write to.
mat$save("matrix1.mat.gz")Combines this matrix with other matrices by columns. (cbind)
datais a list of matrix convertible objects.
mat$combine_columns("matrix2.csv")Combines this matrix with other matrices by rows. (rbind)
datais a list of matrix convertible objects.
mat$combine_rows("matrix2.csv")Removes columns from the matrix.
columnsis a vector of column indices (1-based) to remove.
mat$remove_columns(c(1, 2, 3))Removes a column from the matrix by name.
columnis the column name to remove.
mat$remove_column("eid")Removes a column from the matrix by name if it exists.
columnis the column name to remove.
mat$remove_column_if_exists("eid")Removes rows from the matrix.
rowsis a vector of row indices (1-based) to remove.
mat$remove_rows(c(1, 2, 3))Transposes the matrix.
mat$transpose()Sort by the column at the given index.
byis the column index (1-based) to sort by.
mat$sort(1)Sort by the column with the given name.
byis the column name to sort by.
mat$sort_by_name("eid")Sort by the given order of rows.
orderis a vector of row indices (1-based) to sort by.
mat$sort_by_order(c(3, 2, 1))Deduplicate the matrix by a column.
byis the column index (1-based) to deduplicate by.
mat$dedup(1)Deduplicate the matrix by a column name.
byis the column name to deduplicate by.
mat$dedup_by_name("eid")Match the rows of the matrix to the values in a vector by a column.
withis a numeric vector to match the rows to.byis the column index (1-based) to match the rows by.joinis the type of join to perform. 0 is inner, 1 is left, 2 is right, and 3 is full. If a row is not matched for a left or right join, it will error.
mat$match_to(c(1, 2, 3), 1, 0)Match the rows of the matrix to the values in a vector by a column name.
withis a numeric vector to match the rows to.byis the column name to match the rows by.joinis the type of join to perform. 0 is inner, 1 is left, 2 is right, and 3 is full. If a row is not matched for a left or right join, it will error.
mat$match_to_by_name(c(1, 2, 3), "eid", 0)Join the matrix with another matrix by a column.
otheris a matrix convertible object.byis the column index (1-based) to join by.joinis the type of join to perform. 0 is inner, 1 is left, 2 is right, and 3 is full. If a row is not matched for a left or right join, it will error.
mat$join("matrix2.csv", 1, 0)Join the matrix with another matrix by a column name.
otheris a matrix convertible object.byis the column name to join by.joinis the type of join to perform. 0 is inner, 1 is left, 2 is right, and 3 is full. If a row is not matched for a left or right join, it will error.
mat$join_by_name("matrix2.csv", "eid", 0)Standardize the columns of the matrix to have a mean of 0 and a standard deviation of 1.
mat$standardize_columns()Standardize the rows of the matrix to have a mean of 0 and a standard deviation of 1.
mat$standardize_rows()Remove rows with any NA values.
mat$remove_na_rows()Remove columns with any NA values.
mat$remove_na_columns()Replace all NA values with a given value.
mat$na_to_value(0)Replace all NA values with the mean of the column.
mat$na_to_column_mean()Replace all NA values with the mean of the row.
mat$na_to_row_mean()Remove columns with a sum less than a given value.
mat$min_column_sum(10)Remove columns with a sum greater than a given value.
mat$max_column_sum(10)Remove rows with a sum less than a given value.
mat$min_row_sum(10)Remove rows with a sum greater than a given value.
mat$max_row_sum(10)Rename a column by name.
mat$rename_column("IID", "eid")Rename a column by name if it exists.
mat$rename_column_if_exists("IID", "eid")Remove columns that are duplicates of other columns. The first column is kept.
mat$remove_duplicate_columns()Remove columns with all identical entries.
mat$remove_identical_columns()Compute the eigenvalues and eigenvectors of the matrix. The matrix must be square.
eigen <- mat$eigen()
# a vector of real or complex eigenvalues
eigen$values
# a n by n matrix of real or complex eigenvectors
eigen$vectorsSubset the matrix to only include the given columns (1-based indices or names).
mat$subset_columns(c(1, 2, 3))Rename columns with a regex and a replacement string.
mat$rename_columns_with_regex("[0-9]", "X")Scale the columns of a matrix by a given scalar or vector. The vector must be the same length as the number of columns in the matrix.
mat$scale_columns(2)
mat$scale_columns(c(1, 2, 3))Scale the rows of a matrix by a given scalar or vector. The vector must be the same length as the number of rows in the matrix.
mat$scale_rows(2)
mat$scale_rows(c(1, 2, 3))Saves a list of matrix convertible objects to files.
fromis a list of matrix convertible objects.tois a character vector of file names to write to.
lmutils::save(
list("file1.csv", matrix(1:9, nrow=3), 1:3, data.frame(a=1:3, b=4:6)),
c("file1.json", "file2.mat.gz", "file3.csv", "file4.rdata"),
)Recursively converts a directory of files to the selected file type.
fromis a string directory name to read the files from.tois a string directory name to write the files to orNULLto write tofrom.file_typeis a string file extension to write the files as.
lmutils::save_dir(
"data",
"converted_data", # or NULL
"mat.gz",
)Calculates the R^2 and adjusted R^2 values for blocks and outcomes.
datais a list of matrix convertible objects.outcomesis a single matrix convertible object. Returns a data frame with columnsr2,adj_r2,data,outcome,n,m,predicted, andbetas.
results <- lmutils::calculate_r2(
c("block1.csv", "block2.mat.gz"),
"outcomes1.RData",
)Compute the p value of a linear regression between each pair of columns in data and outcomes.
datais a list of matrix convertible objects.outcomesis a single matrix convertible object. The function returns a data frame with columnsp_value,beta,intercept,data,data_column, andoutcome.
results <- lmutils::column_p_values(
c("block1.csv", "block2.mat.gz"),
"outcomes1.RData",
)Perform a linear regression between each data element and each outcome column.
datais a list of matrix convertible objects.outcomesis a single matrix convertible object. The function returns a list of data frames with columnsslopes,intercept,r2,adj_r2,data,outcome,n,m, andpredicted(if enabled).
results <- lmutils::linear_regression(
c("block1.csv", "block2.mat.gz"),
"outcomes1.RData",
)Perform a logistic regression between each data element and each outcome column.
datais a list of matrix convertible objects.outcomesis a single matrix convertible object. The function returns a data frame with columnsslopes,intercept,r2,adj_r2,data,outcome,n,m,predicted(if enabled), andcoefs. Each model (row) contains a list object in thecoefscolumn with a list that contains the coefficients of the model. Each item in the list is another list with fieldslabel,coef,se,t, andp.
results <- lmutils::logistic_regression(
c("block1.csv", "block2.mat.gz"),
"outcomes1.RData",
)
coefs <- results$coefs[[1]] # results for block1.csv
coefs[[1]]$label # first column label
coefs[[1]]$coef # coefficient for the first column
coefs[[1]]$se # standard error for the first column
coefs[[1]]$t # t value for the first column
coefs[[1]]$p # p value for the first columnPerform a logistic regression with Firth's penalization between each data element and each outcome column
datais a list of matrix convertible objects.outcomesis a single matrix convertible object. The function returns a list of data frames with columnsslopes,intercept,r2,adj_r2,r2_tjur,data,outcome,n,m,predicted(if enabled), andcoefs(see description above).
results <- lmutils::logistic_regression_firth(
c("block1.csv", "block2.mat.gz"),
"outcomes1.RData",
)Performs cross-validated elastic net regression between each data element and each outcome column.
datais a list of matrix convertible objects.outcomesis a single matrix convertible object.alphais a numeric value between 0 and 1, where 0 is ridge regression, 1 is lasso regression, and values in between are elastic net.nfoldsis the number of folds to use for cross-validation. The function returns a data frame with columnsslopes,intercept,lambda,r2,mse,data, andoutcome.
results <- lmutils::cv_elnet(
c("block1.csv", "block2.mat.gz"),
"outcomes1.RData",
1, # alpha, 0 for ridge, 1 for lasso, in between for elastic net
5, # number of folds
)Performs cross-validated elastic net regression between each data element and each outcome column, using pre-defined fold IDs.
datais a list of matrix convertible objects.outcomesis a single matrix convertible object.alphais a numeric value between 0 and 1, where 0 is ridge regression, 1 is lasso regression, and values in between are elastic net.nfoldsis the number of folds to use for cross-validation.foldidsis a numeric vector of fold IDs, where each element is which fold the corresponding row belongs to.
results <- lmutils::cv_elnet_foldids(
c("block1.csv", "block2.mat.gz"),
"outcomes1.RData",
1, # alpha, 0 for ridge, 1 for lasso, in between for elastic net
5, # number of folds
c(5, 3, 1, 2, 4, 2, 1, 3, 4, 5) # fold IDs for each row
)Performs stepwise feature selection of a logistic regression model by AIC.
datais a matrix convertible object. It must have column names.outcomesis a numeric vector of binary outcomes (0 or 1).fromis a string indicating the starting model. It can be "full" (all columns) or "null" (intercept only).directionis a string indicating the direction of the stepwise selection. It can be "both", "backward", or "forward". Returns a list object with fieldsslopes,intercept,r2,adj_r2,aic, andcoefs.
results <- lmutils::step_aic(
"matrix1.csv",
c(0, 1, 0, 1, 0, 1, 0, 1, 0, 1),
"null", # "full" or "null"
"both", # "both", "backward", or "forward"
)Perform 100% plink 1.9 compatible LD pruning on the provided bed file.
bedis the path to the bed file.window_sizeis the size of the window in base pairs.step_sizeis the number of variants to step between windows.thresholdis the R^2 threshold above which variants will be pruned. This function returns a list object with three fields:pruned: the number of variants pruned.prune_in: a vector of variant IDs that were kept.prune_out: a vector of variant IDs that were pruned.
results <- lmutils::ld_prune(
"genotypes.bed",
50000, # window size in base pairs
1000, # step size in variants
0.01, # R^2 threshold
)Combine a list of double vectors into a single matrix using the vectors as columns.
datais a list of double vectors.outis an output file name orNULLto return the matrix.
lmutils::combine_vectors(
list(1:3, 4:6),
"combined_matrix.csv",
)Combine a potentially nested list of rows (double vectors) into a matrix.
datais a list of double vectors.outis an output file name orNULLto return the matrix.
lmutils::combine_rows(
list(list(c(1, 2, 3)), c(4, 5, 6)),
"combined_matrix.csv",
)Removes rows from a matrix.
datais list of matrix convertible objects.rowsis a vector of row indices (1-based) to remove.outis a standard output file.
lmutils::remove_rows(
"matrix1.csv",
c(1, 2, 3),
"matrix1_removed_rows.csv",
)Calculates the cross product of two matrices. Equivalent to t(data) %*% data.
datais a list of matrix convertible objects.outis a standard output file.
lmutils::crossprod(
"matrix1.csv",
"crossprod_matrix1.csv",
)Multiplies two matrices. Equivalent to a %*% b.
ais a list of matrix convertible objects.bis a list of matrix convertible objects.outis a standard output file.
lmutils::mul(
"matrix1.csv",
"matrix2.mat.gz",
"mul_matrix1_matrix2.csv",
)Loads a matrix convertible object into R.
objis a list matrix convertible objects. If a single object is provided, the function will return the matrix directly, otherwise it will return a list of matrices.
lmutils::load("matrix1.csv")Matches rows of a matrix by the values of a vector.
datais a list of matrix convertible objects.withis a numeric vector.byis the column name to match the rows by.outis a standard output file.
lmutils::match_rows(
"matrix1.csv",
c(1, 2, 3),
"eid",
"matched_matrix1.csv",
)Matches rows of all matrices in a directory to the values in a vector by a column.
fromis a string directory name to read the files from.tois a string directory name to write the files to orNULLto write tofrom.withis a numeric vector to match the rows to.byis the column name to match the rows by.
lmutils::match_rows_dir(
"matrices",
"matched_matrices",
c(1, 2, 3),
"eid",
)Deduplicate a matrix by a column. The first occurrence of each value is kept.
datais a list of matrix convertible objects.byis the column name to deduplicate by.outis a standard output file.
lmutils::dedup(
"matrix1.csv",
"eid",
"matrix1_dedup.csv",
)Compute a new column for a data frame from a Rust-flavored regex and an existing column.
dfis a data frame.columnis the column name to match.regexis the regex to match. The first capture group is used.new_columnis the new column name.
lmutils::new_column_from_regex(
data.frame(a=c("a1", "b2", "c3")),
"a",
"([a-z])",
"b",
)Converts two character vectors into a named list, where the first vector is the names and the second vector is the values. Only the first occurrence of each name is used, essentially creating a map.
namesis a character vector of names.valuesis a character vector of values.
lmutils::map_from_pairs(
c("a", "b", "c"),
c("1", "2", "3"),
)Compute a new column for a data frame from a list of values and an existing column, matching by the names of the values.
dfis a data frame.columnis the column name to match.valuesis a named list of values.new_columnis the new column name.
lmutils::new_column_from_map(
data.frame(a=c("a", "b", "c")),
"a",
lmutils::map_from_pairs(
c("a", "b", "c"),
c("1", "2", "3"),
),
"b",
)Compute a new column for a data frame from two character vectors of names and values, matching by the names.
dfis a data frame.columnis the column name to match.namesis a character vector of names.valuesis a character vector of values.new_columnis the new column name.
lmutils::new_column_from_map_pairs(
data.frame(a=c("a", "b", "c")),
"a",
c("a", "b", "c"),
c("1", "2", "3"),
"b",
)Mutably sorts a data frame in ascending order by multiple columns in ascending order. All columns must be numeric (double or integer), character, or logical vectors.
dfis a data frame.columnsis a character vector of column names to sort by.
df <- data.frame(a=c(3, 3, 2, 2, 1, 1), b=c("b", "a", "b", "a", "b", "a"))
lmutils::df_sort_asc(
df,
c("a", "b"),
)Splits a data frame into multiple data frames by a column. This function will mutably sort the data frame by the column before splitting.
dfis a data frame.byis the column name to split by.
df <- data.frame(a=c(1, 2, 3), b=c("a", "b", "c"))
lmutils::df_split(
df,
"b",
)Combines a potentially nested list of data frames into a single data frame. The data frames must have the same columns.
datais a list of data frames.
lmutils::df_combine(
list(data.frame(a=1:3), data.frame(a=4:6))
)Compute the R^2 value for given actual and predicted vectors.
lmutils::compute_r2(
c(1, 2, 3),
c(1, 2, 3),
)Compute the Tjur R^2 value for given actual and predicted vectors.
lmutils::compute_r2_tjur(
c(1, 0, 1),
c(0.8, 0.2, 0.9),
)Computes the mean of a vector.
lmutils::mean(
c(1, 2, 3),
)Computes the median of a vector.
lmutils::median(
c(1, 2, 3),
)Computes the standard deviation of a vector.
lmutils::sd(
c(1, 2, 3),
)Computes the variance of a vector.
lmutils::var(
c(1, 2, 3),
)Returns the number of cores available on the system. This can be used to determine the number of cores to use for parallel operations.
lmutils::num_cores()lmutils exposes a number global config options that can be set using environment variables or the lmutils package functions:
LMUTILS_LOG/lmutils::set_log_levelto set the log level (default:info). Available log levels in order of increasing verbosity areoff,error,warn,info,debug, andtrace.LMUTILS_CORE_PARALLELISM/lmutils::set_core_parallelismto set the core parallelism (default:16). This is the number of primary operations to run in parallel.LMUTILS_NUM_WORKER_THREADS/lmutils::set_num_worker_threadsto set the number of worker threads to use (default:num_cpus::get() / 2). This is the number of threads to use for parallel operations. Once an operation has been run, this value cannot be changed.LMUTILS_ENABLE_PREDICTED/lmutils::disable_predicted/lmutils::enable_predictedto enable the calculation of the predicted values inlmutils::calculate_r2.LMUTILS_IGNORE_CORE_PARALLEL_ERRORS/lmutils::ignore_core_parallel_errors/lmutils::dont_ignore_core_parallel_errorsto ignore errors in core parallel operations. By default, if an error occurs in a core parallel operation it will be retried, if it fails its allowed number of retries then the error will be logged and the next operation will be attempted. If this option is disabled, Rust will panic after the allowed number of retries and the operation will fail.