MetaTCR: A Framework for Analyzing Batch Effects in TCR Repertoire Datasets

MetaTCR is a computational framework designed to standardize disparate T-cell Receptor (TCR) repertoires and systematically correct for batch effects to enable robust downstream analysis. The framework transforms variable-length TCR repertoire data into fixed-dimensional meta-vectors by projecting individual repertoires onto a standardized reference space, facilitating large-scale integration and batch correction.

Framework Overview

The MetaTCR framework consists of four main stages:

Stage 1: Constructing a universal TCR space. A reference database is curated from multiple TCR repertoires and hierarchical clustering is performed to establish functional TCR centroids.

Stage 2: Projecting repertoires into the universal space. Individual repertoires are encoded by mapping their clonotypes to reference centroids, generating standardized meta-feature matrices.

Stage 3: Framework evaluation. The framework's performance is evaluated using simulated and real-world data to assess metric accuracy and batch correction efficacy.

Stage 4: Application for biological discovery. The framework is applied to downstream tasks including batch effect identification, dataset integration, and biological discovery from corrected data.

Installation

Prerequisites

Python >= 3.8

Dependencies

Before installing MetaTCR, ensure you have the following dependencies installed:

tqdm
scipy
biopython
matplotlib
torch >= 1.1.0 (tested on 1.7.1)
pandas
numpy
sklearn
tape_proteins
faiss-gpu

Install MetaTCR

Install Cython (required for building extensions):

pip install cython==3.1.5

Install MetaTCR package:

cd /path/to/metatcr_code
pip install .

Alternatively, you can install directly from the source:

pip install -e .

Data Availability

The processed metadata and MetaTCR-encoded intermediate results are available at:

Zenodo: https://zenodo.org/records/18265157

This repository includes:

Processed TCR reference database
Meta-vector representations of repertoires
Other intermediate analysis results

Data File Organization

For downloaded or generated data files, we recommend the following directory structure:

Dataset files (.pk files from Zenodo or generated via step2): Store in ./data/processed_data/datasets_mtx_1024/
- Example: ./data/processed_data/datasets_mtx_1024/Huth2019.pk
Centroid files (from Zenodo or generated via step1): Store in ./data/processed_data/
- Example: ./data/processed_data/spectral_centroids/centroid_mapping_spectral_k96.pk

For detailed information about data files, please refer to the Zenodo repository.

Data Preprocessing Requirements

Before encoding repertoires with MetaTCR, raw TCR repertoire data must undergo quality control and preprocessing:

Quality Control:
- Filter out entries with CDR3β chain lengths shorter than 10 amino acids
- Remove sequences containing stop codons
- Retain only amino acid sequences beginning with cysteine (C) and ending with phenylalanine (F)
- Select the most abundant clones (up to 10,000 per repertoire)
Required Data Fields: Each repertoire file must contain the following information:
- CDR3 sequence (amino acid)
- V gene annotation
- J gene annotation
- Clone frequency/count
Full-length Sequence Reconstruction: For repertoires containing only CDR3+V+J information, full-length TCR sequences can be reconstructed using the provided script:
- Script: pre_process_scripts/cdr3_to_full_seq_mod.py (original code reference)
- Usage example: See pre_process_scripts/demo_generate_TCR_fullseq.sh
The script reconstructs full TCR sequences by aligning CDR3 sequences with V and J gene segments from IMGT reference sequences.
Input Data Format: Processed repertoire files should be in TSV format with columns including aminoAcid (CDR3), vMaxResolved (V gene), jMaxResolved (J gene), frequencyCount, and full_seq (full-length sequence).

Example processed data format can be found in demo_data/Emerson2017_demo/, which serves as the input format for MetaTCR repertoire encoding.

Pre-trained Models

TCR2vec Model

MetaTCR uses the TCR2vec model to encode TCR clonotypes into numerical vectors. The pre-trained TCR2vec model should be placed in:

pretrained_models/TCR2vec_120/

Model Download: The pre-trained TCR2vec model can be downloaded from:

Google Drive: https://drive.google.com/file/d/1Nj0VHpJFTUDx4X7IPQ0OGXKlGVCrwRZl/view?usp=sharing

Model Source: The TCR2vec model is based on the work from:

GitHub Repository: https://github.com/jiangdada1221/TCR2vec

After downloading, extract the model files to pretrained_models/TCR2vec_120/. The directory should contain:

pytorch_model.bin
args.json
config.json

For more details about the TCR2vec model, please refer to the README in pretrained_models/TCR2vec_120/Readme.md.

Usage

Basic Workflow

Generate TCR Functional Clusters (if not using pre-computed centroids):

python step1.generate_TCR_functional_clusters.py

Encode Repertoires to Meta-vectors:

For unlabeled data (no positive/negative labels):

python step2.0.dataset_to_meta_matrix.py --unlabeled_dir data/repertoire_data/Martinez2025 --dataset_name Martinez2025 --tcr2vec_path ./pretrained_models/TCR2vec_120

For labeled data (with separate positive and negative sample paths):

python step2.0.dataset_to_meta_matrix.py --pos_dir data/repertoire_data/Liu2019/SLE --neg_dir data/repertoire_data/Liu2019/Control --dataset_name Liu2019 --tcr2vec_path ./pretrained_models/TCR2vec_120

For detailed usage instructions and parameter descriptions, please refer to the individual script help messages:

python step2.rawdata_to_meta_encoding.py --help

Measure Quantitative Metrics:

python step3.measure_quantitative_metrics.py

Correct Batch Effects:

python step4.correct_batch_effect.py

Citation

If you use MetaTCR in your research, please cite:

[Citation information to be added]

License

This project is licensed under the GPL-3.0 License.

Contact

For questions and issues, please contact:

Author: Miaozhe Huo
Email: miaozhhuo2-c@my.cityu.edu.hk

Acknowledgments

MetaTCR builds upon the TCR2vec model for TCR sequence encoding. We acknowledge the original TCR2vec authors for their valuable contribution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MetaTCR: A Framework for Analyzing Batch Effects in TCR Repertoire Datasets

Framework Overview

Installation

Prerequisites

Dependencies

Install MetaTCR

Data Availability

Data File Organization

Data Preprocessing Requirements

Pre-trained Models

TCR2vec Model

Usage

Basic Workflow

Citation

License

Contact

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data/processed_data		data/processed_data
demo_data/Emerson2017_demo		demo_data/Emerson2017_demo
metatcr		metatcr
pre_process_scripts		pre_process_scripts
pretrained_models/TCR2vec_120		pretrained_models/TCR2vec_120
workflow		workflow
.gitignore		.gitignore
Readme.md		Readme.md
run_repertoire_to_metavectors.sh		run_repertoire_to_metavectors.sh
setup.py		setup.py
step1.generate_TCR_functional_clusters.py		step1.generate_TCR_functional_clusters.py
step2.rawdata_to_meta_encoding.py		step2.rawdata_to_meta_encoding.py
step3.measure_quantitative_metrics.py		step3.measure_quantitative_metrics.py
step4.correct_batch_effect.py		step4.correct_batch_effect.py

deepomicslab/MetaTCR

Folders and files

Latest commit

History

Repository files navigation

MetaTCR: A Framework for Analyzing Batch Effects in TCR Repertoire Datasets

Framework Overview

Installation

Prerequisites

Dependencies

Install MetaTCR

Data Availability

Data File Organization

Data Preprocessing Requirements

Pre-trained Models

TCR2vec Model

Usage

Basic Workflow

Citation

License

Contact

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages