Skip to content

🧬 EmbedDiff: A modular machine learning pipeline combining ESM2 embeddings, latent diffusion, and transformer-based decoding for de novo protein design

License

Notifications You must be signed in to change notification settings

mgarsamo/EmbedDiff

Repository files navigation

🧬 EmbedDiff: Latent Diffusion Pipeline for De Novo Protein Sequence Generation

HTML Report Run EmbedDiff License: MIT

EmbedDiff is a modular pipeline for de novo protein sequence generation that combines pretrained ESM2 embeddings, a latent diffusion model, and Transformer-based decoding. It enables efficient exploration of the protein sequence landscapeβ€”generating novel sequences that preserve evolutionary plausibility, functional diversity, and foldability, without requiring structural supervision.


πŸš€ Quick Start (1-liner)

To run the entire EmbedDiff pipeline from end to end:

python run_embeddiff_pipeline.py

πŸ” What Is EmbedDiff?

EmbedDiff is a generative machine learning pipeline for de novo protein design that combines powerful pretrained embeddings with a latent diffusion model and Transformer-based decoding.

It starts by embedding natural protein sequences using ESM2, which maps each sequence into a high-dimensional vector that encodes rich evolutionary, functional, and structural priors. These embeddings serve as a biologically grounded latent space. A denoising diffusion model is then trained directly on these ESM2 embeddings. During training, Gaussian noise is added to the embeddings across a series of timesteps, and the model learns to reverse this corruptionβ€”effectively modeling the distribution of natural protein embeddings. This enables EmbedDiff to sample entirely new latent vectors from noise that remain within the manifold of plausible protein sequences. These synthetic embeddings are decoded into amino acid sequences using a Transformer model, which supports both stochastic sampling and optional reference-guided decoding. The resulting sequences are novel yet biologically grounded. The pipeline concludes with comprehensive validation and visualization, including:

  • Shannon entropy filtering to assess compositional diversity
  • BLAST alignment against SwissProt to measure sequence novelty and identity
  • Cosine similarity comparisons in latent space
  • t-SNE and MDS plots for embedding visualization
  • Optional structural assessment using ESMFold to predict 3D folds and per-residue confidence (pLDDT)

All results are compiled into an interactive HTML summary report for easy inspection and sharing.


πŸ“Œ Pipeline Overview

The full EmbedDiff pipeline is modular and proceeds through the following stages:

Step 1: Input Dataset

  • Format: A curated FASTA file of real protein sequences (e.g., Thioredoxin reductases).
  • Used as the basis for learning a latent protein representation and decoder training.

Step 2a: ESM2 Embedding

  • The curated sequences are embedded using the esm2_t33_650M_UR50D model.
  • This transforms each protein into a 1280-dimensional latent vector.
  • These embeddings capture functional and evolutionary constraints without any structural input.

Step 2b: t-SNE of Real Embeddings

  • t-SNE is applied to the real ESM2 embeddings to visualize the structure of protein space.
  • Serves as a baseline to later compare generated (synthetic) embeddings.

Step 3: Train EmbedDiff Latent Diffusion Model

  • A denoising MLP learns to reverse the process of adding Gaussian noise to real ESM2 embeddings.
  • Trained using a sequence of time steps (e.g., 30 steps), the model gradually denoises noisy embeddings back toward the real manifold.
  • This enables sampling realistic embeddings from noise.

Step 4: Sample Synthetic Embeddings

  • Starting from pure Gaussian noise, the trained diffusion model is used to generate new latent vectors that resemble real protein embeddings.
  • These latent samples are biologically plausible but unseen β€” representing de novo candidates.

Step 5a: Build Decoder Dataset

  • Real ESM2 embeddings are paired with their corresponding amino acid sequences.
  • This dataset is used to train a decoder to translate from embedding β†’ sequence.

Step 5b: Train Transformer-based Decoder

  • A Transformer model is trained to autoregressively generate amino acid sequences from input embeddings.
  • Label smoothing and entropy filtering are used to improve sequence diversity and biological plausibility.
  • Optionally, ESM2 logit distillation is applied to align predictions with natural residue distributions.

πŸ”„ Step 6: Decode Synthetic Sequences

The synthetic embeddings from Step 4 are decoded into amino acid sequences using a hybrid decoding strategy that balances biological realism with diversity.

By default:

  • 40% of amino acid positions are generated stochastically, sampled from the decoder’s output distribution.
  • 60% are reference-guided, biased toward residues from the closest matching natural sequence.

This configuration is empirically tuned to produce sequences with approximately 50–60% sequence identity to known proteinsβ€”striking a practical balance between novelty and plausibility.

πŸ’‘ Modular and Adjustable

This decoding step is fully configurable:

  • Setting the stochastic ratio to 100% yields fully de novo sequences, maximizing novelty but potentially reducing identity.
  • Lower stochastic ratios (e.g., 20–30%) increase similarity to natural proteins.
  • The ratio can be adjusted using a configuration flag in the decoding script.

The output is a final FASTA file of decoded protein sequences, suitable for downstream validation or structural modeling.


Step 7a: t-SNE Overlay

  • A combined t-SNE plot compares the distribution of real and generated embeddings.
  • Useful for assessing whether synthetic proteins fall within plausible latent regions.

Step 7b: Cosine Similarity Histogram

  • Pairwise cosine distances are computed between:
    • Natural vs. Natural sequences
    • Natural vs. generated sequences
    • Generated vs. generated sequences
  • This helps evaluate diversity and proximity to known protein embeddings.

πŸ” Step 7c: Entropy vs. Identity Filtering

Each decoded protein sequence is evaluated using two key metrics:

  • Shannon Entropy: Quantifies amino acid diversity across the sequence.

    • Values typically range from ~1.0 (low diversity) to ~4.3 (maximum diversity).
    • Higher entropy values (β‰₯ 3.5) suggest diverse, non-repetitive sequences.
    • Lower values (< 2.5) may indicate low-complexity or biologically implausible repeats.
  • Sequence Identity (via BLAST): Measures similarity to known natural proteins.

    • This helps ensure the generated sequences remain evolutionarily grounded while still being novel.

Sequences are filtered based on configurable entropy and identity thresholds to retain those with balanced novelty and biological relevance. Only sequences within the target range are included in downstream analysis and structural validation.


πŸ” Step 7d: Local BLAST Validation

Generated sequences are validated by aligning them against a locally downloaded SwissProt database using the blastp tool from NCBI BLAST+.

  • Uses: blastp from the BLAST+ suite
  • Target database: SwissProt (downloaded locally in FASTA format)
  • Input: Decoded sequences (decoded_embeddiff.fasta)
  • Output: A CSV summary with:
    • Percent identity
    • E-value
    • Alignment length
    • Matched SwissProt accession/description

This step confirms that generated sequences are evolutionarily meaningful by evaluating their similarity to curated natural proteins.

πŸ“ Output example: data/blast_results/blast_summary_local.csv


Step 8: HTML Summary Report

  • All visualizations, metrics, and links to output files are compiled into an interactive HTML report.
  • Includes cosine plots, entropy scatter, identity histograms, and t-SNE/MDS projections.
  • Allows easy inspection and sharing of results.

πŸ§ͺ Optional: Structural Validation with ESMFold or AlphaFold2

Although not part of the core EmbedDiff pipeline, the generated sequences can optionally be assessed for structural plausibility using modern protein structure prediction tools:

πŸ”¬ ESMFold

  • A fast, accurate structure predictor from Meta AI, built on the ESM2 language model.
  • Accepts a FASTA file of protein sequences as input and returns predicted 3D structures with per-residue confidence scores (pLDDT).
  • Ideal for rapid, large-scale folding of EmbedDiff-generated sequences.

🧬 AlphaFold2

  • The state-of-the-art method from DeepMind for protein structure prediction.
  • Provides highly accurate structural models and can be run locally or via cloud platforms.
  • More computationally intensive, but offers best-in-class accuracy.

🧯 Output from Structural Prediction Tools

  • 3D Models (.pdb) for each sequence.
  • Confidence Scores (e.g. pLDDT or PAE) per residue.
  • Optional Visualizations using tools like:

πŸ“Œ These tools provide additional confidence that the generated sequences are likely to fold into stable and ordered protein structures.


πŸ“‚ Project Structure

EmbedDiff/ β”œβ”€β”€ README.md # πŸ“˜ Project overview and documentation β”œβ”€β”€ .gitignore # πŸ›‘ Files/folders to exclude from version control β”œβ”€β”€ run_embeddiff_pipeline.py # 🧠 Master pipeline script to run all steps β”œβ”€β”€ requirements.txt # πŸ“¦ Python dependencies for setting up environment β”œβ”€β”€ environment.yml # (Optional) Conda environment file (if using Conda) β”‚ β”œβ”€β”€ data/ # πŸ“ Input and output biological data β”‚ β”œβ”€β”€ curated_thioredoxin_reductase.fasta β”‚ β”œβ”€β”€ decoded_embeddiff.fasta β”‚ └── blast_results/ β”‚ └── blast_summary_local.csv β”‚ β”œβ”€β”€ embeddings/ # πŸ“ Latent vector representations β”‚ β”œβ”€β”€ esm2_embeddings.npy β”‚ └── sampled_embeddings.npy β”‚ β”œβ”€β”€ figures/ # πŸ“ All generated plots and report β”‚ β”œβ”€β”€ fig2b_loss_train_val.png β”‚ β”œβ”€β”€ fig3a_generated_tsne.png β”‚ β”œβ”€β”€ fig5a_decoder_loss.png β”‚ β”œβ”€β”€ fig5b_identity_histogram.png β”‚ β”œβ”€β”€ fig5c_entropy_scatter.png β”‚ β”œβ”€β”€ fig5d_all_histograms.png β”‚ β”œβ”€β”€ fig_tsne_by_domain.png β”‚ β”œβ”€β”€ fig5f_tsne_domain_overlay.png β”‚ β”œβ”€β”€ fig5b_identity_scores.csv β”‚ └── embeddiff_summary_report.html β”‚ β”œβ”€β”€ scripts/ # πŸ“ Core processing scripts β”‚ β”œβ”€β”€ esm_embedder.py # Step 2a: Embed sequences with ESM2 β”‚ β”œβ”€β”€ first_tsne_embedding.py # Step 2b: t-SNE of real embeddings β”‚ β”œβ”€β”€ train_emeddiff.py # Step 3: Train latent diffusion model β”‚ β”œβ”€β”€ sample_embeddings.py # Step 4: Sample new embeddings β”‚ β”œβ”€β”€ build_decoder_dataset.py # Step 5a: Build decoder training set β”‚ β”œβ”€β”€ train_transformer.py # Step 5b: Train decoder β”‚ β”œβ”€β”€ transformer_decode.py # Step 6: Decode embeddings to sequences β”‚ β”œβ”€β”€ plot_tsne_class_overlay.py # Step 7a: t-SNE comparison β”‚ β”œβ”€β”€ cosine_simlar_histo.py # Step 7b: Cosine similarity plots β”‚ β”œβ”€β”€ plot_entropy_identity.py # Step 7c: Entropy vs. identity filter β”‚ β”œβ”€β”€ blastlocal.py # Step 7d: Local BLAST alignment β”‚ └── generate_html_report.py # Step 8: Generate final HTML report β”‚ β”œβ”€β”€ models/ # πŸ“ ML model architectures β”‚ β”œβ”€β”€ diffusion_mlp.py # EmbedDiff diffusion model β”‚ └── decoder_transformer.py # Transformer-based decoder β”‚ β”œβ”€β”€ utils/ # πŸ“ Utility and helper functions β”‚ β”œβ”€β”€ amino_acid_utils.py # Mapping functions for sequences β”‚ └── metrics.py # Functions for loss, entropy, identity, etc. β”‚ └── checkpoints/ # πŸ“ Model checkpoints (excluded via .gitignore) β”œβ”€β”€ embeddiff_mlp.pth └── decoder_transformer_best.pth

πŸ™ Citation & Acknowledgment

If you use EmbedDiff in your research or development, please consider starring the repo ⭐ and linking back to it. Citations and backlinks help others find and trust this work.