EmbedDiff is a modular pipeline for de novo protein sequence generation that combines pretrained ESM2 embeddings, a latent diffusion model, and Transformer-based decoding. It enables efficient exploration of the protein sequence landscapeβgenerating novel sequences that preserve evolutionary plausibility, functional diversity, and foldability, without requiring structural supervision.
To run the entire EmbedDiff pipeline from end to end:
python run_embeddiff_pipeline.pyEmbedDiff is a generative machine learning pipeline for de novo protein design that combines powerful pretrained embeddings with a latent diffusion model and Transformer-based decoding.
It starts by embedding natural protein sequences using ESM2, which maps each sequence into a high-dimensional vector that encodes rich evolutionary, functional, and structural priors. These embeddings serve as a biologically grounded latent space. A denoising diffusion model is then trained directly on these ESM2 embeddings. During training, Gaussian noise is added to the embeddings across a series of timesteps, and the model learns to reverse this corruptionβeffectively modeling the distribution of natural protein embeddings. This enables EmbedDiff to sample entirely new latent vectors from noise that remain within the manifold of plausible protein sequences. These synthetic embeddings are decoded into amino acid sequences using a Transformer model, which supports both stochastic sampling and optional reference-guided decoding. The resulting sequences are novel yet biologically grounded. The pipeline concludes with comprehensive validation and visualization, including:
- Shannon entropy filtering to assess compositional diversity
- BLAST alignment against SwissProt to measure sequence novelty and identity
- Cosine similarity comparisons in latent space
- t-SNE and MDS plots for embedding visualization
- Optional structural assessment using ESMFold to predict 3D folds and per-residue confidence (pLDDT)
All results are compiled into an interactive HTML summary report for easy inspection and sharing.
The full EmbedDiff pipeline is modular and proceeds through the following stages:
- Format: A curated FASTA file of real protein sequences (e.g., Thioredoxin reductases).
- Used as the basis for learning a latent protein representation and decoder training.
- The curated sequences are embedded using the
esm2_t33_650M_UR50Dmodel. - This transforms each protein into a 1280-dimensional latent vector.
- These embeddings capture functional and evolutionary constraints without any structural input.
- t-SNE is applied to the real ESM2 embeddings to visualize the structure of protein space.
- Serves as a baseline to later compare generated (synthetic) embeddings.
- A denoising MLP learns to reverse the process of adding Gaussian noise to real ESM2 embeddings.
- Trained using a sequence of time steps (e.g., 30 steps), the model gradually denoises noisy embeddings back toward the real manifold.
- This enables sampling realistic embeddings from noise.
- Starting from pure Gaussian noise, the trained diffusion model is used to generate new latent vectors that resemble real protein embeddings.
- These latent samples are biologically plausible but unseen β representing de novo candidates.
- Real ESM2 embeddings are paired with their corresponding amino acid sequences.
- This dataset is used to train a decoder to translate from embedding β sequence.
- A Transformer model is trained to autoregressively generate amino acid sequences from input embeddings.
- Label smoothing and entropy filtering are used to improve sequence diversity and biological plausibility.
- Optionally, ESM2 logit distillation is applied to align predictions with natural residue distributions.
The synthetic embeddings from Step 4 are decoded into amino acid sequences using a hybrid decoding strategy that balances biological realism with diversity.
By default:
- 40% of amino acid positions are generated stochastically, sampled from the decoderβs output distribution.
- 60% are reference-guided, biased toward residues from the closest matching natural sequence.
This configuration is empirically tuned to produce sequences with approximately 50β60% sequence identity to known proteinsβstriking a practical balance between novelty and plausibility.
This decoding step is fully configurable:
- Setting the stochastic ratio to 100% yields fully de novo sequences, maximizing novelty but potentially reducing identity.
- Lower stochastic ratios (e.g., 20β30%) increase similarity to natural proteins.
- The ratio can be adjusted using a configuration flag in the decoding script.
The output is a final FASTA file of decoded protein sequences, suitable for downstream validation or structural modeling.
- A combined t-SNE plot compares the distribution of real and generated embeddings.
- Useful for assessing whether synthetic proteins fall within plausible latent regions.
- Pairwise cosine distances are computed between:
- Natural vs. Natural sequences
- Natural vs. generated sequences
- Generated vs. generated sequences
- This helps evaluate diversity and proximity to known protein embeddings.
Each decoded protein sequence is evaluated using two key metrics:
-
Shannon Entropy: Quantifies amino acid diversity across the sequence.
- Values typically range from ~1.0 (low diversity) to ~4.3 (maximum diversity).
- Higher entropy values (β₯ 3.5) suggest diverse, non-repetitive sequences.
- Lower values (< 2.5) may indicate low-complexity or biologically implausible repeats.
-
Sequence Identity (via BLAST): Measures similarity to known natural proteins.
- This helps ensure the generated sequences remain evolutionarily grounded while still being novel.
Sequences are filtered based on configurable entropy and identity thresholds to retain those with balanced novelty and biological relevance. Only sequences within the target range are included in downstream analysis and structural validation.
Generated sequences are validated by aligning them against a locally downloaded SwissProt database using the blastp tool from NCBI BLAST+.
- Uses:
blastpfrom the BLAST+ suite - Target database:
SwissProt(downloaded locally in FASTA format) - Input: Decoded sequences (
decoded_embeddiff.fasta) - Output: A CSV summary with:
- Percent identity
- E-value
- Alignment length
- Matched SwissProt accession/description
This step confirms that generated sequences are evolutionarily meaningful by evaluating their similarity to curated natural proteins.
π Output example:
data/blast_results/blast_summary_local.csv
- All visualizations, metrics, and links to output files are compiled into an interactive HTML report.
- Includes cosine plots, entropy scatter, identity histograms, and t-SNE/MDS projections.
- Allows easy inspection and sharing of results.
Although not part of the core EmbedDiff pipeline, the generated sequences can optionally be assessed for structural plausibility using modern protein structure prediction tools:
π¬ ESMFold
- A fast, accurate structure predictor from Meta AI, built on the ESM2 language model.
- Accepts a FASTA file of protein sequences as input and returns predicted 3D structures with per-residue confidence scores (pLDDT).
- Ideal for rapid, large-scale folding of EmbedDiff-generated sequences.
𧬠AlphaFold2
- The state-of-the-art method from DeepMind for protein structure prediction.
- Provides highly accurate structural models and can be run locally or via cloud platforms.
- More computationally intensive, but offers best-in-class accuracy.
- 3D Models (
.pdb) for each sequence. - Confidence Scores (e.g.
pLDDTorPAE) per residue. - Optional Visualizations using tools like:
π These tools provide additional confidence that the generated sequences are likely to fold into stable and ordered protein structures.
EmbedDiff/ βββ README.md # π Project overview and documentation βββ .gitignore # π Files/folders to exclude from version control βββ run_embeddiff_pipeline.py # π§ Master pipeline script to run all steps βββ requirements.txt # π¦ Python dependencies for setting up environment βββ environment.yml # (Optional) Conda environment file (if using Conda) β βββ data/ # π Input and output biological data β βββ curated_thioredoxin_reductase.fasta β βββ decoded_embeddiff.fasta β βββ blast_results/ β βββ blast_summary_local.csv β βββ embeddings/ # π Latent vector representations β βββ esm2_embeddings.npy β βββ sampled_embeddings.npy β βββ figures/ # π All generated plots and report β βββ fig2b_loss_train_val.png β βββ fig3a_generated_tsne.png β βββ fig5a_decoder_loss.png β βββ fig5b_identity_histogram.png β βββ fig5c_entropy_scatter.png β βββ fig5d_all_histograms.png β βββ fig_tsne_by_domain.png β βββ fig5f_tsne_domain_overlay.png β βββ fig5b_identity_scores.csv β βββ embeddiff_summary_report.html β βββ scripts/ # π Core processing scripts β βββ esm_embedder.py # Step 2a: Embed sequences with ESM2 β βββ first_tsne_embedding.py # Step 2b: t-SNE of real embeddings β βββ train_emeddiff.py # Step 3: Train latent diffusion model β βββ sample_embeddings.py # Step 4: Sample new embeddings β βββ build_decoder_dataset.py # Step 5a: Build decoder training set β βββ train_transformer.py # Step 5b: Train decoder β βββ transformer_decode.py # Step 6: Decode embeddings to sequences β βββ plot_tsne_class_overlay.py # Step 7a: t-SNE comparison β βββ cosine_simlar_histo.py # Step 7b: Cosine similarity plots β βββ plot_entropy_identity.py # Step 7c: Entropy vs. identity filter β βββ blastlocal.py # Step 7d: Local BLAST alignment β βββ generate_html_report.py # Step 8: Generate final HTML report β βββ models/ # π ML model architectures β βββ diffusion_mlp.py # EmbedDiff diffusion model β βββ decoder_transformer.py # Transformer-based decoder β βββ utils/ # π Utility and helper functions β βββ amino_acid_utils.py # Mapping functions for sequences β βββ metrics.py # Functions for loss, entropy, identity, etc. β βββ checkpoints/ # π Model checkpoints (excluded via .gitignore) βββ embeddiff_mlp.pth βββ decoder_transformer_best.pth
If you use EmbedDiff in your research or development, please consider starring the repo β and linking back to it. Citations and backlinks help others find and trust this work.