-
Notifications
You must be signed in to change notification settings - Fork 382
Description
Hello, thank you for developing Evo2. Creating DNA foundation models trained on such an extensive dataset is truly impressive.
I recently tried to use your 7B parameter model (evo2-7b) for inference on the first chromosome of Arabidopsis thaliana (~30 Mb). I have my own code to feed the DNA into Evo2 in smaller blocks and extract the embeddings from this layer: 'blocks.28.mlp.l3'. I'm currently only interested in the embeddings and not the final output. When I test different sequence lengths for example 2560 bp vs. 92160 bp, I notice that no matter if I use StripedHyena's stateless_forward or stateful_forward I get different results when comparing the embeddings created with these two sequence lengths/block sizes. For the stateful function I would have assumed that the embeddings for the entire chromosome are nearly identical, no matter which sequence length is used. I tested the similarity with Pearson's correlation and instead of values around 0.99 I find Pearson's correlation values ranging from 0.5 to 0.99 (mean: 0.91). I'm initializing the inference parameters like so:
from evo2 import Evo2
model = Evo2('evo2_7b')
inference_params = model.model.initialize_inference_params(max_seqlen=1048576)
After the data processing I use:
# StripedHyena forward function
model.forward(input_ids, inference_params_dict=inference_params)
If I overlooked how statefulness can be achieved in this repository or vortex, I apologize and would kindly ask you to point me to the right tutorial or code.