Skip to content

Stateful forward does not result in identical embeddings with different sequence lengths #196

@felicitas215

Description

@felicitas215

Hello, thank you for developing Evo2. Creating DNA foundation models trained on such an extensive dataset is truly impressive.

I recently tried to use your 7B parameter model (evo2-7b) for inference on the first chromosome of Arabidopsis thaliana (~30 Mb). I have my own code to feed the DNA into Evo2 in smaller blocks and extract the embeddings from this layer: 'blocks.28.mlp.l3'. I'm currently only interested in the embeddings and not the final output. When I test different sequence lengths for example 2560 bp vs. 92160 bp, I notice that no matter if I use StripedHyena's stateless_forward or stateful_forward I get different results when comparing the embeddings created with these two sequence lengths/block sizes. For the stateful function I would have assumed that the embeddings for the entire chromosome are nearly identical, no matter which sequence length is used. I tested the similarity with Pearson's correlation and instead of values around 0.99 I find Pearson's correlation values ranging from 0.5 to 0.99 (mean: 0.91). I'm initializing the inference parameters like so:

from evo2 import Evo2
model = Evo2('evo2_7b')
inference_params = model.model.initialize_inference_params(max_seqlen=1048576)

After the data processing I use:

# StripedHyena forward function
model.forward(input_ids, inference_params_dict=inference_params)

If I overlooked how statefulness can be achieved in this repository or vortex, I apologize and would kindly ask you to point me to the right tutorial or code.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions