Skip to content

model architecture and model input #3

@Cauliflower61

Description

@Cauliflower61

Thanks for releasing this awesome project and the codebase!
I would like to ask about some details of the model.

  1. Did you drop or random select conditions during training ? For example, when training with cond 1, cond 2, and cond 3, cond 1 is dropped at a certain ratio.

  2. During training, are all conditions compressed by the same VAE as the video? And when performing self-attention, they are concatenated along the sequence dimension? If so, the computational complexity approaches the square of the number of conditions. Have any corresponding GPU memory optimizations been implemented ?

  3. In each DIT block, are cross-attention and the FFN performed separately? For example, with distinct modules like cross_attention_video, cross_attention_cond1, and so on.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions