model architecture and model input

Thanks for releasing this awesome project and the codebase! 
I would like to ask about some details of the model. 

1. Did you drop or random select conditions during training ? For example, when training with cond 1, cond 2, and cond 3, cond 1 is dropped at a certain ratio.

2. During training, are all conditions compressed by the same VAE as the video? And when performing self-attention, they are concatenated along the sequence dimension? If so, the computational complexity approaches the square of the number of conditions. Have any corresponding GPU memory optimizations been implemented ？

3. In each DIT block, are cross-attention and the FFN performed separately? For example, with distinct modules like cross_attention_video, cross_attention_cond1, and so on. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model architecture and model input #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

model architecture and model input #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions