- Load an image from a dataset
- Encode it using the VAE encoder. This gives a Latent Tensor
$z_{0}$ - If our image was 512x512 RGB, the
$z_{0}$ would have the shape (4,64,64)
- If our image was 512x512 RGB, the
- Apply noise to
$z_{0}$ using the noise scheduler:$x_{t} = \sqrt{\alpha_t}z_{0} + \sqrt{1 - \alpha_{t}}{\epsilon}$ - Feed
$x_t$ at timestep$t$ , and text embedding$c$ into the U-Net to predict$\epsilon$ - Compute Loss
- Backpropagate (update the U-Net weights)
- Pipeline samples random Gaussian noise in the latent space
-
U-Net with learned weights, begins denoising step by step.
$z_{t-1} = \text{Scheduler.step}()$ - After T steps you get a clean latent
$z_{0}$ , T approaches 0. - The VAE-decoder then converts the
$z_{0}$ into an RGB image$x_{0}$
Notebook: https://colab.research.google.com/drive/1cMkft2zsIJSDG_yn09G03TMd7qXU6SZh?usp=sharing
AYS EXAMPLE: https://colab.research.google.com/drive/1cIwbbO4HRP1aUQ8WcbQBaT8p3868k7BC?usp=sharing