Diffusion models have become the backbone of modern generative AI, powering everything from Stable Diffusion to DALL-E 3. Their core idea is elegant: systematically destroy data by adding noise, then learn to reverse the process.
The Forward Process
Given a data point x₀ sampled from the real distribution, the forward process produces increasingly noisy versions by adding Gaussian noise:
q(x_t | x_{t-1}) = N(x_t; sqrt(1 - β_t) * x_{t-1}, β_t * I)A key property is that we can sample any noisy version directly without iterating through all previous steps:
x_t = sqrt(ā_t) * x_0 + sqrt(1 - ā_t) * εWhere ε ~ N(0, I). As t → T (typically T=1000), the signal is completely destroyed and x_T is pure Gaussian noise.
The Reverse Process
The generative model learns to reverse this corruption. A neural network (typically a U-Net) predicts the noise that was added at each step:
def reverse_step(model, x_t, t, alpha_bar, alpha, beta):
"""Single reverse diffusion step."""
eps_pred = model(x_t, t)
coeff1 = 1.0 / torch.sqrt(alpha[t])
coeff2 = beta[t] / torch.sqrt(1.0 - alpha_bar[t])
mu = coeff1 * (x_t - coeff2 * eps_pred)
if t > 0:
noise = torch.randn_like(x_t)
sigma = torch.sqrt(beta[t])
return mu + sigma * noise
return muThe Training Objective
The simplified training objective is remarkably elegant:
L = E[||ε - ε_θ(sqrt(ā_t) * x_0 + sqrt(1 - ā_t) * ε, t)||²]The training loop:
for x_0 in dataloader:
t = torch.randint(0, T, (batch_size,)) # Random timestep
eps = torch.randn_like(x_0) # Random noise
x_t = sqrt_alpha_bar[t] * x_0 + sqrt_one_minus_alpha_bar[t] * eps
eps_pred = model(x_t, t)
loss = F.mse_loss(eps_pred, eps)
loss.backward()
optimizer.step()Connection to Score Matching
The noise prediction network is implicitly learning the score function — the gradient of the log probability density. Song et al. (2021) showed:
∇_{x_t} log q(x_t) ≈ -ε_θ(x_t, t) / sqrt(1 - ā_t)This unifies diffusion models with Langevin dynamics. The continuous-time formulation describes diffusion as a stochastic differential equation (SDE), and generation as the reverse SDE.
Classifier-Free Guidance
For conditional generation (e.g., text-to-image), classifier-free guidance interpolates between conditional and unconditional predictions:
ε_guided = ε_uncond + w * (ε_cond - ε_uncond)Where w is the guidance scale (typically 7-15). During training, the conditioning is randomly dropped with ~10% probability so the model learns both modes. Higher guidance scales produce outputs that more strongly match the condition at the cost of diversity.
Why Diffusion Models Work
- Mode coverage: Unlike GANs, diffusion models optimize a likelihood-based objective and don't suffer from mode collapse
- Training stability: Simple regression loss — no adversarial min-max game
- Controllability: Classifier-free guidance and ControlNet provide fine-grained control
- Trade-off: Inference requires hundreds of sequential steps. Distillation techniques (consistency models) reduce this to 1-4 steps