Goal

The main goal behind diffusion models is to accurately estimate a data distribution. If we have an accurate data distribution, we can simply sample from this distribution to generate new, quality data. This blog will go over the fundamental concepts behind diffusion models that allow them to model data distributions accurately and efficiently.

Energy-based Model (The Past Attempt)

Energy-based model is an attempt to model a learnable probability density function (pdf) with an energy function $f_\theta(\bold{x})$. The formula shows that there’s a correlation where the probability becomes greater as corresponding energy becomes lower and vice versa. This method attempts to model the pdf directly, but one issue is that the normalizing factor $Z_\theta$ is needed to keep the total area under the distribution ($\int p_\theta(\bold{x})d\bold{x}=1$) to 1 and this is intractable.

$$ p_{\theta}(\bold{x}) = \frac{e^{-f_{\theta}(\bold{x})}}{Z_\theta} $$

Score-based Model (The Solution)

A clever work around to this issue of intractable normalizing constant is to use what is called a score. Score is the gradient log of a distribution, and this is what will be learned in a diffusion model:

$$ \bold{s}\theta(\bold{x}) = \nabla\bold{x}\log{p(\bold{x})} $$

Simplified Visualization of a Score Function from Yang Song’s blog: Generative Modeling by Estimating Gradients of the Data Distribution

So now, we’re trying to learn a vector function/field instead of directly learning the pdf itself. The reason why this works as a work around is quite clever and simple:

$$ \bold{s}\theta(\bold{x}) = \nabla\bold{x}\log{p_\theta(\bold{x})} = \nabla_\bold{x}\log(\frac{e^{-f_{\theta}(\bold{x})}}{Z_\theta}) = -\nabla_\bold{x}f_\theta(\bold{x}) - \nabla_\bold{x}\log{Z_\theta} $$

since $Z_\theta$ doesn’t depend on $\bold{x}$, $\nabla_\bold{x}\log{Z_\theta} = 0$ :

$$ \bold{s}\theta(\bold{x}) = -\nabla\bold{x}f_\theta(\bold{x}) $$

This result not only shows that the score function is independent of the intractable normalizing constant, but it also shows that the score corresponds to the opposite vector/gradient of the energy.

Now that we know that the score is a very convenient way to indirectly model a data distribution, two questions arise:

how do we learn the score function?
how do we use score to sample/generate new data?

Score Matching

To learn the score function, we try to minimize what we call Fisher divergence:

$$ \frac{1}{2}\mathbb{E}{p(\bold{x})}||\nabla\bold{x}\log{p(\bold{x})} - \bold{s}_\theta(\bold{x})||_2^2 $$