You are on page 1of 10

M AC H I N E L E A R N I N G

The Physics Principle That Inspired Modern AI Art

By A N I L A N A N T H A S W A M Y

January 5, 2023

Diffusion models generate incredible images by learning to reverse the process that, among other things, causes ink to
spread through water.

Samuel Velasco/Quanta Magazine; source: Shutterstock

A
sk DALL·E 2, an image generation system created by OpenAI, to paint a picture of “goldfish
slurping Coca-Cola on a beach,” and it will spit out surreal images of exactly that. The
program would have encountered images of beaches, goldfish and Coca-Cola during training,
but it’s highly unlikely it would have seen one in which all three came together. Yet DALL·E 2 can
assemble the concepts into something that might have made Dalí proud.

DALL·E 2 is a type of generative model — a system that attempts to use training data to generate
something new that’s comparable to the data in terms of quality and variety. This is one of the hardest
problems in machine learning, and getting to this point has been a difficult journey.
The first important generative models for images used an approach to artificial intelligence called a
neural network — a program composed of many layers of computational units called artificial neurons.
But even as the quality of their images got better, the models proved unreliable and hard to train.
Meanwhile, a powerful generative model — created by a postdoctoral researcher with a passion for
physics — lay dormant, until two graduate students made technical breakthroughs that brought the
beast to life.

DALL·E 2 is such a beast. The key insight that makes DALL·E 2’s images possible — as well as those of
its competitors Stable Diffusion and Imagen — comes from the world of physics. The system that
underpins them, known as a diffusion model, is heavily inspired by nonequilibrium thermodynamics,
which governs phenomena like the spread of fluids and gases. “There are a lot of techniques that were
initially invented by physicists and now are very important in machine learning,” said Yang Song, a
machine learning researcher at OpenAI.

The power of these models has rocked industry and users alike. “This is an exciting time for generative
models,” said Anima Anandkumar, a computer scientist at the California Institute of Technology and
senior director of machine learning research at Nvidia. And while the realistic-looking images created
by diffusion models can sometimes perpetuate social and cultural biases, she said, “we have
demonstrated that generative models are useful for downstream tasks [that] improve the fairness of
predictive AI models.”

High Probabilities

To understand how creating data works for images, let’s start with a simple image made of just two
adjacent grayscale pixels. We can fully describe this image with two values, based on each pixel’s shade
(from zero being completely black to 255 being completely white). You can use these two values to plot
the image as a point in 2D space.

If we plot multiple images as points, clusters may emerge — certain images and their corresponding
pixel values that occur more frequently than others. Now imagine a surface above the plane, where the
height of the surface corresponds to how dense the clusters are. This surface maps out a probability
distribution. You’re most likely to find individual data points underneath the highest part of the
surface, and few where the surface is lowest.

DALL·E 2 produced these images of “goldfish slurping Coca-Cola on a beach.” The program, created by
O AI h d lik l t d i il i b t ld till t th it
OpenAI, had likely never encountered similar images, but could still generate them on its own.

DALL·E 2
Now you can use this probability distribution to generate new images. All you need to do is randomly
generate new data points while adhering to the restriction that you generate more probable data more
often — a process called “sampling” the distribution. Each new point is a new image.

The same analysis holds for more realistic grayscale photographs with, say, a million pixels each. Only
now, plotting each image requires not two axes, but a million. The probability distribution over such
images will be some complex million-plus-one-dimensional surface. If you sample that distribution,
you’ll produce a million pixel values. Print those pixels on a sheet of paper, and the image will likely
look like a photo from the original data set.

The challenge of generative modeling is to learn this complicated probability distribution for some set
of images that constitute training data. The distribution is useful partly because it captures extensive
information about the data, and partly because researchers can combine probability distributions over
different types of data (such as text and images) to compose surreal outputs, such as a goldfish
slurping Coca-Cola on a beach. “You can mix and match different concepts … to create entirely new
scenarios that were never seen in training data,” said Anandkumar.

In 2014, a model called a generative adversarial network (GAN) became the first to produce realistic
images. “There was so much excitement,” said Anandkumar. But GANs are hard to train: They may not
learn the full probability distribution and can get locked into producing images from only a subset of
the distribution. For example, a GAN trained on images of a variety of animals may generate only
pictures of dogs.

Machine learning needed a more robust model. Jascha Sohl-Dickstein, whose work was inspired by
physics, would provide one.

Blobs of Excitement

Around the time GANs were invented, Sohl-Dickstein was a postdoc at Stanford University working on
generative models, with a side interest in nonequilibrium thermodynamics. This branch of physics
studies systems not in thermal equilibrium — those that exchange matter and energy internally and
with their environment.

An illustrative example is a drop of blue ink diffusing through a container of water. At first, it forms a
dark blob in one spot. At this point, if you want to calculate the probability of finding a molecule of ink
in some small volume of the container, you need a probability distribution that cleanly models the
initial state, before the ink begins spreading. But this distribution is complex and thus hard to sample
from.

Eventually, however, the ink diffuses throughout the water, making it pale blue. This leads to a much
simpler, more uniform probability distribution of molecules that can be described with a
straightforward mathematical expression. Nonequilibrium thermodynamics describes the probability
distribution at each step in the diffusion process. Crucially, each step is reversible — with small enough
steps, you can go from a simple distribution back to a complex one.
Jascha Sohl-Dickstein created a new approach for generative modeling based on the principles of
diffusion.

Asako Miyakawa
Sohl-Dickstein used the principles of diffusion to develop an algorithm for generative modeling. The
idea is simple: The algorithm first turns complex images in the training data set into simple noise —
akin to going from a blob of ink to diffuse light blue water — and then teaches the system how to
reverse the process, turning noise into images.

Here’s how it works. First, the algorithm takes an image from the training set. As before, let’s say that
each of the million pixels has some value, and we can plot the image as a dot in million-dimensional
space. The algorithm adds some noise to each pixel at every time step, equivalent to the diffusion of ink
after one small time step. As this process continues, the values of the pixels bear less of a relationship
to their values in the original image, and the pixels look more like a simple noise distribution. (The
algorithm also nudges each pixel value a smidgen toward the origin, the zero value on all those axes, at
each time step. This nudge prevents pixel values from growing too large for computers to easily work
with.)

Do this for all images in the data set, and an initial complex distribution of dots in million-dimensional
space (which cannot be described and sampled from easily) turns into a simple, normal distribution of
dots around the origin.

“The sequence of transformations very slowly turns your data distribution into just a big noise ball,”
said Sohl-Dickstein. This “forward process” leaves you with a distribution you can sample from with
ease.
Yang Song helped came up with a novel technique to generate images by training a network to
effectively unscramble noisy images.

Courtesy of Yang Song

Next is the machine learning part: Give a neural network the noisy images obtained from a forward
pass and train it to predict the less noisy images that came one step earlier. It’ll make mistakes at first,
so you tweak the parameters of the network so it does better. Eventually, the neural network can
reliably turn a noisy image, which is representative of a sample from the simple distribution, all the
way into an image representative of a sample from the complex distribution.

The trained network is a full-blown generative model. Now you don’t even need an original image on
which to do a forward pass: You have a full mathematical description of the simple distribution, so you
can sample from it directly. The neural network can turn this sample — essentially just static — into a
final image that resembles an image in the training data set.

Sohl-Dickstein recalls the first outputs of his diffusion model. “You’d squint and be like, ‘I think that
colored blob looks like a truck,’” he said. “I’d spent so many months of my life staring at different
patterns of pixels and trying to see structure that I was like, ‘This is way more structured than I’d ever
gotten before.’ I was very excited.”

Envisioning the Future

Sohl-Dickstein published his diffusion model algorithm in 2015, but it was still far behind what GANs
could do. While diffusion models could sample over the entire distribution and never get stuck spitting
out only a subset of images, the images looked worse, and the process was much too slow. “I don’t
think at the time this was seen as exciting,” said Sohl-Dickstein.

It would take two students, neither of whom knew Sohl-Dickstein or each other, to connect the dots
from this initial work to modern day diffusion models like DALL·E 2. The first was Song, a doctoral
student at Stanford at the time. In 2019, he and his adviser published a novel method for building
generative models that didn’t estimate the probability distribution of the data (the high-dimensional
surface). Instead, it estimated the gradient of the distribution (think of it as the slope of the high-
dimensional surface).

Song found his technique worked best if he first perturbed each image in the training data set with
increasing levels of noise, then asked his neural network to predict the original image using gradients
of the distribution, effectively denoising it. Once trained, his neural network could take a noisy image
sampled from a simple distribution and progressively turn that back into an image representative of
the training data set. The image quality was great, but his machine learning model was painfully slow
to sample. And he did this with no knowledge of Sohl-Dickstein’s work. “I was not aware of diffusion
models at all,” said Song. “After our 2019 paper was published, I received an email from Jascha. He
pointed out to me that [our models] have very strong connections.”

In 2020, the second student saw those connections and realized that Song’s work could improve Sohl-
Dickstein’s diffusion models. Jonathan Ho had recently finished his doctoral work on generative
modeling at the University of California, Berkeley, but he continued working on it. “I thought it was the
most mathematically beautiful subdiscipline of machine learning,” he said.

Ho redesigned and updated Sohl-Dickstein’s diffusion model with some of Song’s ideas and other
advances from the world of neural networks. “I knew that in order to get the community’s attention, I
needed to make the model generate great-looking samples,” he said. “I was convinced that this was
the most important thing I could do at the time.”
His intuition was spot on. Ho and his colleagues announced this new and improved diffusion model in
2020, in a paper titled “Denoising Diffusion Probabilistic Models.” It quickly became such a landmark
that researchers now refer to it simply as DDPM. According to one benchmark of image quality —
which compares the distribution of generated images to the distribution of training images — these
models matched or surpassed all competing generative models, including GANs. It wasn’t long before
the big players took notice. Now, DALL·E 2, Stable Diffusion, Imagen and other commercial models all
use some variation of DDPM.

Jonathan Ho and his colleagues combined Sohl-Dickstein and Song’s methods to make possible
modern diffusion models, such as DALL·E 2.

Courtesy of Jonathan Ho

Modern diffusion models have one more key ingredient: large language models (LLMs), such as GPT-3.
These are generative models trained on text from the internet to learn probability distributions over
words instead of images. In 2021, Ho — now a research scientist at a stealth company — and his
colleague Tim Salimans at Google Research, along with other teams elsewhere, showed how to
combine information from an LLM and an image-generating diffusion model to use text (say,
“goldfish slurping Coca-Cola on a beach”) to guide the process of diffusion and hence image
generation. This process of “guided diffusion” is behind the success of text-to-image models, such as
DALL·E 2.

“They are way beyond my wildest expectations,” said Ho. “I’m not going to pretend I saw all this
coming.”

Generating Problems
As successful as these models have been, images from DALL·E 2 and its ilk are still far from perfect.
Large language models can reflect cultural and societal biases, such as racism and sexism, in the text
they generate. That’s because they are trained on text taken off the internet, and often such texts
contain racist and sexist language. LLMs that learn a probability distribution over such text become
imbued with the same biases. Diffusion models are also trained on un-curated images taken off the
internet, which can contain similarly biased data. It’s no wonder that combining LLMs with today’s
diffusion models can sometimes result in images reflective of society’s ills.

Anandkumar has firsthand experience. When she tried to generate stylized avatars of herself using a
diffusion model–based app, she was shocked. “So [many] of the images were highly sexualized,” she
said, “whereas the things that it was presenting to men weren’t.” She’s not alone.

These biases can be lessened by curating and filtering the data (an extremely difficult task, given the
immensity of the data set), or by putting checks on both the input prompts and the outputs of these
models. “Of course, nothing is a substitute for carefully and extensively safety-testing” a model, Ho
said. “This is an important challenge for the field.”

Despite such concerns, Anandkumar believes in the power of generative modeling. “I really like
Richard Feynman’s quote: ‘What I cannot create, I do not understand,’” she said. An increased
understanding has enabled her team to develop generative models to produce, for example, synthetic
training data of under-represented classes for predictive tasks, such as darker skin tones for facial
recognition, helping improve fairness. Generative models may also give us insights into how our brains
deal with noisy inputs, or how they conjure up mental imagery and contemplate future action. And
building more sophisticated models could endow AIs with similar capabilities.

“I think we are just at the beginning of the possibilities of what we can do with generative AI,” said
Anandkumar.

You might also like