💻 Technical
Series: AI Crash Course

Intro to AI - Part 3: Stable Diffusion for image generation

Amol Kapoor

Module Description:
Ubiquity Extended Team member Amol Kapoor explains how the ML model Stable Diffusion works by training a model to successively remove noise from an image. Now, when you feed in new noisy images along with a text prompt, you get a polished image as output.

Full Transcript:
Stable Diffusion is a paper that was released in August of 2022. It's immediately become incredibly popular for image generation tasks. In fact, I think the entire image generation space has basically been taken over by diffusion models. So it's worth talking a little bit about how they work.



The core idea is you take an image and you add noise to it multiple times in succession. So you start with a clear, polished image, and then you literally just add random noise to it, and then you train a model to predict how to remove each step of noise. So for example, if I start with this noisy image in the top right, I get a slightly less noisy image. I then pass that back into the same model and I get a slightly less noisy image again, and I keep doing that until eventually I end up back at my polished image. So the same model gets fed the image over and over, successively removing noise until it has the final output. This is all happening during training. So the model's weights are learning how to traverse an embedding space where images and noisy images are next to each other. Now, when I feed in random noise, I get polished images out.



One way to think about all of this is again, think about the geometry. I remember I said that one way to build intuition about ML models is to think about what things we want to be neighbors. Well, latent diffusion models create in an embedding space where noisy images are neighbors to their cleaned up counterparts. So what happens when you're actually running a latent diffusion model is the model is simply walking through the space, trying to find the next cleaner image. I mentioned earlier that people use CLIP as a way to guide these models, as a way to describe something in text and then generate an image. Well, it turns out that adding text conditioning to these models is as simple as plugging CLIP embeddings into the training process. So now during training, in addition to the input noise, the model also gets some text. The neurons in the model use the text as additional correlation information. As long as the text is descriptive of the final output, you can train the model to use the text to make sure we're going in the right direction. What does this actually look like? Well, at inference, the text serves to guide the output. So one way you can imagine this is by changing the direction of the arrows as the model walks through the embedding geometry. If we have the red description, we end up as a picture with a picture of Lenna. If we have the orange description, we end up with a picture of a vase full of flowers. Even though we start in the exact same place, the model traverses the embedding space in a different way because of the text conditioning.

Duration:
2 minutes
Series:
Series: AI Crash Course
Startup Stage:
Pre-seed, Seed, Series A
Upload Date:
9/18/2023