Module Description:
Ubiquity Extended Team member Amol Kapoor continues this Intro to AI Series with an explanation of the CLIP model for guided image generation. This model generates an embedding space where images and descriptive text are near one another.
Full Transcript:
The first one I'd like to start with is a model called CLIP. This was a model that was released in January of 2021 by OpenAI. The paper that is based on is called Learnable Transferrable Visual Models from Natural Language Supervision.
The core idea behind CLIP is to create an embedding space for images and the text that describes those images are neighbors. So the way we do this is during training, we iterate over many image text pairs, and the model tries to put those pairs close together, thereby learning the geometry. As an example, if I had a picture of my dog and I fed that into clip, I would get a point in some space, and if I fed a description of that, Photo of Lego, Amol's Dog, I would get a different point in space. And what the model tries to do is, adjust its internal weights so those two points are really close together. Once we have the embedding space, on other words, once we've trained our model, we can do things like classify images. We simply embed a bunch of different labels and some source image and see which label is closest. So for example, if I feed in this clip art image and I pass in a few labels like House, Dog, Pizza and Lamp, all of these becomes points in some geometry. And then we can calculate the distance. We can do embedding math. So I can say, oh, well this green dot is closest to the red dot, so this must be a picture of a house.
Another cool thing we can do with this is guided generation. Let's say we have a way to generate images from a set of numbers, deterministically, it doesn't matter how. If you guys know what GANs are, this is a very popular way to do it with GANs. But for all intents and purposes, we just have an input vector that's random and it produces a random image. Like we already know, CLIP can take a piece of text and convert it into a point in space, and it can take an image and convert that into a point in space, and we can calculate the distance between those two points. What's really cool is we can then take that distance and use that to calculate a change in the input vector. So, the farther away these two points are, the more we need to update the input. And if you do this repeatedly until the points get closer and closer together, you get guided image generation. So you start with a description, Photo of Tom Cruise, and the model iteratively trans-traverses the embedding space, goes through all of the different images, until eventually it gets to a point where the source image is really close to the text description, a Photo of Tom Cruise. This whole field is called Representation Learning. The reason it's called representation learning is because, well, we are learning representations, representations of data, and the Holy Grail for this field for representation learning is to discover an embedding space that works for a huge array of tasks across all data types, and I think CLIP is a critical step in that direction. The process of learning image text pairs leads to the model understanding something more fundamental about information. And as a result, people have successfully used CLIP embeddings as hooks to all kinds of ML-powered interfaces, including Stable Diffusion.