Module Description:
Ubiquity Extended Team member Amol Kapoor explains how GPT and Large Language Models (LLMs) work by outputting probabilities of what the next most likely word is using the concept of "attention".
Full Transcript:
The last model I wanna talk about is GPT, which I'm sure most of the people in this conversation have hopefully heard about. You probably heard about ChatGPT. The original GPT paper was released in June of 2020 by OpenAI, and it really took the AI world by storm. So I wanna talk a little bit about how it works. Most of this stuff directly applies to ChatGPT and GPT-4.
GPT is a language generator, and in particular, it is a large language model or an LLM. Remember I said at the very beginning that every ML model has three parts. There's the features, the embeddings, and the losses. Well, for a GPT, the input features were just sentences scraped from the internet. The embeddings were transformers. We'll talk about what those are in a second. And the loss was to predict the next word. So if I have some input, like for example, not all heroes wear blank. GPT will actually produce a probability distribution of what the most likely next word is, and it gives a 90% probability on case. Now, one caveat here, you know, like I said early on, ML models understand vector representations of data. So we actually have to convert all of these into numbers first. So the way that actually works is GPT has a really big lookup table of about 50,000 different tokens, and each token is one value of from zero to 50,000. So that gets fed into GPT over and over again.
We've had language generators in the past, and actually if you guys have ever used auto corrector, you know auto-fill on your phone, you know what a language generator looks like. GPT is way bigger and was fed way more data in a way that could efficiently handle that data. So it is really, really, really good. The transformer is the main building block of GPT. The key innovation of transformers is that each layer of the model learns how to pay attention to information in the previous layers. Remember what I said earlier about embedding math? Embedding math is powerful because it allows us to do mathematical operations on concepts. It allows us to add and subtract, filter, and remove information basically based on what's important. Transformers allow a ML model to do this at every single layer of the model. A little bit of technical implementation for those of you who are interested. The way a transformer works is you take N embeddings, one for each word, and you pass them through three separate learned weights. These neurons, you multiply the first two outputs to get an N by N matrix, weight matrix, and you normalize those values to sum to one. The normalization is actually the most important part because if you have everything in the weight matrix summing to one, it means that if you put more weight in one area of the matrix, you have to have less weight in some other area. So this forces the model to pick and choose or pay attention to different pieces of information. When you multiply the weight matrix by your third matrix, you get an attention weighted output.
Now, all of this is conceptually very fuzzy. Attention as a concept is kind of made up. But the reason this works is because during back propagation, during training, it gives the model flexibility to learn its own transforms without any human intervention. The model is deciding what to pay attention to, not the human. So let's put it all together. You take some text, you convert it into vectors, you pass those vectors through a whole bunch of transformers, and you convert the output vectors back into text. Now you have a language generator, and if you make it really, really big and you run that over the entire internet, you have GPT.