💻 Technical
Series: AI Crash Course

Dissecting Generative AI Agents: Voyager

Amol Kapoor

Module Description:
Ubiquity Extended Team member Amol Kapoor explains how LLMs (large language models) can be thought of a higher level programming language (vs. C, C++, etc). Voyager is an agent using multiple LLMs that can play Minecraft given plain English direction.

Full Transcript:
So I'm gonna skip some of the background 'cause there's a lot of things I wanna cover. Unlike previous talks of mind that you might've seen or heard, this one is structured more like a journal club presentation. So for those of you guys who aren't familiar, journal club is a pretty popular practice in academia. It's basically when a different member of a lab will review a paper, a single paper in order to keep everybody else up to date on what's happening in the field. And I have been obsessed with this one paper that came out about six, seven months ago. And so I just wanted to kind of dive in it with you guys and talk about it and why I think it's so cool and what I think it's gonna do to kind of the state of the future of AI.



So the paper is called ""VOYAGER."" The full title is ""An Open-Ended Embodied Agent with Large Language Models."" In order to understand this paper, I kinda wanna go high level. I wanna talk a little bit about what programming even is. And I actually went to the dictionary for this. A computer program is a set of instructions that a computer can execute. So computers understand Bytecode. Bytecode is a small subset of commands that perform basic mathematical operations and just manipulate the storage of numbers. And you know, originally programmers would write programs in Bytecode and that was pretty miserable, like that was terrible. So over time, people developed programming languages, ways of expressing computation that can turn into Bytecode while being easier to read. So, you end up with something like in the modern day in 2023, you end up something with a stack that looks like this, where you have these high level programming languages that are generally slower, but they're easier to understand compared to these lower level languages that operate closer to the bare metal, closer to where the computer is actually able to understand things. And these languages compile into other languages, making it easier over time for us to be able to program, which, you know, the trade off between computer speed and developer speed generally makes a lot of sense in a world where computation is pretty cheap but developers are very expensive. So I wanna just quickly go through the stack just so everybody can kind of give an intuition for what it means to program at each of these different levels. This is Bytecode. Like I said, pretty terrible, right? Like this is miserable. Nobody wants to program like this. There was a time when people actually did program just like this and it was like, there were maybe a handful of people in the entire world who could actually do this, like real experts. So pretty soon after Bytecode we get a language called Assembly. And Assembly is really important, because it's like one of the first languages that actually looks like a programming language. It's still dealing with the manipulation and storage of numbers. But you can actually start to see the semblance of what looks like modern programming. On top of Assembly you get a language called C. Hopefully some folks in the room actually know how to program in C. C is really cool 'cause this is the first thing that you actually, you know, looks like a modern language that some people in the room might actually use today. And pretty soon after C, you get a language called C++. Now what's fascinating about C++ is the earliest C++ compilers actually just spat out C. So you would take C++, you would write something in C++ and you would compile it into C, which would compile into Assembly, which would compile into Bytecode, right? So you're starting to get this stack, you can start to see how the safety guarantees of C or C++ might be more than what you did in Assembly or Bytecode at the exchange of being like more readable and slower. On top of C++, you get this language called JavaScript. Now hopefully everybody in the room knows JavaScript, right? This is the language that runs in everybody's browsers. It's probably the most popular language in the world. There's not that many good things to say about JavaScript, but one thing is it's very close to English, right? Console.log HelloWorld, you don't have to know how to program to know what this thing does. It prints HelloWorld to some console. So JavaScript is, if you're running it in Chrome, is running in the VA engine, which is entirely programmed in C++. So you have a JavaScript runtime that is compiled into C++, that's compiled into C, compiled into Assembly, which eventually turns into Bytecode. And of course you don't have to stop here, right? You can use something like Typescript, which is what we use at Soot. TypeScript is a flavor of JavaScript. It compiles into JavaScript which compiles into everything else. And you can do something even crazier. Like you could do something like WASM, which I'm not really gonna get into, but on top of WASM you could build a language like C. So you could have C, which compiles into web Assembly, which compiles into JavaScript, which compiles into C++, and so on and so forth. Why do I care about this?



Why have I taken you on this journey? Well, for two reasons. The first one is I wanna highlight how little the syntax actually matters here. The important thing is not the syntax. What's important is that you have a route to convert back to Bytecode. As long as you have a way to get to this, it doesn't matter what you are writing on top, you can write anything. It just has to be a consistent way to get back to Bytecode in order for you to be able to automate it. That is a programming language. The second thing I wanna highlight is the arc of history here. As we've gone later and later, as we develop newer, newer programming languages, we get languages that are closer to English, we reduce the gap between what it takes to understand something and what it takes to actually write it down compared to what we actually need to compute. I think that's really, really important when we start talking about large language models.



So what is a large language model? LLMs are an AI that is trained to take text and try to predict the next most likely word in a sequence. And as it turns out, many problems can actually be phrased as inputting and outputting text. So large language models end up acting as fairly generic problem solvers. My favorite example of this is chess. If you put in chest notation, you will get out chess notation, because if you're feeding in chest notation, the next most likely word in the sequence is more chess. And this actually allows you to play chess against a large language model, which to me is absolutely fascinating, because the model itself has no concept of chess, much less 2D space or the rules of chess or winning or losing or any of that. It's just trying to output what it thinks the next most likely word is. And what's really crazy is large language models are capable of playing chess at a level that's much better than I am. I'm about 1,200 Elo, large language models can basically play at 1,800 Elo, which is pretty cool. So more modern large language models such as ChatGPT, which hopefully everybody in the room is familiar with, are specifically trained to respond in a question-answer format. And I think this allows us to do a pretty wide variety of interesting things. I like to think of large language models as another abstraction layer, a higher level programming language. So in all of these other examples, what we were doing here is we're printing Hello World. It's a very, very common program. It's the first thing you do just to make sure that all of your compilation chain works. To me, this is the print Hello World program in a large language model. Return the following text exactly ""Hello World,"" and ChatGPT goes, ""Hello World."" And you can already start to see what the abstraction stack looks like, right? I start with English, which then compiles into Pytorch weights, which compiles into Python, which compiles into C, which compiles into Assembly, which compiles into Bytecode. Remember what I said earlier, the only property of a computer programming language is that it has a way to compile into Bytecode.



Large language models act as a compiler for natural language, at least I think. So when we say something like programming in Python is easier than programming in C, I think what we generally mean is it takes less effort to express ideas in and understand the former rather than the latter. And large language models are a natural extension of this concept. Instead of writing in code, you write in English. So let's say I wanted to program a fibonacci sequence generator. The fibonacci sequence is the sequence of numbers where every single number is the sum of the previous two numbers. I could type this into ChatGPT completely in English. The fibonacci sequence is calculated by the following formula, and then I can ask it to calculate the eighth value in the fibonacci sequence. And ChatGPT of course gives me the right answer. It goes, F of eight is equal to 21. And actually it even politely says, let's calculate it step by step, which I think is great. So why do I care, why do I think that this is cool, and why do I even defining this as a programming paradigm? Well, one thing that large language models excel at, something that other programming languages are really bad at is you don't need to know the solution to the problems that you're trying to solve. Every other programming language that we've talked about isn't really doing anything that's new. What I mean by this is you already, as a developer, have to know the answer to the problem that you're solving before you can write code. The code is an artifact of the internal state of the developer, not the other way around. The amazing thing about large language models is the developer doesn't actually need to know the solution to the problem they're trying to solve. You just need to be able to describe the problem and give examples of correct answers. And it turns out that there are actually a lot of really generic problems where it's easy to come up with a correct answer in a specific situation, but hard to come up with a generic solution. So let's go back to that fibonacci sequence example again. This time, what if I don't know the fibonacci sequence, but I just know a few examples of what is in the sequence? I say, F of three is equal to two, F of four is equal to three and so and so forth. I can ask ChatGPT, ""What's the output of F of eight?"" And ChatGPT correctly surmises that I'm trying to calculate the fibonacci sequence and goes, ""Hey, this is the pattern that you're creating. Therefore F of eight is equal to 21,"" which is, again, amazing and fantastic. So okay, I bet some people in the audience maybe are a little bit skepticals. Maybe you'll say, ""Look, the fibonacci sequence is a super well known pattern. It's in every introductory textbook, it's probably all over the training data for this thing."" Where I think large language models are really excel is in places where not only is there not a known right answer, the right answer might actually be totally subjective. Where if you ask two completely different people, you'll get two different answers on what it means to do the correct thing.



I think language models are perfect for this in a programming paradigm. So with that in mind, I'm gonna switch gears a little bit. Who here has played Minecraft? I can't see anybody raising their hands, so I'm just gonna assume that all of you have played Minecraft or have some sense of it. This is Minecraft's Steve, he's the main character of Minecraft and for the rest of the talk, he's also gonna be our main character. I find Minecraft to be a really fascinating game. It turns out it's a really hard game to play, not for humans, right? Like my 3-year-old cousins can play Minecraft, but it's actually really, really hard for AI. And it's interesting to think about why. There's no predefined goal. There's no story. The agent is embodied in 3D, so they can walk around, they can move around, they can look around, they have full continuous range of motion. There's lots of long running tasks that are composites of other tasks. So like, if I wanna create a house, I have to get wood, if I wanna get wood, I have to build an ax, if I wanna build an ax, I have to get stone, so on and so forth. And Minecraft is dynamically generated. So if I am in one place and I learn how to do something in one place, I have to be able to apply that knowledge in a completely different place, something that's totally random. Somebody who's good at Minecraft displays all of the properties of lifelong learning that have continuous advancement and improvement over time. And up until this point, no AI has ever been able to solve Minecraft. I mean, yes, you can create an AI that can do a specific thing like, you know, mine wood or something like this, but you've never had an AI that can do all of the things that Minecraft has to offer, whether that's mining diamonds or gathering cactus or building a house or hunting pigs or whatever it might be.



Voyager is an agent that can do this. So Voyager is an embodied agent that uses multiple large language models to solve Minecraft. Each large language model is programmed using a different prompt. The large language models take in the game state and sometimes one or more outputs of other large language models. And then a final LLM writes code that is used to move the agent in game, and then the whole process repeats. The paper defines three different modules. There's a curriculum generator, a skill library, and iterative code generator. And I wanna quickly go through each one just so you guys can all get a sense of what a system of large language models looks like when they're all working in collaboration like this. So we'll start with the curriculum module. The goal of the curriculum module is to propose the suitable next task based on the agent state. And the curriculum module itself is composed of four different large language models. I'm not gonna get super deep into that just to respect time. But the core idea is if you keep running this in a loop, you will eventually generate a steady list of new relevant tasks. So I might start with a high level task like mine a diamond, and this curriculum module will go through and give me a list of sub tasks that I need to complete in order to eventually mine a diamond. Now I wanna take a moment here and just show you how insane it is to be able to program it by prompt. I just wanna take you through an example of what it means to program a large language model. I know this isn't like the best format for a slide, but I just couldn't think of anything better, because it's just like so easy to read, 'cause it's all just natural language. So I'm just gonna take a second to just read this. You are a helpful assistant that tells me the next immediate task to do in Minecraft. My ultimate goal is to discover as many diverse things as possible, accomplish as many diverse tasks as possible, and become the best Minecraft player in the world.



Wait, what? What the heck, right? To call this programming is kind of insane, but what's great about LLMs is you now have a way to turn this text, this English into Bytecode. And so a computer can actually take this and execute on it in a way that makes sense to the computer. The rest of the prompt is also sort of interesting. You have like this IO format, so you say what information you're gonna provide. You include information about, like the game state. You include information about tasks that have already been completed. You give the AI a bunch of criteria. You say, ""Hey, the AI should act as a mentor. It should be really specific, it should follow a concise format when it gives these tasks back."" And then there's an output format, right? It says, ""Okay, you should provide the reasoning and you should provide the task."" If you tried to do this in Python, it would take thousands of lines of code, and here it's like 600 lines of English, max. The reason it would be so hard to do in a traditional programming language is because you have to know the answer. You have to know what it means to be the best Minecraft player in the world and then program that. And that isn't a real thing, right? That's like a subjective thing. That's not something you can program in a computational way, but the large language model is able to, nonetheless, turn that into Bytecode. So here's a few examples of the curriculum module at work. Let's say you had an inventory where you pass in, okay, the player agent has a wooden pickax and some stone. GPT-4, again, entirely in English, will take that and go and take the prompt that we previously had and go, ""Okay, your task is to craft one stone pickax,"" and it gives you the reasoning, ""since you have a wooden pickax and some stones, it would be beneficial to upgrade to a stone pickax for better efficiency."" Super cool. So the auto curriculum, this first module that I talked about, when running a loop, we'll propose increasingly complex tasks in order to achieve this goal of being the best Minecraft player. In order to achieve those tasks, we need some way to store and recall past skills. Humans obviously do this instinctively, right? Like we don't forget how to make a pickax and then have to derive it from first principles every single time we have some new tasks in front of us. But with an AI agent, you have to think about not only how do you store these past skills, but also how do you recall them when you're in a relevant situation. So I'll talk about how the skills are written in a second. That's actually the third module.



But for now, let's talk about how you retrieve skills once you have them. For folks who've worked with embedding databases, this process should look pretty familiar. I'm just gonna quickly run through it. Let's say you have some code that's about combating a zombie. You pass that through a large language model and you ask it to provide a description of that code. This function is about equipping a stone sword to combat a zombie. You then embed that text. I'm not gonna go into what embeddings are right now, but I love embeddings, I talk about it a lot, so feel free to look at any of my other talks. You embed that text and then you add it to the skill library as like a key value store. Later on you give some feedback from the environment. You basically, based on the environment, you ask a few questions like, ""How do I craft an iron pickax in Minecraft?"" And the large language model will give you an answer. You embed that and then you use that as a query in the same skill library. So then you'll get a bunch of skills that you've previously learned that are relevant to your current situation. So that's how you retrieve a bunch of skills. Now let's talk about how you actually write code. This is the final module. A large language module, this follows basically the same pattern that we already know, a large language model generates code, it takes in the agent state and a set of pre-built programs from the skilled library, that we've already talked about, and a prior critique if necessary, and then it produces more code. The agent then runs that generated code. And then finally a different large language model takes in the final agent state and runs a verification and critique step to determine if the AI has actually accomplished a task at hand. If the task succeeds, the program is fed back into the skill library. If not, the critique is sent back to the AI on the next loop. So, what does that look like? Well, let's say my task was to mine five coal ore. I pass in my inventory and GPT-4 goes, ""Okay, well mining coal ore in Minecraft will generate coal. You have five coal in your inventory, so you must have accomplished the task."" What about in an example where I didn't accomplish the task or where the agent didn't accomplish the task? Let's say the task was to hunt three sheep. Well, GPR-4 will go, ""Okay, you have two white wool and six mutton in your inventory, which indicates that you only killed two sheep."" So it will say, ""Hey, you did not succeed in this task."" And it'll give a critique, ""Find and kill one more sheep to complete the task,"" which then goes back into the code generator. The code generator will modify itself and then it'll try again.




So let's bring this all together. You have this automatic curriculum generator that's creating new tasks in this hierarchical way. You have the environment, which is then feeding into the skill library and into the automatic curriculum. You have this iterative prompting mechanism that is taking all of these different pieces and writing code. You then run that code in the environment and then you verify. You make sure that the thing that you were trying to do actually succeeds. If it did, you now have a new skill that you can call upon later. And if it didn't, then you go back and you refine the program and try again. There are a lot of really cool results in the paper, but for time, I just wanna stick to this one. I think this one's kind of the most evocative. Voyager on its own discovers new Minecraft items and skills pretty much continually. It goes from wooden tools to stone tools, iron tools, and then eventually reaches diamond tools. And the whole time it's exploring the map and learning new things. And some of other comparisons, never even get beyond the process of just mining a few basic things. So why do I love this paper? Why am I obsessed with this paper? Well, Voyager is an AI that played Minecraft. Okay? So that alone is cool, right? Like I played Minecraft growing up. I think that's awesome. But I think the main headline is how they did this. The core of this tool is just described in natural language. I mean, even if you don't program, you should really just go look at the GitHub for this thing. It's literally like 600 lines of English and 400 lines of Python. That's it. All of the most complicated pieces, including task generation, code generation, feedback and critique, at no point did the creators of the Voyager agent need to know how to solve these problems. They don't have any generic solutions to these. They just provided some examples to the AI and then asked nicely. Going back to the beginning of this talk, right? Programming languages are about making tasks easier. And each higher level programming language bridges the gap between how we think about a problem and how we describe it to the computer. And I think that gap, when we started with Bytecode and then eventually ended in Python and JavaScript, that gap has definitely gotten a lot smaller over time. With large language models, I think, that gap is basically at zero, and Voyager shows you how. So I wanna end with this last thing. This is actually a different paper. I'm just gonna leave this as a teaser. This is, I think, the natural end conclusion of something like Voyager. I'm just gonna read a little blurb from the abstract. ""CHATDEV, our virtual chat powered company for software development automatically crafts comprehensive software solutions that encompass source codes, environment dependencies, and user manuals. A couple of researchers in China basically put a couple of Voyager agents together in a room and made a company and it can make games just from scratch. You just go, 'Develop a gomoku game,' and it does it."" So I think that's really awesome. And yeah, I wanted to share that all with you today. I hope you guys found that interesting.

Duration:
19 minutes
Series:
Series: AI Crash Course
Startup Stage:
Pre-seed, Seed, Series A
Upload Date:
10/6/2023