Module Description:
Ubiquity Extended Team member Amol Kapoor explains now the machine learning model NeRF is able to predict views of a 3-dimensional view given a set of 2-dimensional views of a scene or object.
Full Transcript:
This model is called NeRF. So far we've talked about text and images, NeRF is all about representing 3D scenes. So NeRF is a model that aims to predict new views of a 3D scene, given one or a few previous views of that scene. Some of you guys who are on TikTok may have seen some examples of this where people will take a video and then create these cool flybys, that's all happening with NeRF. The idea here is that any model that can generate new views must have somewhere in its weights, in its understanding of the world, a representation of the actual underlying 3D scene. So if you can generate novel views, you can generate the 3D model. Unlike some of the other models we've talked about so far, NeRFs don't have different training and inference steps. Instead, the output of the training step is the final result. The NeRF model takes in five inputs an xyz point location and a camera direction and outputs a color and a density for a specific point. So during training we take in a few different image angles of scenes, and then we run this model for sampled xyz coordinates in an LxWxH cube, and we render images of that cube at the same viewpoints as the input images. The renderings are then compared against the input and the model updates based on the differences. So just to walk through a more visual example, let's say we had a 3D model of a Lego truck, you take an input photo of that truck from the front and maybe one from the side. And then we create rays that project along those images. The model tries to fill in the space in those rays, and then we project that back out into the 2D plane, to see what the difference is based on the underlying representation. If you have a bunch of different views, the model should eventually create a accurate representation of the actual scene. Embeddings represent points in some geometry. Again, think about our RGB cube, it's just a point in space. Normally we use that embedding geometry as a vehicle to calculate something else, like classifying an image or detecting if two things are similar. With NeRFs, the cool thing here is the embedding geometry is the actual output. Our loss function allows us to directly calculate the shape of data. So one way to imagine how this works is the model is slowly molding a blob into the right shape based on picture renderings of the output. So if I start with this input representation of this Lego truck, well the model starts with a totally wild guess, everything is everywhere, and then over multiple steps, it slowly molds that down into a more accurate point cloud representation. Can we generate these models directly from text? Oh yeah.
Lemme stop you for a second, on the prior slide. So is it correct that NeRF is one of the more inspectable AI models, where you can actually look into it? I know there's a lot of work around transparency and observability and monitoring and ethics, but it strikes me that NeRF, you can actually just look at the output directly, see what it's like.
Yeah, yeah, it's sort of interesting. I mean transparency and like how a model actually works is sort of an interesting like sub-problem. What's interesting about NeRFs is actually they're very small models, they tend to be very small. So you train a single NeRF model on a single scene, they don't generalize across many scenes and that in turn means it's actually pretty easy to kind of understand what's happening. I also think this formulation, this loss function idea of you put this ray and then you create a volume and you try and like identify what's inside that volume, does lend itself to a fair amount of transparency. But I will say it's hard to compare against whether or not this model is more or less transparent than other models, 'cause they're all kind of different.
Thanks.
Okay, so DreamFields. So the original NeRF paper was published in 2020. In 2022, Google came out with this other paper called DreamFields, and the question that was being asked was, can we generate models directly from text? It turns out the answer is yes. The beauty of CLIP, which is the model that we talked about earlier, is that so many things can be compiled into text or images. So we already know that if I have some text and I have an image, I can calculate where those things live on some embedding space and then I can move those inputs so that they can get closer to each other. Well, it turns out if I just take a picture of a 3D model during training and I pass it through CLIP to figure out how close that 3D model is to a text prompt, I can backpropagate that information to the original model and eventually create a model that just matches the description, which I think is incredibly cool.