9. LLMs and Robotics | How can academia engage in expensive LLM research?

Watch it on Youtube. Or listen on your favortie podcast app.

Show note

Large language models and GPT are trending these days. In this episode, Henrique and I sit down with David watkins from Boston Dynamic AI institute. We talk about how LLM is transforming robotics research. We dive into the cutting-edge research happening around generative models and foundation models like GPT, and how these powerful AI systems can be applied to robotics to enable new capabilities.

David provides unique insights into the challenges and opportunities of training robots to function more intelligently in dynamic real-world environments.

Since we all work closely with academia, we also discussed on why top universities are currently absent from the LLM race. It seems like the rich universities do have a ton of endowment funds at a comparable level as the top tech firms, what are the reasons they are not engaging on LLM trainings?

If you haven't subscribed already please like and subscribe now.

Rich universities have comparable endownment fund as tech firms: tweet

Leave a comment in the youtube video or email me at podcast@halfmaker.com

Full transcript text

Ding (00m00s): Large language models and GPT are trending these days. In this episode, Heike and I sit down with David Watkins from Boston Dynamics AI Institute. We talk about how LLM is transforming robotics research. We dive into the cutting edge research happening around generative models and foundational models like GPT, and how these powerful AI systems can be applied to robotics and enable new capabilities. David provides unique insights into the challenges and opportunities of training robots to function more intelligently in dynamic real-world environments. Since we all work closely with academia, we also discuss why top universities are currently absent from LLM race. It seems like the rich universities, they do have a lot of endowment funds at a comparable level as the top tech firms. What are the reasons they're not engaging on LLM trainings? If you haven't subscribed already, please like and subscribe now. We're available on YouTube as well as all podcast platforms. Link to podcast feed is in the description. Hello, welcome back to another episode of Not Just Research podcast. In this episode, we have David from Boston Dynamics AI Institute with us to talk about robotics, generative AI, and LLM. Hi, David.

David (01m05s): Hi, Ding. Thanks so much for having me on the podcast.

Ding (01m08s): Yeah. Pleasure to have you on. Tell us, what do you do?

David (01m11s): I work at the Boston Dynamics AI Institute. I am a research scientist there, and I am also the foundation model lead. So I am bringing all of the wonderful things that are happening in the world of foundation models, large language models, et cetera, to the world of robotics. And we are trying to put together a really good team of people to work on this problem and to figure out what the intersection is for these new tools that we have in the field and bring them into what has been largely dominated by control systems in the past in robotics. Before this, I was at Columbia University. I was getting my PhD in mobile manipulation. I was specifically trying to have a robot navigate throughout an environment, recognize an object sitting on a table, capture multiple views of the object, and then reconstruct the object without having to register those views together or localize the robot in the environment. That was a lot of fun to work on. It was a lot of very different papers that I somehow put together into one coherent narrative,

Henrique (02m12s): Yeah, you fit right in. I think if you look around the room, that's our PhDs kind of rollercoasters in that sense.

David (02m17s): Yeah, and it's a very difficult thing to do, but robotics is a multidisciplinary field. And so when I started at the Institute, I was really excited about where foundation models could take us. And so while it wasn't necessarily something I was working on in the past, I feel like a lot of the software engineering skills I developed over the course of the PhD lends itself well to the very highly demanding work that's required to train a foundation model from scratch.

Ding (02m44s): Can you talk a little bit more specifically about foundation model?

David (02m48s): What do you mean? Yeah, so a foundation model is any very large model trained on one or more modalities that are used for multiple different downstream tasks without necessarily requiring any additional training. And the best example of this is GPT-4, where it is a general purpose large language model that can take in a series of text tokens and output the next token in the sequence. This is what's called an autoregressive model. And we're seeing zero shot performance in test taking, in question answering, in search querying, in a variety of different contexts. This is not something that we really realized models could do until we got these very large models. I think GPT-3 was the first time that we really realized that this was a possibility, but ChatGPT made it even more widespread starting December last year.

Henrique (03m44s): Now it's like a household name, like I see friends and parents and whatnot talking about just putting things into it and seeing what happens. But it's kind of like you said, it just sort of started now. And so all of us who have been working sort of at the leading edge don't have much experience in sort of working with it. It's not like anyone would because a lot of this just came around just now. And so a lot of it is being able to sort of adapt quickly and really get up to speed. And I think that's why you're such an interesting person to talk to because you're sort of turning it in another direction. You're pointing it at robots, whereas everyone is just sort of thinking of like search queries and like text to text. Like a more interesting thing is we have this like super advanced system able to sort of like you said, zero shot learning. And now we want to use it to manipulate things, to grab things, to analyze a room, things that are completely different dimensions

David (04m33s): to what we're used to. And one of the biggest problems with a large language model, and it's sort of the problem that we saw with Bing and this Sydney debacle is that they are not physically grounded. They don't have anything that really brings them to the real world to understand context. It's also very easy to personify these things when ultimately they are just a feed forward model that's just predicting the next token in the sequence. And so if I'm saying things like thinking or processing or words that we would typically use to describe human actions, it's out of convenience. I'm not trying to ascribe some level of sentience

Henrique (05m12s): to these things yet. So we should not be worried Not yet, not yet.

David (05m18s): Coming soon. I am generally of the mind that that is not likely, that the reason why we work really well in the real world is because this is just how sentience works in the real world. And if we were to train something like ourselves, we would probably see very similar behaviors. But a lot of sort of the philosophy that I have around this is sort of obtained from innovative philosophers like Daniel Dennett or Hofstadter in their seminal works.

Henrique (05m46s): So wait, so these seminal works, are these guys pointing at robotics as well or you're just saying from the traditional AI?

David (05m53s): So a great example of this is the Mind's Eye, which is a series of vignettes over time from a variety of different authors. And then it's an analysis from either Daniel Dennett or Hofstadter describing sort of their position versus that particular selected work. They describe everything from just the philosophy of artificial intelligence to robotics and the embodiment being required for physically grounding these things. And so a lot of the models that we're seeing are just purely based on data collected from the internet as opposed to data that is designed

Henrique (06m33s): Is that something to as much as you can speak towards it that you guys are doing over in Boston, sort of getting data, not from the internet per se, but like recording it live, seeing how things like sort of evolve or that, or how do you see it applying for robotics?

David (06m48s): The great thing about the Institute is that it is a place where we can do lots of different experiments of a variety of different, we can make a lot of different moonshots simultaneously. And so I would say that anything that you can think of, we're probably trying it.

Ding (07m03s): It's very exciting. So for LIMs, like, you know, like at Adobe, we're also trying to experiment with documents, understanding video, understanding this type of task. So can you, and I don't know robotics as a research field very well. How does LIM fit into robotics? Is it to help robotic robots think or giving them instructions?

David (07m24s): Or how does that work? So I think, you know, a great example of this at the beginning was Seikan. Seikan was a really cool paper coming out of Everyday Robots that was taking an LLM and it would query the LLM saying something along the lines of, help, I've made a mess. Should I blank? And then it would try, let's say 30 different queries. Should I pick up an apple? Should I walk to the fridge? Should I pick up the sponge? And then it would figure out the highest probability item that should come after that prompt. They then trained a value function in a reinforcement learning style methodology to then figure out probabilistically, given the context that the robot is in, what action is most likely to succeed? And these are all the skills that they defined. This work had bespoke skills for picking up apples, picking up sponges, navigating to specific locations. And they would then take these two probabilities, multiply them together, and whichever one was the highest is the action that they would actually follow through with. And so they were leveraging the LLM as a form of high-level planning. And they were using value functions to physically ground the LLM in feasibility space. This method sort of fell short because it required you to have every single skill that predefined. I didn't have a general pick-up object skill. I had a pick-up apple skill. And so that made it very difficult for this method to scale and generalize, where some of that could have been parameterized away. And then unfortunately, Everyday Robots was closed late last year. And so a lot of that work has now fallen to robotics at Google. But this work is sort of emblematic of how people are approaching LLMs. We have this very cool tool that can help us predict what is most likely to be useful in the future. The biggest difficulty is that it is trained with no physical grounding whatsoever. And so you have other works like Paul Knee, like RT2, that are trying to actually physically ground it by training these models in semantic space using text tokens or image tokens, and then replacing the last three layers with something like robotic actions coming out. It's sort of a fine-tuning training methodology. And this is also showing promising results. But when you train it in this end-to-end fashion, it can be very difficult to debug. And so finding the balance between something like SACAN, where I treat the LLM and the value function as completely separate entities, and I only bring them together at the end, and training something completely end-to-end, there is probably some middle ground here that allows for some auditability until we can get enough training data where the end-to-end approach is more feasible.

Henrique (10m04s): So it has to sort of learn, in a sense, these habits or actions, and then also have a sort of second part of its brain where it's like, what should I actually be doing here? And I guess these reinforcement learning techniques are sort of a replacement for the traditional controls because they're stateful, they're easier to train policies on, given certain bounds. But LLMs are sort of very useful at aggregating lots and lots of data and sort of making it, what's the next best thing to do here assessment? So we're just missing this sort of link, or we can sort of continuously train different versions to tackle different problems and just get them better at sort of communicating between.

David (10m40s): And I think the robotics community in general has not decided on what skills are the most useful, right? Picking something up is only really useful in service of something downstream. And so how do we parameterize these skills such that we can have them be generalizable and composable, but also enough of a skill that it is substantial, it's actually making meaningful progress on the world.

Henrique (11m05s): So that brings up an interesting question, which is, I guess, with the emergence of these LLMs and this general AI, what becomes the Holy Grail in the world of robotics? Like before that, it's still sort of like being able to grasp or enter a room and do all these things robustly. It's sort of what you imagined for robotics. I want something to help pick up something for me or assist in the case of an emergency or whatnot. But now is it something that's just general action? Is it now a limitation on the hardware? Like we can get things that can do a lot of fancy things, but the hardware is holding us back. Where do you think it falls or where it should be?

David (11m38s): So I think generally the hardware is very capable. We're probably using a very small percentage of what the hardware is fully capable of. What we're missing is something that can do the high-level planning, that can build up these skills as symbols abstractly without requiring us to hard-code them. Papers like Voyager, like Code as Policies are getting us closer to that. But again, they are sort of missing that physical grounding element. So Code as Policies, which is very similar to what Voyager was trying to do in Minecraft, it is able to write and generate code that it can then compose together in the future with other pieces of code that it itself generates. And so this allows you to sort of have a library of primitive skills that you then compose up automatically, using something like a prompt in GPT-4 or some other more sophisticated model. If we can have these things be physically grounded, I'd say that that is the holy grail, but that is hand-waving a lot of work away. We, ultimately, we want to be able to produce these models very quickly and expediently for useful downstream tasks. And it's very difficult for me to say if we are 50 steps away from general purpose AI, what are those 50 steps? Are we anywhere along that path yet? Is the LLM step one of 50 or step 20 of 50?

Henrique (13m05s): I'm not exactly sure yet. Yeah, it's a tricky thing to talk about, and it's a tricky thing to test and prod. Like, it's funny, you get to industry and you think, okay, now we have all the resources on our side, but it still takes weeks to train these things. And just gathering the data and having the right labels and tagging, formatting, all that stuff, that's one part of it. That's like the research part of it. That's the fun, like building up the topology and the architecture. But then you take weeks and then you're really pointing it at a very specific task. And then if it works, that's great. If it doesn't, you're not sure if it would work on another task or if you have to give up and whatnot. And it's crazy because we used to do this on GPUs that there was like maybe one GPU on your machine and you have to make a paper out of that. And so now with all these other resources, you're still coming up against like the same wall, which is things just take a long time and you're requesting resources just to like scratch off an idea from your list.

David (13m55s): And I think, you know, a lot of my thesis was done on my 3090. That was a holy grail, by the way.

Henrique (14m01s): Yeah, that's amazing.

David (14m03s): You had 3090. That was so helpful to have 24 gigabytes of addressable VRAM. Yeah. And now you can't even get by doing inference on a single model with anything less than 240 gigabytes with these LLMs, right? Llama 2 came out and you need multiple A180 gigabyte cards to run inference on it.

Henrique (14m25s): Hopefully in the future, we sort of squeeze these things out and make them fit smaller or the hardware gets better. But yeah, even for me, I'm just working in 2D most of the time. I think then maybe similarly, I don't know. And yet now in 3D for robotics or even 4D for a lot of the other stuff, that's gonna get harder and harder.

David (14m44s): I was doing a 500 million parameter model in 2016 for shape completion. It was all 3D convolutions. And 3D convolutions are probably one of the most expensive operations you can do. It doesn't, there was no infrastructure for training this thing, except for like the Titan card that we had in that one machine in the lab. And that was a very precious machine. If that machine ever went down, we were in big trouble. And it's so much worse. For academic institutions right now that are trying to participate in this space, how do you justify these expenses? How do you know that if you're gonna run and train this model for the next two months on whatever meager GPU supply that you have, that it's going to manifest into something that isn't completely outdated by one of the bigger players in a week from now?

Ding (15m35s): Yeah. It's about the financial states of universities versus industry. I just posted a link in a Zoom chat. I saw this this morning and it's, basically what it says is that the wealthiest private college has as much money as many big tech. I think the comparison is a bit hard to make directly because the cash market cap, like endowment funds, those are words I don't fully understand. But I think what it's trying to say is pure cash wise, it's not that far behind. It's at least in the same order of magnitudes. If we look at the list, like Alphabet, Microsoft, Apple, Amazon, if they have the budget, then in theory, all the Ivy Leagues, plus like Stanford and MIT would have the money, but it seems like they're not in this race. And it seems that they're not in any previous race. They're not in deep learning, they're not in ImageNet. Do you think the current ubiquitous situation of LLM will change their behavior? Will the university finally use their whatever cash endowment fund to kind of, I wouldn't say compete, but just start to engage in this- Join the game.

David (16m46s): Yeah, join the game. So endowments are not cash. Endowments are checks written by people, alumni, for specific purposes. And legally a university is not allowed to spend that for anything other than the purpose specified by the donor. So Harvard may have $49 billion. Most of that is probably earmarked for tuition or buildings and legally cannot be used to compete in this space. The only way Harvard likely could compete in this space is if they got a large cash endowment from somebody specifically for building a compute cluster. Even if they got a check that was a blank check, use it for whatever they want, they probably would not use it for this because they usually have other things that are higher priority that they want to spend on first.

Henrique (17m34s): I would just add to that, that the universities are often competing internally. Like even within our departments, the computer science department, it's like, do we hire another roboticist or do we hire someone in theory, someone who does languages, someone who does security, someone who does graphics. So you're competing in personnel, each one of those competing in like how much money they're allowed to use for funding students, for funding their lab and whatnot. And so, and that at a larger scale, like you were mentioning, the university has to provide a service, it has to provide this education, the facilities, the food, the sports, everything has to sort of be part of the ambience. Whereas an Alphabet, an Apple, a Meta, they can make these huge bets, right? Like they can go, okay, we're gonna, like Microsoft is going all in on a thing or Alphabet is gonna really focus on these LLM models. And so those are sort of riskier plays with more funding to back them that a university has trouble making, I guess.

David (18m30s): And a company like ByteDance is investing a billion dollars into building a super compute cluster. The cash from any one of these companies is enough to justify that level of expense. When Google and Amazon and Microsoft already have very large clusters at their fingertips, they could build an H100 cluster and get at the front of the line because of the cash that they can bring to the table. They can pay at a premium. I think what we probably wanna see for academic institutions is more funding to support this kind of AI research. And you can see this in the British government with what is unfortunately named BRIT GPT. They're investing 11 billion pounds into a GPT model trained specifically for use by the government. We could see very similar projects being done locally and in the United States as a national effort. And the US government could also, they have super compute clusters. They could offer access to many academic institutions, which they do, or they could build even more of them to compete in the global space, both through the academic efforts, but also through the national interest labs like the Army Research Lab.

Henrique (19m47s): It's tricky because it's for the purpose of research, it's hard to justify that much funding and that much resources building up a whole new supercomputer. But then if you spin it, it's still research, it still doesn't really work, but you spin it as a product. You say, we need the search engine to be smarter. There's like a product tied to it. Now you're competing with these other companies. And now there's a certain like hurry and urgency and rush thing. If we don't build the supercomputer, they will and they will get to market first and people will go to their product and we will lose money in the end and the stock may fall or whatever it is. Like that is an urgency that when you're doing research in academia, you're thinking things in the scale of like one, two, three years and like longer projects because you're just sort of poking at an idea and seeing what happens. In this other case, you're sort of, I need to beat my competitors to market and then get this best possible product out. Just the whole sense of urgency is why they've justified themselves by getting so big. It could have been the other way around where the universities have all these supercomputers and all the companies are really small and scrappy, but as they've made more and more money, they've learned to sort of just get bigger and bigger and apply more resources to keep their sort of majority state in the game.

David (20m58s): I think in the past that there was, it was much easier for universities to have a mainframe because many different companies sold mainframes. Right now, the only person in the field that's offering GPUs at the level required for this kind of training is NVIDIA. I can't use Intel GPUs, I can't use Apple GPUs, I can't use TPUs because they're not publicly available. And so it makes it so that the price is driven up so much. Yeah, I know you can. The price is driven up a lot because of the demand for this singular product that nobody else can really have access to. And it's not really the market's fault. I don't think anybody really could have foreseen these things being so useful in this way. It's just an unfortunate side effect of this is a company that has a fantastic product and it's very difficult to make.

Henrique (21m52s): We should make a new podcast on how to fix the markets. At least a recording. The blind leading the blind. How do we fix the market?

Ding (22m01s): Yeah, I think the overall like academic institutions, they can't compete in terms of the infrastructure. I also see that in some of the companies, like not all companies are great at building infrastructures. Like some, especially smaller ones, they just don't have the incentive to build infrastructure. And they would just say, we're going to use AWS or we're going to use whatever that's out there and we create an application layer on top of that. And it seems like academia is, at least in the realm of LMs and Gen AI, is also heading that way. They're coming up with cool add-ons on top of GPT or LLAMA or whatever that's provided to them. And I don't know if that's a good thing or a bad thing. In the deep learning era, we've already seen that the industry kind of dominates all the big models and they productize it really well. And the user experience is good and it's a short term, it's the win-win-win for everyone, like the customer, the company, universities, they get funding from companies and professors, they get double hired, they make a lot of money. So it's win-win-win. But medium to longer term, I'm not sure if that model is the best and yeah.

David (23m07s): I think that when you are, if you can democratize access to this level of compute in any way, it's a net win. If Amazon is offering free compute credits to a lab so that they can participate in this space,

Ding (23m24s): I don't really see a downside to that. It's just- What's the incentive for Amazon to do that then?

David (23m27s): Well, they already are doing that.

Ding (23m30s): But not at the scale, like I don't know about the specific numbers, but I would be surprised if Amazon provided enough compute to train, to iterate on a GPT-4 level model. It may, they might provide like $10,000, like $100,000, but those will burn like in an hour.

David (23m45s): So they will, if you're using all of those GPUs to train a model from scratch, for fine tuning, that is more than enough compute. Fine tuning can happen at a much smaller scale. And the incentive for them to do that is one, they get to put their name on more publications, which helps build their brand, but also shows that they are using their compute for good. I think it is a social good if you're collaborating with universities. And I think third also, it allows them to hire very talented people who are already trained to use the infrastructure that they'll use when working at Amazon. So the more of this that we see from all of the big companies with this level of compute, I think it's a net win. These are highly trained researchers who can easily slot into these very large companies to help build more interesting models later on. I want to definitely see more of that.

Henrique (24m36s): I see. As long as we are still working towards the same goals and not sort of just computing for competition's sake or for product's sake or for the shareholder, I feel like it's like, that's where research sort of thrives. You get access to more resources and you get to ask more interesting questions. But with the thought of sort of pushing science forward, not necessarily the bottom.

David (24m56s): It allows Amazon to have essentially pure researchers working for them without requiring them to do all of the additional work that would be employing this person and getting them closer to a PhD. It would not be in their best interest. And this is true for any company. I'm just using Amazon as an example. It would not be in their interest to fund this person and then have them work exclusively on products because then they're never going to graduate. And all of us know how long this thing takes already. We don't need more distractions.

Henrique (25m29s): Yeah, more obstacles to the fence is not what we need.

David (25m34s): No. And I don't think anybody would want to work with somebody who was just essentially getting cheaper labor out of them, which I have not seen from any of the companies.

Henrique (25m44s): I think we've come a long way from robotics and Gen AI and MLM. I think if there's any final comments on the outlook of robotics sort of moving forward, or is this an exciting time for the next couple of years? Is this a more stressful time because there's so many competitors in the game? Or what's the pulse check right now?

David (26m04s): So I can sort of just read the mission statement of my company. Our mission is to solve the most important and fundamental challenges in AI and robotics to enable future generations of intelligent machines that will help us all live better lives.

Henrique (26m19s): I think- Sounds good to me.

David (26m22s): We need companies like ours that are solely invested in pushing robotics forward because robotics is such a difficult problem. It requires so much R&D to get it to a place where we have really good products. Like I said earlier, we're not using 100% of the hardware's capability right now. I don't even think we're using many tens of percent of the hardware's capability. And the more of us participating in this space, the better I think it will be. I'm not interested in being the first to do foundation models for robotics. If I were, that would be great, but I already am not. I'm interested to see how they can be used in this space. And anybody working on it is a friend of mine.

Henrique (27m03s): Awesome. Well, I hope we get to hear some comments from friends of yours then, in that case, what they have to say. And I'm really glad you came on and enlightened us so much on what's happening in the world of robotics. We really had no clue. So we thank you for that.

David (27m17s): Thanks for having me.

Ding (27m18s): Thanks for joining for another episodes. We hope you enjoy and learn something new. If you have any feedback, questions or comments, we'd love to hear from you. You can leave us a comment, send us an email at podcast at halfmaker.com. Don't forget to subscribe to get notified about new episodes. We release new shows every other week where we dive deep into tech industry and latest research topics. Thanks so much for tuning in today

9. LLMs and Robotics | How can academia engage in expensive LLM research?

Show note​

Full transcript text​

Show note

Full transcript text