You are on page 1of 10
on ) @ Publishee 1. seer ee eee ere You have 1 free member-only story left this month, Sign up for Medium and get an extra one g Alberto Romero Sep 17 - tminread . . @Listen rt [yt Save Google’s PaLM-SayCan: The First of the Next Generation of Robots Google has entered a new path: Merging Al and robotics. To make Medium work, we log us y Policy, including PaLM-SayCan picking up an apple. Credit: Google Research Despite what Google Search says, historically speaking, AI has had very little to do with shiny metallic robots with a human form. This doesn’t seem to be the case anymore. In the last couple of years, tech companies have bet hard on Al-powered robots. And not just any type (Roomba is a useful tool but nowhere near the archetype of a robot). No. Companies are building humanoid robots. Boston Dynamics, the eldest of the group in terms of experience in robotics, presented the latest version of Atlas in 2021. After three decades, they've got a model with somewhat decent motor and proprioceptive skills (it can do mortal jumps). Agility Robotics, now backed by Amazon, produces Digit, a general-purpose robot that can do a le banlinhles aed at herman anta! Haedes hes Gn ERAS a a) The fact that iany niyu-pruime ecu anu rovoULs Compares are Dey va humanoid robots is interesting in and of itself. I've previously argued there's good reason to make robots with human traits: the world is adapted for us in terms of height, shape, movements... These projects reveal the industry's interest in creating robots that could, as Musk said last year during the 2021 AI day, “eliminate dangerous, repetitive, and boring tasks,” or help us at home. But this article isn't about humanoid robots. At least not only. It’s about a novel approach to robotics that no one of the examples I mentioned above follows (yet). I’m talking about merging state-of-the-art AI systems — in particular language models — with full-body robots that companies focus on building the next uber-large language model while robotics n navigate the world. The brain and the body. Some AI companies want the most dexterous robots, but there seems to be no overlapping — despite that it seems to be the obvious path forward. Moravec’s Paradox and the complexity of merging Al and robotics There are good reasons why most Al companies don't go into roboti s (Openal dismantled its robotics branch last year) and why most robotics companies constraint their robots’ scope to simple tasks or simple environments (or both). One of the main reasons is what’s known as Moravec’s Paradox. It says that, counterintuitively, it’s very hard to make robots perform sensorimotor and perceptual tasks (e.g., pick up an apple) well enough whereas creating Als that can solve hard cognitive problems (e.g., play board games or pass IQ tests) is relatively easy. To humans, it seems obvious that calculus is harder than catching a ball in the air. But that's only because calculus is relatively recent evolutionarily speaking. We haven't had time to master it yet. As Marvin Minsky — one of Al’s founding fathers — says: “We're more aware of simple processes that don’t work well than of complex ones that work flawlessly.” In short, making robots that can move around and interact with their environment flawlessly is extremely hard (and very little progress has been achieved in the last decades). e settee ) partnership ry well be the next breakthrough in robotics: PaLM- Can (PSC), a (not so much) humanoid robot that has a mix of abilities the others above can only dream of. T’m particularly interested in Google’s approach because I’m an advocate for the merging of AI virtual systems and real-world robots. Regardless of whether we want to build an artificial general intelligence, this is the natural path for both disciplines. Some researchers and companies believe that the scaling hypothesis holds the key to human-level intelligent Als. I, on the contrary, believe it’s critical to ground AT in the real world both to solve current shortcomings (like Al’s pervasive ignorance of how the world works or internet datasets’ biases) and take it to the next level (reasoning and understanding require the tacit knowledge that’s only acquired by exploring the world). (Note: If you want to know more about this topic, Irecommend my mostly forgotten post “Artificial Intelligence and Robotics Will Inevitably Merge.”) Google's PSC reveals that the company has finally accepted this is the way forward and has decided, not to abandon pure Al, but to give renewed interest to AI + robotics as a means to achieve more capable intelligent systems. In the end, this isn’t different from training multimodal models (generally accepted as the natural next step for deep learning models). In the same way, Als that can “see” and “read” are more powerful than those which can only perceive one mode of information, Als — or robots — that can act, as well as perceive, will fare better in our physical world. Let's see what Google’s PSC is capable of and how it manages to combine the power of large language models with the dexterity and action capabilities of a physical robot. PaLM-SayCan: The first of a new generation of robots Ata high level, we can understand PSC as a system that combines PaLM’s mastery of natural language (PaLM is a language model, pretty much like GPT-3 or LaMDA — although slightly better) with the robot's ability to interact with and navigate the world. a) becomes mt ars of evolutionary progress to do them correctly). For instance, “bring me a snack” although a seemingly simple task, comprises many different elemental actions (and the expression itself involves some degree of ellipsis and implicitness; “which snack?”). PaLM provides the robot with task-grounding: it can transform a natural language request into a precise — albeit complex — task and break it down into elemental actions that are useful to complete it. Robots like Atlas or Digit can do simple tasks very well, but they can’t solve 15-step requests without explicit programming. PSC can. In return, the robot provides Pal.M with contextual knowledge about the environment and itself. It gives world-grounding information that can tell the language model which of the elemental actions are possible — what it can afford to do — , given the external, real-world conditions. PaLM states what's useful and the robot states what's possible. This is the key to Google's innovative design and what puts the company on top with this approach (although not necessarily in terms of accomplishments — PSC is still a research prototype whereas Atlas and Digit are complete products). PSC combines task- grounding (what makes sense given the request) and world-grounding (what makes sense given the environment). Neither PaLM nor the robot could solve these problems by themselves. Now, let’s see an example of what PSC can do, how does it do it, and how much better it is compared to alternatives (read more on Google's blog). PaLM-SayCan in action: Leveraging NLP to navigate the world One of the examples Google researchers use in their experiments (published in the paper “Do As I Can, Not As I Say: Grounding Language in Robotic Affordances”) starts with a human request, expressed naturally: “I just worked out, please bring me a snack and a drink to recover.” This is an eacv tack far a nersan_huta traditionally decioned rahat wanldn't have a cle e ED) high-level ta Pll bring the person an apple and a water bottle.” PaLM acts as an intermediary between the subtlety and implicitness of human language and the precise, rigid language a robot can understand. Now that Pal.M has defined a task to fulfill the user’s request, it can come up with a series of useful steps to accomplish the task. However, because PaLM is a virtual Al that has no contact with the world, it won't necessarily propose the best approach, only ideas that make sense for the task — without taking into account the actual setting. That’s where the robot affordances come into play. The robot, which is trained to “know” what's feasible and what isn’t in its current state within the physical world, can collaborate with PaLM by giving a higher value to those actions that are possible in contrast to those that are harder or impossible. While PaLM gives high scores to the useful actions, the robot gives high scores to the possible actions. This approach allows PSC to eventually find the best plan of action given the task and the environment. PSC takes the best of both worlds. Going back to the example of the snack. PaLM has already decided that it should “pring the person an apple and a water bottle.” It then may propose going to the store to buy an apple (useful). However, the robot would score that step very low because it can’t take the stairs (impossible). On the other hand, the robot may propose taking an empty glass (possible), to which PaLM would say it’s of no use to accomplish the task because the person wants the water, not the glass (useless). By taking the highest score from both the useful and the possible proposals, PSC would finally decide to go find an apple and the water in the kitchen (useful and possible). Once the step is done, the process repeats and PSC decides what's the next elemental action it should take from the new state — getting closer to the completion of the task at each subsequent step. Google researchers tested PSC against two alternatives over 101 instruction tasks. One used a smaller model fine-tuned explicitly on instruction answering (FLAN) and the other didn’t use the robot affordances necessary to ground the language model into the a) the time, red hout robotic grounding.” The results reveal a promising approach to combining state-of-the-art language Al models with robots toward more complete systems that can better understand us and better navigate the world — at the same time. Still, there are shortcomings to the approach. Some are obvious with PSC and others will be once companies explore the entire scope of the problem. What PaLM-SayCan can’t do: It’s hard to beat evolution I'm going to ignore here the effectiveness of the peripheral modules (e.g., speech recognition, speech-to-text, vision sensors to detect and recognize objects, etc.) although those have to work perfectly for the robot to function (e.g., a change of lighting could render useless the object detection software and thus make PSC incapable of completing the task). The first problem that comes to mind — and one that I’ve written about repeatedly in my articles — is language models’ inability to understand in the human sense. I used the example of a human asking for a snack and drink and PaLM correctly interpreting that an apple and a water bottle could do just fine. However, there’s an implicit problem here that not even the best language models, like PaLM, may be able to solve in more complex scenarios. PaLM is a very powerful autocomplete program. It's trained to predict accurately the next token given a history of tokens. Although this training objective has proved very useful to solve satisfactorily a broad number of language tasks, it doesn’t provide the AI with the ability to understand humans or generate utterances with intention. PaLM outputs words but it doesn’t know why, what they mean, or the consequences they can produce. PaLM could correctly interpret the request and give the robot the instruction to bring an annle and water hnt it wanld he a mindless internretation If it onecced incorrectly. a) and no way rt. Another problem that PSC is highly likely to encounter is an error in the robot's actions. Let’s say PaLM has interpreted correctly the person's request and has come up with a sensible task. PSC has decided upon a series of useful and possible steps and it's acting accordingly. What if one of those actions is completed incorrectly or the robot commits a mistake? Let's say, it goes to pick the apple and it falls to the ground and rolls to the corner. Does PSC have a feedback mechanism to reassess its state and the state of the world to come up with a new set of actions that would solve the request given the new circumstances? The answer is no. Google has made the experiments in a very constrained lab environment. If PSC went out in the world, it would encounter a myriad of constantly-changing conditions (moving objects and people, irregularities in the ground, unexpected events, shadows, wind, etc.). It wouldn't be able to do barely anything. The amount of variables in the real world is virtually infinite but PSC is trained with a dataset of other robots acting in acontrolled environment. Of course, PSC is a proof of concept so this isn't the fairest lens to judge its performance, but Google should keep in mind that the leap from this to a real-world working robot isn't merely a quantitative one. ‘These are the main language and action problems. But there are many others somewhat related to these: The task that PaLM comes up with could take a number of steps superior to the upper limit of the robot. The probability of failure increases exponentially with the number of steps needed to finish the task. The robot could find terrain or objects with which it’s not familiarized. It could find itself in a novel situation without the possibility to improvise because of its lack of common sense. The final shortcoming, to which I'll dedicate a whole paragraph, is that Pal.M, as well as all other language models, is prone to perpetuate the biases it has seen during training. Interestingly, researchers from Johns Hopkins University recently analyzed a robot's behavior after it was enhanced with internet's data and found that biases perpetuate beyond language: The robot was racist and sexist the robot's actions were a) Finally, and wus 1s aiways ant auaiuun w Guugies ai vivg posts, ure cuumpany prides itself on priori ing safety and responsible development. PSC comes with a series of mechanisms to ensure the procedure is safe: PaLM shouldn't generate unsafe or biased proposals and the robot shouldn’t take potentially dangerous actions, Still, these problems are ubiquitous and companies don’t have a standard solution. Although PSC. seems to be the first of a new generation of state-of-the-art Al-powered robots, it's no different in this regard. Subscribe to The Algorithmic Bridge. Bridging the gap between algorithms and people. A newsletter about the AI that matters to your life. You can also support my work on Medium directly and get unlimited access by becoming a member using my referral link here! :) ‘Thanks to Ben Huberman Enjoy the read? Reward the writer.*"* Your tip will go to Alberto Romero through a third-party platform of their choice, letting them know you appreciate their story, Seep) Evory Thureday, esearch t0 OFigitias sesso yuu eee rate oe nee te 23y signing up, you will create a Medium account ityou dont already have one. Review ‘our Pewacy Poligy for more information about our privacy practices. (et ~ (& cettisrewsieter ) CPs omen wet) cutting-edge

You might also like