on )
@ Publishee 1. seer ee eee ere
You have 1 free member-only story left this month, Sign up for Medium and get an extra one
g Alberto Romero
Sep 17 - tminread . . @Listen
rt
[yt Save
Google’s PaLM-SayCan: The First of the Next
Generation of Robots
Google has entered a new path: Merging Al and robotics.To make Medium work, we log us
y Policy, including
PaLM-SayCan picking up an apple. Credit: Google Research
Despite what Google Search says, historically speaking, AI has had very little to do with
shiny metallic robots with a human form. This doesn’t seem to be the case anymore. In
the last couple of years, tech companies have bet hard on Al-powered robots. And not
just any type (Roomba is a useful tool but nowhere near the archetype of a robot). No.
Companies are building humanoid robots.
Boston Dynamics, the eldest of the group in terms of experience in robotics, presented
the latest version of Atlas in 2021. After three decades, they've got a model with
somewhat decent motor and proprioceptive skills (it can do mortal jumps). Agility
Robotics, now backed by Amazon, produces Digit, a general-purpose robot that can do
a le banlinhles aed at herman anta! Haedes hes Gn
ERAS
aa)
The fact that iany niyu-pruime ecu anu rovoULs Compares are Dey va humanoid
robots is interesting in and of itself. I've previously argued there's good reason to make
robots with human traits: the world is adapted for us in terms of height, shape,
movements... These projects reveal the industry's interest in creating robots that could,
as Musk said last year during the 2021 AI day, “eliminate dangerous, repetitive, and
boring tasks,” or help us at home.
But this article isn't about humanoid robots. At least not only. It’s about a novel
approach to robotics that no one of the examples I mentioned above follows (yet). I’m
talking about merging state-of-the-art AI systems — in particular language models —
with full-body robots that
companies focus on building the next uber-large language model while robotics
n navigate the world. The brain and the body. Some AI
companies want the most dexterous robots, but there seems to be no overlapping —
despite that it seems to be the obvious path forward.
Moravec’s Paradox and the complexity of merging Al and robotics
There are good reasons why most Al companies don't go into roboti
s (Openal
dismantled its robotics branch last year) and why most robotics companies constraint
their robots’ scope to simple tasks or simple environments (or both). One of the main
reasons is what’s known as Moravec’s Paradox. It says that, counterintuitively, it’s very
hard to make robots perform sensorimotor and perceptual tasks (e.g., pick up an
apple) well enough whereas creating Als that can solve hard cognitive problems (e.g.,
play board games or pass IQ tests) is relatively easy.
To humans, it seems obvious that calculus is harder than catching a ball in the air. But
that's only because calculus is relatively recent evolutionarily speaking. We haven't had
time to master it yet. As Marvin Minsky — one of Al’s founding fathers — says: “We're
more aware of simple processes that don’t work well than of complex ones that work
flawlessly.” In short, making robots that can move around and interact with their
environment flawlessly is extremely hard (and very little progress has been achieved in
the last decades).e settee )
partnership ry well be
the next breakthrough in robotics: PaLM-
Can (PSC), a (not so much) humanoid
robot that has a mix of abilities the others above can only dream of.
T’m particularly interested in Google’s approach because I’m an advocate for the
merging of AI virtual systems and real-world robots. Regardless of whether we want to
build an artificial general intelligence, this is the natural path for both disciplines.
Some researchers and companies believe that the scaling hypothesis holds the key to
human-level intelligent Als. I, on the contrary, believe it’s critical to ground AT in the
real world both to solve current shortcomings (like Al’s pervasive ignorance of how the
world works or internet datasets’ biases) and take it to the next level (reasoning and
understanding require the tacit knowledge that’s only acquired by exploring the
world).
(Note: If you want to know more about this topic, Irecommend my mostly forgotten
post “Artificial Intelligence and Robotics Will Inevitably Merge.”)
Google's PSC reveals that the company has finally accepted this is the way forward and
has decided, not to abandon pure Al, but to give renewed interest to AI + robotics as a
means to achieve more capable intelligent systems. In the end, this isn’t different from
training multimodal models (generally accepted as the natural next step for deep
learning models). In the same way, Als that can “see” and “read” are more powerful
than those which can only perceive one mode of information, Als — or robots — that
can act, as well as perceive, will fare better in our physical world.
Let's see what Google’s PSC is capable of and how it manages to combine the power of
large language models with the dexterity and action capabilities of a physical robot.
PaLM-SayCan: The first of a new generation of robots
Ata high level, we can understand PSC as a system that combines PaLM’s mastery of
natural language (PaLM is a language model, pretty much like GPT-3 or LaMDA —
although slightly better) with the robot's ability to interact with and navigate the world.a)
becomes mt ars of
evolutionary progress to do them correctly). For instance, “bring me a snack” although
a seemingly simple task, comprises many different elemental actions (and the
expression itself involves some degree of ellipsis and implicitness; “which snack?”).
PaLM provides the robot with task-grounding: it can transform a natural language
request into a precise — albeit complex — task and break it down into elemental
actions that are useful to complete it. Robots like Atlas or Digit can do simple tasks very
well, but they can’t solve 15-step requests without explicit programming. PSC can.
In return, the robot provides Pal.M with contextual knowledge about the environment
and itself. It gives world-grounding information that can tell the language model which
of the elemental actions are possible — what it can afford to do — , given the external,
real-world conditions.
PaLM states what's useful and the robot states what's possible. This is the key to
Google's innovative design and what puts the company on top with this approach
(although not necessarily in terms of accomplishments — PSC is still a research
prototype whereas Atlas and Digit are complete products). PSC combines task-
grounding (what makes sense given the request) and world-grounding (what makes
sense given the environment). Neither PaLM nor the robot could solve these problems
by themselves.
Now, let’s see an example of what PSC can do, how does it do it, and how much better it
is compared to alternatives (read more on Google's blog).
PaLM-SayCan in action: Leveraging NLP to navigate the world
One of the examples Google researchers use in their experiments (published in the
paper “Do As I Can, Not As I Say: Grounding Language in Robotic Affordances”) starts
with a human request, expressed naturally: “I just worked out, please bring me a snack
and a drink to recover.”
This is an eacv tack far a nersan_huta traditionally decioned rahat wanldn't have a clee ED)
high-level ta Pll bring
the person an apple and a water bottle.”
PaLM acts as an intermediary between the subtlety and implicitness of human
language and the precise, rigid language a robot can understand. Now that Pal.M has
defined a task to fulfill the user’s request, it can come up with a series of useful steps to
accomplish the task. However, because PaLM is a virtual Al that has no contact with
the world, it won't necessarily propose the best approach, only ideas that make sense
for the task — without taking into account the actual setting.
That’s where the robot affordances come into play. The robot, which is trained to
“know” what's feasible and what isn’t in its current state within the physical world, can
collaborate with PaLM by giving a higher value to those actions that are possible in
contrast to those that are harder or impossible. While PaLM gives high scores to the
useful actions, the robot gives high scores to the possible actions. This approach allows
PSC to eventually find the best plan of action given the task and the environment. PSC
takes the best of both worlds.
Going back to the example of the snack. PaLM has already decided that it should
“pring the person an apple and a water bottle.” It then may propose going to the store
to buy an apple (useful). However, the robot would score that step very low because it
can’t take the stairs (impossible). On the other hand, the robot may propose taking an
empty glass (possible), to which PaLM would say it’s of no use to accomplish the task
because the person wants the water, not the glass (useless). By taking the highest score
from both the useful and the possible proposals, PSC would finally decide to go find an
apple and the water in the kitchen (useful and possible). Once the step is done, the
process repeats and PSC decides what's the next elemental action it should take from
the new state — getting closer to the completion of the task at each subsequent step.
Google researchers tested PSC against two alternatives over 101 instruction tasks. One
used a smaller model fine-tuned explicitly on instruction answering (FLAN) and the
other didn’t use the robot affordances necessary to ground the language model into thea)
the time, red hout robotic
grounding.”
The results reveal a promising approach to combining state-of-the-art language Al
models with robots toward more complete systems that can better understand us and
better navigate the world — at the same time.
Still, there are shortcomings to the approach. Some are obvious with PSC and others
will be once companies explore the entire scope of the problem.
What PaLM-SayCan can’t do: It’s hard to beat evolution
I'm going to ignore here the effectiveness of the peripheral modules (e.g., speech
recognition, speech-to-text, vision sensors to detect and recognize objects, etc.)
although those have to work perfectly for the robot to function (e.g., a change of
lighting could render useless the object detection software and thus make PSC
incapable of completing the task).
The first problem that comes to mind — and one that I’ve written about repeatedly in
my articles — is language models’ inability to understand in the human sense. I used
the example of a human asking for a snack and drink and PaLM correctly interpreting
that an apple and a water bottle could do just fine. However, there’s an implicit
problem here that not even the best language models, like PaLM, may be able to solve
in more complex scenarios.
PaLM is a very powerful autocomplete program. It's trained to predict accurately the
next token given a history of tokens. Although this training objective has proved very
useful to solve satisfactorily a broad number of language tasks, it doesn’t provide the AI
with the ability to understand humans or generate utterances with intention. PaLM
outputs words but it doesn’t know why, what they mean, or the consequences they can
produce.
PaLM could correctly interpret the request and give the robot the instruction to bring
an annle and water hnt it wanld he a mindless internretation If it onecced incorrectly.a)
and no way rt.
Another problem that PSC is highly likely to encounter is an error in the robot's
actions. Let’s say PaLM has interpreted correctly the person's request and has come up
with a sensible task. PSC has decided upon a series of useful and possible steps and it's
acting accordingly. What if one of those actions is completed incorrectly or the robot
commits a mistake? Let's say, it goes to pick the apple and it falls to the ground and
rolls to the corner. Does PSC have a feedback mechanism to reassess its state and the
state of the world to come up with a new set of actions that would solve the request
given the new circumstances? The answer is no.
Google has made the experiments in a very constrained lab environment. If PSC went
out in the world, it would encounter a myriad of constantly-changing conditions
(moving objects and people, irregularities in the ground, unexpected events, shadows,
wind, etc.). It wouldn't be able to do barely anything. The amount of variables in the
real world is virtually infinite but PSC is trained with a dataset of other robots acting in
acontrolled environment. Of course, PSC is a proof of concept so this isn't the fairest
lens to judge its performance, but Google should keep in mind that the leap from this
to a real-world working robot isn't merely a quantitative one.
‘These are the main language and action problems. But there are many others
somewhat related to these: The task that PaLM comes up with could take a number of
steps superior to the upper limit of the robot. The probability of failure increases
exponentially with the number of steps needed to finish the task. The robot could find
terrain or objects with which it’s not familiarized. It could find itself in a novel
situation without the possibility to improvise because of its lack of common sense.
The final shortcoming, to which I'll dedicate a whole paragraph, is that Pal.M, as well
as all other language models, is prone to perpetuate the biases it has seen during
training. Interestingly, researchers from Johns Hopkins University recently analyzed a
robot's behavior after it was enhanced with internet's data and found that biases
perpetuate beyond language: The robot was racist and sexist the robot's actions werea)
Finally, and wus 1s aiways ant auaiuun w Guugies ai vivg posts, ure cuumpany prides
itself on priori
ing safety and responsible development. PSC comes with a series of
mechanisms to ensure the procedure is safe: PaLM shouldn't generate unsafe or biased
proposals and the robot shouldn’t take potentially dangerous actions, Still, these
problems are ubiquitous and companies don’t have a standard solution. Although PSC.
seems to be the first of a new generation of state-of-the-art Al-powered robots, it's no
different in this regard.
Subscribe to The Algorithmic Bridge. Bridging the gap between algorithms and people. A
newsletter about the AI that matters to your life.
You can also support my work on Medium directly and get unlimited access by becoming a
member using my referral link here! :)
‘Thanks to Ben Huberman
Enjoy the read? Reward the writer.*"*
Your tip will go to Alberto Romero through a third-party platform of their choice, letting them know you appreciate their
story,
Seep)Evory Thureday,
esearch t0 OFigitias sesso yuu eee rate oe nee te
23y signing up, you will create a Medium account ityou dont already have one. Review
‘our Pewacy Poligy for more information about our privacy practices.
(et ~
(& cettisrewsieter )
CPs omen
wet)
cutting-edge