You are on page 1of 4

gpt4 can improve Itself by reflecting on its mistakes and learning from

them even if the world does pause AI development gpt4 will keep getting
smarter drawing upon the stunning reflection paper and three other
papers released only in the last 72 hours I will show you not only how
gpt4 is breaking its own records but also how it's helping AI
researchers to develop better models I will also cover the
groundbreaking hugging GPT model which like a centralized brain can draw
upon thousands of other AI models to combine tasks like text the image
text the video and question answering the reflection paper and follow-
up sub stack post that caught Global attention was released only a week
ago and yes I did read both but I also reached out to the lead author
Noah shin and discussed their significance at length others picked up
on the results with the legendary Andre carpathy of Tesla and openai
fame saying that this metacognition strategy revealed that we haven't
yet seen the max capacity of gpt4 yet so what exactly was found here is
the headline result I'm going to explain and demonstrate what was tested
in a moment but look how they used gpt4 itself to beat past gpt4
standards using this reflection technique this isn't any random
challenge this is human eval a coding test designed by the most senior
AI researchers just two years ago the designers included Ilya sutskovar
of openai Fame and Dario amade who went on to found anthropic these are
realistic handwritten programming tasks that assess language
comprehension reasoning algorithms and Mathematics so how exactly did
gpt4 improve itself and beat its own record because remember in the
distant past of two weeks ago in the gpt4 technical report it scored 67
not 88 well here is an example from page 9 of the reflection paper as
you can read in the caption this was a Hotpot QA trial designed
specifically such that models needed to find multiple documents and
analyze the data in each of them to come up with the correct answer
notice how initially a mistake was made on the left by the model and
then the model at the bottom reflected on how it had gone wrong in a
self-contained Loop it then came up with a better strategy and got it
right and the authors put it like this we hypothesized that llm's large
language models possess an emergent property of self-reflection meaning
that earlier models couldn't do this or couldn't do it as well it's a
bit like GPT models are learning how to learn in case you think it was
a model blindly trying again and again until it was successful no it
wasn't this was another challenge called Alf world and look at the
difference between success without reflection and success with the
reflection I discussed this of course with the lead author and the goal
was to distinguish learning curves from self-improvement to simple
probabilistic success over time if you're wondering about Alf World by
the way it's about interactively aligning text and embodied worlds for
example in a simulated environment the model had the task of putting a
pan on the dining table and it had to understand and action that prompt
so as you can see this ability to reflect doesn't just help with coding
it helps with a variety of tasks at this point I want to quickly
mention something I know that there will be a couple of well-versed
insiders who say didn't gpt4 actually get 82 percent in human eval in
the Sparks of AGI paper of course I did a video on that paper too and
asked the author of reflection about this point there are a few
possibilities such as prompting changes and the sparked authors having
access to the raw gpt4 model but either way it is the relative
performance gain that matters whichever bass line you start with gpt4
can improve on it with a reflection and the 88 figure is not a cap the
author has observed results in the last few hours as high as 91 percent
but before I go on I can't resist showing you the examples I found
through experimentation and also shared with the author take this
prompt that I gave gpt4 write a poem in which every word begins with e
now as you can see it did a good job but it didn't fully get it right
look at the word Ascent for example without mentioning anything
specific I just then wrote did the poem meet the assignment not even a
particularly leading question because of course it could have just said
yes gpt4 then said apologies it appears the poem I provided did not meet
the assignment requirements not every word begins with the letter e here
is a revised poem with every word beginning with the letter e remember
I didn't help it at all and look at the results every word begins with
e how far can we take this for the next example I chose mathematics and
asked write me a five question multiple choice quiz to test my
knowledge of probability with correct answers and explanations at the
bottom there should only be one correct answer per question it comes up
with a D decent quiz but notice a problem in question three for example
the probability of drawing an ace or a king is indeed 8 out of 52 but
that simplifies down to 2 out of 13. so two of the answers are correct
and I explicitly asked for it not to do this in the prompt so can the
model self-reflect with mathematics kind of almost look what happens
first I give a vague response saying did the quiz meet the assignment
GPT 4 fumbles this and says yes the quiz did meet the assignment hmm so
I tried did the quiz meet all of the requirements and gbc4 says yes so
I did have to help it a bit and said did the quiz meet the requirement
that there should only be one correct answer per question that was just
enough to get gpt4 to self-reflect properly and it corrected the mistake
I must say it didn't self-correct perfectly notice it identified C and
D as being correct and equivalent when it was B and D but despite
making that mistake it was able to correct the quiz in case you're
wondering the original chat TPT or gbt 3.5 can't self-reflect as well I
went back to the perm example and Not only was the poem generated full
of words that didn't begin with e also the self-reflection was lacking I
said did the poem meet the assignment and it said yes the poem meets
the assignment as the lead author Noah Shin put it with gpt4 we are
shifting the accuracy bottleneck from correct syntactic and semantic
generation to correct syntactic and semantic test generation in other
words if a model can know how to test its outputs accurately that might
be enough even if its initial Generations don't work it just needs to
be smart enough to know where it went wrong others are discovering
similar breakthroughs this paper from just three days ago comes up with
this self-improvement technique they get gpt4 to frame its dialogue as
a discussion between two agent types A researcher and a decider a bit
like a split personality one identifying crucial problem components and
the other one deciding how to integrate that information here is an
example with Gypsy 4's initial medical care plan being insufficient in
crucial regards the model then talks to itself as a researcher and as a
decider and then lo and behold it comes up with a better final care
plan the points in bold were added by gpt4 to its initial care plan
after discussions with itself and the results are incredible Physicians
chose the final summary produced by this dearer dialogue over the
initial Gypsy 4 generator summary 90 to 10 that's the dark red versus
the Pink I'm colorblind but even I can see there's a pretty big
difference the authors also introduce hallucinations at different
levels low medium and high and they wanted to see whether this dialogue
model would reduce those hallucinations these are different medical
gradings and you can see that pretty much every time it did improve it
quite drama automatically and then there was this paper also released
less than 72 hours ago they also get a model to recursively criticize
and improve its own output and find that this process of reflection
outperforms Chain of Thought prompting they tested their model on Mini
wob Plus plus which is a challenging Suite of web browser-based tasks
for computer control ranging from simple button clicking to complex
form filling here it is deleting files clicking on like buttons and
switching between tabs a bit like my earlier experiments they gave it a
math problem and said review your previous answer and find problems
with your answer this was a slightly more leading response but it
worked they then said based on the problems you found improve your
answer and then the model got it right even if you take nothing else
from this video just deploying this technique will massively improve
your outputs from gbt4 but we can go much further which is what the
rest of the video is about before I move on though I found it very
interesting that the authors say that this technique can be viewed as
using the llm's output to write to an external memory which is later
retrieved to choose an action going back to carpathy remember that this
critique retry metacognition strategy isn't the only way that gpt4 will
beat its own records the use of tools as he says will also be critical
less than 72 hours ago this paper was released and arguably it is as
significant as the reflection paper it's called hugging GPT and as the
authors put it it achieves impressive results in language Vision speech
and other challenging tasks which paves a new way towards AGI
essentially what the paper did is it used language as an interface to
connect numerous AI models for solving complicated AI tasks it's a
little bit like a brain deciding which muscle to use to complete an
action take this Example The Prompt was can you describe what this
picture depicts and count how many objects in the picture the model
which was actually chatbt not even gpt4 or use two different tools to
execute the task one model to describe the image and one model to count
the objects within it and if you didn't think that was impressive what
about six different models so the task was this please generate an
image where a girl is reading a book and her pose is the same as the
boy in the image given then please describe the new image with your
voice the Central Language model or brain which was chattybt had to
delegate appropriately all of these models by the way are freely
available on hugging face the first model was used to analyze the pose
of the boy the next one was to transpose that into an image then
generate an image detect an object in that image break that down into
text and then turn that text into speech it did all of this and notice
how the girl is in the same pose as the boy same head position and arm
position and then as a cherry on top the model read out loud what it
had accomplished this example actually comes from another paper
released four days ago called task Matrix remember how the original tool
former paper used only five apis this paper proposes that we could soon
use millions in this example the model is calling different apis to
answer questions about the image caption the image and do out painting
from the image extending it from a simple single flower to this 4K
image going back to hugging GPT we can see how it deciphers these
inscrutable invoices and reads them out loud and can even perform text
to video with an astronaut walking in Space at this point I can't
resist showing you what CGI video editing might soon be possible with
AI here's Wonder Studio which is backed by Steven Spielberg welcome to
wonder Studio we're making movies with CGI is as simple as selecting
your actor and assigning a character the system uses AI to track the
actor's performance across cuts and automatically animates lights and
composes the CG character directly into the scene [Music] whether it's
one shot or a full sequence Wonder Studio analyzes and captures
everything from body motion lighting compositing camera motion and it
even tracks the actor's facial performance these advancements do seem
to be accelerating and requiring fewer and fewer humans this paper
showed back in the before times of October that models didn't need
carefully labeled human data sets and could generate their own going
back to the language models can solve computer task paper the authors
seem to concur they said that previously significant amounts of expert
demonstration data are still required to fine-tune large language models
on the contrary the agent we suggest needs less than two demonstrations
per task on average and doesn't necessitate any fine tuning this
reminded me of the alpaca model that fine-tuned its answers based on
the outputs of another language model human experts were needed briefly
at the start but far less than before a bit like a child no longer
needing a parent except maybe gpt4 is on growth steroids Ilya satsgiver
from openai put it like this I mean already mostly data for enforcement
loan is coming from AIS the humans are being used to train the reward
function but then the but then the reward function enter and in its
interaction with the model is automatic and all the data that's
generated in the during the process of reinforcement learning it's
created by AI before I end I should point out that these recursive
self-improvements are not limited to algorithms and apis even Hardware
is advancing more rapidly due to AI this week we had this from Reuters
Nvidia on Monday showed new research that explains how AI can be used
to improve chip design by the way this includes the new h100 GPU they
say that the Nvidia research took reinforcement learning and added a
second layer of AI on top of it to get even better results and to go
back to where we started the gpt4 technical report showed that even with
compute alone not self-learning we can predict with a high degree of
specificity the future performance of models like gpc5 on tasks such as
human eval these accelerations of AI are even giving the CEO of Google
Whiplash and I can't help feeling that there is one more feedback loop
to point out as one company like openai make breakthroughs it puts
pressure on other companies like Google to catch up apparently Bard
which has been powered by Lambda will soon be upgraded to the more
powerful model Palm with self-improvement tool use Hardware advances
and now commercial pressure it is hard to see how AI will slow down and
of course as always I will be here to discuss it all thank you for
watching to the end and have a wonderful day

You might also like