Chat GPT Is Not All You Need Paper Review

Chat GPT is not all you need.
A State of the Art

Review of large Generative AI models
Author: Roberto Gozalo-Brizuela, Eduardo C.Garrido-Merchán
Quantitative Methods Department, Universidad Pontificia Comillas, Madrid, Spain
Presenter: Diwash Shrestha
Date: 13-06-2023
1
1. Introduction
● Generative AI are possible due to the architecture like transformers, generative adversarial
networks or variational autoencoders
● Discussion on the taxonomy of the main generative models in industry and analyze these
models based on category
● Review of the applications of the models and the content they generate
2
2. Taxonomy of Generative AI
● Input to Output Format of model
● Timeline of the release of model
● Developer of the model
3
2.1 Input to Output Format of model
4
2.2 Timeline of the release of model
5
2.3 Developer of the model
● Generative AI requires huge resources
● Collaboration of large companies with academia
6
3. Generative AI models categories
1. Text-to-image models
2. Text-to-3D models
3. Image-to-Text models
4. Text-to-Video models
5. Text-to-Audio models
6. Text-to-Text models
7. Text-to-Code models
8. Text-to-Science models
9. Other models
7
3.1 Text to Image Models
Dall-E 2
● Generate image and art through the prompt with text description
● Uses the Contrastive Language Image Pre Training (CLIP) neural network
● CLIP is able to find the relation between textual semantics and their visual
representations
● CLIP is combined with the prior model called GLIDE to create the images
● Application: Synthetic data generation and image editing
Image from https://www.assemblyai.com/blog/how-dall-e-2-actually-works/ Image generated from the prompt ”A cat wearing a beret and black turtleneck”.
8
Imagen
● Based on pretrained text encoder which generate text to sequence of word embeddings
● Cascade of conditional diffusion models which maps embeddings to images
● Discovered large language models, pretrained on text-only corpora very effective at
encoding text for image synthesis
● Increasing the size of the language model boosts both sample fidelity and image-text
alignment compared to the increasing size of image diffusion model
”A Golden Retriever dog wearing a blue checkered beret and red dotted
https://imagen.research.google/
turtleneck”.
9
Stable Diffusion
● Open Source model developed by CompVis group @ LMU Munich

● Uses latent diffusion model which perform image modification operations in latent space
● Ability to work in latent space makes stable diffusion faster compared to the other model
which worked in a pixel space
”A cat wearing a beret and black turtleneck”.

https://huggingface.co/spaces/stabilityai/stable-diffusion-1
10
Muse
● Muse is fast model uses transformer architecture instead of latent diffusion

● More efficient due to the use of discrete tokens and requires fewer sampling iterations
● At inference time Muse is 10X faster than Imagen-3B and 3X faster than stable diffusion v .14
https://blog.metaphysic.ai/muse-googles-super-fast-text-to-image-model-aband
ons-latent-diffusion-for-transformers/#:~:text=The%20authors%20estimate%20t
hat%20Muse,Stable%20Diffusion%20requires%20for%20inference.
11
3.2 Text to 3D Models
Dreamfusion
● Uses pretrained 2D text-to-image diffusion model to perform text-to-3D synthesis

● Uses score distillation sampling SDS to generate samples from diffusion model by optimising
a loss function
● Uses differentiable generator which focuses on creating 3D models that will look like good
images when rendered from random angles
● Application: Fast model generation for games, movie development
3D model created with “cat wearing virtual reality headset in renaissance oil
painting high detail caravaggio” prompt
https://dreamfusion3d.github.io/gallery.html
12
3.2 Text to 3D Models
Magic3D
● Shorter processing time and high quality generated models

● Uses two stage optimization framework : Magic3D builds a low-resolution diffusion prior
● Textured 3D mesh model is furthered optimized with an efficient differentiable render
● On Human evaluation 61.7% prefer model of Magic3D compared to Dreamfusion
Textured 3D mesh
3D model for “a peacock on a surfboard”
https://research.nvidia.com/labs/dir/magic3d/
13
3.3 Image to Text Models
Flamingo
● 80B parameter vision language models works with Image,

Audio and Video created with few-shot learning
● Combines pretrained vision model (perceiver) that analyzes
visual scenes & LLM (chinchilla) which performs a basic form
of reasoning
● Multimodal features makes it easy to useable for applications:
visually impaired, identification of hateful content etc.
https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model
14
3.3 Image to Text Models
VisualGPT
● Image captioning model

● VisualGPT uses pre trained GPT-2 model
● Does not require much data compared to other image to text model
15
3.4 Text to Video Models
Phenaki
● Model capable of video generation

● Trained using large image-text pairs and smaller video-text pairs
● Model has three parts: C-ViViT encoder, training transformer and video generator
● Phenaki can create video with text prompts and combination of image and text prompt
16
3.4 Text to Video Models
Soundify
● Generates audio effect for the video scenes

● Leverage labeled, 90,000 sound effects and extending CLIP for image classification
capabilities
● Works in three stages:
classification, synchronization
and mix
● Classify each scene for two types
of sounds: effects and ambients
● Sync the sound from the sound
emitters and the frame on scene
https://chuanenlin.com/papers/soundify-neurips2021.pdf
17
3.5 Text to Audio Models
AudioLM
● Generates speech and piano music, taking audio only as input

● Semantic captures local dependencies(phonetics in speech) and
global long term structure(rhythm in piano music)
● Acoustic captures details of the audio waveform
● AudioLM works in three stages
Two Kinds of audio tokens in AudioLM
https://ai.googleblog.com/2022/10/audiolm-language-modeling-approach-to.html 18
Jukebox
● Generates music along with artist singing in different genre and artist style
● Raw audio is high dimensional which can be computationally challenging
● Uses hierarchical VQ-VAE architecture to compress audio to lower-dimensional space
● Trained on 1.2 millions songs along with lyrics and other metadata
● Can generate music for given audio and generate songs with given lyrics
https://cdn.openai.com/papers/jukebox.pdf
19
Whisper
● Model capable of multilingual (99)

speech recognition, translation
● Trained on 680,000 hours of labeled
audio data collected from web
● Convert the 30 seconds chunk of audio
to Log-mel spectrogram
https://openai.com/research/whisper
20
3.6 Text to Text Models
ChatGPT
● Famous text-to-text model bringing attention to generative ai

● Uses transformer architecture along with the reinforcement learning for human feedback
● Generates text, simple mathematics and codes
21
LaMDA
● Model for generating dialogs

● Pretrained on dataset of 1.56T words collected from public
dialog data
● In fine-tuning, classifiers are trained to predict the Safety &
Quality (SSI) ratings for the response
● LaMDA is being trained to call an external information
retrieval system during its interaction with the user to improve
the factual correctness
https://ai.googleblog.com/2022/01/lamda-towards-safe-grounded-and-high.html
22
PEER
● Model developed to cover the entire writing process of

Plan, Edit, Explain, Repeat
● Trained on wikipedia edit history
https://ai.facebook.com/research/publications/peer-a-collaborative-language-model/
23
Meta AI Speech from Brain
● Decodes speech from noninvasive recordings of brain activity

● Used electroencephalography (EEG) and magnetoencephalography (MEG) datasets and 150+
hours of recordings of 169 healthy volunteers listening audiobooks
● Model with contrastive learning aligns speech and related brain activity
● Decodes speech segments from three seconds of brain activity with 73% accuracy
https://ai.facebook.com/blog/ai-speech-brain-activity/ 24
3.7 Text to Code Models
Codex
● Convert the comments into code

● Based on gpt-3 model trained on 159 GB of Python code from 54 million GitHub repositories
● Proficient in over a dozen languages including JavaScript, Go, Perl, PHP, Ruby, Swift and
TypeScript
● Powers the github copilot
25
3.7 Text to Code Models
Alphacode
● Pretrained on 715.1 GB code of GitHub Repos

● Used Codeforce for fine tuning, participated in 10 recent contest
https://www.deepmind.com/blog/competitive-programming-with-alphacode 26
3.8 Text to Science Models
Galactica
● Model generates citations and help to discover related paper

● Trained on 48 million papers, textbooks and lecture notes, millions of compounds and
proteins, scientific websites, encyclopedias
● Models capable of working with scientific terminology, chemical formulas and codes
https://galactica.org/explore/ 27
3.8 Text to Science Models
Minerva
● Combines several techniques, including few-shot prompting, chain of thought or scratchpad

prompting, and majority voting
● Builds on the Pathways Language Model (PaLM), with further training on a 118GB dataset of
scientific papers from the arXiv preprint server and web pages that contain mathematical
expressions using LaTeX, MathJax, or other mathematical typesetting formats
● Generates answers using a mix of natural language and LaTeX mathematical expressions,
with no explicit underlying mathematical structure
https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html 28
3.9 Other Models
Alphatensor
● Uses the Alphazero model famous for beating GO champion

● Discovered efficient algorithm for the matrix multiplication
● Converted the problem of finding efficient algorithms for matrix multiplication into a
single-player game
https://www.deepmind.com/blog/discovering-novel-algorithms-with-alphatensor 29
3.9 Other Models
GATO
● Inspired by LLM , model for multiple general tasks

● In the training phase, data from diverse tasks and modalities are converted into a sequential
representation of tokens
● Loss is masked so that Gato only predicts action and text targets
https://www.deepmind.com/blog/a-generalist-agent 30
4. Conclusion
● Can help to optimize the non creative, creative tasks
● Lack of data and bias in data hinder the progress
● Lack of understanding of ethics
● Discovering phase of Generative AI and its purpose
31

Chat GPT Is Not All You Need Paper Review

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chat GPT Is Not All You Need Paper Review

Uploaded by

Copyright:

Available Formats

Chat GPT is not all you need.

A State of the Art

● Timeline of the release of model

● Developer of the model

● Open Source model developed by CompVis group @ LMU Munich

”A cat wearing a beret and black turtleneck”.

● Muse is fast model uses transformer architecture instead of latent diffusion

● Uses pretrained 2D text-to-image diffusion model to perform text-to-3D synthesis

● Shorter processing time and high quality generated models

● 80B parameter vision language models works with Image,

● Image captioning model

● Model capable of video generation

● Generates audio effect for the video scenes

● Generates speech and piano music, taking audio only as input

Two Kinds of audio tokens in AudioLM

● Model capable of multilingual (99)

● Famous text-to-text model bringing attention to generative ai

● Model for generating dialogs

● Model developed to cover the entire writing process of

Meta AI Speech from Brain

● Decodes speech from noninvasive recordings of brain activity

● Convert the comments into code

● Pretrained on 715.1 GB code of GitHub Repos

● Model generates citations and help to discover related paper

● Combines several techniques, including few-shot prompting, chain of thought or scratchpad

● Uses the Alphazero model famous for beating GO champion

● Inspired by LLM , model for multiple general tasks

You might also like