You are on page 1of 31

Chat GPT is not all you need.

A State of the Art


Review of large Generative AI models
Author: Roberto Gozalo-Brizuela, Eduardo C.Garrido-Merch´an
Quantitative Methods Department, Universidad Pontificia Comillas, Madrid, Spain
Presenter: Diwash Shrestha

Date: 13-06-2023

1
1. Introduction
● Generative AI are possible due to the architecture like transformers, generative adversarial
networks or variational autoencoders
● Discussion on the taxonomy of the main generative models in industry and analyze these
models based on category
● Review of the applications of the models and the content they generate

2
2. Taxonomy of Generative AI
● Input to Output Format of model

● Timeline of the release of model

● Developer of the model

3
2.1 Input to Output Format of model

4
2.2 Timeline of the release of model

5
2.3 Developer of the model
● Generative AI requires huge resources
● Collaboration of large companies with academia

6
3. Generative AI models categories

1. Text-to-image models
2. Text-to-3D models
3. Image-to-Text models
4. Text-to-Video models
5. Text-to-Audio models
6. Text-to-Text models
7. Text-to-Code models
8. Text-to-Science models
9. Other models

7
3.1 Text to Image Models
Dall-E 2

● Generate image and art through the prompt with text description
● Uses the Contrastive Language Image Pre Training (CLIP) neural network
● CLIP is able to find the relation between textual semantics and their visual
representations
● CLIP is combined with the prior model called GLIDE to create the images
● Application: Synthetic data generation and image editing

Image from https://www.assemblyai.com/blog/how-dall-e-2-actually-works/ Image generated from the prompt ”A cat wearing a beret and black turtleneck”.

8
3.1 Text to Image Models

Imagen

● Based on pretrained text encoder which generate text to sequence of word embeddings
● Cascade of conditional diffusion models which maps embeddings to images
● Discovered large language models, pretrained on text-only corpora very effective at
encoding text for image synthesis
● Increasing the size of the language model boosts both sample fidelity and image-text
alignment compared to the increasing size of image diffusion model

”A Golden Retriever dog wearing a blue checkered beret and red dotted
https://imagen.research.google/
turtleneck”.

9
3.1 Text to Image Models

Stable Diffusion

● Open Source model developed by CompVis group @ LMU Munich


● Uses latent diffusion model which perform image modification operations in latent space
● Ability to work in latent space makes stable diffusion faster compared to the other model
which worked in a pixel space

”A cat wearing a beret and black turtleneck”.


https://huggingface.co/spaces/stabilityai/stable-diffusion-1

10
3.1 Text to Image Models

Muse

● Muse is fast model uses transformer architecture instead of latent diffusion


● More efficient due to the use of discrete tokens and requires fewer sampling iterations
● At inference time Muse is 10X faster than Imagen-3B and 3X faster than stable diffusion v .14

https://blog.metaphysic.ai/muse-googles-super-fast-text-to-image-model-aband
ons-latent-diffusion-for-transformers/#:~:text=The%20authors%20estimate%20t
hat%20Muse,Stable%20Diffusion%20requires%20for%20inference.
11
3.2 Text to 3D Models

Dreamfusion

● Uses pretrained 2D text-to-image diffusion model to perform text-to-3D synthesis


● Uses score distillation sampling SDS to generate samples from diffusion model by optimising
a loss function
● Uses differentiable generator which focuses on creating 3D models that will look like good
images when rendered from random angles
● Application: Fast model generation for games, movie development

3D model created with “cat wearing virtual reality headset in renaissance oil
painting high detail caravaggio” prompt

https://dreamfusion3d.github.io/gallery.html
12
3.2 Text to 3D Models

Magic3D

● Shorter processing time and high quality generated models


● Uses two stage optimization framework : Magic3D builds a low-resolution diffusion prior
● Textured 3D mesh model is furthered optimized with an efficient differentiable render
● On Human evaluation 61.7% prefer model of Magic3D compared to Dreamfusion

Textured 3D mesh
3D model for “a peacock on a surfboard”
https://research.nvidia.com/labs/dir/magic3d/
13
3.3 Image to Text Models
Flamingo

● 80B parameter vision language models works with Image,


Audio and Video created with few-shot learning
● Combines pretrained vision model (perceiver) that analyzes
visual scenes & LLM (chinchilla) which performs a basic form
of reasoning
● Multimodal features makes it easy to useable for applications:
visually impaired, identification of hateful content etc.

https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model
14
3.3 Image to Text Models

VisualGPT

● Image captioning model


● VisualGPT uses pre trained GPT-2 model
● Does not require much data compared to other image to text model

15
3.4 Text to Video Models

Phenaki

● Model capable of video generation


● Trained using large image-text pairs and smaller video-text pairs
● Model has three parts: C-ViViT encoder, training transformer and video generator
● Phenaki can create video with text prompts and combination of image and text prompt

16
3.4 Text to Video Models

Soundify

● Generates audio effect for the video scenes


● Leverage labeled, 90,000 sound effects and extending CLIP for image classification
capabilities
● Works in three stages:
classification, synchronization
and mix
● Classify each scene for two types
of sounds: effects and ambients
● Sync the sound from the sound
emitters and the frame on scene

https://chuanenlin.com/papers/soundify-neurips2021.pdf
17
3.5 Text to Audio Models

AudioLM

● Generates speech and piano music, taking audio only as input


● Semantic captures local dependencies(phonetics in speech) and
global long term structure(rhythm in piano music)
● Acoustic captures details of the audio waveform
● AudioLM works in three stages

Two Kinds of audio tokens in AudioLM

https://ai.googleblog.com/2022/10/audiolm-language-modeling-approach-to.html 18
3.5 Text to Audio Models

Jukebox

● Generates music along with artist singing in different genre and artist style
● Raw audio is high dimensional which can be computationally challenging
● Uses hierarchical VQ-VAE architecture to compress audio to lower-dimensional space
● Trained on 1.2 millions songs along with lyrics and other metadata
● Can generate music for given audio and generate songs with given lyrics

https://cdn.openai.com/papers/jukebox.pdf
19
3.5 Text to Audio Models

Whisper

● Model capable of multilingual (99)


speech recognition, translation
● Trained on 680,000 hours of labeled
audio data collected from web
● Convert the 30 seconds chunk of audio
to Log-mel spectrogram

https://openai.com/research/whisper
20
3.6 Text to Text Models

ChatGPT

● Famous text-to-text model bringing attention to generative ai


● Uses transformer architecture along with the reinforcement learning for human feedback
● Generates text, simple mathematics and codes

21
3.6 Text to Text Models

LaMDA

● Model for generating dialogs


● Pretrained on dataset of 1.56T words collected from public
dialog data
● In fine-tuning, classifiers are trained to predict the Safety &
Quality (SSI) ratings for the response
● LaMDA is being trained to call an external information
retrieval system during its interaction with the user to improve
the factual correctness

https://ai.googleblog.com/2022/01/lamda-towards-safe-grounded-and-high.html

22
3.6 Text to Text Models

PEER

● Model developed to cover the entire writing process of


Plan, Edit, Explain, Repeat
● Trained on wikipedia edit history

https://ai.facebook.com/research/publications/peer-a-collaborative-language-model/
23
3.6 Text to Text Models

Meta AI Speech from Brain

● Decodes speech from noninvasive recordings of brain activity


● Used electroencephalography (EEG) and magnetoencephalography (MEG) datasets and 150+
hours of recordings of 169 healthy volunteers listening audiobooks
● Model with contrastive learning aligns speech and related brain activity
● Decodes speech segments from three seconds of brain activity with 73% accuracy

https://ai.facebook.com/blog/ai-speech-brain-activity/ 24
3.7 Text to Code Models

Codex

● Convert the comments into code


● Based on gpt-3 model trained on 159 GB of Python code from 54 million GitHub repositories
● Proficient in over a dozen languages including JavaScript, Go, Perl, PHP, Ruby, Swift and
TypeScript
● Powers the github copilot

25
3.7 Text to Code Models

Alphacode

● Pretrained on 715.1 GB code of GitHub Repos


● Used Codeforce for fine tuning, participated in 10 recent contest

https://www.deepmind.com/blog/competitive-programming-with-alphacode 26
3.8 Text to Science Models

Galactica

● Model generates citations and help to discover related paper


● Trained on 48 million papers, textbooks and lecture notes, millions of compounds and
proteins, scientific websites, encyclopedias
● Models capable of working with scientific terminology, chemical formulas and codes

https://galactica.org/explore/ 27
3.8 Text to Science Models

Minerva

● Combines several techniques, including few-shot prompting, chain of thought or scratchpad


prompting, and majority voting
● Builds on the Pathways Language Model (PaLM), with further training on a 118GB dataset of
scientific papers from the arXiv preprint server and web pages that contain mathematical
expressions using LaTeX, MathJax, or other mathematical typesetting formats
● Generates answers using a mix of natural language and LaTeX mathematical expressions,
with no explicit underlying mathematical structure

https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html 28
3.9 Other Models

Alphatensor

● Uses the Alphazero model famous for beating GO champion


● Discovered efficient algorithm for the matrix multiplication
● Converted the problem of finding efficient algorithms for matrix multiplication into a
single-player game

https://www.deepmind.com/blog/discovering-novel-algorithms-with-alphatensor 29
3.9 Other Models

GATO

● Inspired by LLM , model for multiple general tasks


● In the training phase, data from diverse tasks and modalities are converted into a sequential
representation of tokens
● Loss is masked so that Gato only predicts action and text targets

https://www.deepmind.com/blog/a-generalist-agent 30
4. Conclusion
● Can help to optimize the non creative, creative tasks
● Lack of data and bias in data hinder the progress
● Lack of understanding of ethics
● Discovering phase of Generative AI and its purpose

31

You might also like