Stanford 2022 AI Index Extract

Artificial Intelligence
Index Report 2022
CHAPTER 2:
Technical
Performance
Index Report 2022 CHAPTER 2: TECHNICAL PERFORMANCE
CHAPTER 2:
Chapter Preview
Overview 50 Medical Image Segmentation 61
Chapter Highlights 51 CVC-ClinicDB and Kvasir-SEG 61
Face Detection and Recognition 62
2.1 COMPUTER VISION—IMAGE 52 National Institute of Standards
Image Classification 52 and Technology (NIST) Face
Recognition Vendor Test (FRVT) 62
ImageNet 52
Face Detection: Effects of Mask-Wearing 63
ImageNet: Top-1 Accuracy 52
Face Recognition Vendor Test (FRVT):
ImageNet: Top-5 Accuracy 52
Face-Mask Effects 63
Image Generation 54
Highlight: Masked Labeled Faces
STL-10: Fréchet Inception in the Wild (MLFW) 64
Distance (FID) Score 54
Visual Reasoning 65
CIFAR-10: Fréchet Inception
Visual Question Answering
Distance (FID) Score 55
(VQA) Challenge 65
Deepfake Detection 56
FaceForensics++ 56 2.2 COMPUTER VISION—VIDEO 67
Celeb-DF 57 Activity Recognition 67
Human Pose Estimation 57 Kinetics-400, Kinetics-600, Kinetics-700 67
Leeds Sports Poses: Percentage ActivityNet: Temporal Action
of Correct Keypoints (PCK) 58 Localization Task 69
Human3.6M: Average Mean Object Detection 70
Per Joint Position Error (MPJPE) 59
Common Object in Context (COCO) 71
Semantic Segmentation 60
You Only Look Once (YOLO) 72
Cityscapes 60
Visual Commonsense Reasoning (VCR) 73
ACCESS THE PUBLIC DATA
Table of Contents 48
Index Report 2022 CHAPTER 2: TECHNICAL PERFORMANCE
CHAPTER 2: CHAPTER PREVIEW (CONT’D)
2.3 LANGUAGE 74 2.4 SPEECH 86

English Language Understanding 74 Speech Recognition 86
SuperGLUE 74 Transcribe Speech: LibriSpeech
Stanford Question Answering (Test-Clean and Other Datasets) 86
Dataset (SQuAD) 75 VoxCeleb 87
Reading Comprehension Dataset
Requiring Logical Reasoning (ReClor) 76 2.5 RECOMMENDATION 88
Text Summarization 78 Commercial Recommendation:
arXiv 78 MovieLens 20M 88
PubMed 79 Click-Through Rate Prediction: Criteo 89
Natural Language Inference 80

2.6 REINFORCEMENT LEARNING 90
Stanford Natural Language
Inference (SNLI) 80 Reinforcement Learning Environments 90
Abductive Natural Language Arcade Learning Environment: Atari-57 90

Inference (aNLI) 81 Procgen 91
Sentiment Analysis 82 Human Games: Chess 93
SemEval 2014 Task 4 Sub Task 2 82
Machine Translation (MT) 83 2.7 HARDWARE 94
WMT 2014, English-German MLPerf: Training Time 94

and English-French 84 MLPerf: Number of Accelerators 96
Number of Commercially IMAGENET: Training Cost 97
Available MT Systems 85
2.8 ROBOTICS 98
Price Trends in Robotic Arms 98
AI Skills Employed by Robotics Professors 99
ACCESS THE PUBLIC DATA
Table of Contents 49
Artificial Intelligence CHAPTER 2: TECHNICAL PERFORMANCE
Index Report 2022 2.1 Computer Vision–Image
Computer vision is the subfield of AI that teaches machines to understand images and videos. There is a wide range of computer
vision tasks, such as image classification, object recognition, semantic segmentation, and face detection. As of 2021, computers can
outperform humans on a plethora of computer vision tasks. Computer vision technologies have a variety of important real-world
applications, such as autonomous driving, crowd surveillance, sports analytics, and video-game creation.
2.1 COMPUTER VISION—IMAGE

I M AG E C L A S S I F I CAT I O N A DEMONSTRATION OF IMAGE CLASSIFICATION
Source: Krizhevsky, 2020
ImageNet
Figure 2.1.1
ImageNet: Top-1 Accuracy
ImageNet: Top-5 Accuracy
Table of Contents Chapter 2 Preview 52

I M AG E G E N E R AT I O N
GAN PROGRESS ON FACE GENERATION

Source: Goodfellow et al., 2014; Radford et al., 2016; Liu & Tuzel, 2016;
Karras et al., 2018; Karras et al., 2019; Goodfellow, 2019; Karras et al.,
2020; AI Index, 2021; Vahdat et al., 2021
2021 Figure 2.1.4
STL-10: Fréchet Inception Distance (FID) Score
STL-10: FRÉCHET INCEPTION DISTANCE (FID) SCORE

Source: Papers with Code, 2021; arXiv, 2021 | Chart: 2022 AI Index Report
Fréchet Inception Distance (FID) Score
30
20
10
7.71
0
2018 2019 2020 2021
Figure 2.1.5

CIFAR-10: Fréchet Inception Distance (FID) Score
CIFAR-10: FRÉCHET INCEPTION DISTANCE (FID) SCORE

30
Fréchet Inception Distance (FID) Score
20
10
2.10
0
2017 2018 2019 2020 2021
Figure 2.1.6

D E E P FA K E D E T E CT I O N
FaceForensics++
FACEFORENSICS++: ACCURACY
Source: arXiv, 2021 | Chart: 2022 AI Index Report
100% 99.98%, Face2Face

99.47%, DeepFake
98.27%, FaceSwap
93.25%, NeuralTextures
90%
Accuracy (%)
80%
70%
60%
2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 2.1.7

Celeb-DF
CELEB-DF: AREA UNDER CURVE SCORE (AUC)

80
76.88
75
Area Under Curve Score (AUC)
70
65
60
2018 2019 2020 2021
Figure 2.1.8
H U M A N P O S E E S T I M AT I O N
A DEMONSTRATION OF
HUMAN POSE ESTIMATION
Source: Cao et al., 2019
Figure 2.1.9

Leeds Sports Poses: Percentage of Correct

Keypoints (PCK)
LEEDS SPORTS POSES: PERCENTAGE of CORRECT KEYPOINTS (PCK)

100% 99.50%
Percentage of Correct Keypoints (PCK)
90%
80%
70%
2014 2015 2016 2017 2018 2019 2020 2021
Figure 2.1.10

Human3.6M: Average Mean Per Joint Position

Error (MPJPE)
HUMAN3.6M: AVERAGE MEAN PER JOINT POSITION ERROR (MPJPE)

150
120
Average MPJPE (mm)
90
60
22.70, Without Extra Training Data

30
18.70, With Extra Training Data

0
2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 2.1.11

S E M A N T I C S E G M E N TAT I O N A DEMONSTRATION OF SEMANTIC SEGMENTATION

Source: Visual Object Classes Challenge, 2012
Figure 2.1.12
Cityscapes
CITYSCAPES CHALLENGE, PIXEL-LEVEL SEMANTIC LABELING TASK: MEAN INTERSECTION-OVER-UNION (IOU)

Source: Cityscapes Challenge, 2021 | Chart: 2022 AI Index Report
90%
86.20%, With Extra Training Data

Mean Intersection-Over-Union (mIoU)
84.30%, Without Extra Training Data
80%
70%
60%
2014 2015 2016 2017 2018 2019 2020 2021
Figure 2.1.13

M E D I CA L I M AG E S E G M E N TAT I O N A DEMONSTRATION OF KIDNEY SEGMENTATION

Source: Kidney and Kidney Tumor Segmentation, 2021
CVC-ClinicDB and Kvasir-SEG
Figure 2.1.14
CVC-CLINICDB: MEAN DICE KVASIR-SEG: MEAN DICE

Source: Papers with Code, 2021; arXiv, 2021 |Chart: 2022 AI Index Report Source: Papers with Code, 2021; arXiv, 2021 | Chart: 2022 AI Index Report
95% 95%
94.20%
92.17%
90% 90%
Mean DICE
Mean DICE
85% 85%
80% 80%
2015 2017 2019 2021 2015 2017 2019 2021
Figure 2.1.15a Figure 2.1.15b

National Institute of Standards and Technology

(NIST) Face Recognition Vendor Test (FRVT)
FACE DETECTION AND RECOGNITION
NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY (NIST) FACE RECOGNITION VENDOR TEST (FRVT):
VERIFICATION ACCURACY by DATASET
Source: National Institute of Standards and Technology, 2021 | Chart: 2022 AI Index Report
0.5
False Non-Match Rate: FMNR (Log-Scale)
0.2
0.1
0.05
0.0297, WILD Photos FNMR @ FMR 0.00001

0.02
0.01
0.005
0.0044, BORDER Photos FNMR @ FMR = 0.000001
0.0023, VISABORDER Photos FNMR@FMR 0.000001
0.0022, MUGSHOT Photos FNMR @ FMR 0.00001
0.002
0.0021, MUGSHOT Photos FNMR @ FMR 0.00001 DT>=12 YRS
0.001 0.0013, VISA Photos FNMR @ FMR 0.000001
2017 2018 2019 2020 2021

Figure 2.1.16

FACE DETECTION:
EFFECTS OF MASK-WEARING Although facial recognition
Face Recognition Vendor Test (FRVT):
technology has existed
Face-Mask Effects for several decades, the
technical progress in the
last few years has been
significant. Some of today’s
top-performing facial
recognition algorithms have
a near 100% success rate on
challenging datasets.
NIST FRVT FACE MASK EFFECTS: FALSE-NON MATCH RATE

Source: National Institute of Standards and Technology, 2021 | Chart: 2022 AI Index Report
0.025
False Non-Match Rate: FMNR (Log-Scale)
0.020
0.015
0.014, Masked
0.010
0.005
0.002, Non-masked
0.000
2019 2020 2021
Figure 2.1.17

Masked Labeled Faces in the Wild (MLFW)

In 2021, researchers from the
Beijing University of Posts and
Telecommunications released a
facial recognition dataset of 6,000
masked faces in response to the new
recognition challenges posed by
large-scale mask-wearing.
EXAMPLES OF MASKED FACES IN

THE MASKED LABELED FACES IN
THE WILD (MLFW) DATABASE
Source: Wang et al., 2021
Figure 2.1.18
As part of the dataset release, the researchers ran a series of existing state-of-the-art detection
algorithms on a variety of facial recognition datasets, including theirs, to determine how much detection
performance decreased when faces were masked. Their estimates suggest that top methods perform 5
to 16 percentage points worse on masked faces compared to unmasked ones. These findings somewhat
confirm the insights from the FRVT face-mask tests: Performance deteriorates when masks are included,
but not by an overly significant degree.
STATE-OF-THE-ART FACE DETECTION METHODS on MASKED LABELED FACES IN THE WILD (MLFW): ACCURACY
Source: Wang et. al, 2021 | Chart: 2022 AI Index Report
Face Detection Method / Dataset

ArcFace1 ArcFace3 Arcface4 CosFace2 Curricularface5 SFace6
100%
90% 91% 91%

90%
85%
83%
Accuracy (%)
80%
75%
70%
60%
50%
CALFW11
CPLFW10
LFW7
MLFW
CALFW11
CPLFW10
LFW7
MLFW
CALFW11
CPLFW10
LFW7
MLFW
CALFW11
CPLFW10
LFW7
MLFW
CALFW11
CPLFW10
LFW7
MLFW
CALFW11
CPLFW10
LFW7
MLFW
SLLFW8
SLLFW8
SLLFW8
SLLFW8
SLLFW8
SLLFW8
Figure 2.1.19

VISUAL REASONING
AN EXAMPLE OF A VISUAL REASONING TASK
Source: Goyal et al., 2021
Figure 2.1.20
SAMPLE QUESTIONS IN THE VISUAL QUESTION

Visual Question Answering
ANSWERING (VQA) CHALLENGE
(VQA) Challenge Source: Goyal et al., 2017
Figure 2.1.21

VISUAL QUESTION ANSWERING (VQA) CHALLENGE: ACCURACY

Source: VQA Challenge, 2021 | Chart: 2022 AI Index Report
80.80%, Human Baseline

80%
79.78%
70%
Accuracy (%)
60%
50%
2015 2016 2017 2018 2019 2020 2021
Figure 2.1.22

Index Report 2022 2.2 Computer Vision—Video
Video analysis concerns reasoning or task operation across sequential frames (videos), rather than single frames (images). Video
computer vision has a wide range of use cases, which include assisting criminal surveillance efforts, sports analytics, autonomous
driving, navigation of robots, and crowd monitoring.
2.2 COMPUTER VISION—VIDEO

AC T I V I T Y R E C O G N I T I O N Kinetics-400, Kinetics-600, Kinetics-700
EXAMPLE CLASSES FROM THE KINETICS DATASET

Source: Kay et al., 2017
Figure 2.2.1

KINETICS-400, KINETICS-600, KINETICS-700: TOP-1 ACCURACY

Source: Papers with Code, 2021; arXIv, 2021 | Chart: 2022 AI Index Report
90% 89.60%, Kinetics-600

89.10%, Kinetics-400
82.20%, Kinetics-700
80%
Top-1 Accuracy (%)
70%
60%
50%
2016 2017 2018 2019 2020 2021 2022
Figure 2.2.2

ActivityNet: Temporal Action Localization Task
ACTIVITYNET, TEMPORAL ACTION LOCALIZATION TASK: MEAN AVERAGE PRECISION (mAP)

Source: ActivityNet, 2021 | Chart: 2022 AI Index Report
44.67%
40%
Mean Average Precision (mAP)
30%
20%
2016 2017 2018 2019 2020 2021

Figure 2.2.3

O B J E C T D E T E CT I O N
A DEMONSTRATION OF HOW OBJECT DETECTION APPEARS TO AI SYSTEMS

Source: COCO, 2020
Figure 2.2.4

Common Object in Context (COCO)
COCO-TEST-DEV: MEAN AVERAGE PRECISION (mAP50)

80%
79.50%, With Extra Training Data
75%
77.10%, Without Extra Training Data

Mean Average Precision (mAP50)
70%
65%
60%
55%
50%
2015 2016 2017 2018 2019 2020 2021
Figure 2.2.5

You Only Look Once (YOLO)
STATE OF THE ART (SOTA) vs. YOU ONLY LOOK ONCE (YOLO): MEAN AVERAGE PRECISION (mAP50)
Source: arXiv, 2021; GitHub, 2021 | Chart: 2022 AI Index Report
79.50%, SOTA
80%
Mean Average Precision (mAP50)
70%
72.40%, YOLO
60%
50%
40%
2016 2017 2018 2019 2020 2021
Figure 2.2.6

Visual Commonsense Reasoning (VCR)
A SAMPLE QUESTION OF THE VISUAL COMMONSENSE REASONING (VCR) CHALLENGE

Source: Zellers et al., 2018
Figure 2.2.7
VISUAL COMMONSENSE REASONING (VCR) TASK: Q->AR SCORE

Source: VCR Leaderboard, 2021 | Chart: 2022 AI Index Report
90
85.00, Human Baseline
80
72.00
70
Q->AR Score
60
50
40
2018 2019 2020 2021
Figure 2.2.8

Index Report 2022 2.3 Language
Natural language processing (NLP) is a subfield of AI, with roots that stretch back as far as the 1950s. NLP involves research into systems that
can read, generate, and reason about natural language. NLP evolved from a set of systems that in its early years used handwritten rules and
statistical methodologies to one that now combines computational linguistics, rule-based modeling, statistical learning, and deep learning.
This section looks at progress in NLP across several language task domains, including: (1) English language understanding; (2) text
summarization; (3) natural language inference; (4) sentiment analysis; and (5) machine translation. In the last decade, technical progress in
NLP has been significant: The adoption of deep neural network–style machine learning methods has meant that many AI systems can now
execute complex language tasks better than many human baselines.
2.3 LANGUAGE
E N G L I S H L A N G UAG E
U N D E R S TA N D I N G
SuperGLUE
A SET OF SUPERGLUE TASKS3

Source: Wang et al., 2019
Figure 2.3.1

SUPERGLUE: SCORE
Source: SuperGLUE Leaderboard, 2021 | Chart: 2022 AI Index Report
92
91.00
91
Score
90
89.80, Human Performance
89
88
2019 2020 2021
Figure 2.3.2
Stanford Question Answering Dataset (SQuAD) HARDER QUESTIONS ADDED TO STANFORD

QUESTION ANSWERING DATASET (SQUAD) 2.0
Source: Rajpurkar et al., 2018
Figure 2.3.3

SQUAD 1.1 and SQUAD 2.0: F1 SCORE

Source: SQuAD 1.1 and SQuAD 2.0, 2021 | Chart: 2022 AI Index Report
100
95.72, Squad 1.1

95
93.21, Squad 2.0
91.20, Human Baseline (v1)

F1 Score
90 89.50, Human Baseline (v2)
85
80
2016 2017 2018 2019 2020 2021

Figure 2.3.4
Reading Comprehension A SAMPLE QUESTION IN READING COMPREHENSION DATASET

Dataset Requiring Logical REQUIRING LOGICAL REASONING (RECLOR)
Reasoning (ReClor) Source: Yu et al., 2020
Figure 2.3.5

Although AI systems
are presently capable of
achieving a relatively high
level of performance on the
easy set of questions, they
struggle on the hard set.
READING COMPREHENSION DATASET REQUIRING LOGICAL REASONING (RECLOR): ACCURACY

Source: ReClor Leaderboard, 2021 | Chart: 2022 AI Index Report
91.82%, Test Easy

90%
80%
Accuracy (%)
70% 69.29%, Test Hard
60%
50%
2020 2021
Figure 2.3.6

T E X T S U M M A R I Z AT I O N
arXiv
ARXIV: ROUGE-1
45
ROUGE-1
40
35
30
2017 2018 2019 2020 2021
Figure 2.3.7

Index Report 2022
PubMed
PUBMED: ROUGE-1
50
45
ROUGE-1
40
35
2017 2018 2019 2020 2021
Figure 2.3.8

N AT U R A L L A N G UAG E I N F E R E N C E Stanford Natural Language Inference (SNLI)
QUESTIONS AND LABELS IN STANFORD NATURAL LANGUAGE INFERENCE (SNLI)

Source: Bowman et al., 2015
Figure 2.3.9

STANFORD NATURAL LANGUAGE INFERENCE (SNLI): ACCURACY

95%
93.10%
Accuracy (%)
90%
85%
2014 2015 2016 2017 2018 2019 2020 2021
Figure 2.3.10
Abductive Natural Language Inference (aNLI)
EXAMPLE QUESTIONS IN ABDUCTIVE NATURAL LANGUAGE INFERENCE (ANLI)

Source: Allen Institute for AI, 2021
Figure 2.3.11

ABDUCTIVE NATURAL LANGUAGE INFERENCE (aNLI): ACCURACY

Source: Allen Institute for AI, 2021 | Chart: 2022 AI Index Report
92.90%, Human Baseline
92% 91.87%
90%
Accuracy (%)
88%
86%
84%
2019 2020 2021

Figure 2.3.12
S E N T I M E N T A N A LYS I S
SemEval 2014 Task 4 Sub Task 2
A SAMPLE
SEMEVAL TASK
Source: Pontiki et al., 2014
Figure 2.3.13

SEMEVAL 2014 TASK 4 SUB TASK 2: ACCURACY

90%
88.64%
85%
Accuracy (%)
80%
75%
70%
2015 2016 2017 2018 2019 2020 2021
Figure 2.3.14
M AC H I N E T R A N S L AT I O N ( M T )

WMT 2014, English-German and English-French
WMT2014, ENGLISH-FRENCH: BLEU SCORE WMT2014, ENGLISH-GERMAN: BLEU SCORE

Source: Papers with Code, 2021; arXiv, 2021 | Chart: 2022 AI Index Report Source: Papers with Code, 2021; arXiv, 2021 | Chart: 2022 AI Index Report
45 45

40 40

BLEU Score
BLEU score
35 35
30 30
25 25
20 20
2015 2017 2019 2021 2015 2017 2019 2021

Figure 2.3.15

Number of Commercially Available MT Systems
NUMBER of INDEPENDENT MACHINE TRANSLATION SERVICES

Source: Intento, 2021 | Chart: 2022 AI Index Report
50
Commercial 46
Number of Independent Machine Translation Services
Open Source Pre-trained

Preview
40
34
30
26 38
23
21
20 28
16
13 23
12 21
18
10
10 9
10 15
9
3
8 9
6 5
0 3 3 3 3
05/2017 07/2017 11/2017 03/2018 07/2018 12/2018 06/2019 11/2019 07/2020 10/2021
Figure 2.3.16

Index Report 2022 2.4 Speech
Another important domain of AI research is the analysis, recognition, and synthesis of human speech. In this AI subfield, AI systems are
typically rated on their ability to recognize speech and identify words and convert them into text; and also to recognize speakers and
identify the individuals speaking. Modern home assistance tools, such as Siri, are one of the many examples of commercially applied AI
speech technology.
2.4 SPEECH
SPEECH RECOGNITION
Transcribe Speech: LibriSpeech (Test-Clean and

Other Datasets)
LIBRISPEECH, TEST CLEAN: WORD ERROR RATE (WER) LIBRISPEECH, TEST OTHER: WORD ERROR RATE (WER)
Source: Papers with Code, 2021; arXiv, 2021 | Chart: 2022 AI Index Report Source: Papers with Code, 2021; arXiv, 2021 | Chart: 2022 AI Index Report
14 14
12 12
10 10
Word Error Rate (WER)
Word Error Rate (WER)
8 8 3.30, Without Extra Training Data
6 6
4 4
2 2 2.50, With Extra Training Data

0 0
2015 2017 2019 2021 2015 2017 2019 2021
Figure 2.4.1

Index Report 2022 2.4 Speech
VoxCeleb
VOXCELEB: EQUAL ERROR RATE (EER)

Source: VoxCeleb, 2021 | Chart: 2022 AI Index Report
8%
6%
Equal Error Rate (%)
4%
2%
0.42%
0%
2017 2018 2019 2020 2021
Figure 2.4.2

Index Report 2022 2.5 Recommendation
Recommendation is the task of suggesting items that might be of interest to a user, such as movies to watch, articles to read, or products
to purchase. Recommendation systems are crucial to businesses, such as Amazon, Netflix, Spotify, and YouTube. For example, one of
the earliest open recommendation competitions in AI was the Netflix Prize; hosted in 2009, it challenged computer scientists to develop
algorithms that could accurately predict user ratings for films based on previously submitted ratings.
2.5 RECOMMENDATION
Commercial Recommendation: MovieLens 20M
MOVIELENS 20M: NORMALIZED DISCOUNTED CUMULATIVE GAIN@100 (nDCG@100)

0.460
Normalized Discounted Cumulative Gain@100 (nDCG@100)
0.450
0.448
0.440
0.430
0.420
2018 2019 2020 2021
Figure 2.5.1

Index Report 2022 2.5 Recommendation
Click-Through Rate Prediction: Criteo
CRITEO: AREA UNDER CURVE SCORE (AUC)

0.813
0.810
Area Under Curve Score (AUC)
0.800
0.790
2016 2017 2018 2019 2020 2021
Figure 2.5.2

Index Report 2022 2.6 Reinforcement Learning
In reinforcement learning, AI systems are trained to maximize performance on a given task by interactively learning from their prior
actions. Researchers train systems to optimize by rewarding them if they achieve a desired goal and then punishing them if they fail.
Systems experiment with different strategy sequences to solve their stated problem (e.g., playing chess or navigating through a maze)
and select the strategies which maximize their rewards.
Reinforcement learning makes the news whenever programs like DeepMind’s AlphaZero demonstrate superhuman performance on
games like Go and Chess. However, reinforcement learning is useful in any commercial domain where computer agents need to maximize
a target goal or stand to benefit from learning from previous experiences. Reinforcement learning can help autonomous vehicles change
lanes, robots optimize manufacturing tasks, or time-series models predict future events.
2.6 REINFORCEMENT LEARNING

REINFORCEMENT LEARNING
ENVIRONMENTS
Creating reinforcement
learning models that are
both high performing
and highly efficient is
an important step in the
commercial deployment of
Arcade Learning Environment: Atari-57
reinforcement learning.

ATARI-57: MEAN HUMAN-NORMALIZED SCORE

Source: Papers with Code, 2021; arXIv, 2021 | Chart: 2022 AI Index Report
10 9.62
Mean-Human Normalized Score (in thousands)
0
2015 2016 2017 2018 2019 2020 2021
Figure 2.6.1
Procgen
A SCREENSHOT OF THE 16 GAME ENVIRONMENTS IN PROCGEN
Source: Cobbe et al. 2019
Figure 2.6.2

PROCGEN: MEAN-NORMALIZED SCORE

0.64
0.60
Mean-Normalized Score
0.50
0.40
0.30
2019 2020 2021

Figure 2.6.3

Human Games: Chess
CHESS SOFTWARE ENGINES: ELO SCORE

Source: Swedish Computer Chess Association, 2021 | Chart: 2022 AI Index Report
3,581
3500
3000 2882, Magnus Carlsen
2500
2300, Expert
Elo Score
2000
1700, Intermediate
1500
1000
800, Novice
500
0
1987 1992 1997 2002 2007 2012 2017 2022
Figure 2.6.4

Index Report 2022 2.7 Hardware
In evaluating technical progress in AI, it is relevant not only to consider improvements in technical performance but also the speed of
operation. As this section shows, AI systems continue to improve in virtually every skill category. This performance is often realized by
increasing parameters and training systems on greater amounts of data. However, all else being equal, models that use more parameters
and source more data will take longer to train. Longer train times mean slower real-world deployment. Given that the potential of
increased training times can be offset by stronger and more robust computational infrastructures, it is important to keep track of
progress in the hardware that powers AI systems.
2.7 HARDWARE
MLPerf: Training Time
MLPERF TRAINING TIME of TOP SYSTEMS by TASK: MINUTES

Source: MLPerf, 2021 | Chart: 2022 AI Index Report
50
20
Training Time (Minutes; Log Scale)
10 13.57, Reinforcement Learning
5
3.24, Object Detection (heavy-weight)
2.38, Speech recognition
2
1.26, Image Segmentation

1
0.63, Recommendation
0.5
0.34, Object Detection (light-weight)
0.23, Language Processing
0.2 0.23, Image Classi cation
2018 2019 2020 2021

Figure 2.7.1

Top-performing hardware systems can reach baseline levels

of performance in task categories like recommendation,
light-weight objection detection, image classification, and
language processing in under a minute.
MLPERF: SCALE of IMPROVEMENT across TASK

26.96
25
22.25
20
Scale of Improvement
16.47
15
10
5
2.38
1.70 1.92
Improvement Baseline 1.16
0
Reinforcement Speech Language Recommendation Segmentation Object Detection Object Detection Image
Recognition Processing (light-weight) (heavy-weight) Classi cation
Figure 2.7.2

MLPerf: Number of Accelerators
MLPERF HARDWARE: ACCELERATORS

4,320, Maximum Number of Accelerators Used
4,000
3,000
Number of Accelerators
2,000 1,785, Average Accelerators Used by Top System
1,000
337, Mean Number of Accelerators
0
12.12.2018 06.10.2019 07.29.2020 06.30.2021 12.01.2021
Figure 2.7.3

IMAGENET: Training Cost
IMAGENET: TRAINING COST (to 93% ACCURACY)

Source: AI Index and Narayanan, 2021 | Chart: 2022 AI Index Report
$1,000.00
$500.00
$200.00
Cost (in U.S. Dollars; Log Scale)
$100.00
$50.00
$20.00
$10.00
$5.00 $4.59
$2.00
$1.00
2017 2018 2019 2020 2021
Figure 2.7.4

Index Report 2022 2.8 Robotics
In 2021, the AI Index developed a survey that asked professors who specialize in robotics at top-ranked universities around the world and
in emerging economies about changes in the pricing of robotic arms as well as the uses of robotic arms in research labs. The survey was
completed by 101 professors and researchers from over 40 universities and collected data on 117 robotic arm purchase events from 2016
to 2022. The survey results suggest that there has been a notable decline in the price of robotic arms since 2016.
2.8 ROBOTICS
Price Trends in Robotic Arms7
MEDIAN PRICE of ROBOTIC ARMS, 2017–21

Source: AI Index, 2022 | Chart: 2022 AI Index Report
$40
Price (in thousands of U.S. Dollars)
$30
$22.60
$20
$10
$0
2017 2018 2019 2020 2021
Figure 2.8.1

Index Report 2022 2.8 Robotics
DISTRIBUTION of ROBOTIC ARM PRICES, 2017–21

Source: AI Index, 2022 | Chart: 2022 AI Index Report
$100
Price (in thousands of U.S. Dollars)
$50
$0
2017 2018 2019 2020 2021
Figure 2.8.2
AI Skills Employed by Robotics AI SKILLS EMPLOYED by ROBOTICS PROFESSORS

Professors Source: AI Index, 2022 | Chart: 2022 AI Index Report
Deep Learning 67.00%
Reinforcement
46.00%
Learning
0% 20% 40% 60% 80%

% of Respondents
Figure 2.8.3

Stanford 2022 AI Index Extract

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stanford 2022 AI Index Extract

Uploaded by

Copyright:

Available Formats

Artificial Intelligence

Index Report 2022

ACCESS THE PUBLIC DATA

CHAPTER 2: CHAPTER PREVIEW (CONT’D)

2.3 LANGUAGE 74 2.4 SPEECH 86

PubMed 79 Click-Through Rate Prediction: Criteo 89

Natural Language Inference 80

Abductive Natural Language Arcade Learning Environment: Atari-57 90

WMT 2014, English-German MLPerf: Training Time 94

2.1 COMPUTER VISION—IMAGE

ImageNet: Top-1 Accuracy

ImageNet: Top-5 Accuracy

Table of Contents Chapter 2 Preview 52

GAN PROGRESS ON FACE GENERATION

2021 Figure 2.1.4

STL-10: Fréchet Inception Distance (FID) Score

STL-10: FRÉCHET INCEPTION DISTANCE (FID) SCORE

Table of Contents Chapter 2 Preview 54

CIFAR-10: Fréchet Inception Distance (FID) Score

CIFAR-10: FRÉCHET INCEPTION DISTANCE (FID) SCORE

Table of Contents Chapter 2 Preview 55

100% 99.98%, Face2Face

Table of Contents Chapter 2 Preview 56

CELEB-DF: AREA UNDER CURVE SCORE (AUC)

Table of Contents Chapter 2 Preview 57

Leeds Sports Poses: Percentage of Correct

LEEDS SPORTS POSES: PERCENTAGE of CORRECT KEYPOINTS (PCK)

Table of Contents Chapter 2 Preview 58

Human3.6M: Average Mean Per Joint Position

HUMAN3.6M: AVERAGE MEAN PER JOINT POSITION ERROR (MPJPE)

22.70, Without Extra Training Data

18.70, With Extra Training Data

Table of Contents Chapter 2 Preview 59

S E M A N T I C S E G M E N TAT I O N A DEMONSTRATION OF SEMANTIC SEGMENTATION

CITYSCAPES CHALLENGE, PIXEL-LEVEL SEMANTIC LABELING TASK: MEAN INTERSECTION-OVER-UNION (IOU)

86.20%, With Extra Training Data

84.30%, Without Extra Training Data

Table of Contents Chapter 2 Preview 60

M E D I CA L I M AG E S E G M E N TAT I O N A DEMONSTRATION OF KIDNEY SEGMENTATION

CVC-ClinicDB and Kvasir-SEG

CVC-CLINICDB: MEAN DICE KVASIR-SEG: MEAN DICE

Table of Contents Chapter 2 Preview 61

National Institute of Standards and Technology

FACE DETECTION AND RECOGNITION

0.0297, WILD Photos FNMR @ FMR 0.00001

2017 2018 2019 2020 2021

Table of Contents Chapter 2 Preview 62

NIST FRVT FACE MASK EFFECTS: FALSE-NON MATCH RATE

Table of Contents Chapter 2 Preview 63

Masked Labeled Faces in the Wild (MLFW)

EXAMPLES OF MASKED FACES IN

Face Detection Method / Dataset

90% 91% 91%

Table of Contents Chapter 2 Preview 64

SAMPLE QUESTIONS IN THE VISUAL QUESTION

Table of Contents Chapter 2 Preview 65

VISUAL QUESTION ANSWERING (VQA) CHALLENGE: ACCURACY

80.80%, Human Baseline

Table of Contents Chapter 2 Preview 66

2.2 COMPUTER VISION—VIDEO

EXAMPLE CLASSES FROM THE KINETICS DATASET

Table of Contents Chapter 2 Preview 67

KINETICS-400, KINETICS-600, KINETICS-700: TOP-1 ACCURACY