Professional Documents
Culture Documents
CHAPTER 2:
Technical
Performance
Artificial Intelligence
Index Report 2022 CHAPTER 2: TECHNICAL PERFORMANCE
CHAPTER 2:
Chapter Preview
Overview 50 Medical Image Segmentation 61
Chapter Highlights 51 CVC-ClinicDB and Kvasir-SEG 61
Face Detection and Recognition 62
2.1 COMPUTER VISION—IMAGE 52 National Institute of Standards
Image Classification 52 and Technology (NIST) Face
Recognition Vendor Test (FRVT) 62
ImageNet 52
Face Detection: Effects of Mask-Wearing 63
ImageNet: Top-1 Accuracy 52
Face Recognition Vendor Test (FRVT):
ImageNet: Top-5 Accuracy 52
Face-Mask Effects 63
Image Generation 54
Highlight: Masked Labeled Faces
STL-10: Fréchet Inception in the Wild (MLFW) 64
Distance (FID) Score 54
Visual Reasoning 65
CIFAR-10: Fréchet Inception
Visual Question Answering
Distance (FID) Score 55
(VQA) Challenge 65
Deepfake Detection 56
FaceForensics++ 56 2.2 COMPUTER VISION—VIDEO 67
Celeb-DF 57 Activity Recognition 67
Human Pose Estimation 57 Kinetics-400, Kinetics-600, Kinetics-700 67
Leeds Sports Poses: Percentage ActivityNet: Temporal Action
of Correct Keypoints (PCK) 58 Localization Task 69
Human3.6M: Average Mean Object Detection 70
Per Joint Position Error (MPJPE) 59
Common Object in Context (COCO) 71
Semantic Segmentation 60
You Only Look Once (YOLO) 72
Cityscapes 60
Visual Commonsense Reasoning (VCR) 73
Table of Contents 48
Artificial Intelligence
Index Report 2022 CHAPTER 2: TECHNICAL PERFORMANCE
Table of Contents 49
Artificial Intelligence CHAPTER 2: TECHNICAL PERFORMANCE
Index Report 2022 2.1 Computer Vision–Image
Computer vision is the subfield of AI that teaches machines to understand images and videos. There is a wide range of computer
vision tasks, such as image classification, object recognition, semantic segmentation, and face detection. As of 2021, computers can
outperform humans on a plethora of computer vision tasks. Computer vision technologies have a variety of important real-world
applications, such as autonomous driving, crowd surveillance, sports analytics, and video-game creation.
ImageNet
Figure 2.1.1
I M AG E G E N E R AT I O N
30
20
10
7.71
0
2018 2019 2020 2021
Figure 2.1.5
30
Fréchet Inception Distance (FID) Score
20
10
2.10
0
2017 2018 2019 2020 2021
Figure 2.1.6
D E E P FA K E D E T E CT I O N
FaceForensics++
FACEFORENSICS++: ACCURACY
Source: arXiv, 2021 | Chart: 2022 AI Index Report
93.25%, NeuralTextures
90%
Accuracy (%)
80%
70%
60%
2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 2.1.7
Celeb-DF
80
76.88
75
Area Under Curve Score (AUC)
70
65
60
2018 2019 2020 2021
Figure 2.1.8
H U M A N P O S E E S T I M AT I O N
A DEMONSTRATION OF
HUMAN POSE ESTIMATION
Source: Cao et al., 2019
Figure 2.1.9
100% 99.50%
Percentage of Correct Keypoints (PCK)
90%
80%
70%
2014 2015 2016 2017 2018 2019 2020 2021
Figure 2.1.10
150
120
Average MPJPE (mm)
90
60
Figure 2.1.12
Cityscapes
90%
80%
70%
60%
2014 2015 2016 2017 2018 2019 2020 2021
Figure 2.1.13
Figure 2.1.14
95% 95%
94.20%
92.17%
90% 90%
Mean DICE
Mean DICE
85% 85%
80% 80%
2015 2017 2019 2021 2015 2017 2019 2021
Figure 2.1.15a Figure 2.1.15b
NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY (NIST) FACE RECOGNITION VENDOR TEST (FRVT):
VERIFICATION ACCURACY by DATASET
Source: National Institute of Standards and Technology, 2021 | Chart: 2022 AI Index Report
0.5
False Non-Match Rate: FMNR (Log-Scale)
0.2
0.1
0.05
0.01
0.005
0.0044, BORDER Photos FNMR @ FMR = 0.000001
0.0023, VISABORDER Photos FNMR@FMR 0.000001
0.0022, MUGSHOT Photos FNMR @ FMR 0.00001
0.002
0.0021, MUGSHOT Photos FNMR @ FMR 0.00001 DT>=12 YRS
0.001 0.0013, VISA Photos FNMR @ FMR 0.000001
FACE DETECTION:
EFFECTS OF MASK-WEARING Although facial recognition
Face Recognition Vendor Test (FRVT):
technology has existed
Face-Mask Effects for several decades, the
technical progress in the
last few years has been
significant. Some of today’s
top-performing facial
recognition algorithms have
a near 100% success rate on
challenging datasets.
0.025
False Non-Match Rate: FMNR (Log-Scale)
0.020
0.015
0.014, Masked
0.010
0.005
0.002, Non-masked
0.000
2019 2020 2021
Figure 2.1.17
Figure 2.1.18
As part of the dataset release, the researchers ran a series of existing state-of-the-art detection
algorithms on a variety of facial recognition datasets, including theirs, to determine how much detection
performance decreased when faces were masked. Their estimates suggest that top methods perform 5
to 16 percentage points worse on masked faces compared to unmasked ones. These findings somewhat
confirm the insights from the FRVT face-mask tests: Performance deteriorates when masks are included,
but not by an overly significant degree.
STATE-OF-THE-ART FACE DETECTION METHODS on MASKED LABELED FACES IN THE WILD (MLFW): ACCURACY
Source: Wang et. al, 2021 | Chart: 2022 AI Index Report
80%
75%
70%
60%
50%
CALFW11
CPLFW10
LFW7
MLFW
CALFW11
CPLFW10
LFW7
MLFW
CALFW11
CPLFW10
LFW7
MLFW
CALFW11
CPLFW10
LFW7
MLFW
CALFW11
CPLFW10
LFW7
MLFW
CALFW11
CPLFW10
LFW7
MLFW
SLLFW8
SLLFW8
SLLFW8
SLLFW8
SLLFW8
SLLFW8
Figure 2.1.19
VISUAL REASONING
AN EXAMPLE OF A VISUAL REASONING TASK
Source: Goyal et al., 2021
Figure 2.1.20
Figure 2.1.21
70%
Accuracy (%)
60%
50%
2015 2016 2017 2018 2019 2020 2021
Figure 2.1.22
Video analysis concerns reasoning or task operation across sequential frames (videos), rather than single frames (images). Video
computer vision has a wide range of use cases, which include assisting criminal surveillance efforts, sports analytics, autonomous
driving, navigation of robots, and crowd monitoring.
Figure 2.2.1
82.20%, Kinetics-700
80%
Top-1 Accuracy (%)
70%
60%
50%
2016 2017 2018 2019 2020 2021 2022
Figure 2.2.2
44.67%
40%
Mean Average Precision (mAP)
30%
20%
O B J E C T D E T E CT I O N
Figure 2.2.4
80%
79.50%, With Extra Training Data
75%
70%
65%
60%
55%
50%
2015 2016 2017 2018 2019 2020 2021
Figure 2.2.5
STATE OF THE ART (SOTA) vs. YOU ONLY LOOK ONCE (YOLO): MEAN AVERAGE PRECISION (mAP50)
Source: arXiv, 2021; GitHub, 2021 | Chart: 2022 AI Index Report
79.50%, SOTA
80%
Mean Average Precision (mAP50)
70%
72.40%, YOLO
60%
50%
40%
2016 2017 2018 2019 2020 2021
Figure 2.2.6
Figure 2.2.7
90
85.00, Human Baseline
80
72.00
70
Q->AR Score
60
50
40
2018 2019 2020 2021
Figure 2.2.8
Natural language processing (NLP) is a subfield of AI, with roots that stretch back as far as the 1950s. NLP involves research into systems that
can read, generate, and reason about natural language. NLP evolved from a set of systems that in its early years used handwritten rules and
statistical methodologies to one that now combines computational linguistics, rule-based modeling, statistical learning, and deep learning.
This section looks at progress in NLP across several language task domains, including: (1) English language understanding; (2) text
summarization; (3) natural language inference; (4) sentiment analysis; and (5) machine translation. In the last decade, technical progress in
NLP has been significant: The adoption of deep neural network–style machine learning methods has meant that many AI systems can now
execute complex language tasks better than many human baselines.
2.3 LANGUAGE
E N G L I S H L A N G UAG E
U N D E R S TA N D I N G
SuperGLUE
Figure 2.3.1
SUPERGLUE: SCORE
Source: SuperGLUE Leaderboard, 2021 | Chart: 2022 AI Index Report
92
91.00
91
Score
90
89.80, Human Performance
89
88
2019 2020 2021
Figure 2.3.2
Figure 2.3.3
100
85
80
Figure 2.3.5
Although AI systems
are presently capable of
achieving a relatively high
level of performance on the
easy set of questions, they
struggle on the hard set.
80%
Accuracy (%)
60%
50%
2020 2021
Figure 2.3.6
T E X T S U M M A R I Z AT I O N
arXiv
ARXIV: ROUGE-1
Source: Papers with Code, 2021; arXiv, 2021 | Chart: 2022 AI Index Report
45
46.74, With Extra Training Data
ROUGE-1
40
35
30
2017 2018 2019 2020 2021
Figure 2.3.7
PubMed
PUBMED: ROUGE-1
Source: Papers with Code, 2021; arXiv, 2021 | Chart: 2022 AI Index Report
50
45
ROUGE-1
40
35
2017 2018 2019 2020 2021
Figure 2.3.8
Figure 2.3.9
95%
93.10%
Accuracy (%)
90%
85%
2014 2015 2016 2017 2018 2019 2020 2021
Figure 2.3.10
Figure 2.3.11
92% 91.87%
90%
Accuracy (%)
88%
86%
84%
S E N T I M E N T A N A LYS I S
A SAMPLE
SEMEVAL TASK
Source: Pontiki et al., 2014
Figure 2.3.13
90%
88.64%
85%
Accuracy (%)
80%
75%
70%
2015 2016 2017 2018 2019 2020 2021
Figure 2.3.14
M AC H I N E T R A N S L AT I O N ( M T )
45 45
BLEU score
35 35
30 30
25 25
20 20
50
Commercial 46
Number of Independent Machine Translation Services
34
30
26 38
23
21
20 28
16
13 23
12 21
18
10
10 9
10 15
9
3
8 9
6 5
0 3 3 3 3
05/2017 07/2017 11/2017 03/2018 07/2018 12/2018 06/2019 11/2019 07/2020 10/2021
Figure 2.3.16
Another important domain of AI research is the analysis, recognition, and synthesis of human speech. In this AI subfield, AI systems are
typically rated on their ability to recognize speech and identify words and convert them into text; and also to recognize speakers and
identify the individuals speaking. Modern home assistance tools, such as Siri, are one of the many examples of commercially applied AI
speech technology.
2.4 SPEECH
SPEECH RECOGNITION
LIBRISPEECH, TEST CLEAN: WORD ERROR RATE (WER) LIBRISPEECH, TEST OTHER: WORD ERROR RATE (WER)
Source: Papers with Code, 2021; arXiv, 2021 | Chart: 2022 AI Index Report Source: Papers with Code, 2021; arXiv, 2021 | Chart: 2022 AI Index Report
14 14
12 12
10 10
Word Error Rate (WER)
6 6
4 4
1.70, Without Extra Training Data
VoxCeleb
8%
6%
Equal Error Rate (%)
4%
2%
0.42%
0%
2017 2018 2019 2020 2021
Figure 2.4.2
Recommendation is the task of suggesting items that might be of interest to a user, such as movies to watch, articles to read, or products
to purchase. Recommendation systems are crucial to businesses, such as Amazon, Netflix, Spotify, and YouTube. For example, one of
the earliest open recommendation competitions in AI was the Netflix Prize; hosted in 2009, it challenged computer scientists to develop
algorithms that could accurately predict user ratings for films based on previously submitted ratings.
2.5 RECOMMENDATION
Commercial Recommendation: MovieLens 20M
0.460
Normalized Discounted Cumulative Gain@100 (nDCG@100)
0.450
0.448
0.440
0.430
0.420
2018 2019 2020 2021
Figure 2.5.1
0.813
0.810
Area Under Curve Score (AUC)
0.800
0.790
2016 2017 2018 2019 2020 2021
Figure 2.5.2
In reinforcement learning, AI systems are trained to maximize performance on a given task by interactively learning from their prior
actions. Researchers train systems to optimize by rewarding them if they achieve a desired goal and then punishing them if they fail.
Systems experiment with different strategy sequences to solve their stated problem (e.g., playing chess or navigating through a maze)
and select the strategies which maximize their rewards.
Reinforcement learning makes the news whenever programs like DeepMind’s AlphaZero demonstrate superhuman performance on
games like Go and Chess. However, reinforcement learning is useful in any commercial domain where computer agents need to maximize
a target goal or stand to benefit from learning from previous experiences. Reinforcement learning can help autonomous vehicles change
lanes, robots optimize manufacturing tasks, or time-series models predict future events.
10 9.62
Mean-Human Normalized Score (in thousands)
0
2015 2016 2017 2018 2019 2020 2021
Figure 2.6.1
Procgen
A SCREENSHOT OF THE 16 GAME ENVIRONMENTS IN PROCGEN
Source: Cobbe et al. 2019
Figure 2.6.2
0.64
0.60
Mean-Normalized Score
0.50
0.40
0.30
3,581
3500
2500
2300, Expert
Elo Score
2000
1700, Intermediate
1500
1000
800, Novice
500
0
1987 1992 1997 2002 2007 2012 2017 2022
Figure 2.6.4
In evaluating technical progress in AI, it is relevant not only to consider improvements in technical performance but also the speed of
operation. As this section shows, AI systems continue to improve in virtually every skill category. This performance is often realized by
increasing parameters and training systems on greater amounts of data. However, all else being equal, models that use more parameters
and source more data will take longer to train. Longer train times mean slower real-world deployment. Given that the potential of
increased training times can be offset by stronger and more robust computational infrastructures, it is important to keep track of
progress in the hardware that powers AI systems.
2.7 HARDWARE
MLPerf: Training Time
50
20
Training Time (Minutes; Log Scale)
5
3.24, Object Detection (heavy-weight)
2.38, Speech recognition
2
26.96
25
22.25
20
Scale of Improvement
16.47
15
10
5
2.38
1.70 1.92
Improvement Baseline 1.16
0
Reinforcement Speech Language Recommendation Segmentation Object Detection Object Detection Image
Recognition Processing (light-weight) (heavy-weight) Classi cation
Figure 2.7.2
4,000
3,000
Number of Accelerators
1,000
0
12.12.2018 06.10.2019 07.29.2020 06.30.2021 12.01.2021
Figure 2.7.3
$1,000.00
$500.00
$200.00
Cost (in U.S. Dollars; Log Scale)
$100.00
$50.00
$20.00
$10.00
$5.00 $4.59
$2.00
$1.00
2017 2018 2019 2020 2021
Figure 2.7.4
In 2021, the AI Index developed a survey that asked professors who specialize in robotics at top-ranked universities around the world and
in emerging economies about changes in the pricing of robotic arms as well as the uses of robotic arms in research labs. The survey was
completed by 101 professors and researchers from over 40 universities and collected data on 117 robotic arm purchase events from 2016
to 2022. The survey results suggest that there has been a notable decline in the price of robotic arms since 2016.
2.8 ROBOTICS
Price Trends in Robotic Arms7
$40
Price (in thousands of U.S. Dollars)
$30
$22.60
$20
$10
$0
2017 2018 2019 2020 2021
Figure 2.8.1
$100
Price (in thousands of U.S. Dollars)
$50
$0
2017 2018 2019 2020 2021
Figure 2.8.2
Reinforcement
46.00%
Learning