Example

Artificial Intelligence
Index Report 2023
Index Report 2023
CHAPTER 1:
Research and
Development
Chapter 1 Preview 1
Index Report 2023
CHAPTER 1 PREVIEW:
Research and Development

Overview 3 Computer Vision 27
Chapter Highlights 4 Natural Language Processing 28
Speech Recognition 29
1.1 Publications 5
Overview 5 1.2 Trends in Significant
Total Number of AI Publications 5 Machine Learning Systems 30
By Type of Publication 6 General Machine Learning Systems 30
By Field of Study 7 System Types 30
By Sector 8 Sector Analysis 31
Cross-Country Collaboration 10 National Affiliation 32
Cross-Sector Collaboration 12 Systems 32
AI Journal Publications 13 Authorship 34
Overview 13 Parameter Trends 35
By Region 14 Compute Trends 37
By Geographic Area 15 Large Language and Multimodal Models 39
Citations 16 National Affiliation 39
AI Conference Publications 17 Parameter Count 41
Overview 17 Training Compute 42
By Region 18 Training Cost 43
By Geographic Area 19
Citations 20 1.3 AI Conferences 45
AI Repositories 21 Conference Attendance 45
Overview 21
1.4 Open-Source AI Software 47
By Region 22
Projects 47
By Geographic Area 23
Stars 49
Citations 24
Narrative Highlight:
Appendix 50
Top Publishing Institutions 25
All Fields 25 ACCESS THE PUBLIC DATA
Chapter 1 Preview 2
Artificial Intelligence Chapter 1: Research and Development
Index Report 2023
Overview
This chapter captures trends in AI R&D. It begins by examining AI publications,
including journal articles, conference papers, and repositories. Next it considers data
on significant machine learning systems, including large language and multimodal
models. Finally, the chapter concludes by looking at AI conference attendance and
open-source AI research. Although the United States and China continue to dominate
AI R&D, research efforts are becoming increasingly geographically dispersed.
Chapter 1 Preview 3
Index Report 2023
Chapter Highlights
The United States and China Industry races ahead
had the greatest number of of academia.
cross-country collaborations in AI Until 2014, most significant machine
publications from 2010 to 2021, learning models were released by
academia. Since then, industry has taken
although the pace of collaboration
over. In 2022, there were 32 significant
has since slowed. industry-produced machine learning
The number of AI research collaborations between
models compared to just three produced
the United States and China increased roughly 4
by academia. Building state-of-the-art
times since 2010, and was 2.5 times greater than the
AI systems increasingly requires large
collaboration totals of the next nearest country pair,
amounts of data, computer power, and
the United Kingdom and China. However, the total
money—resources that industry actors
number of U.S.-China collaborations only increased
inherently possess in greater amounts
by 2.1% from 2020 to 2021, the smallest year-over-
compared to nonprofits and academia.
year growth rate since 2010.
AI research is on the rise, across Large language models

the board. The total number of AI publications are getting bigger and
has more than doubled since 2010. The specific AI
more expensive.
topics that continue to dominate research include
GPT-2, released in 2019, considered
pattern recognition, machine learning,
by many to be the first large language
and computer vision.
model, had 1.5 billion parameters and
cost an estimated $50,000 USD to
train. PaLM, one of the flagship large
China continues to lead in total language models launched in 2022,
AI journal, conference, and had 540 billion parameters and cost an
repository publications. estimated $8 million USD—PaLM was
The United States is still ahead in terms of AI around 360 times larger than GPT-2 and
conference and repository citations, but those leads cost 160 times more. It’s not just PaLM:
are slowly eroding. Still, the majority of the world’s Across the board, large language and
large language and multimodal models (54% in 2022) multimodal models are becoming larger
are produced by American institutions. and pricier.
Chapter 1 Preview 4
Index Report 2023 1.1 Publications
This section draws on data from the Center for Security and Emerging Technology (CSET) at Georgetown University. CSET maintains a
merged corpus of scholarly literature that includes Digital Science’s Dimensions, Clarivate’s Web of Science, Microsoft Academic Graph,
China National Knowledge Infrastructure, arXiv, and Papers With Code. In that corpus, CSET applied a classifier to identify English-
language publications related to the development or application of AI and ML since 2010. For this year’s report, CSET also used select
Chinese AI keywords to identify Chinese-language AI papers; CSET did not deploy this method for previous iterations of the AI Index report.1
In last year’s edition of the report, publication trends were reported up to the year 2021. However, given that there is a significant lag in the
collection of publication metadata, and that in some cases it takes until the middle of any given year to fully capture the previous year’s
publications, in this year’s report, the AI Index team elected to examine publication trends only through 2021, which we, along with CSET,
are confident yields a more fully representative report.
1.1 Publications
Overview publication and citation data by region for AI journal
articles, conference papers, repositories, and patents.
The figures below capture the total number
of English-language and Chinese-language AI Total Number of AI Publications
publications globally from 2010 to 2021—by type, Figure 1.1.1 shows the number of AI publications in
affiliation, cross-country collaboration, and cross- the world. From 2010 to 2021, the total number of
industry collaboration. The section also breaks down AI publications more than doubled, growing from
200,000 in 2010 to almost 500,000 in 2021.
Number of AI Publications in the World, 2010–21

Source: Center for Security and Emerging Technology, 2022 | Chart: 2023 AI Index Report
496.01
500
400
Number of AI Publications (in Thousands)
300
200
100
0
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.1
1 See the Appendix for more information on CSET’s methodology. For more on the challenge of defining AI and correctly capturing relevant bibliometric data, see the AI Index team’s
discussion in the paper “Measurement in AI Policy: Opportunities and Challenges.”
Chapter 1 Preview 5
By Type of Publication book chapters, theses, and unknown document types

Figure 1.1.2 shows the types of AI publications released made up the remaining 10% of publications. While
globally over time. In 2021, 60% of all published AI journal and repository publications have grown 3
documents were journal articles, 17% were conference and 26.6 times, respectively, in the past 12 years, the
papers, and 13% were repository submissions. Books, number of conference papers has declined since 2019.
Number of AI Publications by Type, 2010–21

300
293.48, Journal
270
240
210
180
150
120
90
85.09, Conference
65.21, Repository
60
29.88, Thesis
30
13.77, Book Chapter
5.82, Unknown
0 2.76, Book
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.2
Chapter 1 Preview 6
By Field of Study roughly doubled while the number of machine learning

Figure 1.1.3 shows that publications in pattern papers has roughly quadrupled. Following those two
recognition and machine learning have experienced topic areas, in 2021, the next most published AI fields
the sharpest growth in the last half decade. Since of study were computer vision (30,075), algorithm
2015, the number of pattern recognition papers has (21,527), and data mining (19,181).
Number of AI Publications by Field of Study (Excluding Other AI), 2010–21

60 59.36, Pattern Recognition
50
42.55, Machine Learning

40
30 30.07, Computer Vision
21.53, Algorithm
20 19.18, Data Mining
14.99, Natural Language Processing

11.57, Control Theory
10 10.37, Human–Computer Interaction
6.74, Linguistics
0
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.3
Chapter 1 Preview 7
By Sector
This section shows the number of AI publications 1.1.5).2 The education sector dominates in each region.
affiliated with education, government, industry, The level of industry participation is highest in the
nonprofit, and other sectors—first globally (Figure United States, then in the European Union. Since
1.1.4), then looking at the United States, China, and 2010, the share of education AI publications has been
the European Union plus the United Kingdom (Figure dropping in each region.
AI Publications (% of Total) by Sector, 2010–21

80%
75.23%, Education
70%
60%
AI Publications (% of Total)
50%
40%
30%
20%
13.60%, Nonpro t
10%
7.21%, Industry
3.74%, Government
0% 0.22%, Other
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.4
2 The categorization is adapted based on the Global Research Identifier Database (GRID). Healthcare, including hospitals and facilities, is included under nonprofit. Publications affiliated with
state-sponsored universities are included in the education sector.
Chapter 1 Preview 8
AI Publications (% of Total) by Sector and Geographic Area, 2021

69.17%
Education 69.23%
77.85%
14.82%
Nonpro t 18.63%
11.73%
12.60%
Industry 7.90%
5.47%
3.21%
Government 3.92%
4.74%
0.20% United States
Other 0.33% European Union and United Kingdom

China
0.20%
0% 10% 20% 30% 40% 50% 60% 70% 80%

AI Publications (% of Total)
Figure 1.1.5
Chapter 1 Preview 9
Cross-Country Collaboration
Cross-border collaborations between academics, By far, the greatest number of collaborations in the
researchers, industry experts, and others are a key past 12 years took place between the United States
component of modern STEM (science, technology, and China, increasing roughly four times since 2010.
engineering, and mathematics) development that However the total number of U.S.-China collaborations
accelerate the dissemination of new ideas and the only increased by 2.1% from 2020 to 2021, the smallest
growth of research teams. Figures 1.1.6 and 1.1.7 depict year-over-year growth rate since 2010.
the top cross-country AI collaborations from 2010
The next largest set of collaborations was between
to 2021. CSET counted cross-country collaborations
the United Kingdom and both China and the United
as distinct pairs of countries across authors for each
States. In 2021, the number of collaborations between
publication (e.g., four U.S. and four Chinese-affiliated
the United States and China was 2.5 times greater
authors on a single publication are counted as one
than between the United Kingdom and China.
U.S.-China collaboration; two publications between
the same authors count as two collaborations).
United States and China Collaborations in AI Publications, 2010–21

10.47
10
0
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.6
Chapter 1 Preview 10
Cross-Country Collaborations in AI Publications (Excluding U.S. and China), 2010–21

4.13, United Kingdom and China

4 4.04, United States and United Kingdom
3.42, United States and Germany
3
2.80, China and Australia
2.61, United States and Australia
2
1.83, United States and France
0
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.7
Cross-Sector Collaboration
The increase in AI research outside of academia has educational institutions (12,856); and educational
broadened and grown collaboration across sectors and government institutions (8,913). Collaborations
in general. Figure 1.1.8 shows that in 2021 educational between educational institutions and industry have
institutions and nonprofits (32,551) had the greatest been among the fastest growing, increasing 4.2 times
number of collaborations; followed by industry and since 2010.
Cross-Sector Collaborations in AI Publications, 2010–21

32.55, Education and Nonpro t
30
25
20
15
12.86, Industry and Education
10
8.91, Education and Government
5
2.95, Government and Nonpro t
2.26, Industry and Nonpro t
0.63, Industry and Government
0
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.8
AI Journal Publications
Overview
After growing only slightly from 2010 to 2015, the number of AI journal publications grew around 2.3 times since
2015. From 2020 to 2021, they increased 14.8% (Figure 1.1.9).
Number of AI Journal Publications, 2010–21

300 293.48
Number of AI Journal Publications (in Thousands)
250
200
150
100
50
0
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.9
By Region3
Figure 1.1.10 shows the share of AI journal publications East Asia and the Pacific; Europe and Central Asia;
by region between 2010 and 2021. In 2021, East Asia as well as North America have been declining.
and the Pacific led with 47.1%, followed by Europe During that period, there has been an increase in
and Central Asia (17.2%), and then North America publications from other regions such as South Asia;
(11.6%). Since 2019, the share of publications from and the Middle East and North Africa.
AI Journal Publications (% of World Total) by Region, 2010–21

50%
47.14%, East Asia and Paci c
40%
AI Journal Publications (% of World Total)
30%
20%
17.20%, Europe and Central Asia
11.61%, North America

10% 6.93%, Unknown
6.75%, South Asia
4.64%, Middle East and North Africa
2.66%, Latin America and the Caribbean
2.30%, Rest of the World
0% 0.77%, Sub-Saharan Africa
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.10
3 Regions in this chapter are classified according to the World Bank analytical grouping.
By Geographic Area4
Figure 1.1.11 breaks down the share of AI journal throughout, with 39.8% in 2021, followed by the
publications over the past 12 years by geographic European Union and the United Kingdom (15.1%),
area. This year’s AI Index included India in recognition then the United States (10.0%). The share of Indian
of the increasingly important role it plays in the publications has been steadily increasing—from 1.3%
AI ecosystem. China has remained the leader in 2010 to 5.6% in 2021.
AI Journal Publications (% of World Total) by Geographic Area, 2010–21

40% 39.78%, China

AI Journal Publications (% of World Total)
30%

20%
15.05%, European Union and United Kingdom
10% 10.03%, United States
6.88%, Unknown
5.56%, India
0%
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.11
4 In this chapter we use “geographic area” based on CSET’s classifications, which are disaggregated not only by country, but also by territory. Further, we count the European Union and the
United Kingdom as a single geographic area to reflect the regions’ strong history of research collaboration.
Citations
China’s share of citations in AI journal publications 1.1.12). China, the European Union and the United
has gradually increased since 2010, while those of the Kingdom, and the United States accounted for 65.7%
European Union and the United Kingdom, as well as of the total citations in the world.
those of the United States, have decreased (Figure
AI Journal Citations (% of World Total) by Geographic Area, 2010–21

30%
29.07%, China
25%
AI Journal Citations (% of World Total)

20%
15% 15.08%, United States
10%
6.05%, India
5%
0.92%, Unknown
0%
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.12
AI Conference Publications
Overview
The number of AI conference publications peaked in 2019, and fell 20.4% below the peak in 2021 (Figure 1.1.13).
The total number of 2021 AI conference publications, 85,094, was marginally greater than the 2010 total of 75,592.
Number of AI Conference Publications, 2010–21

100
Number of AI Conference Publications (in Thousands)
85.09
80
60
40
20
0
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.13
By Region
Figure 1.1.14 shows the number of AI conference East Asia and the Pacific continues to rise, accounting
publications by region. As with the trend in journal for 36.7% in 2021, followed by Europe and Central
publications, East Asia and the Pacific; Europe Asia (22.7%), and then North America (19.6%). The
and Central Asia; and North America account for percentage of AI conference publications in South Asia
the world’s highest numbers of AI conference saw a noticeable rise in the past 12 years, growing from
publications. Specifically, the share represented by 3.6% in 2010 to 8.5% in 2021.
AI Conference Publications (% of World Total) by Region, 2010–21

40%

35%
AI Conference Publications (% of World Total)
30%
25%
20% 19.56%, North America
15%
10%
8.45%, South Asia
5% 3.07%, Latin America and the Caribbean
2.76%, Unknown
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.14
By Geographic Area
In 2021, China produced the greatest share of the came in third at 17.2% (Figure 1.1.15). Mirroring trends
world’s AI conference publications at 26.2%, having seen in other parts of the research and development
overtaken the European Union and the United section, India’s share of AI conference publications is
Kingdom in 2017. The European Union plus the United also increasing.
Kingdom followed at 20.3%, and the United States
AI Conference Publications (% of World Total) by Geographic Area, 2010–21

30%

AI Conference Publications (% of World Total)
26.15%, China
25%
20% 20.29%, European Union and United Kingdom
17.23%, United States
15%
10%
6.79%, India
5%
2.70%, Unknown
0%
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.15
Citations
Despite China producing the most AI conference conference citations, with 23.9%, followed by China’s
publications in 2021, Figure 1.1.16 shows that 22.0%. However, the gap between American and
the United States had the greatest share of AI Chinese AI conference citations is narrowing.
AI Conference Citations (% of World Total) by Geographic Area, 2010–21

35%
30%
AI Conference Citations (% of World Total)

25%
22.02%, China
20%
15%
10%
6.09%, India
5%
0.87%, Unknown
0%
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.16
AI Repositories
Overview
Publishing pre-peer-reviewed papers on repositories share their findings before submitting them to journals
of electronic preprints (such as arXiv and SSRN) and conferences, thereby accelerating the cycle of
has become a popular way for AI researchers to information discovery. The number of AI repository
disseminate their work outside traditional avenues for publications grew almost 27 times in the past 12 years
publication. These repositories allow researchers to (Figure 1.1.17).
Number of AI Repository Publications, 2010–21

65.21
60
Number of AI Repository Publications (in Thousands)
50
40
30
20
10
0
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.17
By Region
Figure 1.1.18 shows that North America has by East Asia and the Pacific has grown significantly
maintained a steady lead in the world share of AI since 2010 and continued growing from 2020 to
repository publications since 2016. Since 2011, the 2021, a period in which the year-over-year share of
share of repository publications from Europe and North American as well European and Central Asian
Central Asia has declined. The share represented repository publications declined.
AI Repository Publications (% of World Total) by Region, 2010–21

AI Repository Publications (% of World Total)
30%
26.32%, North America

23.99%, Unknown

20%
10%
3.41%, South Asia

1.80%, Latin America and the Caribbean
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.18
By Geographic Area
While the United States has held the lead in the (Figure 1.1.19). In 2021, the United States accounted
percentage of global AI repository publications since for 23.5% of the world’s AI repository publications,
2016, China is catching up, while the European Union followed by the European Union plus the United
plus the United Kingdom’s share continues to drop Kingdom (20.5%), and then China (11.9%).
AI Repository Publications (% of World Total) by Geographic Area, 2010–21

AI Repository Publications (% of World Total)
30%

23.18%, Unknown
20% 20.54%, European Union and United Kingdom
11.87%, China
10%
2.85%, India
0%
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.19
Citations
In the citations of AI repository publications, Figure a dominant lead over the European Union plus the
1.1.20 shows that in 2021 the United States topped United Kingdom (21.5%), as well as China (21.0%).
the list with 29.2% of overall citations, maintaining
AI Repository Citations (% of World Total) by Geographic Area, 2010–21

40%
AI Repository Citations (% of World Total)
30%

20.98%, China
20%
10%
4.59%, Unknown
1.91%, India
0%
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.20
Top Publishing Institutions
All Fields University, the University of the Chinese Academy
Since 2010, the institution producing the greatest of Sciences, Shanghai Jiao Tong University,
number of total AI papers has been the Chinese and Zhejiang University.5 The total number of
Academy of Sciences (Figure 1.1.21). The next publications released by each of these institutions
top four are all Chinese universities: Tsinghua in 2021 is displayed in Figure 1.1.22.
Top Ten Institutions in the World in 2021 Ranked by Number of AI Publications in All Fields, 2010–21
1 1, Chinese Academy of Sciences
2 2, Tsinghua University
3 3, University of Chinese Academy of Sciences
4 4, Shanghai Jiao Tong University
5 5, Zhejiang University
Rank
6 6, Harbin Institute of Technology
7 7, Beihang University
8 8, University of Electronic Science and Technology of China
9 9, Peking University
10 10, Massachusetts Institute of Technology
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.21
5 It is important to note that many Chinese research institutions are large, centralized organizations with thousands of researchers. It is therefore not entirely surprising that,
purely by the metric of publication count, they outpublish most non-Chinese institutions.
Top Publishing Institutions (cont’d)
Top Ten Institutions in the World by Number of AI Publications in All Fields, 2021
Chinese Academy of Sciences 5,099
Tsinghua University 3,373
University of Chinese Academy

2,904
of Sciences
Shanghai Jiao Tong University 2,703
Zhejiang University 2,590
Harbin Institute of Technology 2,016
Beihang University 1,970
University of Electronic Science

1,951
and Technology of China
Peking University 1,893
Massachusetts Institute of
1,745
Technology
0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000
Number of AI Publications
Figure 1.1.22
Computer Vision
In 2021, the top 10 institutions publishing the greatest number of AI computer vision publications were
all Chinese (Figure 1.1.23). The Chinese Academy of Sciences published the largest number of such
publications, with a total of 562.
Top Ten Institutions in the World by Number of AI Publications in Computer Vision, 2021
Chinese Academy of Sciences 562
Shanghai Jiao Tong University 316

314
of Sciences
Tsinghua University 296
Zhejiang University 289
Beihang University 247
Wuhan University 231
Beijing Institute of Technology 229
Harbin Institute of Technology 210
Tianjin University 182
0 100 200 300 400 500

Figure 1.1.23
Natural Language Processing
American institutions are represented to a took second place (140 publications), followed by
greater degree in the share of top NLP publishers Microsoft (134). In addition, 2021 was the first year
(Figure 1.1.24). Although the Chinese Academy of Amazon and Alibaba were represented among the
Sciences was again the world’s leading institution top-ten largest publishing NLP institutions.
in 2021 (182 publications), Carnegie Mellon
Top Ten Institutions in the World by Number of AI Publications in Natural Language Processing, 2021
Carnegie Mellon University 140
Microsoft (United States) 134
Carnegie Mellon University

116
Australia
Google (United States) 116
Peking University 113

112
of Sciences
Alibaba Group (China) 100
Amazon (United States) 98
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190
Figure 1.1.24
Speech Recognition
In 2021, the greatest number of speech recognition papers came from the Chinese Academy of Sciences
(107), followed by Microsoft (98) and Google (75) (Figure 1.1.25). The Chinese Academy of Sciences
reclaimed the top spot in 2021 from Microsoft, which held first position in 2020.
Top Ten Institutions in the World by Number of AI Publications in Speech Recognition, 2021
Microsoft (United States) 98
Google (United States) 75

66
of Sciences
University of Science
59
and Technology of China
Carnegie Mellon University 57
Tencent (China) 57
Chinese University of Hong Kong 55
Amazon (United States) 54
0 10 20 30 40 50 60 70 80 90 100 110
Figure 1.1.25
Index Report 2023 1.2 Trends in Significant Machine Learning Systems
Epoch AI is a collective of researchers investigating and forecasting the development of advanced AI. Epoch curates a database of
significant AI and machine learning systems that have been released since the 1950s. There are different criteria under which the
Epoch team decides to include particular AI systems in their database; for example, the system may have registered a state-of-the-art
improvement, been deemed to have been historically significant, or been highly cited.
This subsection uses the Epoch database to track trends in significant AI and machine learning systems. The latter half of the chapter
includes research done by the AI Index team that reports trends in large language and multimodal models, which are models trained on
large amounts of data and adaptable to a variety of downstream applications.
1.2 Trends in Significant

Machine Learning Systems
General Machine System Types
Among the significant AI machine learning systems
Learning Systems released in 2022, the most common class of system
The figures below report trends among all machine was language (Figure 1.2.1). There were 23 significant
learning systems included in the Epoch dataset. For AI language systems released in 2022, roughly six
reference, these systems are referred to as significant times the number of the next most common system
machine learning systems throughout the subsection. type, multimodal systems.
Number of Significant Machine Learning Systems by Domain, 2022

Source: Epoch, 2022 | Chart: 2023 AI Index Report
Language 23
Multimodal 4
Drawing 3
Vision 2
Speech 2
Text-to-Video 1
Other 1
Games 1
0 2 4 6 8 10 12 14 16 18 20 22 24
Number of Signi cant Machine Learning Systems
Figure 1.2.16
6 There were 38 total significant AI machine learning systems released in 2022, according to Epoch; however, one of the systems, BaGuaLu, did not have a domain classification
and is therefore omitted from Figure 1.2.1.
Sector Analysis
Which sector among industry, academia, or nonprofit machine learning systems compared to just three
has released the greatest number of significant produced by academia. Producing state-of-the-art
machine learning systems? Until 2014, most machine AI systems increasingly requires large amounts of
learning systems were released by academia. data, computing power, and money; resources that
Since then, industry has taken over (Figure 1.2.2). In industry actors possess in greater amounts compared
2022, there were 32 significant industry-produced to nonprofits and academia.
Number of Significant Machine Learning Systems by Sector, 2002–22

35
32, Industry
30
25
20
15
10
5
3, Academia
2, Research Collective
1, Industry-Academia Collaboration
0 0, Nonpro t
2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 2022
Figure 1.2.2
National Affiliation
In order to paint a picture of AI’s evolving or AI-research firm, was headquartered. In 2022,
geopolitical landscape, the AI Index research the United States produced the greatest number
team identified the nationality of the authors who of significant machine learning systems with 16,
contributed to the development of each significant followed by the United Kingdom (8) and China (3).
machine learning system in the Epoch dataset. 7 Moreover, since 2002 the United States has outpaced
the United Kingdom and the European Union, as well
Systems
as China, in terms of the total number of significant
Figure 1.2.3 showcases the total number of
machine learning systems produced (Figure 1.2.4).
significant machine learning systems attributed to
Figure 1.2.5 displays the total number of significant
researchers from particular countries.8 A researcher
machine learning systems produced by country since
is considered to have belonged to the country in
2002 for the entire world.
which their institution, for example a university
Number of Significant Machine Learning Systems by Number of Significant Machine Learning Systems by
Country, 2022 Select Geographic Area, 2002–22
Source: Epoch and AI Index, 2022 | Chart: 2023 AI Index Report Source: Epoch and AI Index, 2022 | Chart: 2023 AI Index Report
United States 16
Number of Significant Machine Learning Systems
30
United Kingdom 8
China 3 25
Canada 2
20
Germany 2
16, United States
15
France 1
12, European Union and
United Kingdom
India 1 10
Israel 1
5
Russia 1 3, China
0
Singapore 1
2002
2004
2006
2008
2010
2012
2014
2016
2018
2020
2022
0 2 4 6 8 10 12 14 16
Figure 1.2.3 Figure 1.2.4
7 The methodology by which the AI Index identified authors’ nationality is outlined in greater detail in the Appendix.
8 A machine learning system is considered to be affiliated with a particular country if at least one author involved in creating the model was affiliated with that country.
Consequently, in cases where a system has authors from multiple countries, double counting may occur.
Number of Machine Learning Systems by Country, 2002–22 (Sum)

Source: AI Index, 2022 | Chart: 2023 AI Index Report
0
1–10
11–20
21–60
61–255
Figure 1.2.5
Authorship
Figures 1.2.6 to 1.2.8 look at the total number of in 2022 the United States had the greatest number of
authors, disaggregated by national affiliation, that authors producing significant machine learning systems,
contributed to the launch of significant machine with 285, more than double that of the United Kingdom
learning systems. As was the case with total systems, and nearly six times that of China (Figure 1.2.6).
Number of Authors of Significant Machine Learning Number of Authors of Significant Machine Learning
Systems by Country, 2022 Systems by Select Geographic Area, 2002–22
Source: Epoch and AI Index, 2022 | Chart: 2023 AI Index Report Source: Epoch and AI Index, 2022 | Chart: 2023 AI Index Report
400
United States 285
United Kingdom 139 350
China 49 300
285, United States
Canada 21 Number of Authors 250
Israel 13
200
Sweden 8 155, European Union and
150 United Kingdom
Germany 7
100
Russia 3
50 49, China
India 2
France 1 0
0 50 100 150 200 250 300

2002
2004
2006
2008
2010
2012
2014
2016
2018
2020
2022
Number of Authors
Number of Authors of Machine Learning Systems by Country, 2002–22 (Sum)

0
1–10
11–20
21–60
61–180
181–370
371–680
Figure 1.2.8
681–2000
Parameter Trends
Parameters are numerical values that are learned by dataset by sector. Over time, there has been a steady
machine learning models during training. The value of increase in the number of parameters, an increase that
parameters in machine learning models determines has become particularly sharp since the early 2010s.
how a model might interpret input data and make The fact that AI systems are rapidly increasing their
predictions. Adjusting parameters is an essential parameters is reflective of the increased complexity of
step in ensuring that the performance of a machine the tasks they are being asked to perform, the greater
learning system is optimized. availability of data, advancements in underlying
hardware, and most importantly, the demonstrated
Figure 1.2.9 highlights the number of parameters of
performance of larger models.
the machine learning systems included in the Epoch
Number of Parameters of Significant Machine Learning Systems by Sector, 1950–2022

1.0e+14
Academia Industry Industry-Academia Collaboration Nonpro t Research Collective
1.0e+12
Number of Parameters (Log Scale)
1.0e+10
1.0e+8
1.0e+6
1.0e+4
1.0e+2
1950 1954 1958 1962 1966 1970 1974 1978 1982 1986 1990 1994 1998 2002 2006 2010 2014 2018 2022
Figure 1.2.9
Figure 1.2.10 demonstrates the parameters of machine learning systems by domain. In recent years, there has
been a rise in parameter-rich systems.
Number of Parameters of Significant Machine Learning Systems by Domain, 1950–2022

Language Vision Games

1.0e+12
1.0e+10
1.0e+8
1.0e+6
1.0e+4
1.0e+2
1954 1958 1962 1966 1970 1974 1978 1982 1986 1990 1994 1998 2002 2006 2010 2014 2018 2022
Figure 1.2.10
Compute Trends
machine learning systems has increased exponentially
The computational power, or “compute,” of AI
in the last half-decade (Figure 1.2.11).9 The growing
systems refers to the amount of computational
demand for compute in AI carries several important
resources needed to train and run a machine
implications. For example, more compute-intensive
learning system. Typically, the more complex a
models tend to have greater environmental impacts,
system is, and the larger the dataset on which it is
and industrial players tend to have easier access
trained, the greater the amount of compute required.
to computational resources than others, such as
The amount of compute used by significant AI universities.
Training Compute (FLOP) of Significant Machine Learning Systems by Sector, 1950–2022

Academia Industry Industry-Academia Collaboration Nonpro t Research Collective

1.0e+24
1.0e+21
Training Compute (FLOP – Log Scale)
1.0e+18
1.0e+15
1.0e+12
1.0e+9
1.0e+6
1.0e+3
1.0e+0
1950 1954 1958 1962 1966 1970 1974 1978 1982 1986 1990 1994 1998 2002 2006 2010 2014 2018 2022
Figure 1.2.11
9 FLOP stands for “Floating Point Operations” and is a measure of the performance of a computational device.
Since 2010, it has increasingly been the case that of all machine learning systems, language models are
demanding the most computational resources.
Training Compute (FLOP) of Significant Machine Learning Systems by Domain, 1950–2022

Language Vision Games

1.0e+24
1.0e+21
1.0e+18
1.0e+15
1.0e+12
1.0e+9
1.0e+6
1.0e+3
1954 1958 1962 1966 1970 1974 1978 1982 1986 1990 1994 1998 2002 2006 2010 2014 2018 2022
Figure 1.2.12
Large Language and are starting to be widely deployed in the real world.
Multimodal Models National Affiliation

This year the AI Index conducted an analysis of the
Large language and multimodal models, sometimes national affiliation of the authors responsible for
called foundation models, are an emerging and releasing new large language and multimodal models.10
increasingly popular type of AI model that is trained The majority of these researchers were from American
on huge amounts of data and adaptable to a variety institutions (54.2%) (Figure 1.2.13). In 2022, for the first
of downstream applications. Large language and time, researchers from Canada, Germany, and India
multimodal models like ChatGPT, DALL-E 2, and Make- contributed to the development of large language and
A-Video have demonstrated impressive capabilities and multimodal models.
Authors of Select Large Language and Multimodal Models (% of Total) by Country, 2019–22
Source: Epoch and AI Index, 2022 | Chart: 2023 AI Index Report
100%
Authors of Large Language and Multimodal Models (% of Total)
80%
60%
40%
21.88%, United Kingdom

20% 8.04%, China
6.25%, Canada
5.80%, Israel
3.12%, Germany
0.89%, India
0% 0.00%, Korea
2019 2020 2021 2022

Figure 1.2.13
Figure 1.2.14 offers a timeline view of the large PaLM (540B). The only Chinese large language and
language and multimodal models that have been multimodal model released in 2022 was GLM-130B,
released since GPT-2, along with the national an impressive bilingual (English and Chinese) model
affiliations of the researchers who produced the created by researchers at Tsinghua University. BLOOM,
models. Some of the notable American large also launched in late 2022, was listed as indeterminate
language and multimodal models released in given that it was the result of a collaboration of more
2022 included OpenAI’s DALL-E 2 and Google’s than 1,000 international researchers.
10 The AI models that were considered to be large language and multimodal models were hand-selected by the AI Index steering committee. It is possible that this selection may have omitted
certain models.
Timeline and National Affiliation of Select Large Language and Multimodal Model Releases
2023-Jan
BLOOM
2022-Oct
GLM-130B
2022-Jul Minerva (540B)
Imagen
Jurassic-X OPT-175B
2022-Apr Stable Diffusion (LDM-KL-8-G) PaLM (540B) DALL·E 2 Chinchilla
GPT-NeoX-20B InstructGPT AlphaCode
2022-Jan
Gopher
2021-Oct Megatron-Turing NLG 530B

Jurassic-1-Jumbo
2021-Jul Codex ERNIE 3.0
HyperClova Wu Dao 2.0 CogView
PanGu-alpha GPT-J-6B
2021-Apr GPT-Neo
Wu Dao - Wen Yuan
2021-Jan DALL-E
2020-Oct
ERNIE-GEN (large)
2020-Jul
GPT-3 175B (davinci)
2020-Apr
Turing NLG
Meena
2020-Jan United States Canada
United Kingdom Israel
T5-11B T5-3B China Germany
2019-Oct Megatron-LM (Original, 8.3B) United States, Indeterminate
United Kingdom,
Germany, India
2019-Jul Korea
Grover-Mega
2019-Apr
GPT-2
2019-Jan
Figure 1.2.1411
11 While we were conducting the analysis to produce Figure 1.2.14, Irene Solaiman published a paper that has a similar analysis. We were not aware of the paper at the time of our research.
Parameter Count
Over time, the number of parameters of newly released Google in 2022, had 540 billion, nearly 360 times
large language and multimodal models has massively more than GPT-2. The median number of parameters
increased. For example, GPT-2, which was the first in large language and multimodal models is increasing
large language and multimodal model released in 2019, exponentially over time (Figure 1.2.15).
only had 1.5 billion parameters. PaLM, launched by
Number of Parameters of Select Large Language and Multimodal Models, 2019–22

3.2e+12
Wu Dao 2.0
1.0e+12
Megatron-Turing NLG 530B
Minerva (540B)
Gopher PaLM (540B)
3.2e+11 HyperClova
GPT-3 175B (davinci) OPT-175B BLOOM

Jurassic-1-Jumbo
PanGu-α
1.0e+11 GLM-130B
Chinchilla
3.2e+10 GPT-NeoX-20B
Turing NLG
T5-11B DALL-E Codex
1.0e+10 GPT-J-6B ERNIE 3.0 Jurassic-X
Megatron-LM (Original, 8.3B) DALL·E 2
T5-3B Meena CogView
3.2e+9 Wu Dao - Wen Yuan
GPT-2 GPT-Neo Stable Di usion (LDM-KL-8-G)
1.0e+9 Grover-Mega
ERNIE-GEN (large)
3.2e+8
20
20
20
20
20
20 9-S
20
20 0-J
20
20
20
20
20 -M
20 1-A ar
20 1-M r
20 1-J ay
20 1-J n
20
20
20
2022-F
20 2-M b
20 2-A ar
20 2-M r
19
19
1
19 ep
2
20 an
20
20
21
21
2
2 p
2
2 u
21 ul
21
21
2 e
2
2 p
22 a
22
22
-J
-A
-O
-D
-F
-M
-O
-J y
-A
-N
-F
-M
-A
an
eb
ug
un
ec
eb
ct
ug
ct
ug
ov
ay
ay
Figure 1.2.15
Training Compute
The training compute of large language and multimodal reasoning problems, was roughly nine times greater
models has also steadily increased (Figure 1.2.16). The than that used for OpenAI’s GPT-3, which was
compute used to train Minerva (540B), a large language released in June 2022, and roughly 1839 times greater
and multimodal model released by Google in June than that used for GPT-2 (released February 2019).
2022 that displayed impressive abilities on quantitative
Training Compute (FLOP) of Select Large Language and Multimodal Models, 2019–22
PaLM (540B)
3.2e+24 Megatron-Turing NLG 530B Minerva (540B)
Gopher OPT-175B
1.0e+24
GPT-3 175B (davinci) Jurassic-1-Jumbo
Chinchilla BLOOM
3.2e+23
Meena GPT-NeoX-20B
1.0e+23 DALL-E PanGu-α

T5-11B HyperClova Stable Diffusion
3.2e+22 Turing NLG CogView AlphaCode GLM-130B
T5-3B
1.0e+22 GPT-J-6B
Megatron-LM (Original, 8.3B) GPT-Neo
3.2e+21 GPT-2
3.2e+20
1.0e+20
3.2e+19
1.0e+19
ERNIE 3.0
3.2e+18
1.0e+18
20
20 1-J
20
20
20
2022-F
20 2-M b
20 2-A ar
20 -M
20
20
20
20
20 9-S
20
20 0-J
20
20
20
20 -M
20 1-A ar
19
1
19 ep
22
2
20 an
20
21
21
2
21 pr
2
21 ul
21
21
2 e
2
22 pr
22 a
22
-J
-M
-A
-O
-D
-F
-O
-J y
-A
-N
-F
-M
an
eb
ug
un
ec
eb
ct
ug
ct
ov
ay
ay
Figure 1.2.16
Training Cost
A particular theme of the discourse around large estimate with the tag of mid, high, or low: mid where
language and multimodal models has to do with their the estimate is thought to be a mid-level estimate,
hypothesized costs. Although AI companies rarely speak high where it is thought to be an overestimate, and
openly about training costs, it is widely speculated that low where it is thought to be an underestimate. In
these models cost millions of dollars to train and will certain cases, there was not enough data to estimate
become increasingly expensive with scale. the training cost of particular large language and
multimodal models, therefore these models were
This subsection presents novel analysis in which the
omitted from our analysis.
AI Index research team generated estimates for the
training costs of various large language and multimodal The AI Index estimates validate popular claims that
models (Figure 1.2.17). These estimates are based on the large language and multimodal models are increasingly
hardware and training time disclosed by the models’ costing millions of dollars to train. For example,
authors. In cases where training time was not disclosed, Chinchilla, a large language model launched by
we calculated from hardware speed, training compute, DeepMind in May 2022, is estimated to have cost $2.1
and hardware utilization efficiency. Given the possible million, while BLOOM’s training is thought to have cost
variability of the estimates, we have qualified each $2.3 million.
Estimated Training Cost of Select Large Language and Multimodal Models

Mid High Low

12 11.35
(in Millions of U.S. Dollars)
10
8.55
8.01
Training Cost
4
1.97 2.11 2.29
1.47 1.80 1.69
2 1.03
0.23 0.02 0.09 0.43 0.27 0.24 0.60
0.05 0.11 0.01 0.14 0.09 0.16
0
GPT-2
T5-11B
Meena
Turing NLG
GPT-3 175B
DALL-E
Wu Dao - Wen Yuan
GPT-Neo
GPT-J-6B
HyperClova
ERNIE 3.0
Codex
Megatron-Turing NLG 530B
Gopher
AlphaCode
GPT-NeoX-20B
Chinchilla
PaLM (540B)
Stable Di usion (LDM-KL-8-G)
OPT-175B
Minerva (540B)
GLM-130B
BLOOM
2019 2020 2021 2022
Figure 1.2.17
12 See Appendix for the complete methodology behind the cost estimates.
There is also a clear relationship between the cost of large language and multimodal models and their size.
As evidenced in Figures 1.2.18 and 1.2.19, the large language and multimodal models with a greater number of
parameters and that train using larger amounts of compute tend to be more expensive.
Estimated Training Cost of Select Large Language Estimated Training Cost of Select Large Language and
and Multimodal Models and Number of Parameters Multimodal Models and Training Compute (FLOP)
Source: AI Index, 2022 | Chart: 2023 AI Index Report Source: AI Index, 2022 | Chart: 2023 AI Index Report
Minerva (540B) PaLM (540B) Minerva (540B)

PaLM (540B)
5.0e+11 Megatron-Turing NLG 530B Megatron-Turing NLG 530B
OPT-175B
Gopher 1.0e+24
Chinchilla Gopher
HyperClova
OPT-175B

2.0e+11
GLM-130B BLOOM GPT-NeoX-20B Meena BLOOM

GPT-3 175B
GLM-130B Stable Diffusion
1.0e+11 Chinchilla AlphaCode T5-11B
DALL-E
Turing NLG GPT-J-6B
5.0e+10 AlphaCode 1.0e+22 GPT-Neo
GPT-NeoX-20B GPT-2
Turing NLG
Codex T5-11B
ERNIE 3.0 DALL-E
1.0e+10
GPT-J-6B
1.0e+20
5.0e+9
Wu Dao - Wen Yuan Meena
GPT-Neo
2.0e+9 GPT-2 Stable Di usion ERNIE 3.0
1.0e+9 1.0e+18
10k 100k 1M 10M 10k 100k 1M 10M
Training Cost (in U.S. Dollars - Log Scale) Training Cost (in U.S. Dollars - Log Scale)
Index Report 2023 1.3 AI Conferences
AI conferences are key venues for researchers to share their work and connect with peers and collaborators. Conference attendance is an
indication of broader industrial and academic interest in a scientific field. In the past 20 years, AI conferences have grown in size, number,
and prestige. This section presents data on the trends in attendance at major AI conferences.
1.3 AI Conferences
Conference Attendance International Conference on Principles of Knowledge
Representation and Reasoning (KR) were both held
After a period of increasing attendance, the total strictly in-person.
attendance at the conferences for which the AI
Index collected data dipped in 2021 and again in Neural Information Processing Systems (NeurIPS)
2022 (Figure 1.3.1).13 This decline may be attributed continued to be one of the most attended
to the fact that many conferences returned to hybrid conferences, with around 15,530 attendees (Figure
or in-person formats after being fully virtual in 1.3.2).14 The conference with the greatest one-
2020 and 2021. For example, the International Joint year increase in attendance was the International
Conference on Artificial Intelligence (IJCAI) and the Conference on Robotics and Automation (ICRA),
from 1,000 in 2021 to 8,008 in 2022.
Number of Attendees at Select AI Conferences, 2010–22

90
80
70
Number of Attendees (in Thousands)
60 59.45
50
40
30
20
10
0
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
Figure 1.3.1
13 This data should be interpreted with caution given that many conferences in the last few years have had virtual or hybrid formats. Conference organizers report that
measuring the exact attendance numbers at virtual conferences is difficult, as virtual conferences allow for higher attendance of researchers from around the world.
14 In 2021, 9,560 of the attendees attended NeurIPS in-person and 5,970 remotely.
Index Report 2023 1.3 AI Conferences
Attendance at Large Conferences, 2010–22

30
25
20
15.53, NeurIPS
15
10 10.17, CVPR
8.01, ICRA
7.73, ICML
5.35, ICLR
5 4.32, IROS
3.56, AAAI
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
Figure 1.3.2
Attendance at Small Conferences, 2010–22

3.50
3.00
2.50
2.00 2.01, IJCAI
1.50
1.09, FaccT
1.00
0.66, UAI
0.50 0.50, AAMAS
0.39, ICAPS
0.12, KR
0.00
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
Figure 1.3.3
Index Report 2023 1.4 Open-Source AI Software
GitHub is a web-based platform where individuals and coding teams can host, review, and collaborate on various code repositories.
GitHub is used extensively by software developers to manage and share code, collaborate on various projects, and support open-source
software. This subsection uses data provided by GitHub and the OECD.AI policy observatory. These trends can serve as a proxy for some
of the broader trends occuring in the world of open-source AI software not captured by academic publication data.
1.4 Open-Source AI Software

Projects
A GitHub project is a collection of files that software project. Since 2011, the total number of
can include the source code, documentation, AI-related GitHub projects has steadily increased,
configuration files, and images that constitute a growing from 1,536 in 2011 to 347,934 in 2022.
Number of GitHub AI Projects, 2011–22

Source: GitHub, 2022; OECD.AI, 2022 | Chart: 2023 AI Index Report
350 348
300
Number of AI Projects (in Thousands)
250
200
150
100
50
2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
Figure 1.4.1
As of 2022, a large proportion of GitHub AI projects United Kingdom (17.3%), and then the United States
were contributed by software developers in India (14.0%). The share of American GitHub AI projects
(24.2%) (Figure 1.4.2). The next most represented has been declining steadily since 2016.
geographic area was the European Union and the
GitHub AI Projects (% Total) by Geographic Area, 2011–22


40%
35%
30%
AI Projects (% of Total)
25%
24.19%, India
20%
15%
10%
5%
2.40%, China
0%
2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
Figure 1.4.2
Stars
GitHub users can bookmark or save a repository Figure 1.4.3 shows the cumulative number of stars
of interest by “starring” it. A GitHub star is similar attributed to projects belonging to owners of various
to a “like” on a social media platform and indicates geographic areas. As of 2022, GitHub AI projects
support for a particular open-source project. Some of from the United States received the most stars,
the most starred GitHub repositories include libraries followed by the European Union and the United
like TensorFlow, OpenCV, Keras, and PyTorch, which Kingdom, and then China. In many geographic areas,
are widely used by software developers in the AI the total number of new GitHub stars has leveled off
coding community. in the last few years.
Number of GitHub Stars by Geographic Area, 2011–22

3.50 3.44, United States
3.00
Number of Cumulative GitHub Stars (in Millions)
2.69, Rest of the World

2.50
2.34, European Union and United Kingdom
2.00
1.50 1.53, China
1.00
0.50 0.46, India
0.00
2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
Figure 1.4.3
Artificial Intelligence Appendix
Index Report 2023 Chapter 1: Research and Development
Appendix
Center for Security and Methodology
To create the merged corpus, CSET deduplicated
Emerging Technology, across the listed sources using publication metadata,
Georgetown University and then combined the metadata for linked
Prepared by Sara Abdulla and James Dunham publications. To identify AI publications, CSET used an
English-language subset of this corpus: publications
The Center for Security and Emerging Technology since 2010 that appear AI-relevant.4 CSET researchers
(CSET) is a policy research organization within developed a classifier for identifying AI-related
Georgetown University’s Walsh School of Foreign publications by leveraging the arXiv repository, where
Service that produces data-driven research at the authors and editors tag papers by subject. Additionally,
intersection of security and technology, providing CSET uses select Chinese AI keywords to identify
nonpartisan analysis to the policy community. Chinese-language AI papers.5
For more information about how CSET analyzes To provide a publication’s field of study, CSET
bibliometric and patent data, see the Country Activity matches each publication in the analytic corpus
Tracker (CAT) documentation on the Emerging with predictions from Microsoft Academic Graph’s
Technology Observatory’s website.1 Using CAT, users field-of-study model, which yields hierarchical labels
can also interact with country bibliometric, patent, describing the published research field(s) of study and
and investment data.2 corresponding scores.6 CSET researchers identified
Publications from CSET Merged Corpus of the most common fields of study in our corpus of
Scholarly Literature AI-relevant publications since 2010 and recorded
Source publications in all other fields as “Other AI.” English-
CSET’s merged corpus of scholarly literature language AI-relevant publications were then tallied by
combines distinct publications from Digital Science’s their top-scoring field and publication year.
Dimensions, Clarivate’s Web of Science, Microsoft
CSET also provided year-by-year citations for AI-
Academic Graph, China National Knowledge
relevant work associated with each country. A
Infrastructure, arXiv, and Papers With Code.3
publication is associated with a country if it has at
1 https://eto.tech/tool-docs/cat/
2 https://cat.eto.tech/
3 All CNKI content is furnished by East View Information Services, Minneapolis, Minnesota, USA.
4 For more information, see James Dunham, Jennifer Melot, and Dewey Murdick, “Identifying the Development and Application of Artificial Intelligence in Scientific Text,” arXiv [cs.DL],
May 28, 2020, https://arxiv.org/abs/2002.07143.
5 This method was not used in CSET’s data analysis for the 2022 HAI Index report.
6 These scores are based on cosine similarities between field-of-study and paper embeddings. See Zhihong Shen, Hao Ma, and Kuansan Wang, “A Web-Scale System for Scientific
Knowledge Exploration,” arXiv [cs.CL], May 30, 2018, https://arxiv.org/abs/1805.12216.
least one author whose organizational affiliation(s)

Epoch National
are located in that country. Citation counts aren’t
available for all publications; those without counts
Affiliation Analysis
weren’t included in the citation analysis. Over 70% of The AI forecasting research group Epoch maintains
English-language AI papers published between 2010 a dataset of landmark AI and ML models, along with
and 2020 have citation data available. accompanying information about their creators and
publications, such as the list of their (co)authors, number
CSET counted cross-country collaborations as
of citations, type of AI task accomplished, and amount
distinct pairs of countries across authors for each
of compute used in training.
publication. Collaborations are only counted once:
For example, if a publication has two authors from The nationalities of the authors of these papers have
the United States and two authors from China, it is important implications for geopolitical AI forecasting.
counted as a single United States-China collaboration. As various research institutions and technology
companies start producing advanced ML models, the
Additionally, publication counts by year and by
global distribution of future AI development may shift
publication type (e.g., academic journal articles,
or concentrate in certain places, which in turn affects
conference papers) were provided where available.
the geopolitical landscape because AI is expected to
These publication types were disaggregated by
become a crucial component of economic and military
affiliation country as described above.
power in the near future.
CSET also provided publication affiliation sector(s)
To track the distribution of AI research contributions on
where, as in the country attribution analysis, sectors
landmark publications by country, the Epoch dataset is
were associated with publications through authors’
coded according to the following methodology:
affiliations. Not all affiliations were characterized in
terms of sectors; CSET researchers relied primarily 1. A snapshot of the dataset was taken on
on GRID from Digital Science for this purpose, and November 14, 2022. This includes papers about
not all organizations can be found in or linked to landmark models, selected using the inclusion
GRID. Where the affiliation sector is available, papers
7
criteria of importance, relevance, and uniqueness,
were counted toward these sectors, by year. Cross- as described in the Compute Trends dataset
sector collaborations on academic publications documentation.8
were calculated using the same method as in the
2. The authors are attributed to countries based
cross-country collaborations analysis. We use HAI’s
on their affiliation credited on the paper. For
standard regions mapping for geographic analysis,
international organizations, authors are attributed
and the same principles for double-counting apply for
to the country where the organization is
regions as they do for countries.
headquartered, unless a more specific location
is indicated. The number of authors from each
country represented are added up and recorded.
7 See https://www.grid.ac/ for more information about the GRID dataset from Digital Science.
8 https://epochai.org/blog/compute-trends; see note on “milestone systems.”
If an author has multiple affiliations in different Imagen OPT-175B

countries, they are split between those InstructGPT PaLM (540B)
countries proportionately. 9
Jurassic-1-Jumbo PanGu-alpha
Jurassic-X Stable Diffusion (LDM-
3. E
ach paper in the dataset is normalized to
Meena KL-8-G)
equal value by dividing the counts on each
Megatron-LM (original, T5-3B
paper from each country by the total number of
8.3B) T5-11B
authors on that paper.10
Megatron-Turing NLG Turing NLG
4. All of the landmark publications are aggregated 530B Wu Dao 2.0
within time periods (e.g., monthly or yearly) with Minerva (540B) Wu Dao – Wen Yuan
the normalized national contributions added up
to determine what each country’s contribution
to landmark AI research was during each time Large Language and
period.
Multimodal Models Training
5. The contributions of different countries are Cost Analysis
compared over time to identify any trends.
Cost estimates for the models were based directly
Large Language and on the hardware and training time if these were
disclosed by the authors; otherwise, the AI Index
Multimodal Models calculated training time from the hardware speed,
The following models were identified by members training compute, and hardware utilization efficiency.11
of the AI Index Steering Committee as the large Training time was then multiplied by the closest cost
language and multimodal models that would be rate for the hardware the AI Index could find for the
included as part of the large language and multimodal organization that trained the model. If price quotes
model analysis: were available before and after the model’s training,
the AI Index interpolated the hardware’s cost rate
AlphaCode GLM-130B along an exponential decay curve.
BLOOM Gopher
The AI Index classified training cost estimates as
Chinchilla GPT-2
high, middle, or low. The AI Index called an estimate
Codex GPT-3 175B (davinci)
high if it was an upper bound or if the true cost was
CogView GPT-J-6B
more likely to be lower than higher: For example,
DALL-E GPT-Neo
PaLM was trained on TPU v4 chips, and the AI Index
DALL-E 2 GPT-NeoX-20B
estimated the cost to train the model on these chips
ERNIE 3.0 Grover-Mega
from Google’s public cloud compute prices, but the
ERNIE-GEN (large) HyperCLOVA
9 For example, an author employed by both a Chinese university and a Canadian technology firm would be counted as 0.5 researchers from China and 0.5 from Canada.
10 This choice is arbitrary. Other plausible alternatives include weighting papers by their number of citations, or assigning greater weight to papers with more authors.
11 Hardware utilization rates: Every paper that reported the hardware utilization efficiency during training provided values between 30% and 50%. The AI Index used the reported numbers
when available, or used 40% when values were not provided.
internal cost to Google is probably lower than what each file’s revision history. The analysis of GitHub
they charge others to rent their hardware. The AI data could shed light on relevant metrics about who
Index called an estimate low if it was a lower bound is developing AI software, where, and how fast, and
or if the true cost was likely higher: For example, who is using which development tools. These metrics
ERNIE was trained on NVIDIA Tesla v100 chips and could serve as proxies for broader trends in the field
published in July 2021; the chips cost $0.55 per hour of software development and innovation.
in January 2023, so the AI Index could get a low
Identifying AI Projects
estimate of the cost using this rate, but the training
Arguably, a significant portion of AI software
hardware was probably more expensive two years
development takes place on GitHub. OECD.AI
earlier. Middle estimates are a best guess, or those
partners with GitHub to identify public AI projects—
that equally well might be lower or higher.
or “repositories”—following the methodology
developed by Gonzalez et al.,2020. Using the 439
AI Conferences topic labels identified by Gonzalez et al.—as well as
the topics “machine learning,” “deep learning,” and
The AI Index reached out to the organizers of various
“artificial intelligence”—GitHub provides OECD.
AI conferences in 2022 and asked them to provide
AI with a list of public projects containing AI code.
information on total attendance. Some conferences
GitHub updates the list of public AI projects on a
posted their attendance totals online; when this was
quarterly basis, which allows OECD.AI to capture
the case, the AI Index used those reported totals and
trends in AI software development over time.
did not reach out to the conference organizers.
Obtaining AI Projects’ Metadata
GitHub OECD.AI uses GitHub’s list of public AI projects
to query GitHub’s public API and obtain more
The GitHub data was provided to the AI Index
information about these projects. Project metadata
through OECD.AI, an organization with whom
may include the individual or organization that
GitHub partners that provides data on open-
created the project; the programming language(s)
source AI software. The AI Index reproduces the
(e.g., Python) and development tool(s) (e.g., Jupyter
methodological note that is included by OECD.AI on
Notebooks) used in the project; as well as information
its website, for the GitHub Data.
about the contributions—or “commits”—made to it,
Background which include the commit’s author and a timestamp.
Since its creation in 2007, GitHub has become In practical terms, a contribution or “commit” is an
the main provider of internet hosting for software individual change to a file or set of files. Additionally,
development and version control. Many technology GitHub automatically suggests topical tags to each
organizations and software developers use GitHub project based on its content. These topical tags need
as a primary place for collaboration. To enable to be confirmed or modified by the project owner(s)
collaboration, GitHub is structured into projects, or to appear in the metadata.
“repositories,” which contain a project’s files and
Mapping Contributions to AI Projects to a As of October 2021, 71.2% of the contributions to

Country public AI projects were mapped to a country using
Contributions to public AI projects are mapped this methodology. However, a decreasing trend in
to a country based on location information at the the share of AI projects for which a location can be
contributor level and at the project level. identified is observed in time, indicating a possible lag
in location reporting.
a) Location information at the contributor level:
Measuring Contributions to AI Projects
• GitHub’s “Location” field: Contributors can
Collaboration on a given public AI project is measured
provide their location in their GitHub account.
by the number of contributions—or “commits”—made
Given that GitHub’s location field accepts free
to it.
text, the location provided by contributors is
not standardized and could belong to different To obtain a fractional count of contributions by
levels (e.g., suburban, urban, regional, or country, an AI project is divided equally by the total
national). To allow cross-country comparisons, number of contributions made to it. A country’s total
Mapbox is used to standardize all available contributions to AI projects is therefore given by the
locations to the country level. sum of its contributions—in fractional counts—to each
AI project. In relative terms, the share of contributions
• Top level domain: Where the location field
to public AI projects made by a given country is the
is empty or the location is not recognized, a
ratio of that country’s contributions to each of the
contributor’s location is assigned based on his
AI projects in which it participates over the total
or her email domain (e.g., .fr, .us, etc.).
contributions to AI projects from all countries.
b) Location information at the project level:
In future iterations, OECD.AI plans to include
• Project information: Where no location additional measures of contribution to AI software
information is available at the contributor development, such as issues raised, comments, and
level, information at the repository or project pull requests.
level is exploited. In particular, contributions
Identifying Programming Languages and
from contributors with no location information Development Tools Used in AI Projects
to projects created or owned by a known GitHub uses file extensions contained in a project to
organization are automatically assigned the automatically tag it with one or more programming
organization’s country (i.e., the country where languages and/or development tools. This implies that
its headquarters are located). For example, more than one programming language or development
contributions from a contributor with no tool could be used in a given AI project.
location information to an AI project owned by
Microsoft will be assigned to the United States.
If the above fails, a contributor’s location field is left

blank.
Measuring the Quality of AI Projects

Two quality measures are used to classify public AI
projects:
• Project impact: The impact of an AI project is

given by the number of managed copies (i.e.,
“forks”) made of that project.
• Project popularity: The impact of an AI project

is given by the number of followers (i.e., “stars”)
received by that project.
Filtering by project impact or popularity could help

identify countries that contribute the most to high
quality projects.
Measuring Collaboration
Two countries are said to collaborate on a specific
public AI software development project if there is
at least one contributor from each country with at
least one contribution (i.e., “commit”) to the project.
Domestic collaboration occurs when two contributors
from the same country contribute to a project.

Example

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Example

Uploaded by

Copyright:

Available Formats

Artificial Intelligence

Index Report 2023

Research and Development

By Field of Study 7 System Types 30

By Sector 8 Sector Analysis 31

Cross-Country Collaboration 10 National Affiliation 32

Cross-Sector Collaboration 12 Systems 32

AI Journal Publications 13 Authorship 34

Overview 13 Parameter Trends 35

By Region 14 Compute Trends 37

By Geographic Area 15 Large Language and Multimodal Models 39

Citations 16 National Affiliation 39

AI Conference Publications 17 Parameter Count 41

Overview 17 Training Compute 42

By Region 18 Training Cost 43

AI Repositories 21 Conference Attendance 45

AI research is on the rise, across Large language models

Number of AI Publications in the World, 2010–21

By Type of Publication book chapters, theses, and unknown document types

Number of AI Publications by Type, 2010–21

By Field of Study roughly doubled while the number of machine learning

Number of AI Publications by Field of Study (Excluding Other AI), 2010–21

60 59.36, Pattern Recognition

42.55, Machine Learning

30 30.07, Computer Vision

14.99, Natural Language Processing

AI Publications (% of Total) by Sector, 2010–21

AI Publications (% of Total) by Sector and Geographic Area, 2021

0.20% United States

Other 0.33% European Union and United Kingdom

0% 10% 20% 30% 40% 50% 60% 70% 80%

United States and China Collaborations in AI Publications, 2010–21

Cross-Country Collaborations in AI Publications (Excluding U.S. and China), 2010–21

4.13, United Kingdom and China

3.42, United States and Germany

Cross-Sector Collaborations in AI Publications, 2010–21

32.55, Education and Nonpro t

Number of AI Journal Publications, 2010–21

AI Journal Publications (% of World Total) by Region, 2010–21

11.61%, North America

AI Journal Publications (% of World Total) by Geographic Area, 2010–21

40% 39.78%, China

22.70%, Rest of the World

15.05%, European Union and United Kingdom

10% 10.03%, United States

AI Journal Citations (% of World Total) by Geographic Area, 2010–21

21.51%, European Union and United Kingdom

15% 15.08%, United States

Number of AI Conference Publications, 2010–21

AI Conference Publications (% of World Total) by Region, 2010–21

36.72%, East Asia and Paci c

AI Conference Publications (% of World Total) by Geographic Area, 2010–21

26.84%, Rest of the World

20% 20.29%, European Union and United Kingdom

17.23%, United States

AI Conference Citations (% of World Total) by Geographic Area, 2010–21

25.57%, Rest of the World

Number of AI Repository Publications, 2010–21

AI Repository Publications (% of World Total) by Region, 2010–21

26.32%, North America

21.40%, Europe and Central Asia

3.41%, South Asia

AI Repository Publications (% of World Total) by Geographic Area, 2010–21