Professional Documents
Culture Documents
—Alexandr Wang
FOUNDER & CEO, SCALE
AI Year in
Review
Vast amounts of data are collected The base model is then fine-tuned Generative models are then
from a variety of sources, primarily via Reinforcement Learning from able to produce unlimited and
data from the internet. A model is Human Feedback (RLHF) to infinitely creative imagery, engage
trained on GPUs, producing a base align them more closely to human in conversations with users,
Generative AI model that is highly capable, but
not aligned to human preferences.
preferences and in the case of LLMs
interact in a conversational or
summarize documents, and write
code.
takes a giant leap “chat” format.
In March 2023, OpenAI launched GPT-4, the also be developed over the coming year, like
Giant leaps in the capabilities of Generative AI
most capable large language model ever creat- BloombergGPT, a model purpose-built for fi-
defined 2022. In mid-2022, the image diffusion
ed. Also aligned with RLHF, GPT-4 impressively nancial use cases. A steadily growing number of
models Dall-E 2 (OpenAI) and Stable Diffu-
passed many exams designed to test human these models will be trained and released over
sion (Stability AI) captured headlines. Many
capability in the 90th percentile or higher, in- the next year.
new startups rushed to build their businesses
cluding several AP exams and the bar exam. In Generative models are being integrated into
on top of these powerful image-creation tools.
December 2022, Anthropic announced a closed Google Workspace and Microsoft Office, en-
In late November, OpenAI released the large
beta of Claude, an LLM with chat capabilities abling massive productivity gains for business
language model GPT-3.5 and the chat interface
similar to ChatGPT, and an approach to human users. These tools help you write the first draft
ChatGPT which quickly became one of the
alignment based on “constitutional AI,” where of a document, generate complete presenta-
most impactful technology launches ever. Key
human oversight is provided through a list tions from a single prompt, or automatically
to the impact of this model was the use of Re-
of rules or principles. In March 2023, Google analyze and visualize financial data in a spread-
inforcement Learning with Human Feedback
released Bard, its conversational AI based on sheet. Enterprise software platforms like Sales-
(RLHF), a technique to align model perfor-
Google’s LaMDA model. Several other com- force are enabling analysts and executives to gain
mance with human intent.
panies are also building large language mod- new perspectives on data using Generative AI.
els, including AI21 labs, Carper AI, Stability
AI, and Cohere. Domain-specific models will
—GREG BROCKMAN,
PRESIDENT & CO-FOUNDER, OPENAI
65%
either accelerated their existing
strategies or created an AI strategy
for the first time
Most companies don’t have the necessary resources 61% of companies are looking to
or mandate to create their own generative mod- AI to help enhance the custom-
els, so they must rely on third parties. Of those er experience, 56% to improve
companies that plan on working with generative operational efficiency, and 50%
models, the vast majority are looking to leverage to increase profitability. Focusing
open-source generative models (41%) or Cloud API on customer-centricity benefits
generative models (37%), while very few are looking organizations immensely, with
to build their own generative models (22%). improved customer goodwill in
the short term and greater profit-
ability in the long term.
Furthermore, 28% are exclusively using open-source
models, while 26% use cloud APIs (commercially 89% of companies adopting AI benefit
available models such as Anthropic’s Claude, from the ability to develop new prod-
OpenAI’s GPT-4, and Cohere’s Command), and only ucts or services, 78% from enhanced
15% are exclusively building their own. customer experiences, and 76% from
better collaboration across business
functions. These companies also see
There are multiple factors that improved organizational process
enterprises must weigh when deciding efficiency and profitability. Despite
on their Generative AI infrastructure, the positive outcomes for AI adopters,
including their in-house machine even greater outcomes are possible as
learning expertise, budget, security re- companies accelerate their AI strategies
quirements, and need for domain-spe- and increase their investments in AI.
cific capabilities. Leveraging a cloud
provider is the easiest and fastest An organization’s goals also shape the
path to obtain generative capabilities, effectiveness of its AI implementation.
but it comes with higher security Those companies that list shareholder/
risk, less control over the underlying investor demand as the top goal for AI
models, lower quality performance at adoption also show the poorest results
domain-specific tasks, and can be ex- for customer experience, revenue, and
pensive. Open-source models provide profitability. To ensure success with an
more control and are cheaper, but AI implementation, organizations must
they require more in-house expertise avoid implementing AI for the sake of
to deploy and fine-tune. Companies implementing AI, but instead, ensure
looking to build their own generative the goals of an AI implementation are
models benefit from greater control aligned with company priorities and
but incur greater costs from data that AI is a good solution for a given
collection, compute, and hiring machine problem.
learning experts to train and deploy them.
A New
Models Are Increasing in Size
Though Reinforcement Learning from Human Feed- ChatGPT is a large language model that has been 2022 also saw the proliferation of a new role for ma-
back (RLHF) is not new to the research community, tuned specifically for the task of conversational text chine learning teams, the “prompt engineer.” While
in 2022, it catapulted in popularity as it was a critical generation. ChatGPT was trained with RLHF and generative models are competent at generating
ingredient in the success of ChatGPT. data in dialogue formats to enable it to act as a con- the desired output for business use cases, the right
versational chatbot. prompt is required to optimize the model’s effec-
Instead of attempting to write a loss function with tiveness. Prompt engineers carefully craft natural
which to train a model, RLHF involves soliciting ChatGPT quickly became one of the most impactful language inputs to get more consistent and reliable
feedback from human users and training a reward product launches of all time, reaching 1 million users responses from models so that the model outputs
model on that feedback. This human-defined reward in just five days and currently sits at over 100 million can then be used in business applications. Instead
model is then used to train a base model. This also users. of writing an SQL query, these engineers craft finely
allows training on much more data since the human honed natural language prompts.
feedback is mimicked by the reward model, so the ChatGPT was initially launched with GPT-3.5 but
dataset size is now only constrained by how many now also includes GPT-4 for ChatGPT plus subscrib- For example, when integrating applications with
prompts you can create. ers. These models are highly capable of question LLMs, the varied responses from the model can
answering, content generation, and summarization. break the integration if they are not formatted prop-
RLHF tuning results in models better aligned to hu- While these models provide more robust, informa- erly. Say you are creating an application dealing with
man preferences, producing more detailed and factual tive, and creative responses than their predeces- financial data. A user’s input may be related to the
responses. sors, the real breakthrough for adoption was their Q4 earnings of a particular company (see graphic,
ability to hold conversations with humans. The right).
RLHF also defines the “personality” and “mood” ability to interact with the models in an intuitive
of the model, making it more helpful, friendly, and way increased the accessibility of the models so that Prompt engineers help model responses to solve
factual than the base model would be otherwise. anyone can use them. business challenges with greater accuracy, efficiency,
This means we get responses from the model that and quality. They also ensure that responses align
feel more human and less like talking to a machine. with an organization’s brand guidelines and voice.
RLHF is a critical component to the success of They are also critical at finding vulnerabilities in
recent LLMs and is also critical to ensuring that models by using prompt injection techniques and
enterprises using Generative AI get model responses helping to resolve those vulnerabilities before an
that align with their policies and brands. employee or customer does. This role is highly
valuable in ensuring organizations’ successful imple-
mentation of generative models.
As Generative AI is now more capable and widely On their own, base generative models are valuable
“One of the ways that I describe Stable available, companies are quickly incorporating it tools. Paired with a business’s proprietary data,
into their operations. 72% of companies will signifi- they become strong differentiators, improving the
Diffusion is as a generative search cantly increase their investment in AI each year for customer experience, product development, and
engine. You don’t need to use image the next three years. profitability.
pipeline at the right place and having Many organizations are now building their own large
“I’m most excited about...the ability
generative models. These models are being incor-
human-in-the-loop interactions like the for our models to start using tools
porated into search engines and paired with other
work that you do at Scale AI, under- tools, including internal knowledge bases, to become out in the world. Giving them access
powerful business tools. These models will also
standing how humans do that at scale, become multimodal, meaning that they will be able
to knowledge bases, giving them
and having these engines, it will allow to consume and generate text, images, and video, access to search engines like Google
making them even more useful than they are today. or Bing, and augmenting them with
us to have even better experiences that You can upload a product catalog with both images that knowledge as a resource so that
understand what we want.” and text to a multimodal model, and it will recog-
instead of them having to memorize
nize specific products, write product descriptions,
fill in missing attributes, and generate new images to all of these facts, they can make
— E M A D M O S TA Q U E , enrich your product catalog automatically. reference to a live, updated, knowledge
F O U N D E R & C E O , S TA B I L I T Y A I
base. I think that’s going to be super
(TransformX October 2022) impactful.”
Widespread accessibility of generative
models
—AIDAN GOMEZ,
Much like the cloud, widely accessible generative CEO, COHERE
models represent a paradigm shift for companies.
(TransformX October 2022)
Incorporating this type of AI will quickly become the
status quo, and those who are slow to adopt will be
left behind.
59% of companies view AI as critical or highly As companies view AI as more critical to the fu-
critical to their business in the next year, and ture success of their business, they are increasing
69% in the next three years. The increasing AI investments over the next three years. 72% of
capabilities and availability of Generative AI will companies plan to increase their investment in AI
accelerate AI adoption. each year for the next three years.
52%
are in investing heavily
investing heavily in LLMs, 36% in generative visual in LLMs
models, and 30% in computer vision applications.
With the recently popularized capabilities of LLMs, “I really believe that we are at a transformative moment today where ML is moving at an incredible
companies have rapidly shifted their AI strategies to speed and problems that were thought to be too complex to solve with computers a few years ago,
36%
harness the power of Generative AI. are in investing heavily
in generative visual are now being solved by applying machine learning. So we have this great opportunity. If machine
models
learning becomes more accessible, the world will move faster, our economy will move faster, science
will move faster.”
30%
are in investing in
computer vision models
— F R A N C O I S C H O L L E T,
AI ENGINEER AND RESEARCHER, GOOGLE
Companies that view AI as critical to their business While leaders have identified the need to adopt AI, “As a product of this shortage in AI talent,
indicate they have the executive support, strate- the execution of these strategies is difficult, nu-
gy and vision, and budget they need to succeed in anced, and heavily dependent on expertise. The field most businesses are missing out on a huge
implementing their company’s AI strategy. However, is moving so quickly that it is difficult to keep up opportunity to integrate this tech into
these companies generally lack the necessary exper- with the pace of advancement. Highly talented peo-
tise, software, and tools required to achieve success. ple with expertise in Generative AI are simply not products and into their developer’s workflows.
available to most organizations. Similarly, selecting,
Consumers are missing out on products that
standardizing, and updating the software and tools
associated with Generative AI, MLOps, and even have more magical, intuitive, and smart
DevOps is challenging for companies without ded-
experiences. The start of the fix comes with
icated teams to keep up with these changes as the
requisite tech stacks are constantly evolving. the product folks making the decision about
what’s prioritized, looking at what they can
do, understanding where the technology is
today, and where they could insert it, and then
building it. I think we need to start making AI a
standard piece of every single product. I don’t
think consumers are going to tolerate dumb
products anymore. We need to make them
much, much smarter.”
—AIDAN GOMEZ,
CEO, COHERE
AI Adoption
by Industry
80%
Every industry is looking to
increase its AI budgets over the
next three years. Those that top
the list are: Insurance
75% 74%
Healthcare & Retail &
Life Sciences eCommerce
ML Lifecycle As of late 2022, the most widely used foundation models were
BERT, GPT-3, Stable Diffusion, BLOOM, and T5/FLAN.
However, this landscape is quickly changing as more and more powerful generative models are being developed.*
BERT plays a critical but quieter role in many orga- to state-of-the-art models like GPT-4. The compute
nizations today, providing natural language under- required to run these more sophisticated models will
standing capabilities at a significantly reduced cost become cheaper over time, and companies will have
compared to larger models such as GPT-3. more third-party tools and expertise available to help
integrate these larger models into their operations.
However, this trend is shifting as BERT is not a gen- Open-source models such as LLaMA are already be-
erative model, so its use cases are limited compared ing optimized to run on consumer laptops.
*(ChatGPT/GPT-3.5/GPT-4 were not included in this survey as survey collection began in late 2022 before these models were launched)
64%
foundation models using their proprietary data
and knowledge bases. The biggest challenges
for fine-tuning foundation models are acquiring
training data and the necessary ML infrastructure.
Organizations working with generative models
find it challenging and resource-intensive to fine-
tune their models in-house. The most effective
techniques, like RLHF, require humans to apply
feedback to model outputs, custom software, and
26% Data quality remains the most challeng-
ing part of acquiring training data, closely
followed by data collection. The challenges
for acquiring training data are nearly iden-
26%
specialized skill sets to ensure high quality and lim- tical to last year’s survey findings.
it human biases in models.
Curating data and annotation quality are
still the top challenges for companies
preparing training data for models. The
biggest change YOY was that it takes too
long to get labeled data (typically greater
than one month), with 21% of respondents
citing this as their #1 challenge, up from
only 13% a year ago. Companies are looking
to move faster to label their data, but it is
difficult to keep up.
While most of the report has focused on RLHF Just as we found in 2022, measuring the business
and Generative AI, companies apply machine impact of models remains a challenge, especially
Respondents that only label their data manually most rank labeling quality as their number one challenge, learning in many different ways, from object de- for startups or very small companies (those with
frequently rank labeling quality as their number one compared to 47% using automated labeling and only tection to recommendation systems. One critical fewer than 250 employees). These companies
challenge in preparing training data, while those that 44% using human-in-the-loop labeling. The combina- component of these production ML systems is rely more on aggregated model metrics as they
leverage human-in-the-loop and automated labeling tion of automated labeling plus human-in-the-loop how companies evaluate and monitor their per- are building products and a customer base, so the
rank this challenge less frequently. 60% of respon- is recommended as a best practice as it nearly always formance. business impact is difficult to measure. However,
dents leveraging manual labeling in some capacity outpaces the accuracy and efficiency of either alone. small to medium size organizations (500-9,999
employees) are measuring business impact more
than they were even one year ago, at about 73%
now (compared to 55% in last year’s survey).
As in last year’s report, smaller organizations can caught as quickly as startups or small companies, As we found in last year’s report, ML teams that Although small, agile teams at smaller companies
usually identify issues with their models quickly, in which are operating simpler systems and closely identify issues fastest with their models are most may find failure modes, problems in their models, or
less than one week. Larger companies (those with monitoring model metrics. Additionally, these larger likely to use A/B testing when deploying models. problems in their data earlier than teams at large en-
5,000 employees or more) may be more likely to companies typically have a larger customer base, terprises, their validation, testing, and deployment
measure business impact but are more likely than so their users will hit on more edge cases than the Aggregate metrics are a useful baseline, but as en- strategies are typically less sophisticated. Thus, with
smaller ones to take longer (one month or more) to smaller companies. For these reasons, it is import- terprises develop a more robust modeling practice, simpler models solving more uniform problems for
identify issues with their models. ant that as companies scale, they continually refine tools such as “shadow deployments,” ensembling, customers and clients, it’s easier to spot failures.
their MLOps practices, use data curation tools that A/B testing, and even scenario tests can help validate When the system grows to include a large market
These larger companies are operating complex sys- can help them identify edge cases, and monitor their models in challenging edge-case scenarios or rare or even a large engineering staff, complexity and
tems at scale, and the issues they face are less likely models while continuing to measure business impact. classes. technical debt begin to grow. At scale, scenario tests
to cause immediate business impact, so they are not become essential, and even then, it may take longer
to detect issues in a more complex system.
While we have covered best practices for evaluat- appropriately, and human judgment is required to
ing models, we must note that for the first time, evaluate correctness. That means any sound model
the performance of Generative AI models is nearly evaluation, whether done by a generative model
impossible to evaluate automatically. This is because builder or an enterprise, will require human-in-the-
there are many ways for the model to respond loop validation and verification on an ongoing basis.