You are on page 1of 41

Testing and Monitoring

LLM Applications
Introduction
Speakers

Anupam Datta Shayak Sen


Co-Founder, Chief Scientist Co-Founder, CTO
TruEra TruEra

Josh Reini Piotr (Peter) Mardziel Rick Shih


Data Scientist Machine Learning Engineer Machine Learning Engineer
TruEra TruEra TruEra

|
Agenda
◼ Overview and History of Large Language Models (10 minutes)

◼ Building LLM Applications (30 minutes)


■ What kinds of applications can you build?
■ What are the components of an LLM application?
■ Case Study: Building TruBot and its failure modes

◼ Testing LLM Applications (20 minutes; to be continued on May 18)


■ How do you measure LLM application quality?
■ Feedback Functions: Usage, Aggregation and Benchmarking
Foundation models are everywhere
Sample language
foundation models
● BERT, RoBERTa, DistillBERT,
● ChatGPT, GPT-3, LaMDA,
BLOOM

Adaptation uses:
● your data to
● build a model for
● your specific task Image source: Bommasani et al. 2021.

Workshop focus: Large Language Models (LLMs) and Apps


A Brief History of Foundation Models

◼ 1st generation: Transformers: BERT, DistillBERT, RoBERTa,...


■ Attention, parallelizable, encoder-decoder architecture,
~1B params

◼ 2nd generation: LLMs and Generative AI:


■ GPT-3, Dall-E,..., LaMDA ~100-200B parameters
■ massive scale models trained on huge corpus, smarter byte pair encoding
of tokens, prompt engineering
■ InstructGPT ( ~1-2B), ChatGPT (size not disclosed: 20B?)
■ Reinforcement learning with human feedback (RLHF)
Building LLM applications
Tremendous activity in building LLM apps

Applications Building by chaining

◼ Question Answering ◼ Multi-modal data sources (public, proprietary, etc.)


◼ Conversation ◼ System & user inputs
◼ Code completion ◼ Prompt Templates
◼ Creative Generation ◼ Vector databases
◼ Search ◼ LLMs
◼ Translation ◼ Links to the real world (plug-ins…)
◼ Classification ◼ ….
◼ … 🦜🔗 LangChain
Architecture option 1: LLM only QA

Use LLM to directly answer questions

How should I measure ML model performance?

You should measure ML model


performance using a performance
metric, such as accuracy, AUC, ...

Completion

Challenges: Quality and traceability of answers


Architecture option 2: retrieval-augmented QA (1/2)

Step 1: Create & store embeddings of documents in a vector database (knowledge base)

How should I measure performance?

Embedding
Performance metrics include Model
classification accuracy, log loss, ...

Vector DB

Drift metrics include Wasserstein, PSI,


Difference of Mean, ...
Architecture option 2: retrieval-augmented QA (2/2)

Step 2: Answer questions by chaining LLM and vector database

How should I measure performance?

Query
embedding You should measure performance
using a performance metric, such as
accuracy, AUC, ...

Vector DB Embedding Completion

Query
+ Relevant Chunks
Relevant chunks

Better handle on quality and traceability of answers


Building something cool? Share in the community!

https://tinyurl.com/bddkyspv
Case Study:
TruBot
Case Study: TruBot
Architecture: QA with context and chat history.

Implementation: Langchain.

How should I measure performance?

Query
embedding You should measure performance
using a performance metric, such as
accuracy, AUC, ...

Completion
Vector DB Embedding
for Q/A

Query
+ Relevant Chunks
Relevant chunks + Chat Summary
Completion for
Summarization
Chat history / summary
Case Study: TruBot: LLM Completion Prompts
LLM Completion + Prompt --> Pipeline Tasks
● Q/A with context
● Chat history summarization

How should I measure performance?

Query
embedding You should measure performance
using a performance metric, such as
accuracy, AUC, ...

Completion
Vector DB Embedding
for Q/A

Query
+ Relevant Chunks
Relevant chunks + Chat Summary
Completion for
Summarization
Chat history / summary
Case Study: TruBot: LLM Completion Prompts
◼ Completion for Q/A Prompt

Prompt:
Use the following pieces of context to answer the question at the end. If you don't know
the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Helpful Answer:
Complete from here

context: """
Upon selecting the Performance test option, the performance test creation wizard is displayed.

Enter a name for the test, an optional description, then select the performance metric on which to measure
the results.

For performance tests, define failure and/or warning conditions using either absolute values (e.g., AUC
between 0.8 to 1.25, as per the four-fifths rule) or relative thresholds (e.g., if the false-positive rate
is lower by X amount relative to the rate by a baseline model).

Once you've set the attributes above, the results are displayed accordingly, as discussed next.

... """

question: 'How can I measure performance?'


Case Study: TruBot: LLM Completion Prompts
◼ Completion for Chat History Summarization Prompt

Prompt:
Progressively summarize the lines of conversation provided, adding onto the previous summary returning a new summary.

EXAMPLE
Current summary:
The human asks what the AI thinks of artificial intelligence. The AI thinks artificial intelligence is a force for good

New lines of conversation:


Human: Why do you think artificial intelligence is a force for good?
AI: Because artificial intelligence will help humans reach their full potential.

New summary:
The human asks what the AI thinks of artificial intelligence. The AI thinks artificial intelligence is a force for good
because it will help humans reach their full potential.
END OF EXAMPLE

Current summary:
{summary}

New lines of conversation:


{new_lines}

New summary:
Complete from here
Building TruBot
◼ Prepare contexts via Langchain and Pinecone:
■ Scrape truera.com for "documents". Split documents into "chunks".
■ Vector DB of embeddings of chunks via OpenAI embedding.

from langchain.document_loaders import UnstructuredHTMLLoader


from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Pinecone
...
docs = []
...
docs += UnstructuredHTMLLoader(scrape_file).load()
text_splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=0)
chunks = text_splitter.split_documents(docs)
embedding = OpenAIEmbeddings(model='text-embedding-ada-002') # 1536 dims
Pinecone.from_documents(chunks, embedding, index_name="llmdemo")
Building TruBot
◼ Response generation via Langchain, OpenAI, Pinecone:
■ Embed message via OpenAI embedding.
■ Query Pinecone vector DB for nearby chunks.

from langchain.embeddings.openai import OpenAIEmbeddings


from langchain.vectorstores import Pinecone

# Embedding needed for Pinecone vector db.


embedding = OpenAIEmbeddings(model='text-embedding-ada-002') # 1536 dims

# Vector DB configuration.
docsearch = Pinecone.from_existing_index(
index_name="llmdemo", embedding=embedding
)
retriever = docsearch.as_retriever()
Building TruBot
◼ Response generation via Langchain, OpenAI, Pinecone:
■ Create prompt with nearby chunk texts and conversation summary (initially empty).
■ Complete prompt with OpenAI.
■ Send response. Include source of chunks.
■ Summarize conversation so far via Langchain and OpenAI for follow-ups.

from langchain.chains import ConversationalRetrievalChain


from langchain.llms import OpenAI
from langchain.memory import ConversationSummaryBufferMemory

llm = OpenAI(temperature=0, max_tokens=128)

memory = ConversationSummaryBufferMemory(max_token_limit=650, llm=llm, memory_key="chat_history", output_key='answer')

# The main logic rolled into this one line:


chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory, max_tokens_limit=4096, get_chat_history=lambda h: h)

question: str = "How can I measure performance?" THAT'S


IT !?
response: str = chain(question)['answer']
Completion <-> Question Linguistic Mismatch

Failure Modes
Completion does not follow prompt instructions.

● More parts --> more things to go wrong.


Completion escapes instructions in prompt.
Context Modeling: Unsuitable documents in DB.
How should I measure performance?

Context chunks lack higher-level context. Completion misinterprets context.

Query
embedding You should measure performance
using a performance metric, such as
accuracy, AUC, ...

Vector DB Embedding Completion


Embedding not appropriate.

Query
+ Relevant Chunks
Relevant chunks + Chat Summary
Completion for
Summarization
Chat history / summary
Retrieval: Non-relevant contexts.
Find TruBot failure modes yourself

In the AI Quality Forum, send a message to @TruBot. Each new thread is a new conversation
with separate history. Use the #general channel.

https://tinyurl.com/bddkyspv
Failure modes

Vector DB Failures

Context lookup
produces bad context.


Failure modes Spanish
English

Completion Failures

Language mismatch.
Testing LLM applications
Key challenges with LLMs

1. Measure model quality

2. Evaluate, monitor & debug model quality issues


Measure model quality

◼ No obvious ground truth for generative output (e.g., Chatbot response)

◼ First step: Get direct human feedback

◼ Question: Can we scale this up programmatically?


Track user
responses
Measure model quality with feedback

A feedback function takes as input generated text from an LLM (or a downstream model or
application built on it) and some metadata, and returns a score.

Examples

● Prompt Sentiment
● Question Answering Relevance
● Language mismatch
● Transcript length
● Response verbosity
● Fairness substitution
ff: Sentiment analysis on prompts ● Toxicity
● …
score: 2/10
Measure model quality with feedback functions

Analogy Implementing feedback functions

◼ Data labeling functions, e.g. ◼ Build separate model , e.g. BERT


regular expressions sentiment analysis
◼ Programmatically scales up ◼ Use LLMs themselves
human data labeling ◼ …
◼ Combine labeling functions, ◼ Average score of a collection of
e.g. through majority vote feedback functions
◼ Benchmark against golden
sets
Addressing TruBot
Question
Failures with
Feedback Functions
Statement
Context lookup produces bad context.

◼ Feedback Function
■ Question/Statement
Relevance between
question and context
matches. ✅


Addressing TruBot
Failures with Prompt:

Feedback Functions
You are a RELEVANCE classifier, providing the relevance of the given statement to the given question.
Provide all responses only as a number from 1 to 10 where 1 is the least relevant and 10 is the most
relevant.
Never elaborate.

STATEMENT: {statement}

Context lookup produces bad context. QUESTION: {question}

RELEVANCE:

◼ Feedback Function
■ Question/Statement
Relevance between
question and context
matches. relevance=8

relevance=8
◼ Actions
■ Apply Q/S Relevance before
giving context to LLM. relevance=3

relevance=2
Addressing TruBot Spanish
Failures with English
Feedback Functions

Language mismatch.

◼ Feedback Function
■ Question / Answer
languages (match)
Addressing TruBot
Failures with
Feedback Functions
Adjusted Prompt:
Use the following pieces of context to answer the question at the end in the same language
as the question. If you don't know the answer, just say that you don't know, don't try to
make up an answer.
Language mismatch.
{context}

Feedback Function
Question: {question}
◼ Helpful Answer:
■ Question / Answer
languages (match)
Context Translation Prompt:
Translate the following statement to the same language as the question.

Statement: {statement}

◼ Actions Question: {question}


■ Adjust prompt.
Translated statement:
■ Translate context.
2.2 Quantitative Input In\ruence QII (Influencia de Entrada
(QII)\nAs stressed in the previous Cuantitativa) es un enfoque de
section, at the heart of our framework explicación basado en instancias. Esta
developed in this paper is técnica controla las correlaciones
an\ninstance-ba mediante la intervención aleatoria de
las entradas y captura las
Aggregating feedback functions (Binary Sentiment Example)

* * *

Feedback Mechanism:

Human
Feedback
Prompt: Do you like your job?
Response: As an AI language
model, I do not have feelings, as I
am just a program designed to
provide information and
assistance. However, I am
programmed to perform my job to
the best of my abilities in order to
provide accurate and useful
information to users.
* Models provided by this vendor are used to generate feedback
functions. The vendor does not provide nor endorse the functions used.
Majority Voting

Feedback:

Positive > Negative


Sensitive Aggregation

Feedback:

Negative > 0
Weighted Voting

Feedback:
2

Overweight particular qualities or feedback


mechanisms
Benchmarking feedback functions (sentiment)

How do they perform against a “golden set”?

* *# *#
feedback function

imdb (50k labeled


66.9% 96.0% 77.0%
movie reviews)

* Models provided by this vendor are used to generate feedback functions. The vendor does not provide
nor endorse the functions used.
# Used a sample of 100 labeled reviews used due to rate limitations and time constraints for this session.
Summary
◼ Overview and History of Large Language Models

◼ Building LLM Applications


■ What kinds of applications can you build?
■ What are the components of an LLM application?
■ Case Study: Building TruBot and its failure modes

◼ Testing LLM Applications


■ How do you measure LLM application quality?
■ Feedback Functions: Usage, Aggregation and Benchmarking
What’s coming next week?
◼ How to find issues in chained LLM applications

◼ How to add different feedback functions into your application

◼ How to assess feedback function results for your application

◼ And more!
Stay Tuned

Colossal squids are coming to help!

You might also like