Truera Slides LLM Workshop Session 1

Testing and Monitoring
LLM Applications
Introduction
Speakers
Anupam Datta Shayak Sen

Co-Founder, Chief Scientist Co-Founder, CTO
TruEra TruEra
Josh Reini Piotr (Peter) Mardziel Rick Shih

Data Scientist Machine Learning Engineer Machine Learning Engineer
TruEra TruEra TruEra
|
Agenda
◼ Overview and History of Large Language Models (10 minutes)
◼ Building LLM Applications (30 minutes)

■ What kinds of applications can you build?
■ What are the components of an LLM application?
■ Case Study: Building TruBot and its failure modes
◼ Testing LLM Applications (20 minutes; to be continued on May 18)

■ How do you measure LLM application quality?
■ Feedback Functions: Usage, Aggregation and Benchmarking
Foundation models are everywhere
Sample language
foundation models
● BERT, RoBERTa, DistillBERT,
● ChatGPT, GPT-3, LaMDA,
BLOOM
Adaptation uses:
● your data to
● build a model for
● your specific task Image source: Bommasani et al. 2021.
Workshop focus: Large Language Models (LLMs) and Apps

A Brief History of Foundation Models
◼ 1st generation: Transformers: BERT, DistillBERT, RoBERTa,...

■ Attention, parallelizable, encoder-decoder architecture,
~1B params
◼ 2nd generation: LLMs and Generative AI:

■ GPT-3, Dall-E,..., LaMDA ~100-200B parameters
■ massive scale models trained on huge corpus, smarter byte pair encoding
of tokens, prompt engineering
■ InstructGPT ( ~1-2B), ChatGPT (size not disclosed: 20B?)
■ Reinforcement learning with human feedback (RLHF)
Building LLM applications
Tremendous activity in building LLM apps
Applications Building by chaining
◼ Question Answering ◼ Multi-modal data sources (public, proprietary, etc.)

◼ Conversation ◼ System & user inputs
◼ Code completion ◼ Prompt Templates
◼ Creative Generation ◼ Vector databases
◼ Search ◼ LLMs
◼ Translation ◼ Links to the real world (plug-ins…)
◼ Classification ◼ ….
◼ … 🦜🔗 LangChain
Architecture option 1: LLM only QA
Use LLM to directly answer questions
How should I measure ML model performance?
You should measure ML model

performance using a performance
metric, such as accuracy, AUC, ...
Completion
Challenges: Quality and traceability of answers

Architecture option 2: retrieval-augmented QA (1/2)
Step 1: Create & store embeddings of documents in a vector database (knowledge base)
How should I measure performance?
Embedding
Performance metrics include Model
classification accuracy, log loss, ...
Vector DB
Drift metrics include Wasserstein, PSI,

Difference of Mean, ...
Architecture option 2: retrieval-augmented QA (2/2)
Step 2: Answer questions by chaining LLM and vector database
Query
embedding You should measure performance
using a performance metric, such as
accuracy, AUC, ...
Vector DB Embedding Completion
Query
+ Relevant Chunks
Relevant chunks
Better handle on quality and traceability of answers

Building something cool? Share in the community!
https://tinyurl.com/bddkyspv
Case Study:
TruBot
Case Study: TruBot
Architecture: QA with context and chat history.
Implementation: Langchain.
Query
accuracy, AUC, ...
Completion
Vector DB Embedding
for Q/A
Query
+ Relevant Chunks
Relevant chunks + Chat Summary
Completion for
Summarization
Chat history / summary
Case Study: TruBot: LLM Completion Prompts
LLM Completion + Prompt --> Pipeline Tasks
● Q/A with context
● Chat history summarization
Query
accuracy, AUC, ...
Completion
Vector DB Embedding
for Q/A
Query
+ Relevant Chunks
Completion for
Summarization
◼ Completion for Q/A Prompt
Prompt:
Use the following pieces of context to answer the question at the end. If you don't know
the answer, just say that you don't know, don't try to make up an answer.
{context}
Question: {question}
Helpful Answer:
Complete from here
context: """
Upon selecting the Performance test option, the performance test creation wizard is displayed.
Enter a name for the test, an optional description, then select the performance metric on which to measure
the results.
For performance tests, define failure and/or warning conditions using either absolute values (e.g., AUC
between 0.8 to 1.25, as per the four-fifths rule) or relative thresholds (e.g., if the false-positive rate
is lower by X amount relative to the rate by a baseline model).
Once you've set the attributes above, the results are displayed accordingly, as discussed next.
... """
question: 'How can I measure performance?'

◼ Completion for Chat History Summarization Prompt
Prompt:
Progressively summarize the lines of conversation provided, adding onto the previous summary returning a new summary.
EXAMPLE
Current summary:
The human asks what the AI thinks of artificial intelligence. The AI thinks artificial intelligence is a force for good
New lines of conversation:

Human: Why do you think artificial intelligence is a force for good?
AI: Because artificial intelligence will help humans reach their full potential.
New summary:
The human asks what the AI thinks of artificial intelligence. The AI thinks artificial intelligence is a force for good
because it will help humans reach their full potential.
END OF EXAMPLE
Current summary:
{summary}
New lines of conversation:

{new_lines}
New summary:
Complete from here
Building TruBot
◼ Prepare contexts via Langchain and Pinecone:
■ Scrape truera.com for "documents". Split documents into "chunks".
■ Vector DB of embeddings of chunks via OpenAI embedding.
from langchain.document_loaders import UnstructuredHTMLLoader

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Pinecone
...
docs = []
...
docs += UnstructuredHTMLLoader(scrape_file).load()
text_splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=0)
chunks = text_splitter.split_documents(docs)
embedding = OpenAIEmbeddings(model='text-embedding-ada-002') # 1536 dims
Pinecone.from_documents(chunks, embedding, index_name="llmdemo")
Building TruBot
◼ Response generation via Langchain, OpenAI, Pinecone:
■ Embed message via OpenAI embedding.
■ Query Pinecone vector DB for nearby chunks.
from langchain.embeddings.openai import OpenAIEmbeddings

from langchain.vectorstores import Pinecone
# Embedding needed for Pinecone vector db.

embedding = OpenAIEmbeddings(model='text-embedding-ada-002') # 1536 dims
# Vector DB configuration.
docsearch = Pinecone.from_existing_index(
index_name="llmdemo", embedding=embedding
)
retriever = docsearch.as_retriever()
Building TruBot
◼ Response generation via Langchain, OpenAI, Pinecone:
■ Create prompt with nearby chunk texts and conversation summary (initially empty).
■ Complete prompt with OpenAI.
■ Send response. Include source of chunks.
■ Summarize conversation so far via Langchain and OpenAI for follow-ups.
from langchain.chains import ConversationalRetrievalChain

from langchain.llms import OpenAI
from langchain.memory import ConversationSummaryBufferMemory
llm = OpenAI(temperature=0, max_tokens=128)
memory = ConversationSummaryBufferMemory(max_token_limit=650, llm=llm, memory_key="chat_history", output_key='answer')
# The main logic rolled into this one line:

chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory, max_tokens_limit=4096, get_chat_history=lambda h: h)
question: str = "How can I measure performance?" THAT'S

IT !?
response: str = chain(question)['answer']
Completion <-> Question Linguistic Mismatch
Failure Modes
Completion does not follow prompt instructions.
● More parts --> more things to go wrong.

Completion escapes instructions in prompt.
Context Modeling: Unsuitable documents in DB.
Context chunks lack higher-level context. Completion misinterprets context.
Query
accuracy, AUC, ...
Vector DB Embedding Completion

Embedding not appropriate.
Query
+ Relevant Chunks
Completion for
Summarization
Retrieval: Non-relevant contexts.
Find TruBot failure modes yourself
In the AI Quality Forum, send a message to @TruBot. Each new thread is a new conversation
with separate history. Use the #general channel.
https://tinyurl.com/bddkyspv
Failure modes
Vector DB Failures
✅
Context lookup
produces bad context.
✅
❌
Failure modes Spanish
English
Completion Failures
Language mismatch.
Testing LLM applications
Key challenges with LLMs
1. Measure model quality
2. Evaluate, monitor & debug model quality issues

Measure model quality
◼ No obvious ground truth for generative output (e.g., Chatbot response)
◼ First step: Get direct human feedback
◼ Question: Can we scale this up programmatically?

Track user
responses
Measure model quality with feedback
A feedback function takes as input generated text from an LLM (or a downstream model or
application built on it) and some metadata, and returns a score.
Examples
● Prompt Sentiment
● Question Answering Relevance
● Language mismatch
● Transcript length
● Response verbosity
● Fairness substitution
ff: Sentiment analysis on prompts ● Toxicity
● …
score: 2/10
Measure model quality with feedback functions
Analogy Implementing feedback functions
◼ Data labeling functions, e.g. ◼ Build separate model , e.g. BERT

regular expressions sentiment analysis
◼ Programmatically scales up ◼ Use LLMs themselves
human data labeling ◼ …
◼ Combine labeling functions, ◼ Average score of a collection of
e.g. through majority vote feedback functions
◼ Benchmark against golden
sets
Addressing TruBot
Question
Failures with
Feedback Functions
Statement
Context lookup produces bad context.
✅
◼ Feedback Function
■ Question/Statement
Relevance between
question and context
matches. ✅
❌
Addressing TruBot
Failures with Prompt:
Feedback Functions
You are a RELEVANCE classifier, providing the relevance of the given statement to the given question.
Provide all responses only as a number from 1 to 10 where 1 is the least relevant and 10 is the most
relevant.
Never elaborate.
STATEMENT: {statement}
Context lookup produces bad context. QUESTION: {question}
RELEVANCE:
■ Question/Statement
Relevance between
question and context
matches. relevance=8
relevance=8
◼ Actions
■ Apply Q/S Relevance before
giving context to LLM. relevance=3
relevance=2
Addressing TruBot Spanish
Failures with English
Feedback Functions
Language mismatch.
■ Question / Answer
languages (match)
Addressing TruBot
Failures with
Feedback Functions
Adjusted Prompt:
Use the following pieces of context to answer the question at the end in the same language
as the question. If you don't know the answer, just say that you don't know, don't try to
make up an answer.
Language mismatch.
{context}
Feedback Function
Question: {question}
◼ Helpful Answer:
■ Question / Answer
languages (match)
Context Translation Prompt:
Translate the following statement to the same language as the question.
Statement: {statement}
◼ Actions Question: {question}

■ Adjust prompt.
Translated statement:
■ Translate context.
2.2 Quantitative Input In\ruence QII (Influencia de Entrada
(QII)\nAs stressed in the previous Cuantitativa) es un enfoque de
section, at the heart of our framework explicación basado en instancias. Esta
developed in this paper is técnica controla las correlaciones
an\ninstance-ba mediante la intervención aleatoria de
las entradas y captura las
Aggregating feedback functions (Binary Sentiment Example)
* * *
Feedback Mechanism:
Human
Feedback
Prompt: Do you like your job?
Response: As an AI language
model, I do not have feelings, as I
am just a program designed to
provide information and
assistance. However, I am
programmed to perform my job to
the best of my abilities in order to
provide accurate and useful
information to users.
* Models provided by this vendor are used to generate feedback
functions. The vendor does not provide nor endorse the functions used.
Majority Voting
Feedback:
Positive > Negative

Sensitive Aggregation
Feedback:
Negative > 0
Weighted Voting
Feedback:
2
Overweight particular qualities or feedback

mechanisms
Benchmarking feedback functions (sentiment)
How do they perform against a “golden set”?
* *# *#
feedback function
imdb (50k labeled

66.9% 96.0% 77.0%
movie reviews)
* Models provided by this vendor are used to generate feedback functions. The vendor does not provide
nor endorse the functions used.
# Used a sample of 100 labeled reviews used due to rate limitations and time constraints for this session.
Summary
◼ Overview and History of Large Language Models
◼ Building LLM Applications

■ What kinds of applications can you build?
■ What are the components of an LLM application?
■ Case Study: Building TruBot and its failure modes
◼ Testing LLM Applications

■ How do you measure LLM application quality?
■ Feedback Functions: Usage, Aggregation and Benchmarking
What’s coming next week?
◼ How to find issues in chained LLM applications
◼ How to add different feedback functions into your application
◼ How to assess feedback function results for your application
◼ And more!
Stay Tuned
Colossal squids are coming to help!

Truera Slides LLM Workshop Session 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Truera Slides LLM Workshop Session 1

Uploaded by

Copyright:

Available Formats

Testing and Monitoring

Anupam Datta Shayak Sen

Josh Reini Piotr (Peter) Mardziel Rick Shih

◼ Building LLM Applications (30 minutes)

◼ Testing LLM Applications (20 minutes; to be continued on May 18)

Workshop focus: Large Language Models (LLMs) and Apps

◼ 1st generation: Transformers: BERT, DistillBERT, RoBERTa,...

◼ 2nd generation: LLMs and Generative AI:

Applications Building by chaining

◼ Question Answering ◼ Multi-modal data sources (public, proprietary, etc.)

Use LLM to directly answer questions

How should I measure ML model performance?

You should measure ML model

Challenges: Quality and traceability of answers

How should I measure performance?

Drift metrics include Wasserstein, PSI,

Step 2: Answer questions by chaining LLM and vector database

How should I measure performance?

Vector DB Embedding Completion

Better handle on quality and traceability of answers

How should I measure performance?

How should I measure performance?

question: 'How can I measure performance?'

New lines of conversation:

New lines of conversation:

from langchain.document_loaders import UnstructuredHTMLLoader

from langchain.embeddings.openai import OpenAIEmbeddings

# Embedding needed for Pinecone vector db.

from langchain.chains import ConversationalRetrievalChain

llm = OpenAI(temperature=0, max_tokens=128)

memory = ConversationSummaryBufferMemory(max_token_limit=650, llm=llm, memory_key="chat_history", output_key='answer')

# The main logic rolled into this one line:

question: str = "How can I measure performance?" THAT'S

● More parts --> more things to go wrong.

Context chunks lack higher-level context. Completion misinterprets context.

Vector DB Embedding Completion

1. Measure model quality

2. Evaluate, monitor & debug model quality issues

◼ No obvious ground truth for generative output (e.g., Chatbot response)

◼ First step: Get direct human feedback

◼ Question: Can we scale this up programmatically?

Analogy Implementing feedback functions

◼ Data labeling functions, e.g. ◼ Build separate model , e.g. BERT

Context lookup produces bad context. QUESTION: {question}

◼ Actions Question: {question}

Positive > Negative

Overweight particular qualities or feedback

How do they perform against a “golden set”?

imdb (50k labeled

◼ Building LLM Applications

◼ Testing LLM Applications

◼ How to add diﬀerent feedback functions into your application

◼ How to assess feedback function results for your application

Colossal squids are coming to help!

You might also like