Professional Documents
Culture Documents
LLM Applications
Introduction
Speakers
|
Agenda
◼ Overview and History of Large Language Models (10 minutes)
Adaptation uses:
● your data to
● build a model for
● your specific task Image source: Bommasani et al. 2021.
Completion
Step 1: Create & store embeddings of documents in a vector database (knowledge base)
Embedding
Performance metrics include Model
classification accuracy, log loss, ...
Vector DB
Query
embedding You should measure performance
using a performance metric, such as
accuracy, AUC, ...
Query
+ Relevant Chunks
Relevant chunks
https://tinyurl.com/bddkyspv
Case Study:
TruBot
Case Study: TruBot
Architecture: QA with context and chat history.
Implementation: Langchain.
Query
embedding You should measure performance
using a performance metric, such as
accuracy, AUC, ...
Completion
Vector DB Embedding
for Q/A
Query
+ Relevant Chunks
Relevant chunks + Chat Summary
Completion for
Summarization
Chat history / summary
Case Study: TruBot: LLM Completion Prompts
LLM Completion + Prompt --> Pipeline Tasks
● Q/A with context
● Chat history summarization
Query
embedding You should measure performance
using a performance metric, such as
accuracy, AUC, ...
Completion
Vector DB Embedding
for Q/A
Query
+ Relevant Chunks
Relevant chunks + Chat Summary
Completion for
Summarization
Chat history / summary
Case Study: TruBot: LLM Completion Prompts
◼ Completion for Q/A Prompt
Prompt:
Use the following pieces of context to answer the question at the end. If you don't know
the answer, just say that you don't know, don't try to make up an answer.
{context}
Question: {question}
Helpful Answer:
Complete from here
context: """
Upon selecting the Performance test option, the performance test creation wizard is displayed.
Enter a name for the test, an optional description, then select the performance metric on which to measure
the results.
For performance tests, define failure and/or warning conditions using either absolute values (e.g., AUC
between 0.8 to 1.25, as per the four-fifths rule) or relative thresholds (e.g., if the false-positive rate
is lower by X amount relative to the rate by a baseline model).
Once you've set the attributes above, the results are displayed accordingly, as discussed next.
... """
Prompt:
Progressively summarize the lines of conversation provided, adding onto the previous summary returning a new summary.
EXAMPLE
Current summary:
The human asks what the AI thinks of artificial intelligence. The AI thinks artificial intelligence is a force for good
New summary:
The human asks what the AI thinks of artificial intelligence. The AI thinks artificial intelligence is a force for good
because it will help humans reach their full potential.
END OF EXAMPLE
Current summary:
{summary}
New summary:
Complete from here
Building TruBot
◼ Prepare contexts via Langchain and Pinecone:
■ Scrape truera.com for "documents". Split documents into "chunks".
■ Vector DB of embeddings of chunks via OpenAI embedding.
# Vector DB configuration.
docsearch = Pinecone.from_existing_index(
index_name="llmdemo", embedding=embedding
)
retriever = docsearch.as_retriever()
Building TruBot
◼ Response generation via Langchain, OpenAI, Pinecone:
■ Create prompt with nearby chunk texts and conversation summary (initially empty).
■ Complete prompt with OpenAI.
■ Send response. Include source of chunks.
■ Summarize conversation so far via Langchain and OpenAI for follow-ups.
Failure Modes
Completion does not follow prompt instructions.
Query
embedding You should measure performance
using a performance metric, such as
accuracy, AUC, ...
Query
+ Relevant Chunks
Relevant chunks + Chat Summary
Completion for
Summarization
Chat history / summary
Retrieval: Non-relevant contexts.
Find TruBot failure modes yourself
In the AI Quality Forum, send a message to @TruBot. Each new thread is a new conversation
with separate history. Use the #general channel.
https://tinyurl.com/bddkyspv
Failure modes
Vector DB Failures
✅
Context lookup
produces bad context.
✅
❌
Failure modes Spanish
English
Completion Failures
Language mismatch.
Testing LLM applications
Key challenges with LLMs
A feedback function takes as input generated text from an LLM (or a downstream model or
application built on it) and some metadata, and returns a score.
Examples
● Prompt Sentiment
● Question Answering Relevance
● Language mismatch
● Transcript length
● Response verbosity
● Fairness substitution
ff: Sentiment analysis on prompts ● Toxicity
● …
score: 2/10
Measure model quality with feedback functions
❌
Addressing TruBot
Failures with Prompt:
Feedback Functions
You are a RELEVANCE classifier, providing the relevance of the given statement to the given question.
Provide all responses only as a number from 1 to 10 where 1 is the least relevant and 10 is the most
relevant.
Never elaborate.
STATEMENT: {statement}
RELEVANCE:
◼ Feedback Function
■ Question/Statement
Relevance between
question and context
matches. relevance=8
relevance=8
◼ Actions
■ Apply Q/S Relevance before
giving context to LLM. relevance=3
relevance=2
Addressing TruBot Spanish
Failures with English
Feedback Functions
Language mismatch.
◼ Feedback Function
■ Question / Answer
languages (match)
Addressing TruBot
Failures with
Feedback Functions
Adjusted Prompt:
Use the following pieces of context to answer the question at the end in the same language
as the question. If you don't know the answer, just say that you don't know, don't try to
make up an answer.
Language mismatch.
{context}
Feedback Function
Question: {question}
◼ Helpful Answer:
■ Question / Answer
languages (match)
Context Translation Prompt:
Translate the following statement to the same language as the question.
Statement: {statement}
* * *
Feedback Mechanism:
Human
Feedback
Prompt: Do you like your job?
Response: As an AI language
model, I do not have feelings, as I
am just a program designed to
provide information and
assistance. However, I am
programmed to perform my job to
the best of my abilities in order to
provide accurate and useful
information to users.
* Models provided by this vendor are used to generate feedback
functions. The vendor does not provide nor endorse the functions used.
Majority Voting
Feedback:
Feedback:
Negative > 0
Weighted Voting
Feedback:
2
* *# *#
feedback function
* Models provided by this vendor are used to generate feedback functions. The vendor does not provide
nor endorse the functions used.
# Used a sample of 100 labeled reviews used due to rate limitations and time constraints for this session.
Summary
◼ Overview and History of Large Language Models
◼ And more!
Stay Tuned