You are on page 1of 28

Open Challenges For

LLM Applications
Chip Huyen (@chipro | bit.ly/chip-mlops-discord)
Jun ‘23
Agenda
1. Inconsistency
2. Hallucination
3. Compliance + privacy
4. Context length
5. Model drift
6. Forward & backward compatibility
7. LLM on the edge
8. LLM for non-English languages
9. Efficiency of chat as a universal interface
10. Data bottleneck

2
Challenge 1: Consistency
1. How to ensure user experience consistency?
2. How to ensure downstream apps can run without breaking?

3
Same input, different outputs
Small input changes can cause big output changes

● temperature=0 won’t fix it
● won’t pass the perturbation test

5
No output schema guarantee for downstream apps

6
Challenge 2:
Hallucination
“Half knowledge is worse
than ignorance.”

- Thomas Babington Macaulay

7
Poor performance on tasks that require factuality
BIRD-SQL Leaderboard

8
Why do LLMs hallucinate?
● [DeepMind] Models "lack the understanding of the cause and effect of their
actions"
● [OpenAI] Mismatch between LLM’s internal knowledge and labeler’s
internal knowledge caused by behavior cloning

9
Low quality data High quality data Human feedback
RLHF

Text Demonstration Comparison


Prompts
e.g. Internet data data data

Trained to give Optimized to generate


Optimized for Finetuned for responses that maximize
a scalar score for
text completion dialogue scores by reward model
(prompt, response)

Language Supervised Reinforcement


Classification
modeling finetuning Learning

Pretrained LLM SFT model Reward model Final model

Scale >1 trillion 10K - 100K 100K - 1M comparisons 10K - 100K


May ‘23 tokens (prompt, response) (prompt, winning_response, losing_response) prompts

Examples GPT-x, Gopher, Falcon, Dolly-v2, Falcon-Instruct InstructGPT, ChatGPT,


Bolded: open LLaMa, Pythia, Bloom, Claude, StableVicuna
sourced StableLM
See: RLHF: Reinforcement Learning from Human Feedback
Is hallucination a feature or a bug?
A feature for tasks that rely on creativity

A bug for tasks that rely on factuality

11
Challenge 3: Privacy
1. [Build] If you build a chatbot to let
your users to talk to your data, how
to ensure that chatbot doesn’t
accidentally reveal sensitive
information?
2. [Buy] If you send your user data to
APIs, are these APIs compliant?

Multi-step Jailbreaking Privacy Attacks on ChatGPT (Li et al., 2023)


12
Challenge 4: Context length

● A significant proportion of information seeking


questions have context-dependent answers
(e.g., roughly 16.5% of NQ-Open)
(SituatedQA, 2021)
● Use cases:
○ Document processing
○ Summarization
○ Narrative
○ Any task involving genes and proteins
○ etc.

COLT5 (2023) 13
Challenge 5: Data drift
● “Existing models, which are trained
on data collected in the past, fail to
generalize to answering questions
asked in the present, even when
provided with an updated evidence
corpus (a roughly 15 point drop in
accuracy).”
(SituatedQA, 2021)

Generative AI taught everyone


about data drift 14
Challenge 6: Forward & backward compatibility
● Same model, new data
● New model

How to make sure your prompts still work with newer models?

15
Challenge 7: LLM on the edge
● Healthcare devices
● Autonomous vehicles
● Drive-thru voice bots
● Your personal ChatGPT, trained on your own data, run on your Macbook

16
Challenge 7: LLM on the edge
1. On-device inference
2. Training
a. On-device training: bottlenecked by compute + memory + tech available
b. If trained on a server:
i. How to incorporate device’s data?
ii. How to send model’s updates to device?

17
Choose a model size
7B param model can run on a Macbook 5 - 13B
Cost param Perf
● bfloat16 = 14GB memory model
● int8 = 7GB memory

7B param model costs approx*:

● $100 to finetune
● $25,000 to train from scratch
Model size
Finetuned General
for specific models
tasks

* Highly dependent on how much data 18


Challenge 8: LLMs for
non-English languages
● Performance (Lai et al., 2023)

19
Challenge 8: LLMs for non-English languages
● Performance (Lai et al., 2023)
● Tokenization (Yennie Jun, 2023)
○ Latency
○ Cost

20
Challenge 9: Efficiency of chat as a universal interface
Poll: Which do you prefer?

1. Search interface
2. Chat interface

21
Challenge 9: Efficiency of chat as a universal interface
Chat is NOT efficient, but is very robust

22
Challenge 9: Efficiency of chat as a universal interface
How much you like an interface depends on how much you’ve been exposed to
that interface
● Ongoing discussion for the last decade, since the rise of superapp in Asia

23
Dan Grover (2015)
Challenge 10: Data bottleneck
● The rate of training dataset size growth is much faster than the rate of new
data being generated (Villalobos et al, 2022)
● Internet is being rapidly populated with AI-generated text

24
Data is essential to leverage AI
1. Consolidate existing data across departments and sources
2. Update your data terms of use (see StackOverflow and Reddit)
3. Put guardrails around data quality + governance

Reach out if you want Claypot to help you with your data story!
25
10 open challenges
1. Inconsistency
2. Hallucination
3. Compliance + privacy
4. Context length
5. Model drift
6. Forward & backward compatibility
7. LLM on the edge
8. LLM for non-English languages
9. Efficiency of chat as a universal interface
10. Data bottleneck

26
Thank you!
@chipro
linkedin.com/in/chiphuyen
bit.ly/chip-mlops-discord

You might also like