You are on page 1of 61

Running LLMs locally for Faster,

Smarter Inference:
Generative AI in 5 lines of code
Our Speakers

Ria Cheruvu Paula Ramos Yury Gorbachev


Intel AI Evangelist Intel AI Evangelist Fellow, OpenVINO
Lead Architect

DEVCON Workshop Series 2024


• Introduction to Generative AI
• Newest features of OpenVINO
Agenda • Running Gen AI on the AI PC
• Running AI on the Edge
• Running AI on the Cloud
• Closing – Exciting future highlights on
LLMs and Gen AI with OpenVINO
• Q&A

DEVCON Workshop Series 2024


Compelling GenAI
Use Cases
Gaming Experiences
Room Design
Novel Illustration
Vacation-planner chatbot
Fashion and E-Commerce
Web Design
LLM Use Case
USE CASES
Agent Simulations

Agents

Autonomous Agents

Chatbots

Classification

Code Understanding

Code Writing

Evaluation

Extraction

Interacting with APIs

Multi-Modal

QA Over Documents

Self-Checking

SQL

Summarization

Tagging
Chatbots

DEVCON Workshop Series 2024


AI Predominantly Today

AI
AI Service
Trained
AI Service Server Module
Model
Client
Module Network
Machine Performing AI
Request Data Service (Cloud Server)
Machine Requesting AI
Service (Client Device)

DEVCON Workshop Series 2024


Different Types of Compute

Cloud Edge AI PC

DEVCON Workshop Series 2024


Optimized Performance

CPU GPU NPU FPGA

DEVCON Workshop Series 2024


OpenVINO 2024.1
Making it easier to deploy and accelerate Gen AI & LLMs

Improvement Simpler Workflow


Enhancements for More Deep Learning
for LLMs existing platforms Models

DEVCON Workshop Series 2024


Slow inference No flexibility to
Pain Points speed run workloads
of Gen AI on different HW

Difficulty training Large Large memory


+ optimizing model size footprint

DEVCON Workshop Series 2024


Let’s Run
the Demo!
Model Workflow Highlights

Model Optimize Deploy


More Gen AI coverage (Mixtral, LLM compilation time reduced Portability and performance to run AI
URLNet, Falcon 7b Instruct, and through additional optimizations at the edge, in the cloud, or locally
more) Better LLM compression and Preview NPUplugin and JavaScript
improved performance API now available
Significant memory reduction for
select smaller Gen AI models and
iGPU

DEVCON Workshop Series 2024


GenAI Model Workflow with OpenVINO

Optimum-Intel
(base on Transformers and Diffusers)

1 Convert MODEL 2 Optimize MODEL 3 Deploy MODEL 4 Build PIPELINE

PyTorch Frontend NNCF Runtime(Backend) • Text-generation


• openvino.model_convert • Weight Compression • CPU • Text-to-image
• torch.compile • PTQ • GPU
• QAT • NPU • ...
• …

DEVCON Workshop Series 2024


Behind the Scenes:
How to enable faster, smarter inference for
LLMs with OpenVINO

Stateful
Compression Quantization
transformation

DEVCON Workshop Series 2024


OpenVINO Integration with Optimum

pip install optimum-intel[openvino,nncf]

Combine the convenience of Hugging Face with the efficiency of OpenVINO !

DEVCON Workshop Series 2024


OpenVINO Integration
with Optimum
Gen AI in 5 lines of Code
- from transformers import AutoModelForCausalLM
+ from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer, pipeline

model_id = "helenai/gpt2-ov"

- model = AutoModelForCausalLM.from_pretrained(model_id)
+ model = OVModelForCausalLM.from_pretrained(model_id)

tokenizer = AutoTokenizer.from_pretrained(model_id)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

results = pipe("He's a dreadful magician and")

DEVCON Workshop Series 2024


Neural Network Compression
Framework (NNCF)

DEVCON Workshop Series 2024


Neural Network Compression
Framework (NNCF)
Post-Training
Quantization

Accuracy-Control
Quantization

Quantization-
Aware Training
Weight
Compression
Activation-Aware
Weight
Quantization
Filter pruning,
Binarization,
Sparsity,
...

DEVCON Workshop Series 2024


Neural Network Compression
Framework (NNCF)
Post-Training
Quantization

Accuracy-Control
Quantization

Quantization-
Aware Training
Weight
Compression
Activation-Aware
Weight
Quantization
Filter pruning,
Binarization,
Sparsity,
...

pip install nncf

DEVCON Workshop Series 2024


1. Weight Compression
for LLMs (NNCF)
Model Mode Perplexity Perplexity Increase Model Size (GB)
databricks/dolly-v2-3b fp32 5.01 0 10.3
databricks/dolly-v2-3b int8 5.07 0.05 2.6
databricks/dolly-v2-3b int4_asym_g32_r50 5.28 0.26 2.2
databricks/dolly-v2-3b nf4_g128_r60 5.19 0.18 1.9
meta-llama/Llama-2-7b-chat-hf fp32 3.28 0 25.1
meta-llama/Llama-2-7b-chat-hf int8 3.29 0.01 6.3
meta-llama/Llama-2-7b-chat-hf int4_asym_g128_r80 3.41 0.14 4.0
meta-llama/Llama-2-7b-chat-hf nf4_g128 3.41 0.13 3.5
togethercomputer/RedPajama-INCITE-7B-Instruct fp32 4.15 0 25.6
togethercomputer/RedPajama-INCITE-7B-Instruct int8 4.17 0.02 6.4
togethercomputer/RedPajama-INCITE-7B-Instruct nf4_ov_g32_r60 4.28 0.13 5.1
togethercomputer/RedPajama-INCITE-7B-Instruct int4_asym_g128 4.17 0.02 3.6

Significant Reduction in RAM usage!

DEVCON Workshop Series 2024


1. Weight Compression
for LLMs (NNCF)

lambada_openai lambada_openai Gsm8k


Model 100% int4, symmetric mode
Perplexity (↓) Accuracy (↑) Exact Match, 5 shot (↑)
stablelm-2-zephyr-1_6b awq 7.905 54.9 34.12
stablelm-2-zephyr-1_6b data-free 8.584 54.28 29.95
stable-zephyr-3b-dpo awq 8.4099 57.27 48.6
stable-zephyr-3b-dpo data-free 9.3011 56.18 46.17
zephyr-7b-beta awq 3.4564 71.05 -
zephyr-7b-beta data-free 3.5021 70.7 -

Better Quality of the Compressed Model!

DEVCON Workshop Series 2024


2. Dynamic Quantization(CPU)
IR Graph Execution Graph

Activation Weights
Activation Weights

FP32 INT4/INT8
INT4/INT8/NF4
Quantize Decompression
(on the fly) (on the fly)
Decompression
FP32 (offline)
INT8 INT8

FP32 MatMul
INT8
MatMul
INT32

Dequantize
(on the fly)

FP32
ov_model = OVModelForCausalLM.from_pretrained(
model_path, ov_model = OVModelForCausalLM.from_pretrained(
quantization_config=OVWeightQuantizationConfig model_path,
(bits=4, **model_compression_params)) ov_config={DYNAMIC_QUANTIZATION_GROUP_SIZE: “32”})

DEVCON Workshop Series 2024


3. Stateful Transformation
What is KV(Key-Value) Cache?

1 <sos>

2 <sos> Ich

3 <sos> Ich einen

4 <sos> Ich einen frau

5 <sos> Ich einen frau <eos>

DEVCON Workshop Series 2024


3. Stateful Transformation
Logic Behind

DEVCON Workshop Series 2024


OpenVINO Integration
with Optimum
Gen AI in 5 lines of Code

- from transformers import AutoModelForCausalLM


+ from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer, pipeline

model_id = "helenai/gpt2-ov"

- model = AutoModelForCausalLM.from_pretrained(model_id)
+ model = OVModelForCausalLM.from_pretrained(model_id,
ov_config={“KV_CACHE_PRECISION”: “u8”, “DYNAMIC_QUANTIZATION_GROUP_SIZE”:
“32”, “PERFORMANCE_HINT”: “LATENCY”)

tokenizer = AutoTokenizer.from_pretrained(model_id)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

results = pipe("He's a dreadful magician and")

DEVCON Workshop Series 2024


DEVCON Workshop Series 2024
AI Deployment Across Compute
Edge AI PC Cloud

Industry-specific AI use cases, Centralization, Edge device


Wide range of consumer or PC connects to the cloud
Real-time data processing, specific AI use-cases
Wider reach to perform computation and
get results back

Pros
Pros Data independence Pros
Data independence Cost efficiency
Special computing capabilities Large amount of data,
Cost efficiency
for optimal performance and limitless compute on demand
Increased control
Autonomous execution energy consumption
Increased control

Cons Cons Cons


Compute is limited by local Compute is limited by local Risk of data privacy
resources resources High latency
Dependency on the
connection to the cloud
Running Gen AI on
Client (AI PC)
Where is the sweet spot for local
inference for LLMs?

DEVCON Workshop Series 2024


Example: What if you could take your travel assistant
chatbot with you on vacation?
Three AI Engines in
Intel® Core Ultra
The right balance of power and performance for building
and deploying AI models with OpenVINO

NPU CPU GPU


Power Efficiency Fast Response High Throughput
Ideal for sustained AI workloads Ideal for low-latency Ideal for AI-accelerated digital
and AI offload for battery life AI workloads content creation and gaming

DEVCON Workshop Series 2024


Question Retrieve Prompt LLM Answer

Enterprise Intelligence with LLMs using RAG

DEVCON Workshop Series 2024


Let’s switch to a demo!
Latent Consistency
Llama3 (LLM) + RAG
Models

DEVCON Workshop Series 2024


AI PC Developer Program
CTA
▪ Link: https://www.intel.com/content/www/us/en/developer/topic-
technology/ai-pc/overview.html#gs.85rzmn

Sign up for receiving the


latest updates and news
from the AI PC Developer
program
Related Products

▪ NPU Documentation
▪ Built-in GPU
▪ Intel® Core Ultra processor

DEVCON Workshop Series 2024


AI at the Edge
How can we build LLMs for low-power
settings?

When are Small LMs useful instead


(and what are they)?

DEVCON Workshop Series 2024


Demos on Edge
Phi3

DEVCON Workshop Series 2024


OpenVINO Notebooks

100+
Demos

LLMs, GenAI, Stable Diffusion, Whisper, GPT, YOLOv5/v8,


CLIP, Object Detection and Segmentation, Image
Classification, Human Pose Estimation, and much more!

DEVCON Workshop Series 2024


OpenVINO Notebooks

And Music Generation, Text to


Speech (Bark), Speech to
Text (Whisper), Diarization...

DEVCON Workshop Series 2024


Running AI
models on the
Cloud
“How do we balance executing AI
workloads between the edge and the
cloud?”

DEVCON Workshop Series 2024


OpenVINO Model Server
Powered by OpenVINO Runtime

DEVCON Workshop Series 2024


Integration with mediapipe
Python code execution

“How are you” prompt OVMS


MediaPipe Graph

• Python execution is enabled via MediaPipe by “I’m fine” texts


the built-in PythonExecutorCalculator
• supports execution of custom Python code

“zebra” prompt

images

https://docs.openvino.ai/2023.3/ovms_docs_python_support_reference.html

DEVCON Workshop Series 2024


Demos: Cloud <-> Edge

OpenVINO Model Server with INT8 Quantization

DEVCON Workshop Series 2024


Video Slide

DEVCON Workshop Series 2024


OpenVINO Model Server
Run Server

docker run -d --rm -p 9000:9000 -v $(pwd)/onnx:/model:ro openvino/model_server \


--port 9000 \
--model_name gpt-j-6b \
--model_path /model \
--plugin_config '{"PERFORMANCE_HINT":"LATENCY","NUM_STREAMS":1}'

DEVCON Workshop Series 2024


OpenVINO Model Server
Run Client

curl -X POST http://localhost:8000/v1/models/usem:predict \


-H 'Content-Type:application/json’ \
-d '{"instances": ["dog", "Puppies are nice.", "I enjoy taking long
walks along the beach with my dog."]}'

from ovmsclient import make_grpc_client

client = make_grpc_client("localhost:9000")
data = ["dog", "Puppies are nice.", "I enjoy taking long walks along
the beach with my dog."]
inputs = {"inputs": data}
results = client.predict(inputs=inputs, model_name="usem")

DEVCON Workshop Series 2024


Try GenAI + LLM Serving with
OpenVINO Model Server
Deploy generative pipelines as a service

OpenVINO Model Server OpenVINO Model Server –


– Text Generation Demo Stable Diffusion Demo

vLLM – OpenVINO OpenVINO Model Server -


Integration (coming soon) RAG Pipeline Demo
DEVCON Workshop Series 2024
Optimizing LLMs with OpenVINO
Download our comprehensive white paper

Download PDF
DEVCON Workshop Series 2024
When working on cloud/edge/PC,
what do you suggest?

DEVCON Workshop Series 2024


Exciting Future Highlights:
OpenVINO with LLMs and Gen AI

DEVCON Workshop Series 2024


Contribute to OpenVINO Toolkit

How to start: https://medium.com/openvino-toolkit/how-to-contribute-to-an-ai-open-source-project-c741f48e009e

DEVCON Workshop Series 2024


Learn more: https://github.com/openvinotoolkit/openvino/wiki/Google-Summer-Of-Code
DEVCON Workshop Series 2024
Installation

www.openvino.ai
DEVCON Workshop Series 2024
Installation

pip install openvino

www.openvino.ai
DEVCON Workshop Series 2024
AI: The New Age
Solving the World’s Toughest
Challenges, Together.

Calling All Developers & Technologists!


Save the Date:
From front-end, web, app devs to back-end, full-stack, database September 24-25, 2024
& DevOps to data scientists & researchers, and more: San Jose Convention Center, CA
Learn, collaborate, and solve at Intel Innovation –
an event for developers by developers.

Hear from leading industry luminaries, technologists & start-up


entrepreneurs in the field of AI.
Get the latest AI development tools, hands-on experience & join on-
Opt-In for Early Access
site Hackathons to optimize your AI code & workflows. When Registration Opens!
Learn the breadth of future technology advancements in AI through
keynotes, sessions, birds of a feathers, and hands-on labs. www.intel.com/innovation
Share unique ideas and perspectives and collaborate with your peers.
Poll Question:

“For your next project: Will you


build LLMs on your PC, for the
edge, or on the cloud?”

DEVCON Workshop Series 2024


Notices and Disclaimers
Performance varies by use, configuration and other factors. Learn more at
www.Intel.com/PerformanceIndex.

Performance results are based on testing as of dates shown in configurations and may not reflect all
publicly available updates. See backup for configuration details.

Intel technologies may require enabled hardware, software or service activation.

Your costs and results may vary.

Intel is committed to respecting human rights and avoiding complicity in human rights abuses. See
Intel's Global Human Rights Principles. Intel's products and software are intended only to be used in
applications that do not cause or contribute to a violation of an internationally recognized human right.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its
subsidiaries. Other names and brands may be claimed as the property of others.

DEVCON Workshop Series 2024


Connect With Us

DEVCON Workshop Series 2024


Thank You

DEVCON Workshop Series 2024


DEVCON Workshop Series 2024

You might also like