OpenVINO DevCon - Generative AI Fundamentals With OpenVINO™

Running LLMs locally for Faster,
Smarter Inference:
Generative AI in 5 lines of code
Our Speakers
Ria Cheruvu Paula Ramos Yury Gorbachev

Intel AI Evangelist Intel AI Evangelist Fellow, OpenVINO
Lead Architect
DEVCON Workshop Series 2024

• Introduction to Generative AI
• Newest features of OpenVINO
Agenda • Running Gen AI on the AI PC
• Running AI on the Edge
• Running AI on the Cloud
• Closing – Exciting future highlights on
LLMs and Gen AI with OpenVINO
• Q&A

Compelling GenAI
Use Cases
Gaming Experiences
Room Design
Novel Illustration
Vacation-planner chatbot
Fashion and E-Commerce
Web Design
LLM Use Case
USE CASES
Agent Simulations
Agents
Autonomous Agents
Chatbots
Classification
Code Understanding
Code Writing
Evaluation
Extraction
Interacting with APIs
Multi-Modal
QA Over Documents
Self-Checking
SQL
Summarization
Tagging
Chatbots

AI Predominantly Today
AI
AI Service
Trained
AI Service Server Module
Model
Client
Module Network
Machine Performing AI
Request Data Service (Cloud Server)
Machine Requesting AI
Service (Client Device)

Different Types of Compute
Cloud Edge AI PC

Optimized Performance
CPU GPU NPU FPGA

OpenVINO 2024.1
Making it easier to deploy and accelerate Gen AI & LLMs
Improvement Simpler Workflow

Enhancements for More Deep Learning
for LLMs existing platforms Models

Slow inference No flexibility to
Pain Points speed run workloads
of Gen AI on different HW
Difficulty training Large Large memory

+ optimizing model size footprint

Let’s Run
the Demo!
Model Workflow Highlights
Model Optimize Deploy

More Gen AI coverage (Mixtral, LLM compilation time reduced Portability and performance to run AI
URLNet, Falcon 7b Instruct, and through additional optimizations at the edge, in the cloud, or locally
more) Better LLM compression and Preview NPUplugin and JavaScript
improved performance API now available
Significant memory reduction for
select smaller Gen AI models and
iGPU

GenAI Model Workflow with OpenVINO
Optimum-Intel
(base on Transformers and Diffusers)
1 Convert MODEL 2 Optimize MODEL 3 Deploy MODEL 4 Build PIPELINE
PyTorch Frontend NNCF Runtime(Backend) • Text-generation

• openvino.model_convert • Weight Compression • CPU • Text-to-image
• torch.compile • PTQ • GPU
• QAT • NPU • ...
• …

Behind the Scenes:
How to enable faster, smarter inference for
LLMs with OpenVINO
Stateful
Compression Quantization
transformation

OpenVINO Integration with Optimum
pip install optimum-intel[openvino,nncf]
Combine the convenience of Hugging Face with the efficiency of OpenVINO !

OpenVINO Integration
with Optimum
Gen AI in 5 lines of Code
- from transformers import AutoModelForCausalLM
+ from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer, pipeline
model_id = "helenai/gpt2-ov"
- model = AutoModelForCausalLM.from_pretrained(model_id)
+ model = OVModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
results = pipe("He's a dreadful magician and")

Neural Network Compression
Framework (NNCF)

Framework (NNCF)
Post-Training
Quantization
Accuracy-Control
Quantization
Quantization-
Aware Training
Weight
Compression
Activation-Aware
Weight
Quantization
Filter pruning,
Binarization,
Sparsity,
...

Framework (NNCF)
Post-Training
Quantization
Accuracy-Control
Quantization
Quantization-
Aware Training
Weight
Compression
Activation-Aware
Weight
Quantization
Filter pruning,
Binarization,
Sparsity,
...
pip install nncf

1. Weight Compression
for LLMs (NNCF)
Model Mode Perplexity Perplexity Increase Model Size (GB)
databricks/dolly-v2-3b fp32 5.01 0 10.3
databricks/dolly-v2-3b int8 5.07 0.05 2.6
databricks/dolly-v2-3b int4_asym_g32_r50 5.28 0.26 2.2
databricks/dolly-v2-3b nf4_g128_r60 5.19 0.18 1.9
meta-llama/Llama-2-7b-chat-hf fp32 3.28 0 25.1
meta-llama/Llama-2-7b-chat-hf int8 3.29 0.01 6.3
meta-llama/Llama-2-7b-chat-hf int4_asym_g128_r80 3.41 0.14 4.0
meta-llama/Llama-2-7b-chat-hf nf4_g128 3.41 0.13 3.5
togethercomputer/RedPajama-INCITE-7B-Instruct fp32 4.15 0 25.6
togethercomputer/RedPajama-INCITE-7B-Instruct int8 4.17 0.02 6.4
togethercomputer/RedPajama-INCITE-7B-Instruct nf4_ov_g32_r60 4.28 0.13 5.1
togethercomputer/RedPajama-INCITE-7B-Instruct int4_asym_g128 4.17 0.02 3.6
Significant Reduction in RAM usage!

1. Weight Compression
for LLMs (NNCF)
lambada_openai lambada_openai Gsm8k

Model 100% int4, symmetric mode
Perplexity (↓) Accuracy (↑) Exact Match, 5 shot (↑)
stablelm-2-zephyr-1_6b awq 7.905 54.9 34.12
stablelm-2-zephyr-1_6b data-free 8.584 54.28 29.95
stable-zephyr-3b-dpo awq 8.4099 57.27 48.6
stable-zephyr-3b-dpo data-free 9.3011 56.18 46.17
zephyr-7b-beta awq 3.4564 71.05 -
zephyr-7b-beta data-free 3.5021 70.7 -
Better Quality of the Compressed Model!

2. Dynamic Quantization(CPU)
IR Graph Execution Graph
Activation Weights
Activation Weights
FP32 INT4/INT8
INT4/INT8/NF4
Quantize Decompression
(on the fly) (on the fly)
Decompression
FP32 （offline）
INT8 INT8
FP32 MatMul
INT8
MatMul
INT32
Dequantize
(on the fly)
FP32
ov_model = OVModelForCausalLM.from_pretrained(
model_path, ov_model = OVModelForCausalLM.from_pretrained(
quantization_config=OVWeightQuantizationConfig model_path,
(bits=4, **model_compression_params)) ov_config={DYNAMIC_QUANTIZATION_GROUP_SIZE: “32”})

3. Stateful Transformation
What is KV(Key-Value) Cache?
1 <sos>
2 <sos> Ich
3 <sos> Ich einen
4 <sos> Ich einen frau
5 <sos> Ich einen frau <eos>

3. Stateful Transformation
Logic Behind

OpenVINO Integration
with Optimum
Gen AI in 5 lines of Code
- from transformers import AutoModelForCausalLM

+ from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer, pipeline
model_id = "helenai/gpt2-ov"
- model = AutoModelForCausalLM.from_pretrained(model_id)
+ model = OVModelForCausalLM.from_pretrained(model_id,
ov_config={“KV_CACHE_PRECISION”: “u8”, “DYNAMIC_QUANTIZATION_GROUP_SIZE”:
“32”, “PERFORMANCE_HINT”: “LATENCY”)
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
results = pipe("He's a dreadful magician and")

AI Deployment Across Compute
Edge AI PC Cloud
Industry-specific AI use cases, Centralization, Edge device

Wide range of consumer or PC connects to the cloud
Real-time data processing, specific AI use-cases
Wider reach to perform computation and
get results back
Pros
Pros Data independence Pros
Data independence Cost efficiency
Special computing capabilities Large amount of data,
Cost efficiency
for optimal performance and limitless compute on demand
Increased control
Autonomous execution energy consumption
Increased control
Cons Cons Cons

Compute is limited by local Compute is limited by local Risk of data privacy
resources resources High latency
Dependency on the
connection to the cloud
Running Gen AI on
Client (AI PC)
Where is the sweet spot for local
inference for LLMs?

Example: What if you could take your travel assistant
chatbot with you on vacation?
Three AI Engines in
Intel® Core Ultra
The right balance of power and performance for building
and deploying AI models with OpenVINO
NPU CPU GPU

Power Efficiency Fast Response High Throughput
Ideal for sustained AI workloads Ideal for low-latency Ideal for AI-accelerated digital
and AI offload for battery life AI workloads content creation and gaming

Question Retrieve Prompt LLM Answer
Enterprise Intelligence with LLMs using RAG

Let’s switch to a demo!
Latent Consistency
Llama3 (LLM) + RAG
Models

AI PC Developer Program
CTA
▪ Link: https://www.intel.com/content/www/us/en/developer/topic-
technology/ai-pc/overview.html#gs.85rzmn
Sign up for receiving the

latest updates and news
from the AI PC Developer
program
Related Products
▪ NPU Documentation
▪ Built-in GPU
▪ Intel® Core Ultra processor

AI at the Edge
How can we build LLMs for low-power
settings?
When are Small LMs useful instead

(and what are they)?

Demos on Edge
Phi3

OpenVINO Notebooks
100+
Demos
LLMs, GenAI, Stable Diffusion, Whisper, GPT, YOLOv5/v8,

CLIP, Object Detection and Segmentation, Image
Classification, Human Pose Estimation, and much more!

OpenVINO Notebooks
And Music Generation, Text to

Speech (Bark), Speech to
Text (Whisper), Diarization...

Running AI
models on the
Cloud
“How do we balance executing AI
workloads between the edge and the
cloud?”

OpenVINO Model Server
Powered by OpenVINO Runtime

Integration with mediapipe
Python code execution
“How are you” prompt OVMS

MediaPipe Graph
• Python execution is enabled via MediaPipe by “I’m fine” texts

the built-in PythonExecutorCalculator
• supports execution of custom Python code
“zebra” prompt
images
https://docs.openvino.ai/2023.3/ovms_docs_python_support_reference.html

Demos: Cloud <-> Edge
OpenVINO Model Server with INT8 Quantization

Video Slide

Run Server
docker run -d --rm -p 9000:9000 -v $(pwd)/onnx:/model:ro openvino/model_server \

--port 9000 \
--model_name gpt-j-6b \
--model_path /model \
--plugin_config '{"PERFORMANCE_HINT":"LATENCY","NUM_STREAMS":1}'

Run Client
curl -X POST http://localhost:8000/v1/models/usem:predict \

-H 'Content-Type:application/json’ \
-d '{"instances": ["dog", "Puppies are nice.", "I enjoy taking long
walks along the beach with my dog."]}'
from ovmsclient import make_grpc_client
client = make_grpc_client("localhost:9000")
data = ["dog", "Puppies are nice.", "I enjoy taking long walks along
the beach with my dog."]
inputs = {"inputs": data}
results = client.predict(inputs=inputs, model_name="usem")

Try GenAI + LLM Serving with
Deploy generative pipelines as a service
OpenVINO Model Server OpenVINO Model Server –

– Text Generation Demo Stable Diffusion Demo
vLLM – OpenVINO OpenVINO Model Server -

Integration (coming soon) RAG Pipeline Demo
Optimizing LLMs with OpenVINO
Download our comprehensive white paper
Download PDF
When working on cloud/edge/PC,
what do you suggest?

Exciting Future Highlights:
OpenVINO with LLMs and Gen AI

Contribute to OpenVINO Toolkit
How to start: https://medium.com/openvino-toolkit/how-to-contribute-to-an-ai-open-source-project-c741f48e009e

Learn more: https://github.com/openvinotoolkit/openvino/wiki/Google-Summer-Of-Code
Installation
www.openvino.ai
Installation
pip install openvino
www.openvino.ai
AI: The New Age
Solving the World’s Toughest
Challenges, Together.
Calling All Developers & Technologists!

Save the Date:
From front-end, web, app devs to back-end, full-stack, database September 24-25, 2024
& DevOps to data scientists & researchers, and more: San Jose Convention Center, CA
Learn, collaborate, and solve at Intel Innovation –
an event for developers by developers.
Hear from leading industry luminaries, technologists & start-up

entrepreneurs in the field of AI.
Get the latest AI development tools, hands-on experience & join on-
Opt-In for Early Access
site Hackathons to optimize your AI code & workflows. When Registration Opens!
Learn the breadth of future technology advancements in AI through
keynotes, sessions, birds of a feathers, and hands-on labs. www.intel.com/innovation
Share unique ideas and perspectives and collaborate with your peers.
Poll Question:
“For your next project: Will you

build LLMs on your PC, for the
edge, or on the cloud?”

Notices and Disclaimers
Performance varies by use, configuration and other factors. Learn more at
www.Intel.com/PerformanceIndex.
Performance results are based on testing as of dates shown in configurations and may not reflect all
publicly available updates. See backup for configuration details.
Intel technologies may require enabled hardware, software or service activation.
Your costs and results may vary.
Intel is committed to respecting human rights and avoiding complicity in human rights abuses. See
Intel's Global Human Rights Principles. Intel's products and software are intended only to be used in
applications that do not cause or contribute to a violation of an internationally recognized human right.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its
subsidiaries. Other names and brands may be claimed as the property of others.

Connect With Us

Thank You


OpenVINO DevCon - Generative AI Fundamentals With OpenVINO™

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

OpenVINO DevCon - Generative AI Fundamentals With OpenVINO™

Uploaded by

Copyright:

Available Formats

Running LLMs locally for Faster,

Ria Cheruvu Paula Ramos Yury Gorbachev

DEVCON Workshop Series 2024

DEVCON Workshop Series 2024

Interacting with APIs

DEVCON Workshop Series 2024

DEVCON Workshop Series 2024

DEVCON Workshop Series 2024

CPU GPU NPU FPGA

DEVCON Workshop Series 2024

Improvement Simpler Workflow

DEVCON Workshop Series 2024

Difficulty training Large Large memory

DEVCON Workshop Series 2024

Model Optimize Deploy

DEVCON Workshop Series 2024

1 Convert MODEL 2 Optimize MODEL 3 Deploy MODEL 4 Build PIPELINE

PyTorch Frontend NNCF Runtime(Backend) • Text-generation

DEVCON Workshop Series 2024

DEVCON Workshop Series 2024

pip install optimum-intel[openvino,nncf]

Combine the convenience of Hugging Face with the efficiency of OpenVINO !

DEVCON Workshop Series 2024

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

results = pipe("He's a dreadful magician and")

DEVCON Workshop Series 2024

DEVCON Workshop Series 2024

DEVCON Workshop Series 2024

pip install nncf

DEVCON Workshop Series 2024

Significant Reduction in RAM usage!

DEVCON Workshop Series 2024

lambada_openai lambada_openai Gsm8k

Better Quality of the Compressed Model!

DEVCON Workshop Series 2024

DEVCON Workshop Series 2024

3 <sos> Ich einen

4 <sos> Ich einen frau

5 <sos> Ich einen frau <eos>

DEVCON Workshop Series 2024

DEVCON Workshop Series 2024

- from transformers import AutoModelForCausalLM

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

results = pipe("He's a dreadful magician and")

DEVCON Workshop Series 2024

Industry-specific AI use cases, Centralization, Edge device

Cons Cons Cons

DEVCON Workshop Series 2024

NPU CPU GPU

DEVCON Workshop Series 2024

Enterprise Intelligence with LLMs using RAG

DEVCON Workshop Series 2024

DEVCON Workshop Series 2024

Sign up for receiving the

DEVCON Workshop Series 2024

When are Small LMs useful instead

DEVCON Workshop Series 2024

DEVCON Workshop Series 2024

LLMs, GenAI, Stable Diffusion, Whisper, GPT, YOLOv5/v8,

DEVCON Workshop Series 2024

And Music Generation, Text to

DEVCON Workshop Series 2024

DEVCON Workshop Series 2024