You are on page 1of 47

Wasm for AI Infra : a Lightweight, Fast,

and Secure Alternative Approach

@MileyFu
https://github.com/WasmEdge/WasmEdge
Content

• A bit about me and WasmEdge

• Challenges with Python + Docker

• Rust: The Optimal Choice for the AGI Era

• WASM Container Benefits

• Self host LLM across devices with a single command

• Use Case - Build a serverless LLM App


A bit about me
DevRel/ Founding member of WasmEdge,
Ambassador of Cloud Native Computing
Foundation

Organize dev meetups/ conferences

Wasm Meetups, KCD Shenzhen 12.12 2023,


Wasm+Rust Meetups

Give talks

KubeCon, Open Source Summit, Japan DevRelCon,


CloudDay Italy, AWS User Group Meetup, Global AI
Note

Write technical blogs, documentation and translated


content between Chinese and English
China USA Singapore

Germany

India Taipei KCD and Rust Meetup


Singapore Japan
About WasmEdge
● A WebAssembly (Wasm) run1me.
● Highly efficient, lightweight and safe virtualiza1on solu1on.
○ Web browsers and IoT devices
○ cloud data centers
○ public blockchains (decentralized cloud)
● Vibrant open source community with over 100 ac1ve
contributors. Contribute with Ease!

The No.1 WebAssembly runtime among CNCF


projects. Co-launched Wasm landscape
Parterners
Current Solution Running AI:Python + Docker
● Python

○ Easy: easy to learn and use

○ Big: active community, prosperous ecology, rich resources

○ Applicable to various types of AI tasks, in-depth into every aspect

○ Docker Container

○ Portability: facilitates distribution

○ Isolation: ensuring security and stability

○ Scalability: high resource utilization


Python’s Limitations

Speedups from performance engineering a program that multiplies two 4096-by-4096 matrices.

● Performance Bottlenecks
● Parallelism: GIL ensures that only one thread executes Python
bytecode at a time in a single process

● Memory Management
How about Python + C/C++/Rust?
Portability Issue and Complex Integration
● Maintenance Cost

● Portability Issues Python interacts with na/ve libraries or system-level


dependencies across different environments differently.

● Integra4on Complexity binding Python to other languages o;en requires


careful management of data types, memory alloca/on, and error handling.
Even though there are third-party libraries that can improve the binding task,
such as Pybind11, this “glue” process is s/ll error-prone and demands a deep
understanding of both Python and the other language in use: increase the
development /me and risk.
www.secondstate.io
LangChain图片引用自 https://deepwisdom.feishu.cn/docx/XDYbdhq7ro0KPexuXUkcB6mynmh#VygBdZS9poOgNfxKoFrcm6Bensh
Limitations of Linux Containers/Docker

● Cold start performance: seconds


● Disk space: GBs
● Hardware accelerator support: specific
version required
● Portability: dependent on CPU architecture
● Security: Relies on user permissions of the
host OS

部分文字引用自https://wasmedge.org/wasm_linux_container) 图片源自https://medium.com/@shivraj.jadhav82/webassembly-wasm-docker-vs-wasm-275e317324a1
Rust + WebAssembly
● Performance and memory
safety
● Concurrency
● Powerful and expressive type
system
● Cargo, a modern package
management tool
● Rapidly growing ecosystem:
ndarray, llm, candle, burn, …
WebAssembly: Lighter faster&more secure

图片源自 https://medium.com/@shivraj.jadhav82/webassembly-wasm-docker-vs-wasm-275e317324a1
https://wasmedge.org/wasm_linux_container/
The technology path of virtualization

What’s next?

ApplicaKon containers

(e.g., Docker)
Hypervisor VM and microVMs

(e.g., AWS Firecracker)


Another solution

High level language VMs

ApplicaKon containers (e.g., v8, WebAssembly)

(e.g., Docker)
Hypervisor VM and microVMs

(e.g., AWS Firecracker)


○ The sandbox mechanism provides a
safer production environment

○ Protect user data and system


resources

○ Bytecode verification prevents


malicious code

○ Isolated execution environment


between Wasm modules 图片源自https://medium.com/@shivraj.jadhav82/webassembly-wasm-docker-vs-wasm-275e317324a1
Don’t take our word for it

WebAssembly is the next


wave of cloud computing.

Source: CNCF Annual Survey 2022.


https://www.cncf.io/wp-content/uploads/2023/09/The-State-of-WebAssembly-2023.pdf
#ossummit
Don’t take our word for it

Solomon Hykes was Docker’s founding CTO


and the inventor of Linux containers.

Docker’s landmark partnership with


WasmEdge in 2022 brought “Wasm
containers” to 10 million Docker Desktop
users.
WasmEdge was started in 2019
In summary, old container solutions are too
heavyweight and complex for many use cases

Browser IoT

AI Inference
SaaS Plugin Microservices
Run LLM on your Mac/ Across Devices

Find and download the Llama2


series of models and many more
Run LLM on Mac/ Across Devices with Wasm.
Only 4 commands

● Automa(c GPU detec(on.

● 20.54 tokens per second on an M1


MacBook, zero Python dependency

● Inference app: a 2MB cross-plaForm


binary
1. Install WasmEdge with LLM Support

Begin with a single line of command to install WasmEdge runDme, with LLM
support.
curl -sSf
https://raw.githubusercontent.com/WasmEdge/WasmEdge/master
/utils/install.sh | bash -s -- --plugin wasi_nn-ggml

OR you could download and copy the WasmEdge install files manually following
the installaDon guide here.
2. Download an LLM Chat App in Wasm

Next, get the ultra small 2MB cross-platform binary – the LLM
chat application. It’s a testament to efficiency, requiring no other
dependencies and offering seamless operation across various
environments. This small Wasm file is compiled from Rust. To
build your own, check out the llama-utils repo.
curl -LO https://github.com/second-state/llama-
utils/raw/main/chat/llama-chat.wasm
3. Download the Llama2 7b Chat Model

curl -LO
https://huggingface.co/wasmedge/llama2/blob/ma
in/llama-2-7b-chat-q5_k_m.gguf

WasmEdge also supports Llama2, CodeLlama, Codeshell,


Mistrial, MistrialLite, TinyLlama, Baichuan, BELLE, Alpaca,
Vicuna, OpenChat, Starcoder, OpenBuddy, and many more!
4. Chat with Llama2 7b model on CLI

Now that you have everything set up, you can start chatting with your Llama2
7b chat-powered LLM using the command line.
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-
2-7b-chat-q5_k_m.gguf llama-chat.wasm
After that, you can ask Llama2 7b chat any questions

That’s it! You can use the same llama-chat.wasm file to run other LLMs,
like OpenChat, CodeLlama, Mistral, etc.
https://www.secondstate.io/run-llm/
LLaMa-2 and WasmEdge on RHEL9

1.Install NVIDIA driver and CUDA tools and libraries (official repo)
2.dnf install gcc-toolset-12 ninja-build
3.git clone https://github.com/WasmEdge/WasmEdge.git
4.cd WasmEdge
5.Drop into the shell with proper paths for the optional toolchain: scl enable gcc-toolset-12 bash
6.Build: cmake -GNinja -Bbuild -DCMAKE_BUILD_TYPE=Release -
DWASMEDGE_PLUGIN_WASI_NN_GGML_LLAMA_CUBLAS="yes" -
DWASMEDGE_PLUGIN_WASI_NN_BACKEND="ggml" && cmake —build build
7.Install: sudo cmake —install build
8.Now you can download the model, the chat app and run it according to the original instruction

https://pushf.substack.com/p/llama-2-and-wasmedge-on-rhel9-aka
Beyond Language AI
Not limited to LLM tasks: vision and audio
as well. Besides the ggml backend,
WasmEdge Runtime supports PyTorch,
TensorFlow, and OpenVINO AI framework.

AI Inference for Real-time Data Streams


with WasmEdge and YoMo on the edge
device for high performance, small size, and
AI support.
Supports popular OpenCV and FFmpeg libraries for image processing.

Discover how you can apply vision and audio AI with projects like
mediapipe-rs.
•Mediapipe-rs: a Rust library for MediaPipe tasks
• Source code: Mediapipe-rs GitHub
• Tutorial: Mediapipe soluDons

•More WASI NN examples: h[ps://github.com/second-


state/WasmEdge-WASINN-examples
Wasm support for AI inference

hYps://github.com/WasmEdge/mediapipe-rs
Build Serverless AI Apps

-LLM libraries for Rust developers, like ChatGPT, Claude, Llama2 series
-SaaS tools like GitHub, Discord, Telegram, Slack etc.

Serverless:
☑business logic
❎no compilaEon and deployment of Rust funcEons.
Learn Rust Bot (RAG-based)

https://flows.network/learn-rust

《那些让 ChatGPT review 代码的程序员,后来都怎么样了?



1. Create embeddings for your data
1.Prepare data. Split your docs into
several chunks and then save it as
a .txt file. algorithms for splitting your
docs.

2.Fork RAG-embeddings demo


repo and deploy the RAG-embedding
repo code on flows.network to get a
webhook that can embed and store a
data into a vector DB.

3.Upload your prepared text chunks to


the vector DB and name your data
with one command line.
2. Build the LLM agent with the embeddings

Import the RAG-discord-bot demo repo into


flows.network for deployment.

Then configure five environment variables.⬇


Code Review Bot

《那些让 ChatGPT review 代码的程序员,后来都怎么样了?



1.Load the code review bot template in flows.network. The template contains
the source code for the bot itself. We will clone the source code to your own
GitHub account so that you can modify and customize it later. Click on Create and
Deploy.

2.Give the bot your OpenAI API key.

3.Authorize bot access to GitHub.

https://www.cncf.io/blog/2023/06/06/a-chatgpt-powered-code-reviewer-bot-for-open-source-projects/
Code Review Bot
● Code Review Bot
Keep in Touch

https://github.com/WasmEdge/WasmEdge

Connect with me GitHub/Twi2er: @mileyfu

Join the WasmEdge discord: h2ps://discord.com/invite/U4B5sFTkFc

You might also like