WebAssembly For AI Infra - A Lightweight, Fast, and Secure Alternative Approach

Wasm for AI Infra : a Lightweight, Fast,
and Secure Alternative Approach
@MileyFu
https://github.com/WasmEdge/WasmEdge
Content
• A bit about me and WasmEdge
• Challenges with Python + Docker
• Rust: The Optimal Choice for the AGI Era
• WASM Container Benefits
• Self host LLM across devices with a single command
• Use Case - Build a serverless LLM App

A bit about me
DevRel/ Founding member of WasmEdge,
Ambassador of Cloud Native Computing
Foundation
Organize dev meetups/ conferences
Wasm Meetups, KCD Shenzhen 12.12 2023,

Wasm+Rust Meetups
Give talks
KubeCon, Open Source Summit, Japan DevRelCon,

CloudDay Italy, AWS User Group Meetup, Global AI
Note
Write technical blogs, documentation and translated

content between Chinese and English
China USA Singapore
Germany
India Taipei KCD and Rust Meetup

Singapore Japan
About WasmEdge
● A WebAssembly (Wasm) run1me.
● Highly efficient, lightweight and safe virtualiza1on solu1on.
○ Web browsers and IoT devices
○ cloud data centers
○ public blockchains (decentralized cloud)
● Vibrant open source community with over 100 ac1ve
contributors. Contribute with Ease!
The No.1 WebAssembly runtime among CNCF

projects. Co-launched Wasm landscape
Parterners
Current Solution Running AI:Python + Docker
● Python
○ Easy: easy to learn and use
○ Big: active community, prosperous ecology, rich resources
○ Applicable to various types of AI tasks, in-depth into every aspect
○ Docker Container
○ Portability: facilitates distribution
○ Isolation: ensuring security and stability
○ Scalability: high resource utilization

Python’s Limitations
Speedups from performance engineering a program that multiplies two 4096-by-4096 matrices.
● Performance Bottlenecks
● Parallelism: GIL ensures that only one thread executes Python
bytecode at a time in a single process
● Memory Management
How about Python + C/C++/Rust?
Portability Issue and Complex Integration
● Maintenance Cost
● Portability Issues Python interacts with na/ve libraries or system-level

dependencies across different environments differently.
● Integra4on Complexity binding Python to other languages o;en requires

careful management of data types, memory alloca/on, and error handling.
Even though there are third-party libraries that can improve the binding task,
such as Pybind11, this “glue” process is s/ll error-prone and demands a deep
understanding of both Python and the other language in use: increase the
development /me and risk.
www.secondstate.io
LangChain图片引用自 https://deepwisdom.feishu.cn/docx/XDYbdhq7ro0KPexuXUkcB6mynmh#VygBdZS9poOgNfxKoFrcm6Bensh
Limitations of Linux Containers/Docker
● Cold start performance: seconds

● Disk space: GBs
● Hardware accelerator support: specific
version required
● Portability: dependent on CPU architecture
● Security: Relies on user permissions of the
host OS
部分文字引用自https://wasmedge.org/wasm_linux_container) 图片源自https://medium.com/@shivraj.jadhav82/webassembly-wasm-docker-vs-wasm-275e317324a1
Rust + WebAssembly
● Performance and memory
safety
● Concurrency
● Powerful and expressive type
system
● Cargo, a modern package
management tool
● Rapidly growing ecosystem:
ndarray, llm, candle, burn, …
WebAssembly: Lighter faster&more secure
图片源自 https://medium.com/@shivraj.jadhav82/webassembly-wasm-docker-vs-wasm-275e317324a1
https://wasmedge.org/wasm_linux_container/
The technology path of virtualization
What’s next?
ApplicaKon containers
(e.g., Docker)
Hypervisor VM and microVMs
(e.g., AWS Firecracker)

Another solution
High level language VMs
ApplicaKon containers (e.g., v8, WebAssembly)
(e.g., Docker)
Hypervisor VM and microVMs
(e.g., AWS Firecracker)

○ The sandbox mechanism provides a
safer production environment
○ Protect user data and system

resources
○ Bytecode verification prevents

malicious code
○ Isolated execution environment

between Wasm modules 图片源自https://medium.com/@shivraj.jadhav82/webassembly-wasm-docker-vs-wasm-275e317324a1
Don’t take our word for it
WebAssembly is the next

wave of cloud computing.
Source: CNCF Annual Survey 2022.

https://www.cncf.io/wp-content/uploads/2023/09/The-State-of-WebAssembly-2023.pdf
#ossummit
Don’t take our word for it
Solomon Hykes was Docker’s founding CTO

and the inventor of Linux containers.
Docker’s landmark partnership with

WasmEdge in 2022 brought “Wasm
containers” to 10 million Docker Desktop
users.
WasmEdge was started in 2019
In summary, old container solutions are too
heavyweight and complex for many use cases
Browser IoT
AI Inference
SaaS Plugin Microservices
Run LLM on your Mac/ Across Devices
Find and download the Llama2

series of models and many more
Run LLM on Mac/ Across Devices with Wasm.
Only 4 commands
● Automa(c GPU detec(on.
● 20.54 tokens per second on an M1

MacBook, zero Python dependency
● Inference app: a 2MB cross-plaForm

binary
1. Install WasmEdge with LLM Support
Begin with a single line of command to install WasmEdge runDme, with LLM
support.
curl -sSf
https://raw.githubusercontent.com/WasmEdge/WasmEdge/master
/utils/install.sh | bash -s -- --plugin wasi_nn-ggml
OR you could download and copy the WasmEdge install files manually following
the installaDon guide here.
2. Download an LLM Chat App in Wasm
Next, get the ultra small 2MB cross-platform binary – the LLM
chat application. It’s a testament to efficiency, requiring no other
dependencies and offering seamless operation across various
environments. This small Wasm file is compiled from Rust. To
build your own, check out the llama-utils repo.
curl -LO https://github.com/second-state/llama-
utils/raw/main/chat/llama-chat.wasm
3. Download the Llama2 7b Chat Model
curl -LO
https://huggingface.co/wasmedge/llama2/blob/ma
in/llama-2-7b-chat-q5_k_m.gguf
WasmEdge also supports Llama2, CodeLlama, Codeshell,

Mistrial, MistrialLite, TinyLlama, Baichuan, BELLE, Alpaca,
Vicuna, OpenChat, Starcoder, OpenBuddy, and many more!
4. Chat with Llama2 7b model on CLI
Now that you have everything set up, you can start chatting with your Llama2
7b chat-powered LLM using the command line.
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-
2-7b-chat-q5_k_m.gguf llama-chat.wasm
After that, you can ask Llama2 7b chat any questions
That’s it! You can use the same llama-chat.wasm file to run other LLMs,
like OpenChat, CodeLlama, Mistral, etc.
https://www.secondstate.io/run-llm/
LLaMa-2 and WasmEdge on RHEL9
1.Install NVIDIA driver and CUDA tools and libraries (official repo)
2.dnf install gcc-toolset-12 ninja-build
3.git clone https://github.com/WasmEdge/WasmEdge.git
4.cd WasmEdge
5.Drop into the shell with proper paths for the optional toolchain: scl enable gcc-toolset-12 bash
6.Build: cmake -GNinja -Bbuild -DCMAKE_BUILD_TYPE=Release -
DWASMEDGE_PLUGIN_WASI_NN_GGML_LLAMA_CUBLAS="yes" -
DWASMEDGE_PLUGIN_WASI_NN_BACKEND="ggml" && cmake —build build
7.Install: sudo cmake —install build
8.Now you can download the model, the chat app and run it according to the original instruction
https://pushf.substack.com/p/llama-2-and-wasmedge-on-rhel9-aka
Beyond Language AI
Not limited to LLM tasks: vision and audio
as well. Besides the ggml backend,
WasmEdge Runtime supports PyTorch,
TensorFlow, and OpenVINO AI framework.
AI Inference for Real-time Data Streams

with WasmEdge and YoMo on the edge
device for high performance, small size, and
AI support.
Supports popular OpenCV and FFmpeg libraries for image processing.
Discover how you can apply vision and audio AI with projects like
mediapipe-rs.
•Mediapipe-rs: a Rust library for MediaPipe tasks
• Source code: Mediapipe-rs GitHub
• Tutorial: Mediapipe soluDons
•More WASI NN examples: h[ps://github.com/second-

state/WasmEdge-WASINN-examples
Wasm support for AI inference
hYps://github.com/WasmEdge/mediapipe-rs
Build Serverless AI Apps
-LLM libraries for Rust developers, like ChatGPT, Claude, Llama2 series
-SaaS tools like GitHub, Discord, Telegram, Slack etc.
Serverless:
☑business logic
❎no compilaEon and deployment of Rust funcEons.
Learn Rust Bot (RAG-based)
https://flows.network/learn-rust
《那些让 ChatGPT review 代码的程序员，后来都怎么样了？

》
1. Create embeddings for your data
1.Prepare data. Split your docs into
several chunks and then save it as
a .txt file. algorithms for splitting your
docs.
2.Fork RAG-embeddings demo

repo and deploy the RAG-embedding
repo code on flows.network to get a
webhook that can embed and store a
data into a vector DB.
3.Upload your prepared text chunks to

the vector DB and name your data
with one command line.
2. Build the LLM agent with the embeddings
Import the RAG-discord-bot demo repo into

flows.network for deployment.
Then configure five environment variables.⬇

Code Review Bot
《那些让 ChatGPT review 代码的程序员，后来都怎么样了？

》
1.Load the code review bot template in flows.network. The template contains
the source code for the bot itself. We will clone the source code to your own
GitHub account so that you can modify and customize it later. Click on Create and
Deploy.
2.Give the bot your OpenAI API key.
3.Authorize bot access to GitHub.
https://www.cncf.io/blog/2023/06/06/a-chatgpt-powered-code-reviewer-bot-for-open-source-projects/
Code Review Bot
● Code Review Bot
Keep in Touch
https://github.com/WasmEdge/WasmEdge
Connect with me GitHub/Twi2er: @mileyfu
Join the WasmEdge discord: h2ps://discord.com/invite/U4B5sFTkFc

WebAssembly For AI Infra - A Lightweight, Fast, and Secure Alternative Approach

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

WebAssembly For AI Infra - A Lightweight, Fast, and Secure Alternative Approach

Uploaded by

Copyright:

Available Formats

Wasm for AI Infra : a Lightweight, Fast,

and Secure Alternative Approach

• A bit about me and WasmEdge

• Challenges with Python + Docker

• Rust: The Optimal Choice for the AGI Era

• WASM Container Benefits

• Self host LLM across devices with a single command

• Use Case - Build a serverless LLM App

Organize dev meetups/ conferences

Wasm Meetups, KCD Shenzhen 12.12 2023,

KubeCon, Open Source Summit, Japan DevRelCon,

Write technical blogs, documentation and translated

India Taipei KCD and Rust Meetup

The No.1 WebAssembly runtime among CNCF

○ Easy: easy to learn and use

○ Big: active community, prosperous ecology, rich resources

○ Applicable to various types of AI tasks, in-depth into every aspect

○ Portability: facilitates distribution

○ Isolation: ensuring security and stability

○ Scalability: high resource utilization

● Portability Issues Python interacts with na/ve libraries or system-level

● Integra4on Complexity binding Python to other languages o;en requires

● Cold start performance: seconds

(e.g., AWS Firecracker)

High level language VMs

ApplicaKon containers (e.g., v8, WebAssembly)

(e.g., AWS Firecracker)

○ Protect user data and system

○ Bytecode verification prevents

○ Isolated execution environment

WebAssembly is the next

Source: CNCF Annual Survey 2022.

Solomon Hykes was Docker’s founding CTO

Docker’s landmark partnership with

Find and download the Llama2

● Automa(c GPU detec(on.

● 20.54 tokens per second on an M1

● Inference app: a 2MB cross-plaForm

WasmEdge also supports Llama2, CodeLlama, Codeshell,

AI Inference for Real-time Data Streams

•More WASI NN examples: h[ps://github.com/second-

《那些让 ChatGPT review 代码的程序员，后来都怎么样了？

2.Fork RAG-embeddings demo

3.Upload your prepared text chunks to

Import the RAG-discord-bot demo repo into

Then conﬁgure ﬁve environment variables.⬇

《那些让 ChatGPT review 代码的程序员，后来都怎么样了？

2.Give the bot your OpenAI API key.

3.Authorize bot access to GitHub.

Connect with me GitHub/Twi2er: @mileyfu

Join the WasmEdge discord: h2ps://discord.com/invite/U4B5sFTkFc

You might also like