0% found this document useful (0 votes)
18 views4 pages

Bentoml

BentoML is seeking an Inference Optimization Engineer to enhance the performance of large language models through GPU kernel-level optimizations and distributed architectures. The role involves profiling workloads, optimizing inference efficiency, and contributing to open-source projects, with a salary range of $200k - $300k. The position is remote and requires experience with inference engines and optimization techniques for transformer-based models.

Uploaded by

Maxwell john
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views4 pages

Bentoml

BentoML is seeking an Inference Optimization Engineer to enhance the performance of large language models through GPU kernel-level optimizations and distributed architectures. The role involves profiling workloads, optimizing inference efficiency, and contributing to open-source projects, with a salary range of $200k - $300k. The position is remote and requires experience with inference engines and optimization techniques for transformer-based models.

Uploaded by

Maxwell john
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

BentoML

Inference Optimization Engineer


Full-time Remote North America Asia Europe $200k - $300k

About this role


Role:
As an Inference Optimization Engineer, you will improve the speed and efficiency
of large language models at the GPU kernel level, through the inference engine,
and across distributed architectures. You will profile real workloads, remove
bottlenecks, and lift each layer of the stack to new performance ceilings. Every
gain you unlock will flow straight into open source code and power fleets of
production models, cutting GPU costs for teams around the world. By publishing
blog posts and giving conference talks you will become a trusted voice on
efficient LLM inference at large scale.

Example projects:
https://bentoml.com/blog/structured-decoding-in-vllm-a-gentle-introduction
https://www.bentoml.com/blog/benchmarking-llm-inference-backends
https://bentoml.com/blog/25x-faster-cold-starts-for-llms-on-kubernetes

Responsibilities:
Latency & throughput: Identify bottlenecks and optimize inference efficiency in
single-GPU, multi-GPU, and multi-node serving setups.
Benchmarking: Build repeatable tests that model production traffic; track and report
vLLM, SGLang, TRT-LLM, and future runtimes.
Resource efficiency: Reduce memory use and compute cost with mixed precision,
better KV-cache handling, quantization, and speculative decoding.
Serving features: Improve batching, caching, load balancing, and model-parallel
execution.
Knowledge sharing: Write technical posts, contribute code, and present findings to
the open-source community.

Qualifications:
Deep understanding of transformer architecture and inference engine internals.
Hands-on experience speeding up model serving through batching, caching, load
balancing.
Experienced with inference engines such as vLLM, SGLang, or TRT-LLM (upstream
contributions are a plus).
Experienced with inference optimization techniques: quantization, distillation,
speculative decoding, or similar.
Proficiency in CUDA and use of profiling tools like Nsight, nvprof, or CUPTI.
Proficiency in Triton and ROCm is a bonus.
Track record of blog posts, conference talks, or open-source projects in ML systems
is a bonus.

Why join us:


Direct impact – optimize distributed LLM inference and large GPU clusters
worldwide and cut real GPU costs.
Technical scope – operate distributed LLM inference and large GPU clusters
worldwide.
Customer reach – support organizations around the globe that rely on BentoML.
Influence – mentor teammates, guide open-source contributors, and become a go-
to voice on efficient inference in the community.
Remote work – work from where you are most productive and collaborate with
teammates in North America and Asia.
Compensation – competitive salary, equity, learning budget, and paid conference
travel.

1+ years of experience
working with inference engines or inference optimization techniques for
transformer based models

Salary
$200k - $300k

Equity
1.0-2.0%

Remote work policy


Remote from anywhere in the world

Full-time position

Location
North America, Asia, Europe

Report to
https://www.linkedin.com/in/ssheng/

Tech stack
python, CRUD

About BentoML
BentoML is an enterprise-grade InferenceOps platform for deploying and managing AI
models at scale. It offers full control without the complexity, allowing teams to serve
any model including LLMs, embeddings, and agentic pipelines across VPC, on-prem,
or hybrid environments with tailored optimization, advanced orchestration, and fine-
grained performance tuning.

From prototype to production, BentoML covers the full inference lifecycle with instant
model deployments, elastic autoscaling, built-in observability, compliance-ready
features, and mission-critical reliability, freeing your team to deliver AI that drives real
business outcomes faster.

Team size Founded


15 people 2019

Website LinkedIn
www.bentoml.com Visit

Company locations
San Francisco, California

About the team


Chaoyu Yang: Founder & CEO
Sean Sheng: Head of Engineering

Tech stack
python, CRUD

Interview process
1 Initial Screen (1 hour)

2 Virtual Onsite (4 Hours)

Paraform

Terms
Privacy

You might also like