Keith (Seiji) Eicher

Talks

Ray + vLLM Efficient Multi Node Orchestration for Sparse MoE Model Serving

Ray Summit 2025 • November 18, 2025

How Ray Serve and vLLM enable efficient, scalable serving of Mixture-of-Experts models, including optimizations for KV-cache usage and prefill/decode disaggregation.

Building Open AI Agents: Inference, Memory & Coding Systems

Open Source AI Week 2025 • October 23, 2025

Lightning talks on making LLMs efficient and designing scalable agents, covering CLI agent benchmarking, sparse MoE model serving, disaggregated inference, and more.

Writing

vLLM Large Scale Serving: DeepSeek @ 2.2k tok/s/H200 with Wide-EP

December 17, 2025

Optimizations enabling DeepSeek models to achieve 2.2k tokens per second per H200 GPU, including Wide Expert Parallelism, Dual-Batch Overlap, and disaggregated prefill/decode serving.

Ray Serve LLM on Anyscale: APIs for Wide-EP and Disaggregated Serving with vLLM

November 26, 2025

New Ray Serve LLM APIs for deploying advanced LLM serving patterns, with support for wide expert parallelism and prefill/decode disaggregation for sparse mixture-of-experts models.

Ray Serve: Reduce LLM Inference Latency by 60% with Custom Request Routing

September 15, 2025

Introduction of PrefixCacheAffinityRouter, a custom request routing mechanism leveraging prefix caching to achieve 60% reduction in time-to-first-token.

Deploy DeepSeek-R1 with vLLM and Ray Serve on Kubernetes

August 11, 2025

Guide to deploying the DeepSeek-R1 model on Kubernetes using vLLM and Ray Serve, covering managed and self-managed deployment pathways.