Keith (Seiji) Eicher
Talks
Ray + vLLM Efficient Multi Node Orchestration for Sparse MoE Model Serving
How Ray Serve and vLLM enable efficient, scalable serving of Mixture-of-Experts models, including optimizations for KV-cache usage and prefill/decode disaggregation.
Building Open AI Agents: Inference, Memory & Coding Systems
Lightning talks on making LLMs efficient and designing scalable agents, covering CLI agent benchmarking, sparse MoE model serving, disaggregated inference, and more.
Writing
vLLM Large Scale Serving: DeepSeek @ 2.2k tok/s/H200 with Wide-EP
Optimizations enabling DeepSeek models to achieve 2.2k tokens per second per H200 GPU, including Wide Expert Parallelism, Dual-Batch Overlap, and disaggregated prefill/decode serving.
Ray Serve LLM on Anyscale: APIs for Wide-EP and Disaggregated Serving with vLLM
New Ray Serve LLM APIs for deploying advanced LLM serving patterns, with support for wide expert parallelism and prefill/decode disaggregation for sparse mixture-of-experts models.
Ray Serve: Reduce LLM Inference Latency by 60% with Custom Request Routing
Introduction of PrefixCacheAffinityRouter, a custom request routing mechanism leveraging prefix caching to achieve 60% reduction in time-to-first-token.
Deploy DeepSeek-R1 with vLLM and Ray Serve on Kubernetes
Guide to deploying the DeepSeek-R1 model on Kubernetes using vLLM and Ray Serve, covering managed and self-managed deployment pathways.