Foundry Foundry 01 / Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start
In submission · X Liu*, Y Wu*, Y Yao, D Zhuo, I Stoica, ZM Mao
Persists CUDA graph topology and execution context offline, then reconstructs executable graphs online with negligible overhead.
- Persists CUDA graph topology and execution context offline.
- Reconstructs executable graphs online with negligible overhead.
- Reduces LLM serving cold-start latency by up to 99%.
#LLM Serving#CUDA Graphs#Cold Start#Context Materialization
RLBoost RLBoost 02 / RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs
NSDI 2026 · Y Wu*, X Liu*, H Zheng, J Gu, B Chen, ZM Mao, A Krishnamurthy, I Stoica
Offloads rollout workloads to fragmented preemptible resources to reduce LLM RL cost and improve utilization.
- Designed a rollout system that adaptively offloads LLM RL workloads to preemptible instances.
- Reduces LLM RL cost by up to 49%.
- Improves utilization of fragmented cloud resources.
#LLM RL#Spot Instances#Kubernetes#Cost Efficiency
HeterMoE HeterMoE 03 / HeterMoE: Efficient Training of Mixture-of-Experts Models on Heterogeneous GPUs
In submission · Y Wu*, X Liu*, S Jin, C Xu, F Qian, ZM Mao, M Lentz, D Zhuo, I Stoica
Assigns MoE components across mixed GPU generations with zebra parallelism and asymmetric expert placement.
- Disaggregates MoE models across heterogeneous GPU generations.
- Assigns experts to older GPUs such as V100 and T4 while using newer GPUs for attention.
- Uses zebra parallelism and asymmetric expert assignment for fine-grained load balancing.
#LLM Training#MoE#Heterogeneous GPUs#DeepSpeed
Plato Plato 04 / Plato: Plan to Efficiently Decode for Large Language Model Inference
COLM 2025 · S Jin*, X Liu*, Y Wu, H Zheng, Q Zhang, M Lentz, ZM Mao, A Prakash, F Qian, D Zhuo
Decomposes complex queries into dependency graphs to accelerate generation through context-aware parallel decoding.
- Decomposes complex queries into a dependency graph.
- Accelerates generation through context-aware parallel decoding and structured decoding.
#LLM Inference#Parallel Decoding#Structured Decoding#KV Cache
CAKE CAKE 05 / Compute Or Load KV Cache? Why Not Both? (CAKE)
ICML 2025 · S Jin*, X Liu*, Q Zhang, ZM Mao
Reduces long-context prefill latency by overlapping bidirectional KV-cache computation and I/O.
- Reduces LLM prefill latency on long-context inputs.
- Overlaps bidirectional KV-cache generation with computation and I/O.
- Built on top of vLLM and LMCache.
#LLM Inference#KV Cache#Long Context#vLLM#LMCache
LTE LTE 06 / Learn to be efficient: Build structured sparsity in large language models (LTE)
NeurIPS 2024 Spotlight · H Zheng, X Bai, X Liu, ZM Mao, B Chen, F Lai, A Prakash
Trains LLMs to activate fewer neurons while maintaining accuracy, backed by efficient sparse FFN kernels.
- Trains LLMs to activate fewer neurons through structured sparsity while maintaining accuracy.
- Builds an efficient Triton/CUDA gather-scatter MLP kernel.
- Achieves near-linear speedup with sparsity.
#LLM Efficiency#Structured Sparsity#MoE#Gather-scatter#Triton
MM2-gb mm2-gb 07 / mm2-gb: GPU Accelerated Minimap2 for Long Read DNA Mapping
ACM BCB 2024 Oral · J Dong*, X Liu*, H Sadasivan, S Sitaraman, S Narayanasamy
Extends minimap2 with an AMD GPU-accelerated chaining kernel for irregular ultra-long DNA read workloads.
- Extends minimap2-v2.24 with an AMD GPU-accelerated chaining kernel.
- Uses HIP and persistent kernels to tackle extremely irregular ultra-long DNA read workloads.
#GPU#DNA Mapping#Minimap2#HPC#Persistent Kernel