Research & Projects

Work across LLM serving, GPU kernels, heterogeneous MoE training, elastic RL systems, KV-cache runtime design, and irregular GPU workloads.

Foundry Foundry

01 / Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start

In submission · X Liu*, Y Wu*, Y Yao, D Zhuo, I Stoica, ZM Mao

Persists CUDA graph topology and execution context offline, then reconstructs executable graphs online with negligible overhead.

Persists CUDA graph topology and execution context offline.
Reconstructs executable graphs online with negligible overhead.
Reduces LLM serving cold-start latency by up to 99%.

#LLM Serving#CUDA Graphs#Cold Start#Context Materialization

Paper Code

Dr.PT Dr. Post-Training

02 / Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training

In submission · P Hu, X Liu, ZM Mao, JW Ma

A data regularization framework that makes post-training data more effective with zero overheads.

Dr. Post-Training is a framework that treats general training data as a regularizer rather than a selection pool.
Across SFT, RLHF, and RLVR, we show faster convergence and stronger performance over SOTA data selection methods
Includes LLM training system optimizations with zero computation overhead compared with standard training.

#LLM Post-Training#Data Regularization#SFT#RLHF#RLVR

Paper Code

RLBoost RLBoost

03 / RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs

NSDI 2026 · Y Wu*, X Liu*, H Zheng, J Gu, B Chen, ZM Mao, A Krishnamurthy, I Stoica

Offloads rollout workloads to fragmented preemptible resources to reduce LLM RL cost and improve utilization.

Designed a rollout system that adaptively offloads LLM RL workloads to preemptible instances.
Reduces LLM RL cost by up to 49%.
Improves utilization of fragmented cloud resources.

#LLM RL#Spot Instances#Kubernetes#Cost Efficiency

Paper Code

HeterMoE HeterMoE

04 / HeterMoE: Efficient Training of Mixture-of-Experts Models on Heterogeneous GPUs

In submission · Y Wu*, X Liu*, S Jin, C Xu, F Qian, ZM Mao, M Lentz, D Zhuo, I Stoica

Assigns MoE components across mixed GPU generations with zebra parallelism and asymmetric expert placement.

Disaggregates MoE models across heterogeneous GPU generations.
Assigns experts to older GPUs such as V100 and T4 while using newer GPUs for attention.
Uses zebra parallelism and asymmetric expert assignment for fine-grained load balancing.

#LLM Training#MoE#Heterogeneous GPUs#DeepSpeed

Paper

Plato Plato

05 / Plato: Plan to Efficiently Decode for Large Language Model Inference

COLM 2025 · S Jin*, X Liu*, Y Wu, H Zheng, Q Zhang, M Lentz, ZM Mao, A Prakash, F Qian, D Zhuo

Decomposes complex queries into dependency graphs to accelerate generation through context-aware parallel decoding.

Decomposes complex queries into a dependency graph.
Accelerates generation through context-aware parallel decoding and structured decoding.

#LLM Inference#Parallel Decoding#Structured Decoding#KV Cache

Paper

CAKE CAKE

06 / Compute Or Load KV Cache? Why Not Both? (CAKE)

ICML 2025 · S Jin*, X Liu*, Q Zhang, ZM Mao

Reduces long-context prefill latency by overlapping bidirectional KV-cache computation and I/O.

Reduces LLM prefill latency on long-context inputs.
Overlaps bidirectional KV-cache generation with computation and I/O.
Built on top of vLLM and LMCache.

#LLM Inference#KV Cache#Long Context#vLLM#LMCache

Paper

LTE LTE

07 / Learn to be efficient: Build structured sparsity in large language models (LTE)

NeurIPS 2024 Spotlight · H Zheng, X Bai, X Liu, ZM Mao, B Chen, F Lai, A Prakash

Trains LLMs to activate fewer neurons while maintaining accuracy, backed by efficient sparse FFN kernels.

Trains LLMs to activate fewer neurons through structured sparsity while maintaining accuracy.
Builds an efficient Triton/CUDA gather-scatter MLP kernel.
Achieves near-linear speedup with sparsity.

#LLM Efficiency#Structured Sparsity#MoE#Gather-scatter#Triton

Paper

MM2-gb mm2-gb

08 / mm2-gb: GPU Accelerated Minimap2 for Long Read DNA Mapping

ACM BCB 2024 Oral · J Dong*, X Liu*, H Sadasivan, S Sitaraman, S Narayanasamy

Extends minimap2 with an AMD GPU-accelerated chaining kernel for irregular ultra-long DNA read workloads.

Extends minimap2-v2.24 with an AMD GPU-accelerated chaining kernel.
Uses HIP and persistent kernels to tackle extremely irregular ultra-long DNA read workloads.

#GPU#DNA Mapping#Minimap2#HPC#Persistent Kernel

Paper Code