XUESHEN-SMI 1.0 7 kernels loaded

Library Modules

Research & Projects

Work across LLM serving, GPU kernels, heterogeneous MoE training, elastic RL systems, KV-cache runtime design, and irregular GPU workloads.

Foundry Foundry

01 / Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start

In submission · X Liu*, Y Wu*, Y Yao, D Zhuo, I Stoica, ZM Mao

Persists CUDA graph topology and execution context offline, then reconstructs executable graphs online with negligible overhead.

  • Persists CUDA graph topology and execution context offline.
  • Reconstructs executable graphs online with negligible overhead.
  • Reduces LLM serving cold-start latency by up to 99%.
RLBoost RLBoost

02 / RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs

NSDI 2026 · Y Wu*, X Liu*, H Zheng, J Gu, B Chen, ZM Mao, A Krishnamurthy, I Stoica

Offloads rollout workloads to fragmented preemptible resources to reduce LLM RL cost and improve utilization.

  • Designed a rollout system that adaptively offloads LLM RL workloads to preemptible instances.
  • Reduces LLM RL cost by up to 49%.
  • Improves utilization of fragmented cloud resources.
HeterMoE HeterMoE

03 / HeterMoE: Efficient Training of Mixture-of-Experts Models on Heterogeneous GPUs

In submission · Y Wu*, X Liu*, S Jin, C Xu, F Qian, ZM Mao, M Lentz, D Zhuo, I Stoica

Assigns MoE components across mixed GPU generations with zebra parallelism and asymmetric expert placement.

  • Disaggregates MoE models across heterogeneous GPU generations.
  • Assigns experts to older GPUs such as V100 and T4 while using newer GPUs for attention.
  • Uses zebra parallelism and asymmetric expert assignment for fine-grained load balancing.
Plato Plato

04 / Plato: Plan to Efficiently Decode for Large Language Model Inference

COLM 2025 · S Jin*, X Liu*, Y Wu, H Zheng, Q Zhang, M Lentz, ZM Mao, A Prakash, F Qian, D Zhuo

Decomposes complex queries into dependency graphs to accelerate generation through context-aware parallel decoding.

  • Decomposes complex queries into a dependency graph.
  • Accelerates generation through context-aware parallel decoding and structured decoding.
CAKE CAKE

05 / Compute Or Load KV Cache? Why Not Both? (CAKE)

ICML 2025 · S Jin*, X Liu*, Q Zhang, ZM Mao

Reduces long-context prefill latency by overlapping bidirectional KV-cache computation and I/O.

  • Reduces LLM prefill latency on long-context inputs.
  • Overlaps bidirectional KV-cache generation with computation and I/O.
  • Built on top of vLLM and LMCache.
LTE LTE

06 / Learn to be efficient: Build structured sparsity in large language models (LTE)

NeurIPS 2024 Spotlight · H Zheng, X Bai, X Liu, ZM Mao, B Chen, F Lai, A Prakash

Trains LLMs to activate fewer neurons while maintaining accuracy, backed by efficient sparse FFN kernels.

  • Trains LLMs to activate fewer neurons through structured sparsity while maintaining accuracy.
  • Builds an efficient Triton/CUDA gather-scatter MLP kernel.
  • Achieves near-linear speedup with sparsity.
MM2-gb mm2-gb

07 / mm2-gb: GPU Accelerated Minimap2 for Long Read DNA Mapping

ACM BCB 2024 Oral · J Dong*, X Liu*, H Sadasivan, S Sitaraman, S Narayanasamy

Extends minimap2 with an AMD GPU-accelerated chaining kernel for irregular ultra-long DNA read workloads.

  • Extends minimap2-v2.24 with an AMD GPU-accelerated chaining kernel.
  • Uses HIP and persistent kernels to tackle extremely irregular ultra-long DNA read workloads.