- Incoming Quantitative Researcher Intern for summer 2026.
- Optimize ML infrastructures for quantitative equity analysis.
XUESHEN-SMI 1.0 Thu, May 14, 2026
Research Runtime
Xueshen Liu
刘学深
Driver version PhD-0.4.9
Mode Elastic LLM infra
Namespace UMICH CSE
Location Ann Arbor, MI
I build systems for cost efficient LLM training, inference, and reinforcement learning, focusing on designing elastic infrastructure to harvest heterogeneous resources.
LLM Large Language Model Infra Infrastructure Efficient Efficiency Elastic Elasticity Heter Heterogenity
- Docker/tmux sandbox for long-running coding agents.
- Wrapped up the Systems Research @ Google student researcher internship around the RLBoost NSDI'26 acceptance.
- Extend RLBoost to heterogenous systems TPU+GPU.
- Optimize LLM inference engines and weight transfer mechanism on TPU
- Persists CUDA graph topology and execution context offline.
- Reconstructs executable graphs online with negligible overhead.
- Reduces LLM serving cold-start latency by up to 99%.
- Characterized bottlenecks across the LLM RL pipeline and identified rollout as a dominant yet highly elastic component.
- Designed RLBoost on Google Cloud Platform to harvest fragmented spot resources, lower RL training cost, and improve overall utilization.
- Explored heterogeneous compute options across multi-generation GPUs and TPUs under diverse RL workloads.
- Contributed to an NL2SQL agentic training pipeline, optimizing multi-node communication and applying asynchronous tool calling.
- Designed a rollout system that adaptively offloads LLM RL workloads to preemptible instances.
- Reduces LLM RL cost by up to 49%.
- Improves utilization of fragmented cloud resources.
- Disaggregates MoE models across heterogeneous GPU generations.
- Assigns experts to older GPUs such as V100 and T4 while using newer GPUs for attention.
- Uses zebra parallelism and asymmetric expert assignment for fine-grained load balancing.
- Efficient sparse FFN kernel for structured sparsity.
- Profiling multi-node DeepSpeed training on a cluster.
- Reduces LLM prefill latency on long-context inputs.
- Overlaps bidirectional KV-cache generation with computation and I/O.
- Built on top of vLLM and LMCache.
- Led in-class discussions and held regular office hours.
- Delivered a guest lecture on distributed software-defined networking.
- Mentored graduate students on research projects, including methodology, implementation, and presentation.
- Custom operator implementation notes.
- DeepSpeed MoE training placeholder note.
- Designed a large-scale latency-tolerant vehicle positioning system on edge/cloud servers.
- Developed a deep factor graph model to handle delayed perception data while maintaining real-time responsiveness.
- Leveraged parallelism and prioritized scheduling to meet tight latency constraints.
- LLM inference prototype with CUDA graph.
- Triton gather/scatter matrix multiplication kernel.
- Trains LLMs to activate fewer neurons through structured sparsity while maintaining accuracy.
- Builds an efficient Triton/CUDA gather-scatter MLP kernel.
- Achieves near-linear speedup with sparsity.
- First CUDA UVM profiling note.
- CUDA UVM experiment log.
- Decomposes complex queries into a dependency graph.
- Accelerates generation through context-aware parallel decoding and structured decoding.
- Ph.D. in Computer Science and Engineering at the University of Michigan, advised by Prof. Z. Morley Mao.
- Extends minimap2-v2.24 with an AMD GPU-accelerated chaining kernel.
- Uses HIP and persistent kernels to tackle extremely irregular ultra-long DNA read workloads.
- B.S. in Computer Science and Engineering at the University of Michigan.
- B.S. in Electrical and Computer Engineering at Shanghai Jiao Tong University.
GPU0 Education
academic milestones
GPU1 Work
internships and industry experiences
GPU2 Research
projects and publications
GPU3 Blogs
sharing my learnings
Host Processes
16 May 2026 TALK Invited talk for Amazon Rufus AI lab Excited to give an invited talk, Towards Instantaneous Elasticity in LLM Infrastructure: From Harvesting Preemptible Resources to a General Cold-Start-Free Serving Stack.
15 May 2026 CONF Present RLBoost at NSDI'26 Excited to present RLBoost on NSDI'26 and meet talented researchers working on network and systems! 14 Apr 2026 PUB Foundry paper and code release Super excited to share Foundry! It is our recent work for fast LLM serving cold start via template-based CUDA graph context materialization. 13 Feb 2026 ROLE Incoming Citadel Securities QR internship Happy to share that I will work at Citadel Securities as Quantitative Researcher Intern for summer 2026 in Miami. 12 Dec 2025 PUB RLBoost accepted to NSDI'26 Happy to share that RLBoost has been accepted to NSDI'26. Wrapped up Student Researcher work with Systems Research @ Google! Huge thanks to all my collaborators. 11 Jul 2025 PUB Plato accepted to COLM'25 Happy to share that Plato has been accepted to COLM'25. This is our work on planning and parallel decoding for efficient LLM inference. 10 May 2025 PUB CAKE accepted to ICML'25 Excited to share that CAKE has been accepted to ICML'25! It is our work on reducing long-context prefill latency by overlapping KV-cache computation and loading. 09 May 2025 ROLE Systems Research @ Google Student Researcher Internship Excited to join Systems Research @ Google as a Student Researcher in Seattle, working on distributed RL systems for LLMs. 08 Dec 2024 CONF LTE NeurIPS spotlight Honored to share that LTE was presented as a Spotlight at NeurIPS'24. This work explores structured sparsity and efficient sparse FFN kernels for LLMs. 07 Oct 2024 CONF mm2-gb ACM BCB oral Happy to share that mm2-gb was selected as an Oral at ACM BCB'24. It is our work on GPU-accelerated minimap2 for long-read DNA mapping. 06 Sep 2024 ROLE CSE 589 Graduate Student Instructor Happy to start as Graduate Student Instructor for CSE 589 Advanced Computer Networks at the University of Michigan.
05 Aug 2024 TALK Invited talk for General Motors Research Excited to give an invited talk, Scalable & Latency-tolerant Edge/Cloud Computing via Deep Factor Graph.
04 May 2024 ROLE General Motors CAV Lab Research Intern Excited to intern in the Connected Autonomous Vehicle Lab at General Motors, working on latency-tolerant edge/cloud positioning systems.
03 May 2024 TALK Invited talk at AMD HPC Apps Knowledge Sync Excited to give an invited talk on Minimap2-gigabases (mm2-gb) at AMD HPC Apps Knowledge Sync.
02 Aug 2021 AWARD Roger King Scholarship Honored to receive the Roger King Scholarship from the College of Engineering at the University of Michigan.
01 Aug 2019 AWARD Robomaster Final Competition Happy to share that our team won Runner-up Team and Grand Prize at the Robomaster Final Competition.