Blogs

Xueshen Liu

CSE Ph.D at UMich

These blogs are written to log my study during research for future reference. Hope you can also enjoy!

2024 Dec

Efficient Gather-and-scatter Feed-forward Network Kernel with Triton

13 minute read

In our recent work Learn to be efficient: Build structured sparsity in large language models, we propose a novel method to build structured sparsity in large...

2024 Nov

Use Nsight System to Profile a Model Training with DeepSpeed on Multi-Node Cluster

10 minute read

This post is to log how I managed to profile a model training running on multiple nodes in a cluster with DeepSpeed and Nsight System. Click here to jump to ...

2024 Jun

Training Custom Mixtral Model with DeepSpeed

less than 1 minute read

Placeholder for the blog logging how I trained a custom Mixtral model with DeepSpeed.

Custom Gather-scatter Operator by CUTLASS

19 minute read

This blog is to log my experience of building efficient custom operator based on CUTLASS. Jump to the final implementation of gather and scatter matrix multi...

2024 May

Compact Inference with CUDA graph and StaticCache

14 minute read

This post is to log a minimum prototype of LLM inference with CUDA graph to eliminate bubbles between kernel launches. Click here to jump to the final implem...

2024 Apr

Efficient Gather-and-scatter Matrix Multiplication Kernel with Triton

26 minute read

This post is to log my implementation of gather-and-scatter matrix multiplication operation with Triton. Click here to jump to the final implementation code....

2024 Mar

Understand CUDA Unified Memory

7 minute read

This post is to log my experiments with CUDA unified memory and some innovative and interesting application of UVM in large language model (LLM).

Profile CUDA UVM Performance

5 minute read

This post is to log my profile of CUDA unified virtual memory.