Xueshen Liu

Claude Code in a Docker sandbox (kept alive with tmux)

6 minute read

I wanted a YOLO sandbox for Claude Code: isolated dependencies, optional CUDA/PyTorch, and a way to keep the agent running even when my SSH session drops. My...

Efficient Gather-and-scatter Feed-forward Network Kernel with Triton

13 minute read

In our recent work Learn to be efficient: Build structured sparsity in large language models, we propose a novel method to build structured sparsity in large...

Use Nsight System to Profile a Model Training with DeepSpeed on Multi-Node Cluster

10 minute read

This post is to log how I managed to profile a model training running on multiple nodes in a cluster with DeepSpeed and Nsight System. Click here to jump to ...

Training Custom Mixtral Model with DeepSpeed

less than 1 minute read

Placeholder for the blog logging how I trained a custom Mixtral model with DeepSpeed.

Custom Gather-scatter Operator by CUTLASS

19 minute read

This blog is to log my experience of building efficient custom operator based on CUTLASS. Jump to the final implementation of gather and scatter matrix multi...

Xueshen Liu

Xueshen Liu

Recent Posts

Claude Code in a Docker sandbox (kept alive with tmux)

Efficient Gather-and-scatter Feed-forward Network Kernel with Triton

Use Nsight System to Profile a Model Training with DeepSpeed on Multi-Node Cluster

Training Custom Mixtral Model with DeepSpeed

Custom Gather-scatter Operator by CUTLASS