Blogs

Notes on CUDA, Triton, LLM inference, profiling, DeepSpeed, clusters, and practical sandboxing.

CUDA CUDA Triton Triton PyTorch PyTorch Docker Docker

Feb 23, 2026

Claude Code in a Docker sandbox (kept alive with tmux)

I wanted a YOLO sandbox for Claude Code: isolated dependencies, optional CUDA/PyTorch, and a way to keep the agent running even when my SSH session drops. My setup is: Docker for i...

#Docker#tmux#Claude Dec 12, 2024

Efficient Gather-and-scatter Feed-forward Network Kernel with Triton

In our recent work Learn to be efficient: Build structured sparsity in large language models, we propose a novel method to build structured sparsity in large language models. Throu...

#CUDA#Triton#GEMM#Pytorch#Feed-forward Network#Structured Sparsity Nov 7, 2024

Use Nsight System to Profile a Model Training with DeepSpeed on Multi-Node Cluster

This post is to log how I managed to profile a model training running on multiple nodes in a cluster with DeepSpeed and Nsight System. Click here to jump to the final implementatio...

#Deepspeed#Training#Nsight#Multi-GPU#Cluster#Multi-Node Jun 25, 2024

Training Custom Mixtral Model with DeepSpeed

Placeholder for the blog logging how I trained a custom Mixtral model with DeepSpeed.

#Deepspeed#Mixtral#PyTorch#MoE#Training Jun 19, 2024

Custom Gather-scatter Operator by CUTLASS

This blog is to log my experience of building efficient custom operator based on CUTLASS. Jump to the final implementation of gather and scatter matrix multiplication operator.

#CUDA#CUTLASS#Python#PyTorch May 27, 2024

Compact Inference with CUDA graph and StaticCache

This post is to log a minimum prototype of LLM inference with CUDA graph to eliminate bubbles between kernel launches. Click here to jump to the final implementation code.

#CUDA#Pytorch#Huggingface#CUDA Graph#LLM Apr 24, 2024

Efficient Gather-and-scatter Matrix Multiplication Kernel with Triton

This post is to log my implementation of gather-and-scatter matrix multiplication operation with Triton. Click here to jump to the final implementation code. If you are interested ...

#CUDA#Triton#GEMM#Pytorch Mar 31, 2024

Understand CUDA Unified Memory

This post is to log my experiments with CUDA unified memory and some innovative and interesting application of UVM in large language model (LLM).

#CUDA Mar 25, 2024

Profile CUDA UVM Performance

This post is to log my profile of CUDA unified virtual memory.

#CUDA#UVM#Profiler