Efficient Gather-and-scatter Feed-forward Network Kernel with Triton
This post is the continuation of Efficient Gather-and-scatter Matrix Multiplication Kernel with Triton. We will implement an efficient gather-and-scatter fee...
This post is the continuation of Efficient Gather-and-scatter Matrix Multiplication Kernel with Triton. We will implement an efficient gather-and-scatter fee...
This post is to log how I managed to profile a model training running on multiple nodes in a cluster with DeepSpeed and Nsight System. Click here to jump to ...
Placeholder for the blog logging how I trained a custom Mixtral model with DeepSpeed.
This blog is to log my experience of building efficient custom operator based on CUTLASS. Jump to the final implementation of gather and scatter matrix multi...
This post is to log a minimum prototype of LLM inference with CUDA graph to eliminate bubbles between kernel launches. Click here to jump to the final implem...