You May Also Enjoy
Efficient Gather-and-scatter Feed-forward Network Kernel with Triton
13 minute read
In our recent work Learn to be efficient: Build structured sparsity in large language models, we propose a novel method to build structured sparsity in large...
Use Nsight System to Profile a Model Training with DeepSpeed on Multi-Node Cluster
11 minute read
This post is to log how I managed to profile a model training running on multiple nodes in a cluster with DeepSpeed and Nsight System. Click here to jump to ...
Custom Gather-scatter Operator by CUTLASS
19 minute read
This blog is to log my experience of building efficient custom operator based on CUTLASS. Jump to the final implementation of gather and scatter matrix multi...
Compact Inference with CUDA graph and StaticCache
14 minute read
This post is to log a minimum prototype of LLM inference with CUDA graph to eliminate bubbles between kernel launches. Click here to jump to the final implem...
Comments