Use Nsight System to Profile a Model Training with DeepSpeed on Multi-Node Cluster
This post is to log how I managed to profile a model training running on multiple nodes in a cluster with DeepSpeed and Nsight System. Click here to jump to ...
These blogs are written to log my study during research for future reference. Hope you can also enjoy!
This post is to log how I managed to profile a model training running on multiple nodes in a cluster with DeepSpeed and Nsight System. Click here to jump to ...
Placeholder for the blog logging the results of profiling vllm
Placeholder for the blog logging the results of profiling different implementations of attention modules.
Placeholder for the blog logging how I build an app with OpenAI API.
Placeholder for the blog logging how I trained a custom Mixtral model with DeepSpeed.
This blog is to log my experience of building efficient custom operator based on CUTLASS. Jump to the final implementation of gather and scatter matrix multi...
This post is to log a minimum prototype of LLM inference with CUDA graph to eliminate bubbles between kernel launches. Click here to jump to the final implem...
This post is to log my implementation of gather-and-scatter matrix multiplication operation with Triton. Click here to jump to the final implementation code.
This post is to log my experiments with CUDA unified memory and some innovative and interesting application of UVM in large language model (LLM).
Placeholder for my blog
Placeholder for my blog