Tags
Attention
- » Profiling of different implementation of attention modules
- » Profiling of different implementation of attention modules
CUDA
- » Custom Gather-scatter Operator by CUTLASS
- » Compact Inference with CUDA graph and StaticCache
- » Efficient Gather-and-scatter Matrix Multiplication Kernel with Triton
- » Understand CUDA Unified Memory
- » Understand CUDA PTXAS
- » Profile CUDA program with Nsight
CUDA Graph
CUTLASS
Cluster
Compiler
Deepspeed
- » Use Nsight System to Profile a Model Training with DeepSpeed on Multi-Node Cluster
- » Training Custom Mixtral Model with DeepSpeed
Flash-attn
- » Profiling of different implementation of attention modules
- » Profiling of different implementation of attention modules
GEMM
GPT
Huggingface
- » Profiling of different implementation of attention modules
- » Profiling of different implementation of attention modules
- » Compact Inference with CUDA graph and StaticCache
LLM
- » Profiling of different implementation of attention modules
- » Profiling of different implementation of attention modules
- » Compact Inference with CUDA graph and StaticCache
Mixtral
MoE
Multi-GPU
Multi-Node
Node.js
Nsight
OpenAI
Profile
- » Profiling of different implementation of attention modules
- » Profiling of different implementation of attention modules
Profiler
PyTorch
Python
Pytorch
- » Profiling of different implementation of attention modules
- » Profiling of different implementation of attention modules
- » Compact Inference with CUDA graph and StaticCache
- » Efficient Gather-and-scatter Matrix Multiplication Kernel with Triton
Training
- » Use Nsight System to Profile a Model Training with DeepSpeed on Multi-Node Cluster
- » Training Custom Mixtral Model with DeepSpeed