Tags
CUDA
- » Efficient Gather-and-scatter Feed-forward Network Kernel with Triton
- » Custom Gather-scatter Operator by CUTLASS
- » Compact Inference with CUDA graph and StaticCache
- » Efficient Gather-and-scatter Matrix Multiplication Kernel with Triton
- » Understand CUDA Unified Memory
- » Profile CUDA UVM Performance
CUDA Graph
CUTLASS
Cluster
Deepspeed
- » Use Nsight System to Profile a Model Training with DeepSpeed on Multi-Node Cluster
- » Training Custom Mixtral Model with DeepSpeed
Feed-forward Network
GEMM
- » Efficient Gather-and-scatter Feed-forward Network Kernel with Triton
- » Efficient Gather-and-scatter Matrix Multiplication Kernel with Triton
Huggingface
LLM
Mixtral
MoE
Multi-GPU
Multi-Node
Nsight
Profiler
PyTorch
Python
Pytorch
- » Efficient Gather-and-scatter Feed-forward Network Kernel with Triton
- » Compact Inference with CUDA graph and StaticCache
- » Efficient Gather-and-scatter Matrix Multiplication Kernel with Triton
Structured Sparsity
Training
- » Use Nsight System to Profile a Model Training with DeepSpeed on Multi-Node Cluster
- » Training Custom Mixtral Model with DeepSpeed
Triton
- » Efficient Gather-and-scatter Feed-forward Network Kernel with Triton
- » Efficient Gather-and-scatter Matrix Multiplication Kernel with Triton