Tags

CUDA

December 12, 2024 » Efficient Gather-and-scatter Feed-forward Network Kernel with Triton
June 19, 2024 » Custom Gather-scatter Operator by CUTLASS
May 27, 2024 » Compact Inference with CUDA graph and StaticCache
April 24, 2024 » Efficient Gather-and-scatter Matrix Multiplication Kernel with Triton
March 31, 2024 » Understand CUDA Unified Memory
March 25, 2024 » Profile CUDA UVM Performance

CUDA Graph

May 27, 2024 » Compact Inference with CUDA graph and StaticCache

CUTLASS

June 19, 2024 » Custom Gather-scatter Operator by CUTLASS

Cluster

November 07, 2024 » Use Nsight System to Profile a Model Training with DeepSpeed on Multi-Node Cluster

Deepspeed

November 07, 2024 » Use Nsight System to Profile a Model Training with DeepSpeed on Multi-Node Cluster
June 25, 2024 » Training Custom Mixtral Model with DeepSpeed

Feed-forward Network

December 12, 2024 » Efficient Gather-and-scatter Feed-forward Network Kernel with Triton

GEMM

December 12, 2024 » Efficient Gather-and-scatter Feed-forward Network Kernel with Triton
April 24, 2024 » Efficient Gather-and-scatter Matrix Multiplication Kernel with Triton

Huggingface

May 27, 2024 » Compact Inference with CUDA graph and StaticCache

LLM

May 27, 2024 » Compact Inference with CUDA graph and StaticCache

Mixtral

June 25, 2024 » Training Custom Mixtral Model with DeepSpeed

MoE

June 25, 2024 » Training Custom Mixtral Model with DeepSpeed

Multi-GPU

November 07, 2024 » Use Nsight System to Profile a Model Training with DeepSpeed on Multi-Node Cluster

Multi-Node

November 07, 2024 » Use Nsight System to Profile a Model Training with DeepSpeed on Multi-Node Cluster

Nsight

November 07, 2024 » Use Nsight System to Profile a Model Training with DeepSpeed on Multi-Node Cluster

Profiler

March 25, 2024 » Profile CUDA UVM Performance

PyTorch

June 25, 2024 » Training Custom Mixtral Model with DeepSpeed
June 19, 2024 » Custom Gather-scatter Operator by CUTLASS

Python

June 19, 2024 » Custom Gather-scatter Operator by CUTLASS

Pytorch

December 12, 2024 » Efficient Gather-and-scatter Feed-forward Network Kernel with Triton
May 27, 2024 » Compact Inference with CUDA graph and StaticCache
April 24, 2024 » Efficient Gather-and-scatter Matrix Multiplication Kernel with Triton

Structured Sparsity

December 12, 2024 » Efficient Gather-and-scatter Feed-forward Network Kernel with Triton

Training

November 07, 2024 » Use Nsight System to Profile a Model Training with DeepSpeed on Multi-Node Cluster
June 25, 2024 » Training Custom Mixtral Model with DeepSpeed

Triton

December 12, 2024 » Efficient Gather-and-scatter Feed-forward Network Kernel with Triton
April 24, 2024 » Efficient Gather-and-scatter Matrix Multiplication Kernel with Triton

UVM

March 25, 2024 » Profile CUDA UVM Performance