Xueshen Liu

Efficient Gather-and-scatter Matrix Multiplication Kernel with Triton

26 minute read

This post is to log my implementation of gather-and-scatter matrix multiplication operation with Triton. Click here to jump to the final implementation code....

Understand CUDA Unified Memory

7 minute read

This post is to log my experiments with CUDA unified memory and some innovative and interesting application of UVM in large language model (LLM).

Profile CUDA UVM Performance

5 minute read

This post is to log my profile of CUDA unified virtual memory.

Xueshen Liu

Xueshen Liu

Recent Posts

Efficient Gather-and-scatter Matrix Multiplication Kernel with Triton

Understand CUDA Unified Memory

Profile CUDA UVM Performance