Mar 25, 2024

Profile CUDA UVM Performance

This post is to log my profile of CUDA unified virtual memory.

Summary of Issues

  1. Extremely slow on manually swap when run export CUDA_VISIBLE_DEVICES=X at the beginning. Performance drop after specifying CUDA_VISIBLE_DEVICES=0 - CUDA Programming and Performance - NVIDIA Developer Forums, solved by pinned core or pinned memory

  2. Compile with -O3 will also effect the performance of baseline, and set visible devices won’t affect it much then.

Virtual Address Fragmentation

Setup

Running on A100 with 80GB memory, CUDA 12.2, driver 535.161.07.

Assume k requests coming in, every request has a individual virtual memory space for KV cache hidden_size x max_embeddings.

Every cycle randomly selects a group of request and generate a number of new tokens to emulate generation progress. Then, randomly select a group of request to free and add new to simulate query out and in.

Compare the theoretical space (sum of KV cache size) and allocated space (GPU memory utilization).

Results

Oversubscription Performance

Setup

Running on A100 with 80GB memory, CUDA 12.2, driver 535.161.07.

Split GPU memory into k blocks and allocate m chunks of data. Chunk size is equation to one partition, and m > k.

For example, if we launch with (k, m) = (4, 8), each chunk takes about 20GB memory and we need to traverse through 8 chunks in total.

The test includes three phases,

  1. Prefill: The first k blocks will fills up the GPU memory, which only needs HtoD memory copy without evicting blocks.

  2. First cycle: For the next m-k chunks, the system also need to evict one of the resident chunk to host by DtoH copy, then copy the new chunk to the empty block.

  3. Second cycle: Now all the blocks are filled with kchunks, we will issue the second cycle that iteratively load m chunks. Every iteration includes sequential DtoH and HtoD copy.

Experiements

Baseline: Manually swapping

wierd output when export visible device

mkprefill [s]1st cycle[s]2nd cycletotal [s]
8419.2036.8127.4264.23
12419.3257.9640.9298.89
804019.3143.3335.6378.97
1204019.5859.4141.02100.42
2048102419.6437.0527.9064.95
3072102419.8259.0841.42100.50

Manually swapping with pinned host

wierd output when export visible device

mkprefill [s]1st cycle[s]2nd cycletotal [s]
843.8630611.5981
124
8040
12040
20481024
30721024

Using UVM

Naive

mkprefill [s]1st cycle[s]2nd cycletotal [s]
8421.0642.6444.4387.07
12421.2667.4366.86134.28
804017.1336.2637.9874.24
1204016.9754.7056.91111.62
2048102420.5943.4447.3790.81
3072102420.7766.6670.62137.28

Prefetch

mkprefill [s]1st cycle[s]2nd cycletotal [s]
843.4810.3118.0128.32
1243.4817.1526.0343.17
80403.5410.8018.1628.96
120403.5418.0726.2844.35
204810243.5711.1418.7229.86
307210243.5718.7427.0945.83

Advise [TODO]

COMMENTS Please share your thoughts