Add possibility to offload inactive KV Caches to RAM #33

davmacario · 2024-05-20T16:32:47Z

Since with the Llama architecture came KV caching, and since each node has to cache the K and V matrices for each of the generated samples, this increases memory usage for the device used.
By storing the non-active KV caches in the RAM instead of the VRAM, it is possible to save memory, especially when the number of samples is high;

Implementation idea: --offload-kv flag.

This could potentially slow down inference, as it requires to transfer data from CPU to GPU memory at each local processing...

The text was updated successfully, but these errors were encountered:

davmacario added enhancement New feature or request extras Not directly related to the thesis, low priority idea A new idea - may or may not bring improvements labels May 20, 2024

davmacario self-assigned this May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add possibility to offload inactive KV Caches to RAM #33

Add possibility to offload inactive KV Caches to RAM #33

davmacario commented May 20, 2024

Add possibility to offload inactive KV Caches to RAM #33

Add possibility to offload inactive KV Caches to RAM #33

Comments

davmacario commented May 20, 2024