Add possibility to offload inactive KV Caches to RAM #33
Labels
enhancement
New feature or request
extras
Not directly related to the thesis, low priority
idea
A new idea - may or may not bring improvements
Since with the Llama architecture came KV caching, and since each node has to cache the K and V matrices for each of the generated samples, this increases memory usage for the device used.
By storing the non-active KV caches in the RAM instead of the VRAM, it is possible to save memory, especially when the number of samples is high;
Implementation idea:
--offload-kv
flag.This could potentially slow down inference, as it requires to transfer data from CPU to GPU memory at each local processing...
The text was updated successfully, but these errors were encountered: