Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea for Discussion: Device Memory Management Primitives #780

Open
wacky6 opened this issue Nov 7, 2024 · 6 comments
Open

Idea for Discussion: Device Memory Management Primitives #780

wacky6 opened this issue Nov 7, 2024 · 6 comments

Comments

@wacky6
Copy link

wacky6 commented Nov 7, 2024

The idea is perhaps future-looking, but I'd like to bring it up for discussion.

Motivations

  • Reduce the GPU/NPU memory required for completing a use case (e.g. text2image).
  • Reduce the memory copy overhead of loading model weights (e.g. on disk) into GPU/NPU
  • Enable processing of models whose weights can't fit in main memory as a whole (e.g. pipelining)

Real World Examples

Case 1: Text to Image

Text to Image use cases generally involves multiple models, represented as a three stage pipeline (consisting of at least 3 models).

Take FLUX to generate an image of 1024x1024 for example:

Stage Model(s) Example Model Weight Size Model's Resident Memory on GPU (Approx.)
1. Text to Embedding Google T5 XXL and CLIP-L; fp8 7GB 8-12GB
2. Diffusion / Denoising FLUX UNet; fp8 12GB 16GB
3. Image decoding FLUX Variational AutoEncoder; fp8 200MB 16GB

In the ideal case, all three stages fits in GPU memory (totaling > 32GB). This exceeds the limit of every consumer GPU (except Apple M series with a large unified memory).

The only practical way to run FLUX is to "load, compute, unload" each model into GPU in sequence, at the cost of "reinitialize" each stage for each text2image inference.

This reduces the required GPU memory max(required_memory_per_stage) from sum(required_memory_per_stage), and requires the main memory to fit sum(size_of_model_weight).

Note:

  • Stable Diffusion has the same architecture (three stages), and uses pipelining on consumer GPUs.
  • Using a larger image size increases the resident memory proportionally for stage 2 and 3.

Case 2: Mixture of Experts (MoE) model

MoE models are self-explanatory. They used multiple small models to produce one inference result (e.g. the next token in LLMs).

Only one small model (say 1/8 size of the entire model) need to reside in GPU memory at a time. Each small model sequentially computes their result, then the results are merged into a single output token.

A high level pseudo code:

while output != END_OF_TEXT:
  outputs = []
  for small_model of small_models:
    small_model.load_to_gpu()
    out = small_model.predict_next_token()
    outputs.append(out)
    small_model.unload_from_gpu()
 output_token = select_from_outputs(outputs)
 emit_to_caller(output_token)

If the GPU has enough memory, all of the small models can reside in memory (load and unload becomes no-op). If not, the small models are repeatedly load to and unload from gpu (usually to/from main memory).

Some LLMs adopt an architecture where the number of activated params (model weight that has to reside in GPU memory) is much smaller than the number of total params (total size of model weight). I believe they functions similarly to MoE at inference time from memory usage standpoint.

Examples:

  • Mixtral 8x7b
  • DeepSeek v2 (236B total params, 21B activated param per token)

Case 3: Model Streaming

I observe this technique when playing with Ollama.

If the model weight is too big to fit into main memory, the model weight are streamed from disk during inference (e.g. the model will be read from disk N times for N predicted tokens).

It's slow (bottlenecked by disk throughput), but does allow large models inference to happen.

Current WebNN API

Case 1 and 2 are feasible but not efficient. Model pipelining and partitioning involves destroying and rebuilding the compute graph.

For every inference (e.g. generate one image from a text prompt, predict one token using MoE model), we need to:

  • Copy model weights from JavaScript ArrayBuffers into WebNN service process
  • And, call platform-specific API to build (with an optional, potentially expensive, fuze/optimize pass)

Case 3 is infeasible because the entire model weight needs to be copied to WebNN service process before we can build a graph. We can fallback to model partitioning, and convert the problem to case 2.

API / Spec Implication

The topic for discussion :)

Two primitives?

  1. Swap a built MLGraph between GPU/NPU memory and main memory (e.g. PyTorch model.to(device))
  2. Memory mapping from file-on-disk to main memory (POSIX mmap equivalent?)

Related spec / APIs I can think of:

  • MLBuffer / MLTensor
  • Fetch: Obtain a mmap-ed response?
  • File System Access: mmap a file on disk (from FileSystemFileHandle)?
  • WebGPU: Shader's compilation cache
  • @reillyeon mentioned to me about introducing a caching mechanism for MLGraph (e.g. save it for later use and avoid repeated graph compilation). Said mechanism might help here.
@reillyeon
Copy link
Contributor

How much of this is an implementation issue (i.e. implementations can be cleverer than they currently are about memory management when multiple large graphs are active) vs. something that needs to be exposed to developers via the API?

@wacky6
Copy link
Author

wacky6 commented Nov 8, 2024

I think model swapping between GPU/Main Memory is feasible in a clever implementation (a LRU cache of somesort). I'm not sure how much overhead that will be, or if it will make it harder for web applications to get predictable performance characteristics (what if the LRU cache isn't clever enough to adapt to the workload).

I think the "stream from disk" / mmap most likely require some API change (I don't think ArrayBuffers are mmap-ed.

@reillyeon
Copy link
Contributor

I think the "stream from disk" / mmap most likely require some API change (I don't think ArrayBuffers are mmap-ed.

Once a graph is built it isn't backed by an ArrayBuffers, it is opaquely held behind the MLGraph interface and an implementation could implement it by keeping the weights on disk and stream / mmap them in as necessary.

Similarly, we've been working on changes to the implementation of MLGraphBuilder.constant() so that the constant value doesn't have to stay in memory after it has been wrapped in an (again opaque) MLOperator instance. This is transparent to developers.

@wacky6
Copy link
Author

wacky6 commented Nov 11, 2024

So reading into ArrayBuffer is still required?

Some model files can be >40GB in size, which won't fit in main memory. So the graph building stage will fail (because building requires an ArrayBuffer in memory).

@reillyeon
Copy link
Contributor

With the changes we've made the ArrayBuffers passed to constant() do not need to be held in memory until build() is called. They could be written to disk incrementally. They only need to exist in memory for the constant() call. While a model file can be >4GB in size an individual constant is typically much smaller because it only contains the weights for a single layer of the model.

@bbernhar
Copy link

In WebGPU, GPU memory management was made an implementation detail and in Chromium, we did used an LRU to page-in GPU heaps to/from system memory per command buffer (concept called "residency"). ML framework offloading to WebGPU uses this system, at-least on Windows AFAIK. I could see WebNN doing something similar but on a per graph basis.

The app is given a GPU budget, provided by the OS, for the WebGPU runtime to stay under (or will OOM). GPU driver can provide management during over-budget situations. This is important for interop situations because both WebNN and WebGPU must share the same GPU budget (it is per process) where neither can easily access each other heaps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants