Skip to content

Commit

Permalink
llama : fix Gemma-2 Query scaling factors (ggerganov#8473)
Browse files Browse the repository at this point in the history
* 9B - query_pre_attn_scalar = 256 not 224

See google/gemma_pytorch@03e6575

Gemma 9b should use 256 and not 224 (self.config.hidden_size // self.config.num_attention_heads)

* llama : fix Gemma-2 Query scaling factor

ggml-ci

---------

Co-authored-by: Daniel Han <[email protected]>
  • Loading branch information
ggerganov and danielhanchen authored Jul 14, 2024
1 parent e236528 commit 73cf442
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 6 deletions.
5 changes: 0 additions & 5 deletions convert_hf_to_gguf.py
Original file line number Diff line number Diff line change
Expand Up @@ -2504,11 +2504,6 @@ def set_gguf_parameters(self):
)
self.gguf_writer.add_sliding_window(self.hparams["sliding_window"])

# sanity check
attn_scalar = self.hparams["query_pre_attn_scalar"]
if attn_scalar != hparams["hidden_size"] / hparams["num_attention_heads"]:
raise ValueError("query_pre_attn_scalar must be equal to n_embd / n_head")

def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
del bid # unused

Expand Down
7 changes: 6 additions & 1 deletion src/llama.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -11680,7 +11680,12 @@ struct llm_build_context {
ext_factor, attn_factor, beta_fast, beta_slow);
cb(Qcur, "Qcur", il);

Qcur = ggml_scale(ctx0, Qcur, 1.0f / sqrtf(float(n_embd / n_head)));
// ref: https://github.com/google/gemma_pytorch/commit/03e657582d17cb5a8617ebf333c1c16f3694670e
switch (model.type) {
case e_model::MODEL_9B: Qcur = ggml_scale(ctx0, Qcur, 1.0f / sqrtf(float(n_embd_head_k))); break;
case e_model::MODEL_27B: Qcur = ggml_scale(ctx0, Qcur, 1.0f / sqrtf(float(n_embd / n_head))); break;
default: GGML_ASSERT(false);
};
cb(Qcur, "Qcur_scaled", il);

Kcur = ggml_rope_ext(
Expand Down

0 comments on commit 73cf442

Please sign in to comment.