metal : improve FA + improve MoE (llama/12612)

* ggml : FA with different K, V head sizes (CPU) ggml-ci * metal : add FA with HS=192 * metal : extend FA to support different K and V head sizes ggml-ci * metal : add FA vector kernels for heads K 192 and V 128 ggml-ci * ggml : restrict op on other backends to equal head sizes ggml-ci * metal : optimize FA-vec kernel ggml-ci * metal : FA remove mq registers * metal : improve MoE mul_mat_id condition ggml-ci * metal : fix comments + remove unnecessary addition ggml-ci * metal : avoid too much shared memory usage with mul_mat_id ggml-ci
2025-08-18 07:20:08 +02:00 · 2025-03-28 20:21:59 +02:00
parent 1b81415963
commit 27533e7f63
8 changed files with 875 additions and 670 deletions
--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
@@ -4369,7 +4369,7 @@ struct ggml_tensor * ggml_flash_attn_ext(
    }

    // permute(0, 2, 1, 3)
-    int64_t ne[4] = { q->ne[0], q->ne[2], q->ne[1], q->ne[3] };
+    int64_t ne[4] = { v->ne[0], q->ne[2], q->ne[1], q->ne[3] };
    struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne);

    float params[] = { scale, max_bias, logit_softcap };