whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2025-06-03 00:15:40 +02:00

Author	SHA1	Message	Date
Eve	4712f7b663	vulkan: fix warnings (llama/13626) * small fixes * remove ifdef	2025-05-27 18:03:00 +03:00
0cc4m	0b69f74e15	Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence (llama/13607)	2025-05-27 18:03:00 +03:00
Jeff Bolz	6d61a09bc4	vulkan: use scalar FA rather than coopmat2 when N==1 (llama/13554)	2025-05-19 14:58:39 +03:00
Jeff Bolz	162bbe8220	vulkan: KHR_coopmat flash attention (llama/13506) This shader uses coopmat1 to do the QK^T multiply. The PV multiply is more difficult for various reasons so I haven't done it. Performance for this shader is around 2.5x better than for the scalar shader when doing prompt processing. Some of the benefit may be from other optimizations like staging through shared memory, or splitting by rows.	2025-05-19 14:58:39 +03:00
Jeff Bolz	a04b329ad1	vulkan: scalar flash attention implementation (llama/13324) * vulkan: scalar flash attention implementation * vulkan: always use fp32 for scalar flash attention * vulkan: use vector loads in scalar flash attention shader * vulkan: remove PV matrix, helps with register usage * vulkan: reduce register usage in scalar FA, but perf may be slightly worse * vulkan: load each Q value once. optimize O reduction. more tuning * vulkan: support q4_0/q8_0 KV in scalar FA * CI: increase timeout to accommodate newly-supported tests * vulkan: for scalar FA, select between 1 and 8 rows * vulkan: avoid using Float16 capability in scalar FA	2025-05-13 13:59:21 +03:00
Jeff Bolz	e46df4850f	vulkan: Allow up to 4096 elements for mul_mat_id row_ids (llama/13326) This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf: GGML_ASSERT(nei0 * nei1 <= 3072); The tensor is 8 x 512. Increase this array size to accommodate.	2025-05-13 13:59:21 +03:00
Jeff Bolz	22ba2e27ce	vulkan: Additional type support for unary, binary, and copy (llama/13266) Support f16->f32 copy. Support f16->f16 and f32->f32 unary ops. Support all combinations of f16/f32 for src0/src1/dst for add/sub/mul/div.	2025-05-07 21:00:32 +03:00
Jeff Bolz	fd1cb9fc12	vulkan: Add bfloat16 support (llama/12554) * vulkan: Add bfloat16 support This adds bfloat16 matrix multiply support based on VK_KHR_shader_bfloat16. The extension is required for coopmat multiply support, but matrix-vector multiply trivially promotes bf16 to fp32 and doesn't require the extension. The copy/get_rows shaders also don't require the extension. It's probably possible to fall back to non-coopmat and promote to fp32 when the extension isn't supported, but this change doesn't do that. The coopmat support also requires a glslc that supports the extension, which currently requires a custom build. * vulkan: Support bf16 tensors without the bf16 extension or coopmat support Compile a variant of the scalar mul_mm shader that will promote the bf16 values to float, and use that when either the bf16 extension or the coopmat extensions aren't available. * vulkan: bfloat16 fixes (really works without bfloat16 support now) * vulkan: fix spirv-val failure and reenable -O	2025-05-07 15:39:32 +03:00
Jeff Bolz	17f6b8225e	vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader (llama/13191) * vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader	2025-05-07 15:39:32 +03:00
Acly	6374ea32ca	vulkan : kernels for depthwise 2D convolution (CONV_2D_DW) (ggml/1204) * vulkan : add kernels for depthwise 2d convolution (OP_CONV_2D_DW) * review: remove src_x/y < 0 checks; add performance tests	2025-05-07 15:39:32 +03:00
Eve	cf3eb291ab	vulkan: matmul gcn tuning (llama/13016) * tune matmul for gcn * this one is more power efficient * Update ggml/src/ggml-vulkan/ggml-vulkan.cpp Co-authored-by: 0cc4m <picard12@live.de> * disable this tune for the proprietary driver --------- Co-authored-by: 0cc4m <picard12@live.de>	2025-04-24 20:39:16 +03:00
Jeff Bolz	27a56e7243	vulkan: support noncontiguous rms_norm (llama/13031)	2025-04-24 20:39:16 +03:00
Georgi Gerganov	36019c35a3	graph : make FA compatible with MLA + add initial Metal kernels (llama/12953) * graph : make mla compatible with FA * metal : add exp FA kernels for DeepSeek models ggml-ci * llama : minor naming updates ggml-ci * ggml : disable FA for DS head sizes * tests : add FA tests for MLA shapes ggml-ci	2025-04-24 20:39:16 +03:00
Jeff Bolz	7db8f278f0	vulkan: enable coopmat2 FA gqa and split_k optimizations more often (llama/12931) The grouped query attention optmization doesn't require a power of two ratio, the only thing relying on it was the modulo operation written as bitwise &. split_k need not depend on gqa_ratio - enable it any time there's only one workgroup in the X dimension. The shader gets the split index from the x coord, and multiple workgroups in the X dimension (pre-split) indicates a larger FA operation that wouldn't need splitting.	2025-04-24 20:39:16 +03:00
Jeff Bolz	b9bfe0c693	vulkan: In coopmat2 mmq, load q4_k/q5_k scales through shared memory (llama/12833) q4_k and q5_k had a lot of redundant global loads where the same 16B of scale information is repeatedly loaded and decoded during each loop iteration. This change restructures the loops to more explicitly iterate over whole blocks in the outer loop (with unrolled inner loop) and to copy/decode the scale data into shared memory once at the start of each outer loop. The copy is pipelined so the scale load from global memory is relatively cheap. This improves q4_k/q5_k model prompt processing performance by around 5-7%. I briefly tried applying this to q6_k and q4_0, and it didn't help for q6_k and hurt for q4_0. The big "else" path in mul_mm_cm2.comp that had all the clamped/unclamped variants isn't used as often as it originally was (e.g. due to the padded_N change), so I trimmed it down to offset some of the new complexity of the semi-manual loop unrolling.	2025-04-24 20:39:16 +03:00
Diego Devesa	b9c71fae5a	ggml : add bilinear upscale support (ggml/1185)	2025-04-24 20:39:16 +03:00
Jeff Bolz	d792d2a2dc	vulkan: Use unclamped loads for flash attention mask (llama/12720) nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple of the number of rows in the matrix. The KV dim is a multiple of the number of columns for the aligned shader.	2025-04-24 20:39:16 +03:00
0cc4m	8add58aa5e	Vulkan: Tune Vulkan mmq int dot shader for performance (llama/12767)	2025-04-24 20:39:16 +03:00
Jeff Bolz	76231bda56	vulkan: Hybrid waitForFences/getFenceStatus to reduce fence latency (llama/12630) There seems to be a bubble waking up from waitForFences, which costs a few percent performance and also increased variance in performance. This change inserts an "almost_ready" fence when the graph is about 80% complete and we waitForFences for the almost_ready fence and then spin (with _mm_pauses) waiting for the final fence to be signaled.	2025-04-24 20:39:16 +03:00
Jeff Bolz	b243416918	vulkan: Implement split_k for coopmat2 flash attention. (llama/12627) When using group query attention, we have one workgroup per KV batch and this can be very few workgroups (e.g. just 8 in some models). Enable split_k to spread the work across SMs. This helps a lot when the KV cache is large.	2025-04-24 20:39:16 +03:00
Jeff Bolz	2105b110d3	vulkan: Implement grouped query attention in the coopmat2 FA shader (llama/12559) When adjacent batches of Q share the same batches of K/V, batch them into the same workgroup. For example, when: dst(128,32,1,1) = FA(q(128,1,32,1), k(128,16640,8,1), v(128,16640,8,1)) previously we would run 32 workgroups computing 1 result each, now we will run 8 workgroups computing 4 results each. This doesn't directly translate to better performance (at least when you have >=32 SMs), but in a subsequent change I'll enable split_k which will scale much better with 4x fewer workgroups.	2025-04-24 20:39:16 +03:00
Wagner Bruna	801d6bd809	vulkan: fix build when glslc doesn't support coopmat (llama/12683)	2025-04-02 15:51:57 +03:00
0cc4m	0810f02547	Vulkan: Add DP4A MMQ and Q8_1 quantization shader (llama/12135) * Vulkan: Add DP4A MMQ and Q8_1 quantization shader * Add q4_0 x q8_1 matrix matrix multiplication support * Vulkan: Add int8 coopmat MMQ support * Vulkan: Add q4_1, q5_0 and q5_1 quants, improve integer dot code * Add GL_EXT_integer_dot_product check * Remove ggml changes, fix mmq pipeline picker * Remove ggml changes, restore Intel coopmat behaviour * Fix glsl compile attempt when integer vec dot is not supported * Remove redundant code, use non-saturating integer dot, enable all matmul sizes for mmq * Remove redundant comment * Fix integer dot check * Fix compile issue with unsupported int dot glslc * Update Windows build Vulkan SDK version	2025-04-02 15:51:57 +03:00
Georgi Gerganov	27533e7f63	metal : improve FA + improve MoE (llama/12612) * ggml : FA with different K, V head sizes (CPU) ggml-ci * metal : add FA with HS=192 * metal : extend FA to support different K and V head sizes ggml-ci * metal : add FA vector kernels for heads K 192 and V 128 ggml-ci * ggml : restrict op on other backends to equal head sizes ggml-ci * metal : optimize FA-vec kernel ggml-ci * metal : FA remove mq registers * metal : improve MoE mul_mat_id condition ggml-ci * metal : fix comments + remove unnecessary addition ggml-ci * metal : avoid too much shared memory usage with mul_mat_id ggml-ci	2025-03-28 21:47:42 +02:00
Jeff Bolz	cbb88c4050	vulkan: Optimize mul_mat_vec p021 and nc shaders (llama/12505) * tests: add mul_mat perf/functional tests for p021/nc vulkan shaders * vulkan: Optimize mul_mat_vec p021 and nc shaders. These shaders are used in attention calculations, and when the KV cache grows large they start to dominate the run time. For the nc shader (which is called with large 'k' dimension), use unrolling and vector loads. For the p021 shader (which is called with large 'm' and small 'k' dimensions), take advantage of grouped query attention to reuse loads from the A matrix for the whole group, and reduce the number of workgroups (too much overhead from tiny dispatches). Using subgroupAdd in the p021 shader also helps, use that conditionally.	2025-03-27 11:06:03 +02:00
stduhpf	13455c0b5f	Vulkan: RTE rounding for cpy to quant (llama/12480) * Vulkan: RTE rounding for cpy to quant Co-Authored-By: Jeff Bolz <jbolz@nvidia.com> * remove trailing whitespace * avoid duplicating pipeline_cpy_f32_quant * fix copypasting issue * remove duplicated code --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-03-27 11:06:03 +02:00
Jeff Bolz	102af79f63	vulkan: Submit once enough matmul work has been recorded (llama/12406) I've been seeing significantly worse performance for tg with flash attention enabled vs disabled, and it seems to be related to the submit heuristic. Change the heuristic to check how many bytes worth of weight matrix are used and flush every 100MB, and ramp up after the first few submits. This seems to resolve the issue, and also increases perf for non-FA a bit.	2025-03-27 11:06:03 +02:00
0cc4m	fa72479cfb	Vulkan: Default to 1GB allocations instead of 4GB to avoid fragmentation and driver issues (llama/12434)	2025-03-27 11:06:03 +02:00
Molly Sophia	52c4c03b0a	llama: Add support for RWKV v7 architecture (llama/12412) * ggml: Add op l2_norm Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * ggml: Add op rwkv_wkv7 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: Add support for RWKV7 and ARWKV7 models Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: fix inference with RWKV6Qwen2 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: add more (a)rwkv7 variants in size Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Apply code-format changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * fix MUSA build Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: fix shape error with rwkv using llama-parallel Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2025-03-27 11:06:03 +02:00
Jeff Bolz	b3f3779c1b	vulkan: Add N/2 and N/4 optimized paths in coopmat2 shader (llama/12312)	2025-03-27 11:06:03 +02:00
Daniele	13eeebb1b2	vulkan: subgroup size tuning (llama/12087) * vulkan: subgroup size test * Vulkan: Add device architecture enum and logic to recognize AMD generations * vulkan: use new architecture logic to specify subgroup size * Initial vulkan subgroup size tuning for RDNA3 * vulkan: commonize RDNA subgroup tuning * vulkan: override subgroup size if required_subgroup_size = 0 * vulkan: disable warp 32 for RDNA3 * vulkan: fine tuned RDNA1 subgroup sizes * vulkan: adjusted subgroup size map * vulkan: fixed RDNA2 subgroup map --------- Co-authored-by: 0cc4m <picard12@live.de>	2025-03-27 11:06:03 +02:00
Jeff Bolz	2cd3061a23	vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking (llama/12273) * vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking	2025-03-27 11:06:03 +02:00
Jeff Bolz	88d59e21b2	vulkan: Adjust coopmat2 tile sizes and selection heuristic (llama/12258)	2025-03-27 11:06:03 +02:00
Georgi Gerganov	54a54faee4	vulkan : sync (llama/0) ggml-ci	2025-03-08 15:13:01 +02:00
William Tambellini	c98681e6d5	ggml : upgrade init_tensor API to return a ggml_status (llama/11854) * Upgrade init_tensor API to return a ggml_status To prepare for an 'abort-free' ggml (ggml not to abort on OOMs but return a OOM status), as agreeed with Diego in the ggml repo, upgrade the init_tensor() and view_init() APIs to return a ggml_status. * misc fixes --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-03-08 15:13:01 +02:00
Rémy O	3bab804981	vulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizations (llama/11595) * vulkan: implement specialized MMV kernels for IQ2 quantizations * vulkan: add MMV kernels for IQ3 quants * vulkan: Increase MMV batch size and unroll IQ LUT setup * vulkan: fix init_iq_shmem for WG sizes larger than tables * vulkan: common batch size for all I-quants	2025-03-08 15:13:01 +02:00
Jeff Bolz	a0f76b2da7	vulkan: fix assertion when qy_needs_dequant (llama/12068) Looks like a copy/paste bug from qx_needs_dequant.	2025-03-08 15:13:01 +02:00
cmdr2	6ac8e6b2ce	cuda/vulkan: specify fp32-only support for some operations in supports_op (ggml/1129) * cuda: restrict SILU_BACK to fp32, since fp16 exceeds the desired test threshold * vulkan: specify fp32-only support for certain ops (that are now tested for fp16 as well) * f32 sigmoid in vulkan supports op * Revert "f32 sigmoid in vulkan supports op" This reverts commit c6f04b3c19bf4504c2776149c6d8cd84e0b48acb.	2025-03-08 15:13:01 +02:00
Rémy O	37a21dd43d	vulkan: implement several ops relevant for ggml_opt (llama/11769) * vulkan: support memset_tensor * vulkan: support GGML_OP_SUM * vulkan: implement GGML_OP_ARGMAX * vulkan: implement GGML_OP_SUB * vulkan: implement GGML_OP_COUNT_EQUAL * vulkan: implement GGML_OP_OPT_STEP_ADAMW * vulkan: fix check_results RWKV_WKV6 crash and memory leaks * vulkan: implement GGML_OP_REPEAT_BACK * tests: remove invalid test-backend-ops REPEAT_BACK tests * vulkan: fix COUNT_EQUAL memset using a fillBuffer command	2025-02-27 08:55:36 +02:00
Jeff Bolz	8a22a8b17f	vulkan: support multi/vision rope, and noncontiguous rope (llama/11902)	2025-02-27 08:55:36 +02:00
Rémy O	1689aaf854	vulkan: initial support for IQ1_S and IQ1_M quantizations (llama/11528) * vulkan: initial support for IQ1_S and IQ1_M quantizations * vulkan: define MMV kernels for IQ1 quantizations * devops: increase timeout of Vulkan tests again * vulkan: simplify ifdef for init_iq_shmem	2025-02-27 08:55:36 +02:00
Eve	e22d69839d	vulkan: linux builds + small subgroup size fixes (llama/11767) * mm subgroup size * upload vulkan x86 builds	2025-02-27 08:55:36 +02:00
Danny Milosavljevic	db6e19188a	vulkan: Make Vulkan optional at runtime (ggml/11493). (llama/11494) Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-02-27 08:55:36 +02:00
Wagner Bruna	b4b063a5c9	vulkan: add environment variable GGML_VK_PREFER_HOST_MEMORY to avoid VRAM allocation (llama/11592)	2025-02-27 08:55:36 +02:00
Jeff Bolz	930b739e7a	vulkan: account for lookup tables when checking shared memory size (llama/11502)	2025-02-27 08:55:36 +02:00
Jeff Bolz	be83f342fb	vulkan: print shared memory size (llama/11719)	2025-02-27 08:55:36 +02:00
Rémy O	6f08b24146	vulkan: initial support for IQ4_XS quantization (llama/11501)	2025-02-27 08:55:36 +02:00
Jeff Bolz	7c165d7fa8	vulkan: use smaller combined allocations to avoid fragmentation (llama/11551)	2025-02-27 08:55:36 +02:00
Johannes Gäßler	bae6bbf487	CUDA: non-contiguous (RMS) norm support (llama/11659) * CUDA: non-contiguous (RMS) norm support --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-27 08:55:36 +02:00
Rémy Oudompheng	80fa576254	vulkan: implement initial support for IQ2 and IQ3 quantizations (llama/11360) * vulkan: initial support for IQ3_S * vulkan: initial support for IQ3_XXS * vulkan: initial support for IQ2_XXS * vulkan: initial support for IQ2_XS * vulkan: optimize Q3_K by removing branches * vulkan: implement dequantize variants for coopmat2 * vulkan: initial support for IQ2_S * vulkan: vertically realign code * port failing dequant callbacks from mul_mm * Fix array length mismatches * vulkan: avoid using workgroup size before it is referenced * tests: increase timeout for Vulkan llvmpipe backend --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-02-03 22:00:57 +02:00

1 2

91 Commits