whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2025-08-19 03:56:45 +02:00

Author	SHA1	Message	Date
Sigbjørn Skjæret	367cd11f5d	cuda : fix GGML_CUDA_GRAPHS=OFF (llama/15300) * fix USE_CUDA_GRAPH=OFF ggml-ci * check capture status * completely disable capturing check instead	2025-08-18 20:30:45 +03:00
Jonathan Graehl	c76ec72d59	finetune: SGD optimizer, more CLI args (llama/13873) * examples/finetune -opt SGD (stochastic gradient descent) memory opt add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating m, v tensors. support finetune.cpp arg -opt SGD (or sgd). (default adamw as before) llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch) when using SGD instead of 19gb (55 sec/epoch) using adamw. (wikipedia 100 lines finetune) ( using the same GPU memory, adamw can only do before OOM 512 batch/context, reaching: train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00 val: [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00 SGD is superior, though it converges slower, with max before OOM 1728 batch/context (esp see the better validation perf): train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00 val: [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00 ) note: when finetuning long enough (or w/ enough -lr), validation accuracy eventually drops ('catastrophic forgetting') -lr-half (halflife) option useful for SGD to avoid oscillation or super slow underdamped learning (makes setting -lr more forgiving). terminal -lr for now is set by lr-halvings i.e. if you want at most 1/8 the inital -lr you set -lr-halvings 3. note: objective loss not directly comparable between adamw, sgd? - check perplexity or accuracy or consider relative improvements for convergence new finetune args -wd 1e-9 to enable weight decay in sgd or adamw, and max -epochs N (default 2 as before) cache (1 - wdalpha) in 'adamw' opt struct - no noticeable perf benefit, disabled (still done for new SGD though) since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params would probably be able to change between SGD and AdamW with each epoch but would need to use adamw for the first (unconfirmed - no cmdline arg to set such a policy yet) test-opt checks adamw as before and now sgd (except for a few disabled tests for sgd only; probably just needs logging values and adding alternate reference values); tolerance on the 'regression' test is broader for sgd (so we don't need many more epochs) Vulkan: Implement GGML_OP_OPT_STEP_SGD * tests: Fix OPT_STEP_SGD test-backend-ops * SGD op param store weight-decay and not 1-alphawd minor + cosmetic changes * fix vulkan sgd * try CI fix --------- Co-authored-by: 0cc4m <picard12@live.de> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-08-18 20:30:45 +03:00
uvos	cbaec6c4ac	HIP: bump requirement to rocm 6.1 (llama/15296)	2025-08-18 20:30:45 +03:00
Judd	80ef57f0f0	ggml : update `ggml_rope_multi` (llama/12665) * update `rope_multi`: 1. add `ggml_rope_multi_inplace`; 1. use `GGML_MROPE_SECTIONS` instead of 4. * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-08-18 20:30:45 +03:00
Georgi Gerganov	0e8b244366	ggml : repack block_iq4_nlx8 (llama/14904) ggml-ci	2025-08-18 20:30:45 +03:00
Oliver Simons	b8b1b50c47	CUDA: Optimize `reduce_rows_f32` kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n (llama/15132) * Factor out `reduce_rows_f32` from common.cuh This increases iteration cycle speed by not having to recompile every kernel all the time * Hide memory-latency by loop unrolling in reduce_rows_f32 * Further optimizations to `reduce_rows_f32` 1. Increase threadblock size to better hide latency of memory requests. As a consequence of bigger threadblocks, do 2-step summation, using shared memory to communicate results between invocations 2. Use sum_temp array to reduce waits on sum 3. Adjust num_unroll to reflext bigger threadblock 4. Improve default block_dims, increase support for more block_dims * Add perf tests for `reduce_rows_f32` kernel * Add heuristic to toggle 128/512 threads based on sm count Break even point was the minimum of the following multiples. \| GPU Model \| Nrow SM Count Multiple \| \| ----------- \| ----------- \| \| RTX 4000 SFF ADA \| 2.0x \| \| RTX 6000 ADA \| 2.5x \| \| RTX PRO 6000 Blackwell Max-Q \| 3.04x \| \| RTX PRO 4500 Blackwell \| 3.15x \| * Ensure perf gains also for small ncols and large nrows Alternative to this, one could have also made the number of unrollings template-able, but that would require compiling the kernel multiple times, increasing binary size unnecessarily * Modify perf and unit-tests * Apply auto-formatting by clang * Fix CI build failure See https://github.com/ggml-org/llama.cpp/actions/runs/16798370266/job/47573716079?pr=15132#step:7:486 Building with VS generator worked though. * Remove sm_count property from `ggml_backend_cuda_context` Requested by @JohannesGaessler, and should fix remaining CI issues as a side-effect * Add CUB-based implementation for GGML_OP_MEAN Currently this branch is only executed for nrows==1 * Add heuristics to execute CUB branch only when it brings perf Heuristics were determined on the following HW: * RTX 4000 SFF ADA * RTX 6000 ADA * RTX PRO 6000 Blackwell Max-Q * RTX PRO 4500 Blackwell * Add unit-test for CUB-based mean Tests should run with CUDA Graphs enabled per default on NVGPUs * Rename `USE_CUB` to `GGML_CUDA_USE_CUB` Suggested by @JohannesGaessler * Unindent Preprocessor directives See https://github.com/ggml-org/llama.cpp/pull/15132#discussion_r2269213506	2025-08-18 20:30:45 +03:00
Tak-RS	4e234ac013	ggml-rpc: chunk send()/recv() to avoid EINVAL for very large tensors over RPC (macOS & others) (llama/15188) * ggml-rpc: chunk send()/recv() to avoid EINVAL for very large tensors over RPC (macOS & others). Fixes #15055 * ggml-rpc: rename RPC_IO_CHUNK->MAX_CHUNK_SIZE, use std::min() for cap, switch to GGML_LOG_ERROR, handle 0-length send/recv * rpc: drop n==0 special case in send_data(); retry in loop per review * rpc: remove trailing whitespace in send_data() --------- Co-authored-by: Shinnosuke Takagi <nosuke@nosukenoMacBook-Pro.local>	2025-08-18 20:30:45 +03:00
uvos	8df931b608	HIP: disable sync warp shuffel operators from clr amd_warp_sync_functions.h (llama/15273)	2025-08-18 20:30:45 +03:00
Romain Biessy	1334f434f3	sycl: Fix and disable more configurations of mul_mat (llama/15151) * sycl: Fix and disable more configurations of mul_mat * Disable more configurations	2025-08-18 20:30:45 +03:00
rmatif	139110701e	opencl: allow mixed f16/f32 `add` (llama/15140)	2025-08-18 20:30:45 +03:00
Aman Gupta	082c7ba67c	CUDA cmake: add `-lineinfo` for easier debug (llama/15260)	2025-08-18 20:30:45 +03:00
Chenguang Li	0effaad964	CANN: GGML_OP_CPY optimization (llama/15070) Signed-off-by: noemotiovon <757486878@qq.com>	2025-08-18 20:30:45 +03:00
R0CKSTAR	8e2ddfec31	musa: fix failures in test-backend-ops for mul_mat_id op (llama/15236) * musa: fix failures in test-backend-ops for mul_mat_id op Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-08-18 20:30:45 +03:00
hipudding	3e2c262c08	CANN: Add broadcast for softmax and FA (llama/15208) * refactor softmax * fix fa * fix mask shape * format * add comments * Remove whitespace	2025-08-18 20:30:45 +03:00
Charles Xu	30cc11dc94	kleidiai: fix unsigned overflow bug (llama/15150) * kleidiai: fix unsigned overflow bug * address review comments	2025-08-18 20:30:45 +03:00
David Zhao	457eadfe6f	cuda: refactored ssm_scan and use CUB (llama/13291) * cuda: refactored ssm_scan to use CUB * fixed compilation error when when not using CUB * assign L to constant and use size_t instead of int * deduplicated functions * change min blocks per mp to 1 * Use cub load and store warp transpose * suppress clang warning	2025-08-18 20:30:45 +03:00
Aman Gupta	93c7a08019	CUDA: add attention sinks for tile and wmma (llama/15178) * CUDA: add attention sinks for tile and wmma * Review: formatting changes + remove syncthreads from tile + remove warp_reduce_max from wmma	2025-08-18 20:30:45 +03:00
compilade	62566a5436	gguf-py : add Numpy MXFP4 de/quantization support (llama/15111) * gguf-py : add MXFP4 de/quantization support * ggml-quants : handle zero amax for MXFP4	2025-08-18 20:30:45 +03:00
AN Long	573bf9d128	ggml : fix field name when new ggml_backend (llama/14944)	2025-08-18 20:30:45 +03:00
Johannes Gäßler	2baea5e4b3	CUDA: attention sinks for mma FlashAttention (llama/15157)	2025-08-18 20:30:45 +03:00
lhez	8a36cd924a	opencl: support sink in `soft_max` (attn sinks) (llama/15152)	2025-08-18 20:30:45 +03:00
Jeff Bolz	1984530710	vulkan: support fattn sinks (llama/15126)	2025-08-18 20:30:45 +03:00
Jeff Bolz	414e9074e0	vulkan: Add env var to disable host visible vidmem (llama/15109)	2025-08-18 20:30:45 +03:00
uvos	813ceb2a74	HIP: add cmake option to enable compiler output of kernel resource usage metrics (llama/15103)	2025-08-18 20:30:45 +03:00
Christian Kastner	6d7ffea292	ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (llama/15094) Any available libraries are found and loaded dynamically at runtime.	2025-08-18 20:30:45 +03:00
Johannes Gäßler	5caf8a1ea2	CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16 (llama/15131) * CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16	2025-08-18 20:30:45 +03:00
rmatif	b405fd88b3	fix profiling crash (llama/15072)	2025-08-18 20:30:45 +03:00
lhez	d153cfb507	opencl: add `swiglu_oai` and `add_id` (llama/15121) * opencl: add `swiglu-oai` * opencl: add `add_id` * opencl: add missing `add_id.cl`	2025-08-18 20:30:45 +03:00
Diego Devesa	6fb55d8f7c	ggml : fix fallback to CPU for ununsupported ops (llama/15118)	2025-08-18 20:30:45 +03:00
Chenguang Li	e809e81e69	CANN: add support for ACL Graph (llama/15065) * feat(cann): add optional support for ACL Graph execution This commit adds support for executing ggml computational graphs using Huawei's ACL graph mode via the USE_CANN_GRAPH flag. The support can be enabled at compile time using the CMake option: -DUSE_CANN_GRAPH=ON By default, ACL graph execution is disabled, and the fallback path uses node-by-node execution. Key additions: - CMake option to toggle graph mode - Graph capture and execution logic using - Tensor property matching to determine whether graph update is required - Safe fallback and logging if the environment variable LLAMA_SET_ROWS is unset or invalid This prepares the backend for performance improvements in repetitive graph execution scenarios on Ascend devices. Signed-off-by: noemotiovon <757486878@qq.com> * Fix review comments Signed-off-by: noemotiovon <757486878@qq.com> * remane USE_CANN_GRAPH to USE_ACL_GRAPH Signed-off-by: noemotiovon <757486878@qq.com> * fix typo Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-08-18 20:30:45 +03:00
Georgi Gerganov	d3aab3efde	llama : add gpt-oss (llama/15091) * oai moe * compat with new checkpoint * add attn sink impl * add rope scaling yarn * logits match with latest transformers code * wip chat template * rm trailing space * use ggml_scale_bias * rm redundant is_swa_all * convert interleaved gate_up * graph : fix activation function to match reference (llama/7) * vocab : handle o200k_harmony special tokens * ggml : add attention sinks support (llama/1) * llama : add attn sinks * ggml : add attn sinks * cuda : add attn sinks * vulkan : add support for sinks in softmax remove unnecessary return * ggml : add fused swiglu_oai op (llama/11) * ggml : add fused swiglu_oai op * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update CUDA impl * cont : metal impl * add vulkan impl * test-backend-ops : more test cases, clean up * llama : remove unfused impl * remove extra lines --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> * repack mxfp4 upon conversion * clean up a bit * enable thinking * add quick hack to render only some special tokens * fix bf16 conversion * remove vocab hack * webui ok * support chat parsing for gpt-oss * fix webui * direct mapping mxfp4, FINALLY * force using mxfp4 * properly use lazy tensor * ggml : add mxfp4 ggml : use e8m0 conversion instead of powf Co-authored-by: Diego Devesa <slarengh@gmail.com> change kvalues_mxfp4 table to match e2m1 (llama/6) metal : remove quantization for now (not used) cuda : fix disabled CUDA graphs due to ffn moe bias vulkan : add support for mxfp4 cont : add cm2 dequant * ggml : add ggml_add_id (llama/13) * ggml : add ggml_add_id * add cuda impl * llama : add weight support check for add_id * perf opt * add vulkan impl * rename cuda files * add metal impl * allow in-place ggml_add_id * llama : keep biases on CPU with --cpu-moe * llama : fix compile error ggml-ci * cuda : add fallback for __nv_cvt_e8m0_to_bf16raw ggml-ci * cleanup ggml-ci * sycl : fix supports_op for MXFP4 ggml-ci * fix Unknown reasoning format * ggml-cpu : fix AVX build ggml-ci * fix hip build ggml-ci * cuda : add mxfp4 dequantization support for cuBLAS ggml-ci * ggml-cpu : fix mxfp4 fallback definitions for some architectures ggml-ci * cuda : fix version required for __nv_cvt_e8m0_to_bf16raw --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: slaren <slarengh@gmail.com>	2025-08-18 20:30:45 +03:00
Romain Biessy	6558022873	sycl: fix mul_mat selection (llama/15092)	2025-08-18 20:30:45 +03:00
Christian Kastner	349b9a2097	cmake: Add GGML_BACKEND_DIR option (llama/15074) * cmake: Add GGML_BACKEND_DIR option This can be used by distributions to specify where to look for backends when ggml is built with GGML_BACKEND_DL=ON. * Fix phrasing	2025-08-18 20:30:45 +03:00
Jeff Bolz	00ff38376a	vulkan: fix build when using glslang that does not support coopmat2 (llama/15062)	2025-08-18 20:30:45 +03:00
Jeff Bolz	abc971e69a	vulkan: Use coopmat2 for conv2d (llama/14982)	2025-08-18 20:30:45 +03:00
lhez	53d8c5179f	opencl: fix adreno compiler detection logic (llama/15029)	2025-08-18 20:30:45 +03:00
Johannes Gäßler	d6e7315717	CUDA: use mma FA kernel for gqa > 4 on RTX 4000 (llama/15035)	2025-08-18 20:30:45 +03:00
leejet	a3123e105b	cuda: make im2col a little faster (llama/15025)	2025-08-18 20:30:45 +03:00
Georgi Gerganov	d119ecf0c1	cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1 (llama/15038) * cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1 ggml-ci * cont : fix cont types ggml-ci * cont : adopt variable names and comment from the other branch	2025-08-18 20:30:45 +03:00
Jeff Bolz	b374fd6172	vulkan: coopmat2 mul_mat optimizations (llama/14934) - Increase tile size for k-quants, to match non-k-quants - Choose more carefully between large and medium tiles, considering how it interacts with split_k - Allow larger/non-power of two split_k, and make the splits a multiple of 256 - Use split_k==3 to when >1/2 and <=2/3 of the SMs would hae been used	2025-08-18 20:30:45 +03:00
Jeff Bolz	97341224b2	vulkan: Support ne[3]>1 in noncontig matrix-vector multiply (llama/15015)	2025-08-18 20:30:45 +03:00
Jeff Bolz	46e9e5b9a7	vulkan: optimizations for direct convolution (llama/14933) * vulkan: optimizations for direct convolution - Empirically choose a better tile size. Reducing BS_K/BS_NPQ helps fill the GPU. The new size should be amenable to using coopmat, too. - Fix shmem bank conflicts. 16B padding should work with coopmat. - Some explicit loop unrolling. - Skip math/stores work for parts of the tile that are OOB. - Apply fastdiv opt. - Disable shuffles for NV. * Three tiles sizes for CONV_2D, and a heuristic to choose * reallow collectives for pre-Turing * make SHMEM_PAD a spec constant * fixes for intel perf - no shmem padding, placeholder shader core count * shader variants with/without unrolling * 0cc4m's fixes for AMD perf Co-authored-by: 0cc4m <picard12@live.de> --------- Co-authored-by: 0cc4m <picard12@live.de>	2025-08-18 20:30:45 +03:00
Johannes Gäßler	7e7557ac50	CUDA: fix MMQ nwarps for AMD with warp_size==32 (llama/15014)	2025-08-18 20:30:45 +03:00
lhez	ba6a81c9c9	opencl: add f16 for `add`, `sub`, `mul`, `div` (llama/14984)	2025-08-18 20:30:45 +03:00
Srihari-mcw	1c6cb7df47	ggml : Q2k interleaving implementation - x86/x64 SIMD (llama/14373) * Initial Q2_K Block Interleaving Implementation * Addressed review comments and clean up of the code * Post rebase fixes * Initial CI/CD fixes * Update declarations in arch-fallback.h * Changes for GEMV Q2_K in arch-fallback.h * Enable repacking only on AVX-512 machines * Update comments in repack.cpp * Address q2k comments --------- Co-authored-by: Manogna-Sree <elisetti.manognasree@multicorewareinc.com>	2025-08-18 20:30:45 +03:00
diannao	78668cb8d1	docker : add cann build pipline (llama/14591) * docker: add cann build pipline * docker: add cann build pipline * docker: fix cann devops * cann : fix multi card hccl * Update ggml/src/ggml-cann/ggml-cann.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Update ggml-cann.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-08-18 20:30:45 +03:00
Ruben Ortlam	41e161657e	Vulkan: Fix minor debug mode issues (llama/14899) * vulkan: fix debug mode issues * vulkan: remove broken check_results GGML_OP_SET_ROWS support	2025-08-18 20:30:45 +03:00
hipudding	572152d6af	CANN: Improve loading efficiency after converting weights to NZ format. (llama/14985) * CANN: Improve loading efficiency after converting weights to NZ format. * CANN: fix typo	2025-08-18 20:30:45 +03:00
lhez	4904bc3bda	opencl: add `mul_mat_f32_f32_l4_lm` and `mul_mat_f16_f32_l4_lm` (llama/14809)	2025-08-18 20:30:45 +03:00
uvos	8ed27b407d	HIP: enable mfma mmq on gfx908 and gfx90a for select datatypes and shapes (llama/14949)	2025-08-18 20:30:45 +03:00

1 2 3 4 5 ...

3042 Commits