whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2025-06-19 08:07:09 +02:00

Author	SHA1	Message	Date
Daniel Bevenius	bd1cb0c8e3	whisper : remove redundant assignments (#3178 ) This commit removes some redundant assignments in the function `whisper_exp_compute_token_level_timestamps`. The motivations for this is that tokens[j] and token are references to the same object and this can be a little confusing when reading the code.	2025-05-21 13:23:20 +02:00
Jugal Haresh Sheth	62dc8f7d7b	whisper : update CMakeLists.txt to handle deprecated gpu Warnings (#3163 ) * Fix CMakeLists.txt to handle deprecated gpu Warnings * Conditionally apply -Wno-deprecated-gpu-targets only when GGML_CUDA is enabled * Conditionally apply -Wno-deprecated-gpu-targets only when GGML_CUDA is enabled and not MSVC --------- Co-authored-by: Jugal Sheth <jugal.sheth@marineai.co.uk>	2025-05-20 11:58:25 +02:00
Daniel Bevenius	2c4b904596	ruby : add GGML_SYCL_DNN option to ruby bindings (#3172 ) This commit adds the `GGML_SYCL_DNN` option to the Ruby bindings for the GGML library. This option as added to ggml in Commit (5e7e07758a5f3172380500e173ca71f679bbef1e "sycl: use oneDNN for matrices multiplication") The motivation for this change to enable the CI build to pass.	2025-05-19 17:59:43 +02:00
Georgi Gerganov	6b6cf19c65	talk-llama : sync llama.cpp ggml-ci	2025-05-19 14:58:39 +03:00
Georgi Gerganov	05501c218d	sync : ggml ggml-ci	2025-05-19 14:58:39 +03:00
Chenguang Li	9da3fc27be	CANN: Support MOE Model MUL_MAT_ID (llama/13042) Signed-off-by: noemotiovon <757486878@qq.com>	2025-05-19 14:58:39 +03:00
Gilad S.	2c13651e08	cmake: use the current build config for vulkan-shaders-gen (llama/13595) * fix: use the current build config for `vulkan-shaders-gen` * fix: only pass a valid build type to `--config`	2025-05-19 14:58:39 +03:00
Jeff Bolz	13dca86c56	vulkan: move common FA code to flash_attn_base.comp (llama/13556) * vulkan: move common FA code to flash_attn_base.comp * vulkan: move common FA index/stride setup code to flash_attn_base.comp * build fix	2025-05-19 14:58:39 +03:00
Jeff Bolz	6d61a09bc4	vulkan: use scalar FA rather than coopmat2 when N==1 (llama/13554)	2025-05-19 14:58:39 +03:00
Georgi Gerganov	4fedad988b	metal : add FA-vec kernel for head size 64 (llama/13583) ggml-ci	2025-05-19 14:58:39 +03:00
Łukasz Ślusarczyk	a8e17a244d	sycl : fixed compilation warnings (llama/13582)	2025-05-19 14:58:39 +03:00
Diego Devesa	0c76acd08a	gguf : use ggml log system (llama/13571) * gguf : use ggml log system * llama : remove unnecessary new lines in exception messages	2025-05-19 14:58:39 +03:00
Atharva Dubey	27964db1be	sycl: simplify bin_bcast_kernel (llama/13383)	2025-05-19 14:58:39 +03:00
Svetlozar Georgiev	8081e7a23d	sycl: reordered Q4_K MMVQ (llama/13109)	2025-05-19 14:58:39 +03:00
Łukasz Ślusarczyk	d807c497a4	sycl: use oneDNN for matrices multiplication (llama/12972)	2025-05-19 14:58:39 +03:00
Yibo Cai	8e9bf548f4	arm64: optimize q6_k_q8_k kernel with i8mm (llama/13519) This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction. Tested on neoverse-n2 with llama3 8b q6_k quantization model. - 40% ~ 54% S_PP uplift for all batch sizes - 16% ~ 47% S_TG uplift for batch size 4 and above Perplexity doesn't change with this PR. ``` // tested on neoverse-n2 $ llama-batched-bench \ -m Meta-Llama-3-8B-Instruct-Q6_K.gguf \ --no-mmap -fa \ -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \ -npl 1,2,4,8,16,32 \ -t 64 --------------------------------------------------------------------- \| PP \| TG \| B \| S_PP t/s \| S_TG t/s \| \| \| \| \| original \| this pr \| original \| this pr \| \|-------\|--------\|------\|----------\|----------\|----------\|----------\| \| 128 \| 128 \| 1 \| 78.52 \| 109.18 \| 18.63 \| 18.88 \| \| 128 \| 128 \| 2 \| 84.62 \| 123.94 \| 34.54 \| 36.92 \| \| 128 \| 128 \| 4 \| 84.36 \| 122.49 \| 52.65 \| 61.32 \| \| 128 \| 128 \| 8 \| 90.52 \| 138.87 \| 63.46 \| 84.41 \| \| 128 \| 128 \| 16 \| 90.11 \| 138.56 \| 71.04 \| 101.33 \| \| 128 \| 128 \| 32 \| 89.81 \| 137.79 \| 75.14 \| 110.47 \| --------------------------------------------------------------------- ```	2025-05-19 14:58:39 +03:00
Johannes Gäßler	0dda27bc0b	CUDA: fix crash on large batch size for quant. MoE (llama/13537)	2025-05-19 14:58:39 +03:00
Johannes Gäßler	ffa4720f25	CUDA: faster Deepseek FA, add Turing support (llama/13435)	2025-05-19 14:58:39 +03:00
bandoti	9b8eea28b5	cmake: simplify vulkan shader test logic (llama/13263)	2025-05-19 14:58:39 +03:00
Jeff Bolz	162bbe8220	vulkan: KHR_coopmat flash attention (llama/13506) This shader uses coopmat1 to do the QK^T multiply. The PV multiply is more difficult for various reasons so I haven't done it. Performance for this shader is around 2.5x better than for the scalar shader when doing prompt processing. Some of the benefit may be from other optimizations like staging through shared memory, or splitting by rows.	2025-05-19 14:58:39 +03:00
Jeff Bolz	a221288dc6	vulkan: workaround FA compile failures on macos (llama/13517)	2025-05-19 14:58:39 +03:00
Georgi Gerganov	08436716ae	metal : use FA-vec kernel up to batch size 20 (llama/13496) * batched-bench : fix pp batch contents * metal : optimize multi-sequence FA vec kernel ggml-ci * metal : use FA-vec kernel up to batch size 20 ggml-ci	2025-05-19 14:58:39 +03:00
Georgi Gerganov	e11fc21e6c	metal : optimize multi-sequence FA vec kernel (llama/13493) * batched-bench : fix pp batch contents * metal : optimize multi-sequence FA vec kernel ggml-ci	2025-05-19 14:58:39 +03:00
Dan Johansson	a77a924b20	ggml-cpu: Update KleidiAI to v1.6 and fix include directives (llama/13509) Signed-off-by: Dan Johansson <dan.johansson@arm.com>	2025-05-19 14:58:39 +03:00
Johannes Gäßler	405b9c77ad	mnist: fix segmentation fault (ggml/1227)	2025-05-19 14:58:39 +03:00
Diego Devesa	9c3bfc1499	ggml : fix apple OS check in ggml_print_backtrace (ggml/1229)	2025-05-19 14:58:39 +03:00
Daniel Tang	5b7797f674	ggml : Fix missing backtrace on Linux (ggml/1228) * Modern Linux defaults /proc/sys/kernel/yama/ptrace_scope to 1 * Fixed lldb attach * Simplify by having the child do ggml_print_backtrace_symbols	2025-05-19 14:58:39 +03:00
Daniel Bevenius	82ad275800	examples : add vad-speech-segments to win warns [no ci] (#3170 ) The commit includes the vad-speech-segments in the disable msvc warnings "list".	2025-05-19 12:17:18 +02:00
Daniel Bevenius	d1f114da61	vad : return early if no vad segments are detected (#3158 ) This commit adds a check to `whisper_full_with_state` and if no VAD segments are detected, the function will return early. The motivation for this is that if no VAD segments are detected, the function will not have any samples to process which can happen if an audio sample does not contain any speech. I did not test this previously and only discovered this when updating the stream example.	2025-05-16 08:50:53 +02:00
Daniel Bevenius	bae5d074c7	vad : store VAD context in whisper_state (#3156 ) * vad : store VAD context in whisper_state This commit stores the VAD context in the whisper_state structure, allowing for better management and reuse of the VAD context across multiple calls to the whisper_vad function. The motivation for this change is that when updating the stream example I noticed that the VAD context was being re-initialized every time the whisper_vad function was called. This involved loading the VAD model which is expensive and unnecessary if the context can be reused. Storing this in the whisper_state seems follow the pattern simliar to how whisper_coreml_context and whisper_openvion_context are stored. * vad : free vad_context in whisper_free_state	2025-05-16 07:53:26 +02:00
Daniel Bevenius	20a20decd9	whisper : add build_/ to .gitignore [no ci] (#3157 ) This commit add `build_/` to `.gitignore` to ignore all build directories that start with `build_`. The motivation for this is that the Go bindings creates a directory named build_go, which is not ignored by the current .gitignore. I was not sure if changing this to build-go could effect exising users so I opted to update .gitignore instead.	2025-05-15 14:28:10 +02:00
Daniel Bevenius	f389d7e3e5	examples : add --print-confidence option to cli (#3150 ) * examples : add --print-confidence option to cli This commit adds a new command-line option `--print-confidence` to the whisper-cli. When enabled, this option prints the confidence level of each token in the transcribed text using ANSI formatting codes. The confidence levels are represented using different styles: ```console main: confidence: highlighted (low confidence), underlined (medium), dim (high confidence) ``` Refs: https://github.com/ggml-org/whisper.cpp/issues/3135	2025-05-14 19:21:48 +02:00
Daniel Bevenius	96d791ae61	vad : add download-vad-model scripts (#3149 ) * vad : add download-vad-model scripts This commit adds a script to download VAD models. * vad : add vad model download script for windows [no ci] Refs: https://github.com/ggml-org/whisper.cpp/issues/3146	2025-05-14 16:47:18 +02:00
Daniel Bevenius	3882a099e1	server : add --flash-attn usage output (#3152 ) This commit adds the `--flash-attn` option to the usage output of the server example. The motivation for this change is that while it is possible to set this option it is not printed in the usage output.	2025-05-14 15:22:05 +02:00
Georgi Gerganov	f890560575	talk-llama : sync llama.cpp ggml-ci	2025-05-13 13:59:21 +03:00
Georgi Gerganov	a14c89aefa	whisper : update to ggml-backend changes (#0 ) ggml-ci	2025-05-13 13:59:21 +03:00
Georgi Gerganov	a6a956b36d	sync : ggml ggml-ci	2025-05-13 13:59:21 +03:00
Xuan-Son Nguyen	75e9a840c5	ggml : add mrope kernel for metal (llama/13457)	2025-05-13 13:59:21 +03:00
Georgi Gerganov	41ed62bdbc	metal : optimize MoE for large batches (llama/13388)	2025-05-13 13:59:21 +03:00
lhez	029c8837f8	opencl: remove unnecessary assert for `add` (llama/13257)	2025-05-13 13:59:21 +03:00
Johannes Gäßler	5d8b068249	llama/ggml: add LLM training support (llama/10544) * llama/ggml: add LLM training support more compact progress bar llama_save_model_to_file llama_opt_param_filter ggml_graph_dup force_grads refactor ggml_opt, fix test-opt * remove logits_all * refactor CUDA implementation for ACC * reset graph at beginning of opt period	2025-05-13 13:59:21 +03:00
Dan Johansson	93ef22657e	ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel (llama/13053) * ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel Signed-off-by: Dan Johansson <dan.johansson@arm.com> * * code review fixes Signed-off-by: Dan Johansson <dan.johansson@arm.com> * * adds a comment that clarifies barrier usage Signed-off-by: Dan Johansson <dan.johansson@arm.com> --------- Signed-off-by: Dan Johansson <dan.johansson@arm.com> Co-authored-by: Charles Xu <charles.xu@arm.com>	2025-05-13 13:59:21 +03:00
Johannes Gäßler	866f685bbc	CUDA: fix misaligned synchronization in FA (llama/13469)	2025-05-13 13:59:21 +03:00
Atharva Dubey	250bcc041a	enable dpcpp nightly builds with libraries (llama/13406)	2025-05-13 13:59:21 +03:00
Johannes Gäßler	90b17a99bf	CUDA: fix crash with partial offloading of MoE (llama/13439)	2025-05-13 13:59:21 +03:00
David Huang	e1b2ace0f8	Add `--no-op-offload` to improve `-ot` pp perf in MoE models like llama4 400B (llama/13386)	2025-05-13 13:59:21 +03:00
Johannes Gäßler	6db0e01db6	CUDA: fix race conditions FlashAttention kernels (llama/13438)	2025-05-13 13:59:21 +03:00
Johannes Gäßler	16f3546f38	CUDA: fix FlashAttention on Turing (llama/13415)	2025-05-13 13:59:21 +03:00
Jeff Bolz	a04b329ad1	vulkan: scalar flash attention implementation (llama/13324) * vulkan: scalar flash attention implementation * vulkan: always use fp32 for scalar flash attention * vulkan: use vector loads in scalar flash attention shader * vulkan: remove PV matrix, helps with register usage * vulkan: reduce register usage in scalar FA, but perf may be slightly worse * vulkan: load each Q value once. optimize O reduction. more tuning * vulkan: support q4_0/q8_0 KV in scalar FA * CI: increase timeout to accommodate newly-supported tests * vulkan: for scalar FA, select between 1 and 8 rows * vulkan: avoid using Float16 capability in scalar FA	2025-05-13 13:59:21 +03:00
Alberto Cabrera Pérez	45d8b2352e	sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (llama/12858) * sycl : Implemented reorder Q4_0 mmvq Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com> * sycl : Fixed mmvq being called when reorder is disabled * sycl : Improved comments in the quants header Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com> * Use static_assert * safe_div -> ceil_div * Clarify qi comment * change the reorder tensor from init to execute OP * dbg * Undo changes to test-backend-ops * Refactor changes on top of q4_0 reorder fix * Missing Reverts * Refactored opt_for_reorder logic to simplify code path * Explicit inlining and unroll * Renamed mul_mat_algo enum for consistency --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com> Co-authored-by: romain.biessy <romain.biessy@codeplay.com>	2025-05-13 13:59:21 +03:00

1 2 3 4 5 ...

2725 Commits