2638 Commits

Author SHA1 Message Date
Svetlozar Georgiev
f44b53480f sycl: disable reorder for sycl mulmat (llama/13536) 2025-05-27 18:03:00 +03:00
Georgi Gerganov
e04e8f1c79 metal : fix typo in FA kernel comments (llama/13651) 2025-05-27 18:03:00 +03:00
Nicolò Scipione
ee3f177cba sycl : Overcoming workaround for mmap() allocation on Windows (llama/13482)
* Remove mmap workaround on windows

After some testing I found that mmap is supported on windows and for
many GPUs on Linux. Therefore I remove the workaround for windows since
it is not necessary.

* Update llama-bench README

SYCL backend introduced a workaround that allows execution of
llama-bench also without specifying `--mmp 0` flag
2025-05-27 18:03:00 +03:00
0cc4m
0b69f74e15 Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence (llama/13607) 2025-05-27 18:03:00 +03:00
Georgi Gerganov
e415db0ed7 sync : ggml 2025-05-27 18:03:00 +03:00
Daniel Bevenius
2bb7694edb
docs : convert README_sycl.md to utf8 format [no ci] (#3191)
This commit updates the README_sycl.md file to use UTF-8 encoding.

The motivation for this is that while this file displays correctly in
github it will fail to render with tools that expect UTF-8 encoding.
For example this is the case when using `grip` to view the file locally.
2025-05-27 10:53:50 +02:00
Daniel Bevenius
450de0787e
node : enable no_prints to suppress all output (#3189)
This commit enable the node addon to suppress all output, even the
result of the transcription if the no_prints parameter is set to true.

The motivation for this is that for the node addon there is a
fullfilment handler/success callback to process the transcription
result. And it might be useful to be able to disable the printing of
the transcription result to the console, so that the user can handle
the result in their own way.

Refs: https://github.com/ggml-org/whisper.cpp/issues/3176
2025-05-27 05:51:47 +02:00
matteng1
ea9f206f18
talk-llama : fix for swedish umlauts + expose model inference settings in talk-llama.cpp (#3187)
Quick fix for not removing swedish umlauts.

* Update talk-llama.cpp

Expose model inference settings to user instead of hard coding them. Same defaults as previous defaults.

* Update examples/talk-llama/talk-llama.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-05-26 07:57:39 +02:00
KITAITI Makoto
13d92d08ae
docs : fix VAD section heading levels (#3186) 2025-05-23 10:38:26 +02:00
Daniel Bevenius
aab6976465
ci : use dynamic libopenblas.dll for window-blas (#3177)
* ci : use dynamic libopenblas.dll for window-blas

This commit updates the windows-blas job to use the dynamic (can load
different kernels depending of the CPU arch) libopenblas.dll instead of
the "static" openblas.dll that get installed by vcpgk.

The motivation for this change is that there have been reports of
performance drops in later version specifically related to blas. Please
see the links below for more details.

Resolves: https://github.com/ggml-org/whisper.cpp/issues/3166
Refs: https://github.com/ggml-org/whisper.cpp/issues/2666#issuecomment-2885978811
2025-05-23 05:48:08 +02:00
Sacha Arbonel
78b31ca782
server : Add k6 Load Testing Script (#3175)
* add load testing script and update README for k6 integration
2025-05-22 10:03:04 +02:00
Daniel Bevenius
cbe557f9b1
docs : add VAD model download instructions [no ci] (#3180) 2025-05-22 07:49:29 +02:00
Alpaim
273af4aab9
docs : replace typo "]"with ")" in README (#3179) 2025-05-22 05:49:44 +02:00
Daniel Bevenius
bd1cb0c8e3
whisper : remove redundant assignments (#3178)
This commit removes some redundant assignments in the function
`whisper_exp_compute_token_level_timestamps`.

The motivations for this is that tokens[j] and token are references to
the same object and this can be a little confusing when reading the
code.
2025-05-21 13:23:20 +02:00
Jugal Haresh Sheth
62dc8f7d7b
whisper : update CMakeLists.txt to handle deprecated gpu Warnings (#3163)
* Fix CMakeLists.txt to handle deprecated gpu Warnings

* Conditionally apply -Wno-deprecated-gpu-targets only when GGML_CUDA is enabled

* Conditionally apply -Wno-deprecated-gpu-targets only when GGML_CUDA is enabled and not MSVC

---------

Co-authored-by: Jugal Sheth <jugal.sheth@marineai.co.uk>
2025-05-20 11:58:25 +02:00
Daniel Bevenius
2c4b904596
ruby : add GGML_SYCL_DNN option to ruby bindings (#3172)
This commit adds the `GGML_SYCL_DNN` option to the Ruby bindings for
the GGML library. This option as added to ggml in
Commit (5e7e07758a5f3172380500e173ca71f679bbef1e "sycl: use oneDNN for
matrices multiplication")

The motivation for this change to enable the CI build to pass.
2025-05-19 17:59:43 +02:00
Georgi Gerganov
6b6cf19c65 talk-llama : sync llama.cpp
ggml-ci
2025-05-19 14:58:39 +03:00
Georgi Gerganov
05501c218d sync : ggml
ggml-ci
2025-05-19 14:58:39 +03:00
Chenguang Li
9da3fc27be CANN: Support MOE Model MUL_MAT_ID (llama/13042)
Signed-off-by: noemotiovon <757486878@qq.com>
2025-05-19 14:58:39 +03:00
Gilad S.
2c13651e08 cmake: use the current build config for vulkan-shaders-gen (llama/13595)
* fix: use the current build config for `vulkan-shaders-gen`

* fix: only pass a valid build type to `--config`
2025-05-19 14:58:39 +03:00
Jeff Bolz
13dca86c56 vulkan: move common FA code to flash_attn_base.comp (llama/13556)
* vulkan: move common FA code to flash_attn_base.comp

* vulkan: move common FA index/stride setup code to flash_attn_base.comp

* build fix
2025-05-19 14:58:39 +03:00
Jeff Bolz
6d61a09bc4 vulkan: use scalar FA rather than coopmat2 when N==1 (llama/13554) 2025-05-19 14:58:39 +03:00
Georgi Gerganov
4fedad988b metal : add FA-vec kernel for head size 64 (llama/13583)
ggml-ci
2025-05-19 14:58:39 +03:00
Łukasz Ślusarczyk
a8e17a244d sycl : fixed compilation warnings (llama/13582) 2025-05-19 14:58:39 +03:00
Diego Devesa
0c76acd08a gguf : use ggml log system (llama/13571)
* gguf : use ggml log system

* llama : remove unnecessary new lines in exception messages
2025-05-19 14:58:39 +03:00
Atharva Dubey
27964db1be sycl: simplify bin_bcast_kernel (llama/13383) 2025-05-19 14:58:39 +03:00
Svetlozar Georgiev
8081e7a23d sycl: reordered Q4_K MMVQ (llama/13109) 2025-05-19 14:58:39 +03:00
Łukasz Ślusarczyk
d807c497a4 sycl: use oneDNN for matrices multiplication (llama/12972) 2025-05-19 14:58:39 +03:00
Yibo Cai
8e9bf548f4 arm64: optimize q6_k_q8_k kernel with i8mm (llama/13519)
This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q6_k quantization model.
- 40% ~ 54% S_PP uplift for all batch sizes
- 16% ~ 47% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q6_K.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |    78.52 |   109.18 |    18.63 |    18.88 |
|   128 |    128 |    2 |    84.62 |   123.94 |    34.54 |    36.92 |
|   128 |    128 |    4 |    84.36 |   122.49 |    52.65 |    61.32 |
|   128 |    128 |    8 |    90.52 |   138.87 |    63.46 |    84.41 |
|   128 |    128 |   16 |    90.11 |   138.56 |    71.04 |   101.33 |
|   128 |    128 |   32 |    89.81 |   137.79 |    75.14 |   110.47 |
---------------------------------------------------------------------
```
2025-05-19 14:58:39 +03:00
Johannes Gäßler
0dda27bc0b CUDA: fix crash on large batch size for quant. MoE (llama/13537) 2025-05-19 14:58:39 +03:00
Johannes Gäßler
ffa4720f25 CUDA: faster Deepseek FA, add Turing support (llama/13435) 2025-05-19 14:58:39 +03:00
bandoti
9b8eea28b5 cmake: simplify vulkan shader test logic (llama/13263) 2025-05-19 14:58:39 +03:00
Jeff Bolz
162bbe8220 vulkan: KHR_coopmat flash attention (llama/13506)
This shader uses coopmat1 to do the Q*K^T multiply. The P*V multiply is more
difficult for various reasons so I haven't done it. Performance for this
shader is around 2.5x better than for the scalar shader when doing prompt
processing. Some of the benefit may be from other optimizations like staging
through shared memory, or splitting by rows.
2025-05-19 14:58:39 +03:00
Jeff Bolz
a221288dc6 vulkan: workaround FA compile failures on macos (llama/13517) 2025-05-19 14:58:39 +03:00
Georgi Gerganov
08436716ae metal : use FA-vec kernel up to batch size 20 (llama/13496)
* batched-bench : fix pp batch contents

* metal : optimize multi-sequence FA vec kernel

ggml-ci

* metal : use FA-vec kernel up to batch size 20

ggml-ci
2025-05-19 14:58:39 +03:00
Georgi Gerganov
e11fc21e6c metal : optimize multi-sequence FA vec kernel (llama/13493)
* batched-bench : fix pp batch contents

* metal : optimize multi-sequence FA vec kernel

ggml-ci
2025-05-19 14:58:39 +03:00
Dan Johansson
a77a924b20 ggml-cpu: Update KleidiAI to v1.6 and fix include directives (llama/13509)
Signed-off-by: Dan Johansson <dan.johansson@arm.com>
2025-05-19 14:58:39 +03:00
Johannes Gäßler
405b9c77ad mnist: fix segmentation fault (ggml/1227) 2025-05-19 14:58:39 +03:00
Diego Devesa
9c3bfc1499 ggml : fix apple OS check in ggml_print_backtrace (ggml/1229) 2025-05-19 14:58:39 +03:00
Daniel Tang
5b7797f674 ggml : Fix missing backtrace on Linux (ggml/1228)
* Modern Linux defaults /proc/sys/kernel/yama/ptrace_scope to 1
* Fixed lldb attach
* Simplify by having the child do ggml_print_backtrace_symbols
2025-05-19 14:58:39 +03:00
Daniel Bevenius
82ad275800
examples : add vad-speech-segments to win warns [no ci] (#3170)
The commit includes the vad-speech-segments in the disable msvc warnings
"list".
2025-05-19 12:17:18 +02:00
Daniel Bevenius
d1f114da61
vad : return early if no vad segments are detected (#3158)
This commit adds a check to `whisper_full_with_state` and if no VAD
segments are detected, the function will return early.

The motivation for this is that if no VAD segments are detected, the
function will not have any samples to process which can happen if an
audio sample does not contain any speech. I did not test this previously
and only discovered this when updating the stream example.
2025-05-16 08:50:53 +02:00
Daniel Bevenius
bae5d074c7
vad : store VAD context in whisper_state (#3156)
* vad : store VAD context in whisper_state

This commit stores the VAD context in the whisper_state structure,
allowing for better management and reuse of the VAD context across
multiple calls to the whisper_vad function.

The motivation for this change is that when updating the stream example
I noticed that the VAD context was being re-initialized every time the
whisper_vad function was called. This involved loading the VAD model
which is expensive and unnecessary if the context can be reused.

Storing this in the whisper_state seems follow the pattern simliar to
how whisper_coreml_context and whisper_openvion_context are stored.

* vad : free vad_context in whisper_free_state
2025-05-16 07:53:26 +02:00
Daniel Bevenius
20a20decd9
whisper : add build_*/ to .gitignore [no ci] (#3157)
This commit add `build_*/` to `.gitignore` to ignore all build
directories that start with `build_`.

The motivation for this is that the Go bindings creates a directory
named build_go, which is not ignored by the current .gitignore. I was
not sure if changing this to build-go could effect exising users so I
opted to update .gitignore instead.
2025-05-15 14:28:10 +02:00
Daniel Bevenius
f389d7e3e5
examples : add --print-confidence option to cli (#3150)
* examples : add --print-confidence option to cli

This commit adds a new command-line option `--print-confidence` to the
whisper-cli. When enabled, this option prints the confidence level of each
token in the transcribed text using ANSI formatting codes.

The confidence levels are represented using different styles:
```console
main: confidence: highlighted (low confidence), underlined (medium), dim (high confidence)
```

Refs: https://github.com/ggml-org/whisper.cpp/issues/3135
2025-05-14 19:21:48 +02:00
Daniel Bevenius
96d791ae61
vad : add download-vad-model scripts (#3149)
* vad : add download-vad-model scripts

This commit adds a script to download VAD models.

* vad : add vad model download script for windows [no ci]

Refs: https://github.com/ggml-org/whisper.cpp/issues/3146
2025-05-14 16:47:18 +02:00
Daniel Bevenius
3882a099e1
server : add --flash-attn usage output (#3152)
This commit adds the `--flash-attn` option to the usage output of the
server example.

The motivation for this change is that while it is possible to set this
option it is not printed in the usage output.
2025-05-14 15:22:05 +02:00
Georgi Gerganov
f890560575 talk-llama : sync llama.cpp
ggml-ci
2025-05-13 13:59:21 +03:00
Georgi Gerganov
a14c89aefa whisper : update to ggml-backend changes (#0)
ggml-ci
2025-05-13 13:59:21 +03:00
Georgi Gerganov
a6a956b36d sync : ggml
ggml-ci
2025-05-13 13:59:21 +03:00