2657 Commits

Author SHA1 Message Date
Jeff Bolz
474f7be8b6 vulkan: mark IM2COL as supporting non-contig (llama/13783) 2025-05-27 18:03:00 +03:00
Bizhao Shi
e35fecc2a1 CANN: Add the basic supports of Flash Attention kernel (llama/13627)
* cann: add the basic FA support

* cann: update the readme

* cann: update the FlashAttention with PSEShift

* cann: update the input parameters in FA

* cann: update the alibi with max_bias

* cann: add the constrints of softcap

* cann: update the docs CANN.md

* cann: update the docs CANN.md

* cann: fix typo of CANN.md

* cann: add some comments and update the CANN.md

* cann: update the CANN.md

* cann: update the inner precise for fusedInferAttention

* cann: update the constraints of flash_attn_ext on ggml-cann.cpp

* cann: clean the whitespace

* cann: clean the whitespace

* cann: add a new endline
2025-05-27 18:03:00 +03:00
Akarshan Biswas
1cd7028428 SYCL: revert "sycl: simplify bin_bcast_kernel (ggml/13383)" (llama/13752)
Temporarily reverted due to failing fp16 DIV operation

This reverts commit 02cdd2d8b092b5a4bb18e013c6887ce49ba20ac5.

ggml-ci
2025-05-27 18:03:00 +03:00
Diego Devesa
99596d6031 ggml-cpu : set openmp wait time if not set (llama/13758) 2025-05-27 18:03:00 +03:00
Xuan-Son Nguyen
2d6c6862f7 ggml : add ggml_gelu_erf() CUDA kernel (llama/13719)
* ggml : add ggml_gelu_erf() CUDA kernel

* missing semicolon
2025-05-27 18:03:00 +03:00
Johannes Gäßler
f1576b2659 CUDA: fix race condition in FA vector kernels (llama/13742) 2025-05-27 18:03:00 +03:00
Chenguang Li
994b4f86ab CANN: Support MUL_MAT_ID for q8_0 and q4_0 (llama/13705)
* [CANN]Support MUL_MAT_ID Q8 && Q4

Signed-off-by: noemotiovon <757486878@qq.com>

* codestyle adjustment

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
2025-05-27 18:03:00 +03:00
Xuan-Son Nguyen
3e7eaccf55 ggml : fix the order of ggml_unary_op (llama/13718) 2025-05-27 18:03:00 +03:00
Jeff Bolz
191f040414 vulkan: support CPY from any type to itself (llama/13695)
Reuse the f16/f32 copy shaders, and just scale the number of elements
according to the type size.
2025-05-27 18:03:00 +03:00
Jeff Bolz
2d49d4a9b5 vulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (llama/13696) 2025-05-27 18:03:00 +03:00
Judd
000d65befb use LOG_WARN to replace std::cerr (llama/13657) 2025-05-27 18:03:00 +03:00
Nicolò Scipione
f0803e6646 sycl : Remove waits from function calls (llama/13702)
* removes the waits in async memcpy functions
2025-05-27 18:03:00 +03:00
Ewan Crawford
730a00be8a SYCL: Avoid using with SYCL-Graph for unsupported nodes (llama/13587)
Currently on a CUDA backend to SYCL when running
`GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` there
are two operations that throw an exception from the blocking
waits during queue recording.

* `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187
* `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074

We've noticed that `ggml-cuda.cu` has the
[check_node_graph_compatibility_and_refresh_copy_ops](39e73ae0d6/ggml/src/ggml-cuda/ggml-cuda.cu (L2458-L2458))
method for checking if a graph can be used, even if enabled. I've taken a
similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking
if a graph can be used for the operations even if a user has asked for it to be
enabled.
2025-05-27 18:03:00 +03:00
Henry Linjamäki
316600e8ee opencl: Add support for multiple devices (llama/12622)
* opencl: Add support for multiple devices

... but limited to one platform. A platform with a GPU will be preferred.

Additionally:

* Filter out devices that lack capabilities needed by the backend
  implementation (half support, OpenCL 2.0+, etc).

* Make ggml_backend_opencl_reg() thread-safe.

* fixup: fix an error in sync_with_other_backends

... when there is only one OpenCL device available.
2025-05-27 18:03:00 +03:00
Henry Linjamäki
42f2b3bb65 opencl: fix couple crashes (llama/12795)
* opencl: fix couple crashes

* fix kernel launches failed on devices which do not support
  non-uniform work-groups. When non-uniform work-groups are not
  supported, set `local_work_size` to NULL (= let driver choose the
  work-group sizes). This patch does not cover everything - just the
  cases tested by test-backend-ops.

* fix sub-buffer creation failed due to `cl_buffer_region::origin` not
  being aligned to `CL_DEVICE_MEM_BASE_ADDR_ALIGN`.

* OpenCL: query non-uniform WG sizes only on OpenCL 3.0+
2025-05-27 18:03:00 +03:00
Xuan-Son Nguyen
dd6ef64060 ggml : add ggml_gelu_erf() (llama/13667)
* ggml : add ggml_gelu_na (not approximated)

* fix naming order

* rename na --> erf

* apply review suggesions

* revert naming order
2025-05-27 18:03:00 +03:00
R0CKSTAR
131ee546ca musa: Upgrade MUSA SDK version to rc4.0.1 and use mudnn::Unary::IDENTITY op to accelerate D2D memory copy (llama/13647)
* musa: fix build warning (unused parameter)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: upgrade MUSA SDK version to rc4.0.1

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: use mudnn::Unary::IDENTITY op to accelerate D2D memory copy

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Update ggml/src/ggml-cuda/cpy.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* musa: remove MUDNN_CHECK_GEN and use CUDA_CHECK_GEN instead in MUDNN_CHECK

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-05-27 18:03:00 +03:00
Eve
4712f7b663 vulkan: fix warnings (llama/13626)
* small fixes

* remove ifdef
2025-05-27 18:03:00 +03:00
Johannes Gäßler
926fe234e9 CUDA: skip fully masked-out KV in FA vec kernel (llama/13584)
* CUDA: skip fully masked-out KV in FA vec kernel
2025-05-27 18:03:00 +03:00
Svetlozar Georgiev
f44b53480f sycl: disable reorder for sycl mulmat (llama/13536) 2025-05-27 18:03:00 +03:00
Georgi Gerganov
e04e8f1c79 metal : fix typo in FA kernel comments (llama/13651) 2025-05-27 18:03:00 +03:00
Nicolò Scipione
ee3f177cba sycl : Overcoming workaround for mmap() allocation on Windows (llama/13482)
* Remove mmap workaround on windows

After some testing I found that mmap is supported on windows and for
many GPUs on Linux. Therefore I remove the workaround for windows since
it is not necessary.

* Update llama-bench README

SYCL backend introduced a workaround that allows execution of
llama-bench also without specifying `--mmp 0` flag
2025-05-27 18:03:00 +03:00
0cc4m
0b69f74e15 Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence (llama/13607) 2025-05-27 18:03:00 +03:00
Georgi Gerganov
e415db0ed7 sync : ggml 2025-05-27 18:03:00 +03:00
Daniel Bevenius
2bb7694edb
docs : convert README_sycl.md to utf8 format [no ci] (#3191)
This commit updates the README_sycl.md file to use UTF-8 encoding.

The motivation for this is that while this file displays correctly in
github it will fail to render with tools that expect UTF-8 encoding.
For example this is the case when using `grip` to view the file locally.
2025-05-27 10:53:50 +02:00
Daniel Bevenius
450de0787e
node : enable no_prints to suppress all output (#3189)
This commit enable the node addon to suppress all output, even the
result of the transcription if the no_prints parameter is set to true.

The motivation for this is that for the node addon there is a
fullfilment handler/success callback to process the transcription
result. And it might be useful to be able to disable the printing of
the transcription result to the console, so that the user can handle
the result in their own way.

Refs: https://github.com/ggml-org/whisper.cpp/issues/3176
2025-05-27 05:51:47 +02:00
matteng1
ea9f206f18
talk-llama : fix for swedish umlauts + expose model inference settings in talk-llama.cpp (#3187)
Quick fix for not removing swedish umlauts.

* Update talk-llama.cpp

Expose model inference settings to user instead of hard coding them. Same defaults as previous defaults.

* Update examples/talk-llama/talk-llama.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-05-26 07:57:39 +02:00
KITAITI Makoto
13d92d08ae
docs : fix VAD section heading levels (#3186) 2025-05-23 10:38:26 +02:00
Daniel Bevenius
aab6976465
ci : use dynamic libopenblas.dll for window-blas (#3177)
* ci : use dynamic libopenblas.dll for window-blas

This commit updates the windows-blas job to use the dynamic (can load
different kernels depending of the CPU arch) libopenblas.dll instead of
the "static" openblas.dll that get installed by vcpgk.

The motivation for this change is that there have been reports of
performance drops in later version specifically related to blas. Please
see the links below for more details.

Resolves: https://github.com/ggml-org/whisper.cpp/issues/3166
Refs: https://github.com/ggml-org/whisper.cpp/issues/2666#issuecomment-2885978811
2025-05-23 05:48:08 +02:00
Sacha Arbonel
78b31ca782
server : Add k6 Load Testing Script (#3175)
* add load testing script and update README for k6 integration
2025-05-22 10:03:04 +02:00
Daniel Bevenius
cbe557f9b1
docs : add VAD model download instructions [no ci] (#3180) 2025-05-22 07:49:29 +02:00
Alpaim
273af4aab9
docs : replace typo "]"with ")" in README (#3179) 2025-05-22 05:49:44 +02:00
Daniel Bevenius
bd1cb0c8e3
whisper : remove redundant assignments (#3178)
This commit removes some redundant assignments in the function
`whisper_exp_compute_token_level_timestamps`.

The motivations for this is that tokens[j] and token are references to
the same object and this can be a little confusing when reading the
code.
2025-05-21 13:23:20 +02:00
Jugal Haresh Sheth
62dc8f7d7b
whisper : update CMakeLists.txt to handle deprecated gpu Warnings (#3163)
* Fix CMakeLists.txt to handle deprecated gpu Warnings

* Conditionally apply -Wno-deprecated-gpu-targets only when GGML_CUDA is enabled

* Conditionally apply -Wno-deprecated-gpu-targets only when GGML_CUDA is enabled and not MSVC

---------

Co-authored-by: Jugal Sheth <jugal.sheth@marineai.co.uk>
2025-05-20 11:58:25 +02:00
Daniel Bevenius
2c4b904596
ruby : add GGML_SYCL_DNN option to ruby bindings (#3172)
This commit adds the `GGML_SYCL_DNN` option to the Ruby bindings for
the GGML library. This option as added to ggml in
Commit (5e7e07758a5f3172380500e173ca71f679bbef1e "sycl: use oneDNN for
matrices multiplication")

The motivation for this change to enable the CI build to pass.
2025-05-19 17:59:43 +02:00
Georgi Gerganov
6b6cf19c65 talk-llama : sync llama.cpp
ggml-ci
2025-05-19 14:58:39 +03:00
Georgi Gerganov
05501c218d sync : ggml
ggml-ci
2025-05-19 14:58:39 +03:00
Chenguang Li
9da3fc27be CANN: Support MOE Model MUL_MAT_ID (llama/13042)
Signed-off-by: noemotiovon <757486878@qq.com>
2025-05-19 14:58:39 +03:00
Gilad S.
2c13651e08 cmake: use the current build config for vulkan-shaders-gen (llama/13595)
* fix: use the current build config for `vulkan-shaders-gen`

* fix: only pass a valid build type to `--config`
2025-05-19 14:58:39 +03:00
Jeff Bolz
13dca86c56 vulkan: move common FA code to flash_attn_base.comp (llama/13556)
* vulkan: move common FA code to flash_attn_base.comp

* vulkan: move common FA index/stride setup code to flash_attn_base.comp

* build fix
2025-05-19 14:58:39 +03:00
Jeff Bolz
6d61a09bc4 vulkan: use scalar FA rather than coopmat2 when N==1 (llama/13554) 2025-05-19 14:58:39 +03:00
Georgi Gerganov
4fedad988b metal : add FA-vec kernel for head size 64 (llama/13583)
ggml-ci
2025-05-19 14:58:39 +03:00
Łukasz Ślusarczyk
a8e17a244d sycl : fixed compilation warnings (llama/13582) 2025-05-19 14:58:39 +03:00
Diego Devesa
0c76acd08a gguf : use ggml log system (llama/13571)
* gguf : use ggml log system

* llama : remove unnecessary new lines in exception messages
2025-05-19 14:58:39 +03:00
Atharva Dubey
27964db1be sycl: simplify bin_bcast_kernel (llama/13383) 2025-05-19 14:58:39 +03:00
Svetlozar Georgiev
8081e7a23d sycl: reordered Q4_K MMVQ (llama/13109) 2025-05-19 14:58:39 +03:00
Łukasz Ślusarczyk
d807c497a4 sycl: use oneDNN for matrices multiplication (llama/12972) 2025-05-19 14:58:39 +03:00
Yibo Cai
8e9bf548f4 arm64: optimize q6_k_q8_k kernel with i8mm (llama/13519)
This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q6_k quantization model.
- 40% ~ 54% S_PP uplift for all batch sizes
- 16% ~ 47% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q6_K.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |    78.52 |   109.18 |    18.63 |    18.88 |
|   128 |    128 |    2 |    84.62 |   123.94 |    34.54 |    36.92 |
|   128 |    128 |    4 |    84.36 |   122.49 |    52.65 |    61.32 |
|   128 |    128 |    8 |    90.52 |   138.87 |    63.46 |    84.41 |
|   128 |    128 |   16 |    90.11 |   138.56 |    71.04 |   101.33 |
|   128 |    128 |   32 |    89.81 |   137.79 |    75.14 |   110.47 |
---------------------------------------------------------------------
```
2025-05-19 14:58:39 +03:00
Johannes Gäßler
0dda27bc0b CUDA: fix crash on large batch size for quant. MoE (llama/13537) 2025-05-19 14:58:39 +03:00
Johannes Gäßler
ffa4720f25 CUDA: faster Deepseek FA, add Turing support (llama/13435) 2025-05-19 14:58:39 +03:00