Commit Graph

895 Commits

Author SHA1 Message Date
1e16340f4b threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling (llama/12995)
* threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling

We talked about adding LOW priority for GGML threads in the original threadpool PR.
It might be useful for some cases to avoid contention.

Latest Windows ARM64 releases started parking (offlining) the CPU cores
more aggresively which results in suboptimal performance with n_threads > 4.
To deal with that we now disable Power Throttling for our threads for the NORMAL
and higher priorities.

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* threading: disable SetThreadInfo() calls for older Windows versions

* Update tools/llama-bench/llama-bench.cpp

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-06-01 15:14:44 +03:00
4a50254998 CUDA: add a prop in ggml_cuda_device_infor for distinguish iGPU or dGPU in cuda (#13856) (llama/13895)
* 1.  add "integrated" in ggml_cuda_device_info for distinguish whether it is Intergrate_gpu or discrete_gpu
2. Adjust the func:"ggml_backend_cuda_device_supports_buft" for this new feature

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted code indentation

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Fixed incorrect setting of variable types

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted the judgment logic

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* add a host_buft assert in case of integrated_cuda_device with func:'evaluate_and_capture_cuda_graph()'

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Add a defensive security assert

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted the support judgment logic.

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* revoke the suggest commit changes due to it's not applicable in jetson_device

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Add parentheses to enforce operator precedence​

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Fix ci bug: add a spaces

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: yangxiao <yang_xl@tju.edu.cn>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: yangxiao <yangxl_zz@qq.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-06-01 15:14:44 +03:00
a5aff28198 CUDA: fix typo in FlashAttention code (llama/13926) 2025-06-01 15:14:44 +03:00
6c0472ab8f sched : avoid changing cur_copy when a graph is already allocated (llama/13922) 2025-06-01 15:14:44 +03:00
b14cee184a cuda : prevent using split buffers with 3d/4d matrices (llama/13919) 2025-06-01 15:14:44 +03:00
f7f92d0aab SYCL: Add mrope kernel (llama/13755)
* SYCL: Add mrope kernel

* feat: Optimize rope operations with vectorization

Uses `sycl::vec` to load and store two elements at a time,
significantly improving performance in `rope_norm`,
`rope_neox`, and `rope_multi`. This reduces the number of memory
accesses and leverages SIMD instructions for faster execution.

* Use ceil_div
2025-06-01 15:14:44 +03:00
1893359cfd cmake: Guard GGML_CPU_ALL_VARIANTS by architecture (llama/13890) 2025-06-01 15:14:44 +03:00
ea643c6ae3 arm64: optimize q4_k_q8_k kernel with i8mm (llama/13886)
This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q4_k_m quantization model.
- 34% ~ 50% S_PP uplift for all batch sizes
- 12% ~ 37% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |   110.12 |   147.83 |    24.36 |    24.28 |
|   128 |    128 |    2 |   121.16 |   172.42 |    46.36 |    47.93 |
|   128 |    128 |    4 |   120.15 |   169.75 |    74.68 |    84.00 |
|   128 |    128 |    8 |   130.97 |   196.81 |    91.04 |   114.74 |
|   128 |    128 |   16 |   131.01 |   196.88 |   101.43 |   135.79 |
|   128 |    128 |   32 |   130.85 |   196.51 |   106.97 |   147.29 |
---------------------------------------------------------------------
```
2025-06-01 15:14:44 +03:00
1d7b3c79f4 cmake: Factor out CPU architecture detection (llama/13883)
* cmake: Define function for querying architecture

The tests and results match exactly those of src/CMakeLists.txt

* Switch arch detection over to new function
2025-06-01 15:14:44 +03:00
ccfaac2bb0 ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm (llama/13882)
* F32-Mamba-Seq_Scan-SVE

* Fix formatting

* ggml : missing space

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-01 15:14:44 +03:00
1230d37bca ggml: aarch64: Implement SVE F32 kernels for vector functions (llama/13843)
* F32-Mamba-SVE

* F32-Mamba-SVE

* Resolve test errors-1

* Resolve test errors-2

* F32-vec-SVE

* F32-vec-SVE

* F32-vec-SVE
2025-06-01 15:14:44 +03:00
9a500394ad CUDA: fix FA tg at long context for CC >= 8.9 (llama/13852) 2025-06-01 15:14:44 +03:00
0035b8527c CANN: Add SOC TYPE printing in cmake configuration (llama/13837) 2025-06-01 15:14:44 +03:00
3623186312 opencl: add new ops - argsort, div, sub, addrows, sigmoid, group_norm (llama/13787)
* opencl: add `argsort`

* opencl: add `div`

* opencl: add `add_rows`

* opencl: add `sub`

* opencl: add `sigmoid`, both `f16` and `f32`

* opencl: add `group_norm`
2025-06-01 15:14:44 +03:00
67beac47f3 opencl: mark mul_mat f32f32 as supporting non-contiguous tensors (llama/13790) 2025-06-01 15:14:44 +03:00
47a19bae25 vulkan: use timestamp queries for GGML_VULKAN_PERF (llama/13817)
Also change it to be controlled by an env var rather than cmake flag
2025-06-01 15:14:44 +03:00
3d5c7ca4bc SYCL: add gelu_erf kernel (llama/13749)
* SYCL: add gelu_erf kernel

* refactor code

Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com>

* Use scope_op_debug_print

---------

Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com>
2025-06-01 15:14:44 +03:00
4dfb2c2215 ggml : add ggml_repeat_4d (llama/13824) 2025-06-01 15:14:44 +03:00
ad433403ce vulkan : Remove unexpected ; (ggml/1253) 2025-06-01 15:14:44 +03:00
4064dd6484 cmake : Fix broken CMake error messages (ggml/1252) 2025-06-01 15:14:44 +03:00
fd75c4995b ggml : remove ggml_graph_import and ggml_graph_export declarations (ggml/1247)
The implementation is already deleted with commit 9d0762e.

closes: #1235
2025-06-01 15:14:44 +03:00
4d18e52f55 ggml : Fix backtrace breaking Windows build (#3203) 2025-05-29 13:26:58 +03:00
48dddbbac1 ggml : install dynamic backends (ggml/1240) 2025-05-29 09:56:26 +03:00
5ea2c37a4c ggml : Print backtrace on uncaught C++ exceptions (ggml/1232)
The goal is to have what users call "full logs" contain the backtrace.

This is registered upon ggml_init. Also fixes a minor fd leak on Linux.
2025-05-29 09:56:26 +03:00
5720426d97 whisper : install shared libs when using GGML_BACKEND_DL (#3195) 2025-05-28 10:15:04 +02:00
15ae9dc2a4 ggml : riscv: add xtheadvector support (llama/13720)
* ggml : riscv: add xtheadvector support

* ggml : clean up some macro usage
2025-05-27 18:03:00 +03:00
2e7a1e3e43 ggml-cpu: x86 feature detection is specific to x86 (llama/13811) 2025-05-27 18:03:00 +03:00
b75babebb2 ggml : allow CUDA graphs when using pipeline parallelism (llama/13814) 2025-05-27 18:03:00 +03:00
cc7a0105ef cuda : avoid cuGetErrorString (llama/13791)
ggml-ci
2025-05-27 18:03:00 +03:00
195fde8804 SYCL: Add non contiguous support in RMS_NORM and NORM kernels (llama/13611)
* SYCL: Add non contiguous input support to norm kernel

* refactor and add RMS_NORM non contiguous input support

ggml-ci

* restore subgroup reduction for multi-subgroup thread blocks in norm kernels

* Swap grid dims of nsamples and nrows

ggml-ci

* Revert "Swap grid dims of nsamples and nrows"

This reverts commit 43be2d657fec7f7fba54e2cd154106bc0fc45adf.

* restore not required changes
ggml-ci

* address review comments: change it to more like SYCL

* Use a common function to calculate offset

* remove wrap around logic for handling broadcasts

* remove static from calculate_offset fn and use ceil_div
2025-05-27 18:03:00 +03:00
25e27904ca sycl: Add more debug prints (llama/13640) 2025-05-27 18:03:00 +03:00
474f7be8b6 vulkan: mark IM2COL as supporting non-contig (llama/13783) 2025-05-27 18:03:00 +03:00
e35fecc2a1 CANN: Add the basic supports of Flash Attention kernel (llama/13627)
* cann: add the basic FA support

* cann: update the readme

* cann: update the FlashAttention with PSEShift

* cann: update the input parameters in FA

* cann: update the alibi with max_bias

* cann: add the constrints of softcap

* cann: update the docs CANN.md

* cann: update the docs CANN.md

* cann: fix typo of CANN.md

* cann: add some comments and update the CANN.md

* cann: update the CANN.md

* cann: update the inner precise for fusedInferAttention

* cann: update the constraints of flash_attn_ext on ggml-cann.cpp

* cann: clean the whitespace

* cann: clean the whitespace

* cann: add a new endline
2025-05-27 18:03:00 +03:00
1cd7028428 SYCL: revert "sycl: simplify bin_bcast_kernel (ggml/13383)" (llama/13752)
Temporarily reverted due to failing fp16 DIV operation

This reverts commit 02cdd2d8b092b5a4bb18e013c6887ce49ba20ac5.

ggml-ci
2025-05-27 18:03:00 +03:00
99596d6031 ggml-cpu : set openmp wait time if not set (llama/13758) 2025-05-27 18:03:00 +03:00
2d6c6862f7 ggml : add ggml_gelu_erf() CUDA kernel (llama/13719)
* ggml : add ggml_gelu_erf() CUDA kernel

* missing semicolon
2025-05-27 18:03:00 +03:00
f1576b2659 CUDA: fix race condition in FA vector kernels (llama/13742) 2025-05-27 18:03:00 +03:00
994b4f86ab CANN: Support MUL_MAT_ID for q8_0 and q4_0 (llama/13705)
* [CANN]Support MUL_MAT_ID Q8 && Q4

Signed-off-by: noemotiovon <757486878@qq.com>

* codestyle adjustment

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
2025-05-27 18:03:00 +03:00
3e7eaccf55 ggml : fix the order of ggml_unary_op (llama/13718) 2025-05-27 18:03:00 +03:00
191f040414 vulkan: support CPY from any type to itself (llama/13695)
Reuse the f16/f32 copy shaders, and just scale the number of elements
according to the type size.
2025-05-27 18:03:00 +03:00
2d49d4a9b5 vulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (llama/13696) 2025-05-27 18:03:00 +03:00
000d65befb use LOG_WARN to replace std::cerr (llama/13657) 2025-05-27 18:03:00 +03:00
f0803e6646 sycl : Remove waits from function calls (llama/13702)
* removes the waits in async memcpy functions
2025-05-27 18:03:00 +03:00
730a00be8a SYCL: Avoid using with SYCL-Graph for unsupported nodes (llama/13587)
Currently on a CUDA backend to SYCL when running
`GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` there
are two operations that throw an exception from the blocking
waits during queue recording.

* `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187
* `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074

We've noticed that `ggml-cuda.cu` has the
[check_node_graph_compatibility_and_refresh_copy_ops](39e73ae0d6/ggml/src/ggml-cuda/ggml-cuda.cu (L2458-L2458))
method for checking if a graph can be used, even if enabled. I've taken a
similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking
if a graph can be used for the operations even if a user has asked for it to be
enabled.
2025-05-27 18:03:00 +03:00
316600e8ee opencl: Add support for multiple devices (llama/12622)
* opencl: Add support for multiple devices

... but limited to one platform. A platform with a GPU will be preferred.

Additionally:

* Filter out devices that lack capabilities needed by the backend
  implementation (half support, OpenCL 2.0+, etc).

* Make ggml_backend_opencl_reg() thread-safe.

* fixup: fix an error in sync_with_other_backends

... when there is only one OpenCL device available.
2025-05-27 18:03:00 +03:00
42f2b3bb65 opencl: fix couple crashes (llama/12795)
* opencl: fix couple crashes

* fix kernel launches failed on devices which do not support
  non-uniform work-groups. When non-uniform work-groups are not
  supported, set `local_work_size` to NULL (= let driver choose the
  work-group sizes). This patch does not cover everything - just the
  cases tested by test-backend-ops.

* fix sub-buffer creation failed due to `cl_buffer_region::origin` not
  being aligned to `CL_DEVICE_MEM_BASE_ADDR_ALIGN`.

* OpenCL: query non-uniform WG sizes only on OpenCL 3.0+
2025-05-27 18:03:00 +03:00
dd6ef64060 ggml : add ggml_gelu_erf() (llama/13667)
* ggml : add ggml_gelu_na (not approximated)

* fix naming order

* rename na --> erf

* apply review suggesions

* revert naming order
2025-05-27 18:03:00 +03:00
131ee546ca musa: Upgrade MUSA SDK version to rc4.0.1 and use mudnn::Unary::IDENTITY op to accelerate D2D memory copy (llama/13647)
* musa: fix build warning (unused parameter)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: upgrade MUSA SDK version to rc4.0.1

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: use mudnn::Unary::IDENTITY op to accelerate D2D memory copy

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Update ggml/src/ggml-cuda/cpy.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* musa: remove MUDNN_CHECK_GEN and use CUDA_CHECK_GEN instead in MUDNN_CHECK

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-05-27 18:03:00 +03:00
Eve
4712f7b663 vulkan: fix warnings (llama/13626)
* small fixes

* remove ifdef
2025-05-27 18:03:00 +03:00
926fe234e9 CUDA: skip fully masked-out KV in FA vec kernel (llama/13584)
* CUDA: skip fully masked-out KV in FA vec kernel
2025-05-27 18:03:00 +03:00