Commit Graph

2983 Commits

Author SHA1 Message Date
Johannes Gäßler
24d3524bfd CUDA: fix pointer incrementation in FA (llama/14916) 2025-08-18 20:30:45 +03:00
Alberto Cabrera Pérez
923619ffd5 sycl: refactor quantization to q8_1 (llama/14815)
* sycl: quantization to q8_1 refactor

* Refactored src1 copy logic in op_mul_mat
2025-08-18 20:30:45 +03:00
Kai Pastor
45784c05ae cmake : Fix BLAS link interface (ggml/1316) 2025-08-18 20:30:45 +03:00
Kai Pastor
01bdc522e0 vulkan : fix 32-bit builds (ggml/1313)
The pipeline member can be cast to VkPipeline.
This is a VkPipeline_T* on 64 bit but a uint64_t on 32 bit.
Cf. VK_DEFINE_NON_DISPATCHABLE_HANDLE documentation.
2025-08-18 20:30:45 +03:00
Georgi Gerganov
9446500b9d scripts : update sync scripts 2025-08-18 20:30:45 +03:00
Daniel Bevenius
040510a132 node : add win platform check for require path (#3363)
This commit adds a check to the platform in use and adjust the path to
the addon.node shared library.

The motivation for this change is that on windows addon.node library is
built into build\bin\Release and on linux into build/Release.

Resolves: https://github.com/ggml-org/whisper.cpp/issues/3360
2025-08-15 14:54:23 +02:00
ustas
16c2924cb2 ci : update main-cuda.Dockerfile (#3371)
* Update main-cuda.Dockerfile

Bump CUDA to 13.0.0 and exclude the `compute_50` arch from build because it was deprecated and now throws an error.

* Add quotes in main-cuda.Dockerfile
2025-08-13 19:30:45 +02:00
Dw9
5527454cdb whisper : fixed crash in GPU device selection on multi-GPU systems (#3372) 2025-08-12 13:58:52 +03:00
Georgi Gerganov
b02242d0ad wasm : change ggml model host to HF (#3369) 2025-08-10 13:00:17 +03:00
Adam Debono
4245c77b65 ruby : Add ruby binding for max_len (#3365)
* add ruby binding for max_len

* add test, update param numbers
2025-08-07 11:37:45 +09:00
Daniel Bevenius
0becabc8d6 stream.wasm : add language selection support (#3354)
* stream.wasm : add language selection support

This commit adds support for selecting the language in the stream.wasm
example. This is includes adding the model `base` which supports
multilingual transcription, and allowing the user to select a language
from a dropdown menu in the HTML interface.

The motivation for this is that it allows users to transcribe audio in
various languages.

Refs: https://github.com/ggml-org/whisper.cpp/issues/3347

* squash! stream.wasm : add language selection support

Remove strdup() for language in stream.wasm and update butten text for
base (should not be "base.en" but just "base").
2025-08-02 07:03:04 +02:00
Georgi Gerganov
f7502dca87 whisper : reset conv scheduler when CoreML is used (#3350)
ggml-ci
2025-07-30 21:54:58 +03:00
Georgi Gerganov
28b39c624e ggml : remove old kompute, cann (skip) (#3349)
ggml-ci
2025-07-30 16:08:57 +03:00
Georgi Gerganov
d0a9d8c7f8 talk-llama : sync llama.cpp 2025-07-28 13:02:32 +03:00
Georgi Gerganov
5b4646df1a sync : ggml
ggml-ci
2025-07-28 13:02:32 +03:00
Erik Scholz
d96f4d8ea1 vulkan : add fp16 support for the conv_2d kernel (llama/14872)
* add f16 to conv_2d testing
* weaken conv2d test error threshold
2025-07-28 13:02:32 +03:00
Jeff Bolz
5693b857d2 vulkan: skip empty set_rows to avoid invalid API usage (llama/14860) 2025-07-28 13:02:32 +03:00
deepsek
b275e52b46 HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 (llama/14624)
This commit adds support for MFMA instructions to MMQ. CDNA1/GFX908 CDNA2/GFX90a and CDNA3/GFX942 are supported by the MFMA-enabled code path added by this commit. The code path and stream-k is only enabled on CDNA3 for now as it fails to outperform blas in all cases on the other devices.
Blas is currently only consistently outperformed on CDNA3 due to issues in the amd-provided blas libraries.
This commit also improves the awareness of MMQ towards different warp sizes and as a side effect improves the performance of all quant formats besides q4_0 and q4_1, which regress slightly, on GCN gpus.
2025-07-28 13:02:32 +03:00
hipudding
4692558a1f CANN: Implement GLU ops (llama/14884)
Implement REGLU, GEGLU, SWIGLU ops according to #14158
2025-07-28 13:02:32 +03:00
R0CKSTAR
8643960acc musa: fix build warnings (unused variable) (llama/14869)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-28 13:02:32 +03:00
Aaron Teo
6629201471 ggml-cpu : disable GGML_NNPA by default due to instability (llama/14880)
* docs: update s390x document for sentencepiece

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit e086c5e3a7ab3463d8e0906efcfa39352db0a48d)

* docs: update huggingface links + reword

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 8410b085ea8c46e22be38266147a1e94757ef108)

* ggml-cpu: disable ggml-nnpa compile flag by default

fixes #14877

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 412f4c7c88894b8f55846b4719c76892a23cfe09)

* docs: update s390x build docs to reflect nnpa disable

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit c1eeae1d0c2edc74ab9fbeff2707b0d357cf0b4d)

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-07-28 13:02:32 +03:00
Gabe Goodhart
0b0de0bbf2 metal: SSM_SCAN performance (llama/14743)
* feat: Add s_off as a parameter in the args struct

This may not be necessary, but it more closely mirrors the CUDA kernel

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* perf: Parallelize mamba2 SSM_SCAN metal kernel over d_state

This is a first attempt at optimizing the metal kernel. The changes here
are:

- Launch the kernel with a thread group of size d_state
- Use simd groups and shared memory to do the summation for the y
  computation

When tested with G4 tiny preview, this shows roughly a 3x speedup on
prefill and 15% speedup on decode.

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Update logic to correctly do the multi-layer parallel sum

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Correctly size the shared memory bufer and assert expected size relationships

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Compute block offsets once rather than once per token

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Use local variable for state recursion

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Use a secondary simd_sum instead of a for loop

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add assertion and comment about relationship between simd size and num simd groups

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Parallelize of d_state for mamba-1

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Parallel sum in SSM_CONV

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* Revert "feat: Parallel sum in SSM_CONV"

After discussion with @compilade, the size of the parallelism here is
not worth the cost in complexity or overhead of the parallel for.

https://github.com/ggml-org/llama.cpp/pull/14743#discussion_r2223395357

This reverts commit 16bc059660c1c59e566628201c0ca2c20c9f4bc3.

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Simplify shared memory sizing

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-Authored-By: Georgi Gerganov <ggerganov@gmail.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-07-28 13:02:32 +03:00
lhez
d414c3f6ac opencl: add fused rms_norm_mul (llama/14841)
* opencl: add fused `rms_norm` + `mul`

* opencl: improve workgroup size for `rms_norm_mul`
2025-07-28 13:02:32 +03:00
Oliver Simons
bbf2389919 ggml : remove invalid portPos specifiers from dot files (llama/14838)
Neither "g" nor "x" are valid portPos specifiers per the official
[graphviz documents](https://graphviz.org/docs/attr-types/portPos/):

> If a compass point is used, it must have the form "n","ne","e","se","s","sw","w","nw","c","_".

I tested locally for it to fall back to default portPos specifier if an
invalid portPos is specified. As a consequence, we can remove associated
code.
2025-07-28 13:02:32 +03:00
Chris Rohlf
56350ecc12 rpc : check for null buffers in get/set/copy tensor endpoints (llama/14868) 2025-07-28 13:02:32 +03:00
Diego Devesa
270fa9b25c sched : fix multiple evaluations of the same graph with pipeline parallelism (llama/14855)
ggml-ci
2025-07-28 13:02:32 +03:00
R0CKSTAR
89ae789450 musa: upgrade musa sdk to rc4.2.0 (llama/14498)
* musa: apply mublas API changes

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: update musa version to 4.2.0

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: restore MUSA graph settings in CMakeLists.txt

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: disable mudnnMemcpyAsync by default

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: switch back to non-mudnn images

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* minor changes

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: restore rc in docker image tag

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-28 13:02:32 +03:00
Kai Pastor
5823eabc78 cmake : Indent ggml-config.cmake (ggml/1310) 2025-07-28 13:02:32 +03:00
Alberto Cabrera Pérez
7dc5ae2d6a sycl: fixed semantics of block offset calculation (llama/14814) 2025-07-28 13:02:32 +03:00
Georgi Gerganov
faedce5dcb metal : fix fusion across different encoders (llama/14849)
* metal : fix fusion across different encoders

ggml-ci

* cont : add assertion

ggml-ci
2025-07-28 13:02:32 +03:00
Donghyeon Jeong
e648f9f079 sycl: fix undefined variable in work group size check (llama/14843) 2025-07-28 13:02:32 +03:00
Johannes Gäßler
95efcf011d CUDA: fix overflow in FA, tune performance (llama/14840) 2025-07-28 13:02:32 +03:00
Johannes Gäßler
8272aa9f14 CUDA: fix compilation with GGML_CUDA_F16 (llama/14837) 2025-07-28 13:02:32 +03:00
Johannes Gäßler
a65976fc3c CUDA: fix quantized KV cache + multiple sequences (llama/14822)
* CUDA: fix quantized KV cache + multiple sequences

* Update ggml/src/ggml-cuda/fattn-common.cuh

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-07-28 13:02:32 +03:00
lixing-star
026d8a0c6e ggml: fix loongarch quantize_row_q8_1 error (llama/14827) 2025-07-28 13:02:32 +03:00
chen fan
49d5540206 CANN: weight format to NZ for Ascend310P3 (llama/14407)
* weight format to nz for 310p

* remove quant weight format to nz

* clean code

* fix

* make the conditions for converting weights to NZ format consistent

* clean code
2025-07-28 13:02:32 +03:00
Aman Gupta
f8402d0a95 CUDA: add fused rms norm (llama/14800) 2025-07-28 13:02:32 +03:00
Jeff Bolz
c91361379a vulkan: fix rms_norm_mul to handle broadcasting dim0 (llama/14817) 2025-07-28 13:02:32 +03:00
Sigbjørn Skjæret
810018a63a cuda : implement bf16 cpy ops and enable bf16 cont (llama/14763)
* implement bf16 cpy ops and enable bf16 cont

* deduplicate copy functions

* deduplicate checks
2025-07-28 13:02:32 +03:00
lhez
de49384ab3 opencl: remove unreachable return (llama/14806) 2025-07-28 13:02:32 +03:00
R0CKSTAR
9008410087 cuda: remove linking to cublasLt (llama/14790)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-28 13:02:32 +03:00
Sigbjørn Skjæret
e81e17b048 opencl: fix im2col when KW!=KH (llama/14803) 2025-07-28 13:02:32 +03:00
rmatif
a2a5612402 opencl: add conv2d kernel (llama/14403)
* add conv2d kernel

* fix trailing whitespace

* whitespace fixe

* handle f16 input and f16 kernel, more opt

* resolve conflicts

* use enqueue_ndrange_kernel
2025-07-28 13:02:32 +03:00
Romain Biessy
52ad451c8a sycl: Fix im2col (llama/14797) 2025-07-28 13:02:32 +03:00
Charles Xu
fc2ff438fd kleidiai: add support for get_rows (llama/14676)
* kleidiai: add support for get_rows

* apply fixes based on code review

* apply more fixes based on code review
2025-07-28 13:02:32 +03:00
Jeff Bolz
e3f4162a06 vulkan/cuda: Fix im2col when KW!=KH (llama/14789)
The tid is decomposed into "ow + ky*OW + kx*OW*KH". Change "ksize" to match.
2025-07-28 13:02:32 +03:00
Ervin Áron Tasnádi
92a9e85d8b ggml: adds CONV_2D op and direct GEMM Vulkan implementation (llama/14316)
* ggml/ggml-vulkan/test-backend-ops: adds CONV_2D for Vulkan

* ggml-vulkan: adds f32 scalar shader to compute 2D convolution directly
with gemm (no need for im2col),

* test-backend-ops: adds test_case_ref to check the validity/performance of ops
against reference implementations having different graphs, adds tests

* * Performance fixes: minimized branch divergence, uses collectives to
  eliminate redundant calculation, macros removed.

* Kernel shared memory size check

* Updates test-backend-ops to support graphs for performance
  measurement.

* * Apple/Win32 compile errors fixed

* Subgroup size used to determine tile size -> fixes llvmpipe errors.

* Collectives disabled by default.

* Intel support is disabled as the performance is poor.

* Conv2d enabled for Intel with disabled collectives, disabled for Apple

* test-backend-ops modifications are reverted

* Trailing spaces and missing override fixed.

* Triggering pipeline relaunch.

* Code formatted with .clang-format.
2025-07-28 13:02:32 +03:00
Peter0x44
50f983a17e vulkan: Add logging for bf16 features to ggml_vk_print_gpu_info (#13274) (llama/14707) 2025-07-28 13:02:32 +03:00
0cc4m
b06f314667 Vulkan: Fix fprintf format-security warning (llama/14770) 2025-07-28 13:02:32 +03:00
Kai Pastor
5c3b794c51 cmake : fix usage issues (ggml/1257)
* CMake config: Create target only once

Fix error on repeated find_package(ggml).
For simplicity, check only for the top-level ggml::ggml.

* CMake config: Add CUDA link libs

* CMake config: Add OpenCL link libs

* CMake config: Use canonical find_dependency

Use set and append to control link lib variables.
Apply more $<LINK_ONLY...>.

* CMake config: Wire OpenMP dependency
2025-07-28 13:02:32 +03:00