Commit Graph

2934 Commits

Author SHA1 Message Date
5c3b794c51 cmake : fix usage issues (ggml/1257)
* CMake config: Create target only once

Fix error on repeated find_package(ggml).
For simplicity, check only for the top-level ggml::ggml.

* CMake config: Add CUDA link libs

* CMake config: Add OpenCL link libs

* CMake config: Use canonical find_dependency

Use set and append to control link lib variables.
Apply more $<LINK_ONLY...>.

* CMake config: Wire OpenMP dependency
2025-07-28 13:02:32 +03:00
e238dc1bdd ggml-cpu : remove stdlib include from repack.cpp (ggml/1276)
This commit removes the inclusion of `<cstdlib>`.

The motivation for this change is that this source file does not seem to
use any functions from this header and the comment about `qsort` is a
little misleading/confusing.
2025-07-28 13:02:32 +03:00
e7bf0294ec Support static xcframework packaging in build-xcframework.sh (#3322)
* This commit allows for the building of a static xcframework by adding a
BUILD_STATIC_XCFRAMEWORK option. When enabled, the build-xcframework.sh
script builds a self-contained static whisper.xcframework.

The motivation for this change is so that command line binaries can link
whisper.cpp without forcing users to install the whisper.xcframework
separately.

* Update build-xcframework.sh

Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* Address reviewer feedback: remove extra indentation around static xcframework creation.

* squash! Address reviewer feedback: remove extra indentation around static xcframework creation.

Fix whitespaces.

---------

Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2025-07-26 12:25:44 +02:00
7de8dd783f examples : add note about WHISPER_WASM_SINGLE_FILE [no ci] (#3332)
This commit adds a note to the README files of the WASM examples
about the `WHISPER_WASM_SINGLE_FILE` option.

The motivation for this is that currently this option is not documented
and might be surprising to users who expect a separate .wasm file to be
generated.

Refs: https://github.com/ggml-org/whisper.cpp/issues/3290
2025-07-24 16:06:48 +02:00
85e474fd55 ci : add paths to build.yml (#3333)
This commit adds specific paths to the GitHub Actions workflow file
`.github/workflows/build.yml`.

The motivation for this to avoid unnecessary builds when unrelated files
are changed, which can save resources and time during the CI process.

Refs: https://github.com/ggml-org/whisper.cpp/issues/3285
2025-07-24 16:04:21 +02:00
210bbbe4d5 musa: upgrade musa sdk to rc4.2.0 (#3324)
* musa: upgrade musa sdk to 4.2.0

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: restore rc in docker image tag

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-24 13:19:57 +03:00
1f5cf0b288 server : hide language probabilities option behind flag (#3328)
* examples/server: hide language probabilities option behind flag

* code review

* fix
2025-07-21 13:03:54 +02:00
2e6be2f380 go: fix Mac OS X builds (#3310)
This commit fixes Go bindings build failure for Mac OS X (15.1) which is currently failing.

Co-authored-by: Chaitanya Bayapuneni <bvk@mini.cinnamon-interval.ts.net>
2025-07-21 08:47:35 +02:00
c0dc391349 sync : ggml
ggml-ci
2025-07-20 00:23:50 +03:00
0ed687c6f1 metal : fuse add, mul + add tests (llama/14596)
ggml-ci
2025-07-20 00:23:50 +03:00
d4a7ea1634 cuda : Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs (llama/14741)
* Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs

Gemma3n uses Matrix-Matrix addition as part of their input processing,
wrongly triggering CUDA_GRAPH disablement on NVGPUs even when batch-size
of 1 is used.

* Exclude `project_per_layer_input` by matching node names

This ensures that all other graphs which don't exhibit this pattern do
not have their behavior changed.

* Revert unnecessary formatting changes
2025-07-20 00:23:50 +03:00
9a07cb064a CUDA: set_rows + cpy.cu refactor (llama/14712) 2025-07-20 00:23:50 +03:00
fed20b0682 use max work group size for device to replace the magic number (llama/14732) 2025-07-20 00:23:50 +03:00
17c5411195 ggml: Add initial WebGPU backend (llama/14521)
* Minimal setup of webgpu backend with dawn. Just prints out the adapter and segfaults

* Initialize webgpu device

* Making progress on setting up the backend

* Finish more boilerplate/utility functions

* Organize file and work on alloc buffer

* Add webgpu_context to prepare for actually running some shaders

* Work on memset and add shader loading

* Work on memset polyfill

* Implement set_tensor as webgpu WriteBuffer, remove host_buffer stubs since webgpu doesn't support it

* Implement get_tensor and buffer_clear

* Finish rest of setup

* Start work on compute graph

* Basic mat mul working

* Work on emscripten build

* Basic WebGPU backend instructions

* Use EMSCRIPTEN flag

* Work on passing ci, implement 4d tensor multiplication

* Pass thread safety test

* Implement permuting for mul_mat and cpy

* minor cleanups

* Address feedback

* Remove division by type size in cpy op

* Fix formatting and add github action workflows for vulkan and metal (m-series) webgpu backends

* Fix name

* Fix macos dawn prefix path
2025-07-20 00:23:50 +03:00
ae1bb2c8ea llama : add high-throughput mode (llama/14363)
* kv-cache : prepare K/V buffers for separation

ggml-ci

* batched-bench : fix oob write

ggml-ci

* llama : add "virtual sequences"

ggml-ci

* llama : use "stream" vs "virtual sequence"

ggml-ci

* graph : fix stream splitting when KV cache is not used

ggml-ci

* kv-cache : add multi-stream save/load support

ggml-ci

* llama : add "--attn-streams" flag

ggml-ci

* kv-cache : fix handling when find_slot fails

ggml-ci

* kv-cache : restore find_slot impl

ggml-ci

* kv-cache : add comments

* kv-cache : add bounds checks for sequence id

ggml-ci

* cont : add n_seq_max to batch allocr

ggml-ci

* kv-cache : perform stream copies lazily after llama_synchronize

ggml-ci

* kv-cache : avoid throwing exceptions across the C boundary

ggml-ci

* CUDA: 4D FlashAttention support (llama/14628)

* CUDA: 4D FlashAttention support

* CUDA: fix WMMA FA kernel

* llama : rename attn_streams -> kv_unified

ggml-ci

* common : rename kv_split -> kv_unified

ggml-ci

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-07-20 00:23:50 +03:00
9cc645fec0 ggml : add asserts (llama/14720)
* ggml : add asserts

ggml-ci

* cont : fix constant type

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-07-20 00:23:50 +03:00
8d1a0485f1 vulkan: fix noncontig check for mat_mul_id splitting (llama/14683)
* vulkan: fix noncontig check for mat_mul_id splitting

Remove supports_op check for > 4096 (splitting fixes this)

* vulkan: fix batched matmul dequant for Q*_K
2025-07-20 00:23:50 +03:00
b33841c453 vulkan: add RTE variants for glu/add/sub/mul/div (llama/14653) 2025-07-20 00:23:50 +03:00
ab79c6c118 cuda: fix build warnings in set-rows.cu (unused variable) (llama/14687)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-20 00:23:50 +03:00
a6b9271c2c sycl: Hotfix for non dnnl codepath (llama/14677) 2025-07-20 00:23:50 +03:00
ded2e3cf6d ggml : refactor llamafile_sgemm PPC code (llama/14673)
Remove un-necessary templates from class definition and packing functions
Reduce deeply nested conditionals, if-else switching in mnapck function
Replace repetitive code with inline functions in Packing functions

2 ~ 7% improvement in Q8 Model
15 ~ 50% improvement in Q4 Model

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
2025-07-20 00:23:50 +03:00
ebb0e9d0ed SYCL: use 1D kernel for set_rows (llama/14618)
* SYCL: Use 1D kernel for set_rows

* Remove dangling comment

* Refactor and use ceil_div
2025-07-20 00:23:50 +03:00
24803d62c6 sycl: Batched mulmat rework for oneDNN dispatch (llama/14617) 2025-07-20 00:23:50 +03:00
0611387d17 cuda : add set rows for bf16 (llama/14664) 2025-07-20 00:23:50 +03:00
fe33572b22 cuda : add ELU support (llama/14657) 2025-07-20 00:23:50 +03:00
21308b4e6e ggml : add build-time message to remind about ggml_set_rows (llama/14661)
ggml-ci
2025-07-20 00:23:50 +03:00
3cad26d807 metal : Add missing unary ops Metal support (llama/14660) 2025-07-20 00:23:50 +03:00
66b3a39bdc CUDA: add set rows for f32 and f16 (llama/14551)
* CUDA: add set rows for f32 and f16

* Review: change kernel params, use strides from host

* Use 1-d kernel

* Review: use int64_t for blockDim.x, rename nb->s for clarity
2025-07-20 00:23:50 +03:00
032697b9a8 whisper: validate get_rows support for cpu extra buffer (#3323) 2025-07-14 15:13:44 +03:00
a16da91365 examples : update links in wasm examples (#3318)
* fix 404 link

* update link in whisper.wasm example

* update example in command.wasm

* update link in bench.wasm example

* update link in stream.wasm example
2025-07-12 23:22:35 +02:00
3775c503d5 sync : resolve conflicts (#0)
ggml-ci
2025-07-12 19:23:56 +03:00
6ddff4d96a talk-llama : sync llama.cpp
ggml-ci
2025-07-12 19:23:56 +03:00
6d64e4abf3 sync : ggml 2025-07-12 19:23:56 +03:00
85dcc74b88 sync : resolve conflicts (ggml/0)
ggml-ci
2025-07-12 19:23:56 +03:00
915fc153a5 vulkan: support SET_ROWS (llama/14587)
* vulkan: support SET_ROWS

Add variants of the copy_to_quant shader that do the SET_ROWS operation.
Change these shaders to spread the work across the workgroup.
The memory access pattern is probably not great (one thread per quant block),
but should be fine for now.

* vulkan: optimize set_rows

Larger workgroups for non-quant types.
Set "norepeat" (there is manual repeat logic).
Use fastmod.
2025-07-12 19:23:56 +03:00
8670a3fd5d vulkan: optimizations for deepseek prompt processing (llama/14555)
* vulkan: allow unclamped loads in coopmat2 mul_mat_id shader

* vulkan: increase coopmat2 mul_mat_id tile size

* vulkan: optimize mat_mul_id row_ids search to batch loads, and port to coopmat1 path

* vulkan: use smaller FA row size when head size is large. applies to both scalar and CM2 paths (CM1 isn't used due to shared memory limits)
2025-07-12 19:23:56 +03:00
74f6d47904 model : support LiquidAI LFM2 hybrid family (llama/14620)
**Important**
LFM2 was [merged ](https://github.com/huggingface/transformers/pull/39340)into transformers, but has not yet been released.
To convert into gguf, install transformers from source
```shell
pip install "transformers @ git+https://github.com/huggingface/transformers.git@main"
```
2025-07-12 19:23:56 +03:00
a4ff4ec9cb HIP : Add HIP 7.0+ compatibility for hipBLAS compute types (llama/14634) 2025-07-12 19:23:56 +03:00
b0754136be opencl: add tiled mul_mat_f16_f32 (llama/14535)
* add tiled mul_mat_f16_f32

* fix trailing whitespace

* add insightful comments
2025-07-12 19:23:56 +03:00
6f113cbcaa opencl: add set_rows for f16 and f32 (llama/14547)
* opencl: add `set_rows` for `f16` and `f32`

* opencl: better choose workgroup size for `set_rows`
2025-07-12 19:23:56 +03:00
3c21cde540 SYCL: Initial set_rows kernel implementation (llama/14562)
* SYCL: Initial set_rows kernel implementation

* Revert max_threads to 256

* Refactor set_rows and address review comments

* Deduplicate conversion function

* Remove guard before kernel launch and refactor

* Fix and add back SFINAE
2025-07-12 19:23:56 +03:00
fb885fa48b cuda : support Falcon-H1 state size for SSM_SCAN (llama/14602) 2025-07-12 19:23:56 +03:00
2021870fb8 ggml : add ggml_scale_bias (llama/14417)
* ggml : add ggml_scale_bias

* ggml_vec_mad1_f32

* add more simd

* add CUDA

* sycl

* vulkan

* cann (placeholder)

* opencl

* will this fix cpu?

* fix cuda

* suggestions from coderabbit

* fix cann compile error

* vDSP_vsmsa

* rm __ARM_FEATURE_SVE

* use memcpy for op params

* make code looks more consistent

* use scalar for __ARM_FEATURE_SVE

* add x param to ggml_vec_mad1_f32
2025-07-12 19:23:56 +03:00
48b18f9eb8 ggml : prevent integer overflow in gguf tensor size calculation (llama/14595) 2025-07-12 19:23:56 +03:00
fadb3233b6 vulkan: optimize flash attention split_k_reduce (llama/14554)
* vulkan: allow FA split_k with smaller KV values

* vulkan: spread split_k_reduce work across more threads

k_num can get rather large. Use the whole workgroup to reduce the M/L values.

Launch a thread for each element in the HSV dimension of the output. Helps a
lot for large HSV (like deepseek).
2025-07-12 19:23:56 +03:00
9750e4c988 vulkan : fix rope with partial rotation and non-cont src (llama/14582) 2025-07-12 19:23:56 +03:00
c3942b3db6 cuda : fix rope with partial rotation and non-cont src (llama/14580)
* cuda : fix rope non-cont

ggml-ci

* cont : fix multi-rope + add test

ggml-ci

* sycl : try fix

ggml-ci

* cont : fix sycl + clean-up cuda

ggml-ci
2025-07-12 19:23:56 +03:00
98e7beac6c CUDA: add bilinear interpolation for upscale (llama/14563) 2025-07-12 19:23:56 +03:00
7e9c6bbab2 musa: fix build warnings (unused variable) (llama/14561)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-12 19:23:56 +03:00
8e545f466c CUDA: add bf16 and i32 to getrows (llama/14529) 2025-07-12 19:23:56 +03:00