Commit Graph

2952 Commits

Author SHA1 Message Date
95efcf011d CUDA: fix overflow in FA, tune performance (llama/14840) 2025-07-28 13:02:32 +03:00
8272aa9f14 CUDA: fix compilation with GGML_CUDA_F16 (llama/14837) 2025-07-28 13:02:32 +03:00
a65976fc3c CUDA: fix quantized KV cache + multiple sequences (llama/14822)
* CUDA: fix quantized KV cache + multiple sequences

* Update ggml/src/ggml-cuda/fattn-common.cuh

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-07-28 13:02:32 +03:00
026d8a0c6e ggml: fix loongarch quantize_row_q8_1 error (llama/14827) 2025-07-28 13:02:32 +03:00
49d5540206 CANN: weight format to NZ for Ascend310P3 (llama/14407)
* weight format to nz for 310p

* remove quant weight format to nz

* clean code

* fix

* make the conditions for converting weights to NZ format consistent

* clean code
2025-07-28 13:02:32 +03:00
f8402d0a95 CUDA: add fused rms norm (llama/14800) 2025-07-28 13:02:32 +03:00
c91361379a vulkan: fix rms_norm_mul to handle broadcasting dim0 (llama/14817) 2025-07-28 13:02:32 +03:00
810018a63a cuda : implement bf16 cpy ops and enable bf16 cont (llama/14763)
* implement bf16 cpy ops and enable bf16 cont

* deduplicate copy functions

* deduplicate checks
2025-07-28 13:02:32 +03:00
de49384ab3 opencl: remove unreachable return (llama/14806) 2025-07-28 13:02:32 +03:00
9008410087 cuda: remove linking to cublasLt (llama/14790)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-28 13:02:32 +03:00
e81e17b048 opencl: fix im2col when KW!=KH (llama/14803) 2025-07-28 13:02:32 +03:00
a2a5612402 opencl: add conv2d kernel (llama/14403)
* add conv2d kernel

* fix trailing whitespace

* whitespace fixe

* handle f16 input and f16 kernel, more opt

* resolve conflicts

* use enqueue_ndrange_kernel
2025-07-28 13:02:32 +03:00
52ad451c8a sycl: Fix im2col (llama/14797) 2025-07-28 13:02:32 +03:00
fc2ff438fd kleidiai: add support for get_rows (llama/14676)
* kleidiai: add support for get_rows

* apply fixes based on code review

* apply more fixes based on code review
2025-07-28 13:02:32 +03:00
e3f4162a06 vulkan/cuda: Fix im2col when KW!=KH (llama/14789)
The tid is decomposed into "ow + ky*OW + kx*OW*KH". Change "ksize" to match.
2025-07-28 13:02:32 +03:00
92a9e85d8b ggml: adds CONV_2D op and direct GEMM Vulkan implementation (llama/14316)
* ggml/ggml-vulkan/test-backend-ops: adds CONV_2D for Vulkan

* ggml-vulkan: adds f32 scalar shader to compute 2D convolution directly
with gemm (no need for im2col),

* test-backend-ops: adds test_case_ref to check the validity/performance of ops
against reference implementations having different graphs, adds tests

* * Performance fixes: minimized branch divergence, uses collectives to
  eliminate redundant calculation, macros removed.

* Kernel shared memory size check

* Updates test-backend-ops to support graphs for performance
  measurement.

* * Apple/Win32 compile errors fixed

* Subgroup size used to determine tile size -> fixes llvmpipe errors.

* Collectives disabled by default.

* Intel support is disabled as the performance is poor.

* Conv2d enabled for Intel with disabled collectives, disabled for Apple

* test-backend-ops modifications are reverted

* Trailing spaces and missing override fixed.

* Triggering pipeline relaunch.

* Code formatted with .clang-format.
2025-07-28 13:02:32 +03:00
50f983a17e vulkan: Add logging for bf16 features to ggml_vk_print_gpu_info (#13274) (llama/14707) 2025-07-28 13:02:32 +03:00
b06f314667 Vulkan: Fix fprintf format-security warning (llama/14770) 2025-07-28 13:02:32 +03:00
5c3b794c51 cmake : fix usage issues (ggml/1257)
* CMake config: Create target only once

Fix error on repeated find_package(ggml).
For simplicity, check only for the top-level ggml::ggml.

* CMake config: Add CUDA link libs

* CMake config: Add OpenCL link libs

* CMake config: Use canonical find_dependency

Use set and append to control link lib variables.
Apply more $<LINK_ONLY...>.

* CMake config: Wire OpenMP dependency
2025-07-28 13:02:32 +03:00
e238dc1bdd ggml-cpu : remove stdlib include from repack.cpp (ggml/1276)
This commit removes the inclusion of `<cstdlib>`.

The motivation for this change is that this source file does not seem to
use any functions from this header and the comment about `qsort` is a
little misleading/confusing.
2025-07-28 13:02:32 +03:00
e7bf0294ec Support static xcframework packaging in build-xcframework.sh (#3322)
* This commit allows for the building of a static xcframework by adding a
BUILD_STATIC_XCFRAMEWORK option. When enabled, the build-xcframework.sh
script builds a self-contained static whisper.xcframework.

The motivation for this change is so that command line binaries can link
whisper.cpp without forcing users to install the whisper.xcframework
separately.

* Update build-xcframework.sh

Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* Address reviewer feedback: remove extra indentation around static xcframework creation.

* squash! Address reviewer feedback: remove extra indentation around static xcframework creation.

Fix whitespaces.

---------

Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2025-07-26 12:25:44 +02:00
7de8dd783f examples : add note about WHISPER_WASM_SINGLE_FILE [no ci] (#3332)
This commit adds a note to the README files of the WASM examples
about the `WHISPER_WASM_SINGLE_FILE` option.

The motivation for this is that currently this option is not documented
and might be surprising to users who expect a separate .wasm file to be
generated.

Refs: https://github.com/ggml-org/whisper.cpp/issues/3290
2025-07-24 16:06:48 +02:00
85e474fd55 ci : add paths to build.yml (#3333)
This commit adds specific paths to the GitHub Actions workflow file
`.github/workflows/build.yml`.

The motivation for this to avoid unnecessary builds when unrelated files
are changed, which can save resources and time during the CI process.

Refs: https://github.com/ggml-org/whisper.cpp/issues/3285
2025-07-24 16:04:21 +02:00
210bbbe4d5 musa: upgrade musa sdk to rc4.2.0 (#3324)
* musa: upgrade musa sdk to 4.2.0

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: restore rc in docker image tag

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-24 13:19:57 +03:00
1f5cf0b288 server : hide language probabilities option behind flag (#3328)
* examples/server: hide language probabilities option behind flag

* code review

* fix
2025-07-21 13:03:54 +02:00
2e6be2f380 go: fix Mac OS X builds (#3310)
This commit fixes Go bindings build failure for Mac OS X (15.1) which is currently failing.

Co-authored-by: Chaitanya Bayapuneni <bvk@mini.cinnamon-interval.ts.net>
2025-07-21 08:47:35 +02:00
c0dc391349 sync : ggml
ggml-ci
2025-07-20 00:23:50 +03:00
0ed687c6f1 metal : fuse add, mul + add tests (llama/14596)
ggml-ci
2025-07-20 00:23:50 +03:00
d4a7ea1634 cuda : Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs (llama/14741)
* Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs

Gemma3n uses Matrix-Matrix addition as part of their input processing,
wrongly triggering CUDA_GRAPH disablement on NVGPUs even when batch-size
of 1 is used.

* Exclude `project_per_layer_input` by matching node names

This ensures that all other graphs which don't exhibit this pattern do
not have their behavior changed.

* Revert unnecessary formatting changes
2025-07-20 00:23:50 +03:00
9a07cb064a CUDA: set_rows + cpy.cu refactor (llama/14712) 2025-07-20 00:23:50 +03:00
fed20b0682 use max work group size for device to replace the magic number (llama/14732) 2025-07-20 00:23:50 +03:00
17c5411195 ggml: Add initial WebGPU backend (llama/14521)
* Minimal setup of webgpu backend with dawn. Just prints out the adapter and segfaults

* Initialize webgpu device

* Making progress on setting up the backend

* Finish more boilerplate/utility functions

* Organize file and work on alloc buffer

* Add webgpu_context to prepare for actually running some shaders

* Work on memset and add shader loading

* Work on memset polyfill

* Implement set_tensor as webgpu WriteBuffer, remove host_buffer stubs since webgpu doesn't support it

* Implement get_tensor and buffer_clear

* Finish rest of setup

* Start work on compute graph

* Basic mat mul working

* Work on emscripten build

* Basic WebGPU backend instructions

* Use EMSCRIPTEN flag

* Work on passing ci, implement 4d tensor multiplication

* Pass thread safety test

* Implement permuting for mul_mat and cpy

* minor cleanups

* Address feedback

* Remove division by type size in cpy op

* Fix formatting and add github action workflows for vulkan and metal (m-series) webgpu backends

* Fix name

* Fix macos dawn prefix path
2025-07-20 00:23:50 +03:00
ae1bb2c8ea llama : add high-throughput mode (llama/14363)
* kv-cache : prepare K/V buffers for separation

ggml-ci

* batched-bench : fix oob write

ggml-ci

* llama : add "virtual sequences"

ggml-ci

* llama : use "stream" vs "virtual sequence"

ggml-ci

* graph : fix stream splitting when KV cache is not used

ggml-ci

* kv-cache : add multi-stream save/load support

ggml-ci

* llama : add "--attn-streams" flag

ggml-ci

* kv-cache : fix handling when find_slot fails

ggml-ci

* kv-cache : restore find_slot impl

ggml-ci

* kv-cache : add comments

* kv-cache : add bounds checks for sequence id

ggml-ci

* cont : add n_seq_max to batch allocr

ggml-ci

* kv-cache : perform stream copies lazily after llama_synchronize

ggml-ci

* kv-cache : avoid throwing exceptions across the C boundary

ggml-ci

* CUDA: 4D FlashAttention support (llama/14628)

* CUDA: 4D FlashAttention support

* CUDA: fix WMMA FA kernel

* llama : rename attn_streams -> kv_unified

ggml-ci

* common : rename kv_split -> kv_unified

ggml-ci

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-07-20 00:23:50 +03:00
9cc645fec0 ggml : add asserts (llama/14720)
* ggml : add asserts

ggml-ci

* cont : fix constant type

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-07-20 00:23:50 +03:00
8d1a0485f1 vulkan: fix noncontig check for mat_mul_id splitting (llama/14683)
* vulkan: fix noncontig check for mat_mul_id splitting

Remove supports_op check for > 4096 (splitting fixes this)

* vulkan: fix batched matmul dequant for Q*_K
2025-07-20 00:23:50 +03:00
b33841c453 vulkan: add RTE variants for glu/add/sub/mul/div (llama/14653) 2025-07-20 00:23:50 +03:00
ab79c6c118 cuda: fix build warnings in set-rows.cu (unused variable) (llama/14687)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-20 00:23:50 +03:00
a6b9271c2c sycl: Hotfix for non dnnl codepath (llama/14677) 2025-07-20 00:23:50 +03:00
ded2e3cf6d ggml : refactor llamafile_sgemm PPC code (llama/14673)
Remove un-necessary templates from class definition and packing functions
Reduce deeply nested conditionals, if-else switching in mnapck function
Replace repetitive code with inline functions in Packing functions

2 ~ 7% improvement in Q8 Model
15 ~ 50% improvement in Q4 Model

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
2025-07-20 00:23:50 +03:00
ebb0e9d0ed SYCL: use 1D kernel for set_rows (llama/14618)
* SYCL: Use 1D kernel for set_rows

* Remove dangling comment

* Refactor and use ceil_div
2025-07-20 00:23:50 +03:00
24803d62c6 sycl: Batched mulmat rework for oneDNN dispatch (llama/14617) 2025-07-20 00:23:50 +03:00
0611387d17 cuda : add set rows for bf16 (llama/14664) 2025-07-20 00:23:50 +03:00
fe33572b22 cuda : add ELU support (llama/14657) 2025-07-20 00:23:50 +03:00
21308b4e6e ggml : add build-time message to remind about ggml_set_rows (llama/14661)
ggml-ci
2025-07-20 00:23:50 +03:00
3cad26d807 metal : Add missing unary ops Metal support (llama/14660) 2025-07-20 00:23:50 +03:00
66b3a39bdc CUDA: add set rows for f32 and f16 (llama/14551)
* CUDA: add set rows for f32 and f16

* Review: change kernel params, use strides from host

* Use 1-d kernel

* Review: use int64_t for blockDim.x, rename nb->s for clarity
2025-07-20 00:23:50 +03:00
032697b9a8 whisper: validate get_rows support for cpu extra buffer (#3323) 2025-07-14 15:13:44 +03:00
a16da91365 examples : update links in wasm examples (#3318)
* fix 404 link

* update link in whisper.wasm example

* update example in command.wasm

* update link in bench.wasm example

* update link in stream.wasm example
2025-07-12 23:22:35 +02:00
3775c503d5 sync : resolve conflicts (#0)
ggml-ci
2025-07-12 19:23:56 +03:00
6ddff4d96a talk-llama : sync llama.cpp
ggml-ci
2025-07-12 19:23:56 +03:00