Commit Graph

2888 Commits

Author SHA1 Message Date
c3942b3db6 cuda : fix rope with partial rotation and non-cont src (llama/14580)
* cuda : fix rope non-cont

ggml-ci

* cont : fix multi-rope + add test

ggml-ci

* sycl : try fix

ggml-ci

* cont : fix sycl + clean-up cuda

ggml-ci
2025-07-12 19:23:56 +03:00
98e7beac6c CUDA: add bilinear interpolation for upscale (llama/14563) 2025-07-12 19:23:56 +03:00
7e9c6bbab2 musa: fix build warnings (unused variable) (llama/14561)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-12 19:23:56 +03:00
8e545f466c CUDA: add bf16 and i32 to getrows (llama/14529) 2025-07-12 19:23:56 +03:00
Eve
e753b9a952 vulkan: increase LOAD_VEC_A to 8 (IQ1/IQ2) or 4 (IQ3) (llama/14485)
Commit taken from remyoudompheng's PR https://github.com/ggml-org/llama.cpp/pull/12260

Co-authored-by: Rémy Oudompheng <remyoudompheng@gmail.com>
2025-07-12 19:23:56 +03:00
9d0c408260 vulkan: fix rms_norm+mul fusion (llama/14545)
The fused operation was grabbing the epsilon value from the wrong place.

Add an env var to disable fusion.

Add some missing checks for supported shapes/types.

Handle fused rms_norm+mul in check_results.
2025-07-12 19:23:56 +03:00
3aebb8d5d3 vulkan: Handle updated FA dim2/3 definition (llama/14518)
* vulkan: Handle updated FA dim2/3 definition

Pack mask boolean and n_head_log2 into a single dword to keep the push
constant block under the 128B limit.

* handle null mask for gqa

* allow gqa with dim3>1
2025-07-12 19:23:56 +03:00
df5af1dc75 opencl: add GELU_ERF (llama/14476) 2025-07-12 19:23:56 +03:00
10d0d28f7c metal : disable fast math in all quantize kernels (llama/14528)
ggml-ci
2025-07-12 19:23:56 +03:00
af304ef080 CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator (llama/14002)
Co-authored-by: luyuhong <luyuhong@kylinos.cn>
2025-07-12 19:23:56 +03:00
e8138c51d2 ggml : implement GEGLU_ERF and GEGLU_QUICK ops (llama/14445) 2025-07-12 19:23:56 +03:00
7cec4cc83a opencl : broadcast for soft_max (llama/14510) 2025-07-12 19:23:56 +03:00
a432929d58 vulkan: support mixed/deepseekR1 FA head sizes (llama/14509)
* vulkan: better parameterize FA by head sizes

* vulkan: support mixed/deepseekR1 FA head sizes
2025-07-12 19:23:56 +03:00
4aaf8114e7 ggml: backward pass for split swiglu (llama/14483) 2025-07-12 19:23:56 +03:00
0ca760433c Fix conditional enabling following arch checks for ggml-sycl (llama/14504)
Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
2025-07-12 19:23:56 +03:00
ed639c7f22 kv-cache : use ggml_set_rows (llama/14285)
* kv-cache : use ggml_set_rows

ggml-ci

* graph : separate k and v indices

ggml-ci

* cont : remove redundant ifs

ggml-ci

* kv-cache : improve find_slot impl

* kv-cache : bounds-check when accessing slot_info indices

* kv-cache : add comments

ggml-ci

* ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends

ggml-ci
2025-07-12 19:23:56 +03:00
0abd0660e1 ggml : fix FA mask dim 2 and 3 (llama/14505)
* ggml : fix FA mask dim 2 and 3

ggml-ci

* backends : unsupport batched FA in CUDA and Vulkan

ggml-ci

* vulkan : disable FA for mask->ne[2] != 1
2025-07-12 19:23:56 +03:00
9cde908c0a CUDA: add dynamic shared mem to softmax, refactor general usage (llama/14497) 2025-07-12 19:23:56 +03:00
d2d120c256 llama : initial Mamba-2 support (llama/9126)
* llama : initial Mamba-2 support

* ggml : SIMD ggml_ssm_scan for Mamba-2

* ggml : improve ggml_mul speed when masking recurrent states

* llama : support running Mamba-Codestral-7B-v0.1

* llama : fix Mamba-2 conv state saving

* ggml : make the ggml_mul fast broadcast path more consistently formatted

* llama : remove unused variable

* llama : add missing break

* convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires
workarounds to work correctly.

* llama : avoid redundant state copy for Mamba 1 and 2

* metal : attempt to adapt SSM_SCAN for Mamba-2

* metal : fix SSM_SCAN pipeline scope

* metal : use log and exp instead of log1pf and expf in SSM_SCAN

* metal : remove unused arguments for SSM_SCAN

The max index is 31, so trimming the arguments is necessary.

* metal : add back n_seqs to SSM_SCAN args

Whoops, this is needed for the offset in the concatenated output.

* metal : fix SSM_SCAN state head offset

* metal : fix wrong number of tokens per sequence in SSM_SCAN

* ggml : remove unused fast broadcast path in GGML_MUL

This was initially added because states were masked with ggml_mul,
but this is no longer done and so this "optimisation" is no longer
necessary, or at least not worth the additional code complexity.

* ggml : avoid multiply by D in GGML_OP_SSM_SCAN

This makes the weight buft detection in src/llama.cpp simpler.

* convert : transpose Mamba-2 A, D and reshape SSM_NORM

This breaks existing conversions of Mamba-2 models
to avoid some reshapes.

Not sure if it's a good idea,
but it makes the graph slightly cleaner.

* llama : more appropriate SSM_SCAN and SSM_CONV buft support checks

* convert : fix flake8 lint

* metal : fix confusion between ; and ,

* metal : add missing args for nb references in ssm_scan_f32_group

* metal : single-user mamba2 inference works

* kv-cache : remove const_cast when setting inputs for s_copy

And also fix multi-user inference for recurrent models
by using cell_id instead of i as the kv cell index
when populating s_copy.

* convert : avoid AutoConfig for Mamba and Mamba2 hparams

* kv-cache : allow context shift for recurrent models

* graph : fix recurrent state copies when avoiding copies

Works, but using lambda functions might not be that clean.

* ggml : fix mamba2 ssm scan when compiled with SVE

* ggml-cpu : reorder SVE FMA for consistency with other SIMD arches

* cuda : implement ssm scan for Mamba2

There is still room for improvement, but it works!

* cuda : adapt Mamba1 ssm scan to shape changes from Mamba2

* mamba : fix mismatched new and delete size for llm_build_mamba

Subclasses of llm_graph_context cannot have extra fields,
because the called destructor is not the one from the subclass.
This otherwise would cause problems when runnning Mamba-(1|2) inference
when compiled -DGGML_SANITIZE_ADDRESS=ON

* cuda : graceful fallback for Mamba-1 models with weird embd size
2025-07-12 19:23:56 +03:00
fb5c4095ee CUDA: add softmax broadcast (llama/14475)
* CUDA: add softmax broadcast

* Pass by const ref

* Review: Use blockDims for indexing, remove designated initializers

* Add TODO for noncontigous input/output
2025-07-12 19:23:56 +03:00
70515ed728 CUDA: broadcasting for FlashAttention mask (llama/14500) 2025-07-12 19:23:56 +03:00
1b3e06a400 vulkan: support softmax/FA batch and broadcast (llama/14449) 2025-07-12 19:23:56 +03:00
d1286cf32b ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (llama/14435) 2025-07-12 19:23:56 +03:00
2e04b81f3e opencl : fix possible buffer overflow in dump_tensor (llama/14490) 2025-07-12 19:23:56 +03:00
cd87a2f7e0 opencl : skip empty nodes on cgraph compute (llama/14491) 2025-07-12 19:23:56 +03:00
e43c38f9f1 opencl : update upscale to support align corners (llama/14488) 2025-07-12 19:23:56 +03:00
ab850d4680 ggml : Callback before abort (llama/14481)
* Add a callback that will be called just before abort. This allows apps without a console to display a message to the user and save data if needed.

* Return previous callback to allow callback chaining

* style fixes

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-07-12 19:23:56 +03:00
cdf5e72163 ci : disable fast-math for Metal GHA CI (llama/14478)
* ci : disable fast-math for Metal GHA CI

ggml-ci

* cont : remove -g flag

ggml-ci
2025-07-12 19:23:56 +03:00
32d7c10766 CANN: update aclnnGroupedMatmulV2 to aclnnGroupedMatmulV3 (llama/14411)
* [CANN]update to aclnnGroupedMatmulV2

Signed-off-by: noemotiovon <757486878@qq.com>

* Support MUL_MAT_ID on 310p

Signed-off-by: noemotiovon <757486878@qq.com>

* fix editorconfig

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
2025-07-12 19:23:56 +03:00
3c7939cfe5 vulkan: Split large mul_mat_id to fit in shared memory (llama/14451) 2025-07-12 19:23:56 +03:00
6fc80e8456 add GELU_ERF (llama/14455) 2025-07-12 19:23:56 +03:00
19b9aaf044 vulkan : implement bilinear interpolation for ggml_upscale/ggml_interpolate (ggml/1291)
* supports GGML_SCALE_MODE_BILINEAR and GGML_SCALE_FLAG_ALIGN_CORNERS
2025-07-12 19:23:56 +03:00
f98cb6607b vulkan : implement ggml_roll (ggml/1290)
* vulkan : implement ggml_roll

* vulkan : refactor vk_op_unary_push_constants initialization
2025-07-12 19:23:56 +03:00
5ea5c58768 ggml : add version function to get lib version (ggml/1286)
* ggml : add version function to get lib version

This commit adds a function `ggml_version()` to the ggml library that
returns the version of the library as a string.

The motivation for this is that it can be useful to be able to
programmatically check the version of the ggml library being used.

Usage:
```c
printf("GGML version: %s\n", ggml_version());
```
Output:
```console
GGML version: 0.0.2219
```

* ggml : add ggml_commit()

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-07-12 19:23:56 +03:00
869335f2d5 server : add dtw.params for v3-large-turbo (#3307)
* Add DTW model large-v3-turbo parameters to server.cpp example

DTW support is available in whispercpp and the large-v3-turbo model has already been added to the sources, but the large-v3-turbo model hasn't been added to the server.cpp file to make use of it. This commit hopefully corrects that issue.

* match original linebreak of original server.cpp file after adding large.v3.turbo dtw
2025-07-07 12:51:15 +03:00
d9999d54c8 feat: support vad for addon.node (#3301)
Co-authored-by: linxiaodong <calm.lin@wukongsch.com>
2025-07-02 13:14:29 +03:00
bca021c974 sync : ggml
ggml-ci
2025-07-01 17:54:53 +03:00
1f816de7da talk-llama : sync llama.cpp 2025-07-01 17:54:53 +03:00
c4ea72be9a ggml : remove trailing whitespace (llama/0) 2025-07-01 17:54:53 +03:00
1e930ab1b8 opencl : add GEGLU, REGLU, SWIGLU (llama/14456) 2025-07-01 17:54:53 +03:00
b5b237d49a Add Conv2d for CPU (llama/14388)
* Conv2D: Add CPU version

* Half decent

* Tiled approach for F32

* remove file

* Fix tests

* Support F16 operations

* add assert about size

* Review: further formatting fixes, add assert and use CPU version of fp32->fp16
2025-07-01 17:54:53 +03:00
679f31a9d1 metal : disable fast-math for some cpy kernels (llama/14460)
* metal : disable fast-math for some cpy kernels

ggml-ci

* cont : disable for q4_1

ggml-ci

* cont : disable for iq4_nl

ggml-ci
2025-07-01 17:54:53 +03:00
e29e36aee7 ggml-cpu: sycl: Re-enable exp f16 (llama/14462) 2025-07-01 17:54:53 +03:00
6bb1234a56 cmake : Remove redundant include path in CMakeLists.txt (llama/14452)
* Update docker.yml

修改docker.yml文件中的内容使其停止周期性的运行该workflow,如果想要运行该workflow可以手动启动

* Remove redundant include path in CMakeLists.txt

The parent directory '..' was removed from the include directories for the ggml-cpu-feats target, to avoid unnecessary include paths.

* Enable scheduled Docker image builds

Uncomments the workflow schedule to trigger daily Docker image rebuilds at 04:12 UTC, improving automation and keeping images up to date.
2025-07-01 17:54:53 +03:00
3239359bd1 scripts : make the shell scripts cross-platform (llama/14341) 2025-07-01 17:54:53 +03:00
e81be92931 SYCL: disable faulty fp16 exp kernel (llama/14395)
* SYCL: disable faulty fp16 CPU exponent for now

* Revert "SYCL: disable faulty fp16 CPU exponent for now"

This reverts commit ed0aab1ec31b4eb4b0f275dd7acd41d96a375202.

* SYCL: disable faulty fp16 CPU exponent for now

* Fix logic of disabling exponent kernel
2025-07-01 17:54:53 +03:00
130044f228 ggml : fix unmerged GGML_FPxx_TO_FPxx refactoring (llama/14443) 2025-07-01 17:54:53 +03:00
8bc638ee56 ggml : implement REGLU/GEGLU/SWIGLU ops (llama/14158)
* implement unary REGLU/GEGLU/SWIGLU cpu ops

* relax constraints

* duplicate shape of source

* fix ggml_vec_geglu_f16

* special case gated ops

* implement unary REGLU/GEGLU/SWIGLU cuda ops

* tighten constraints again

* refactor into GGML_GLU_OP

* metal : add glu kernels

ggml-ci

* add CUDA_GLU_BLOCK_SIZE [no ci]

* more constraints and use 64bit ints

ggml-ci

* 64bit multiplication [no ci]

* implement swapped variants (cpu/cuda)

* update comment [no ci]

ggml-ci

* Vulkan: Add GLU ops and shaders

* SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate

* ggml : implement GLU for split up/gate (llama/14181)

* implement GLU for split up/gate

* add tests for ggml_glu_split

* Vulkan: Implement glu_split logic and shader support

* add split to logging [no ci]

* SYCL: refactor element_size ops and add split up and gate support to gated kernels

* SYCL: switch GEGLU to use tanh approximation

---------

Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Akarshan <akarshan@menlo.ai>

* GGML: increase OP count in assertion

* Refactor: Optimize SYCL element-wise operations with unary function inlining

This commit refactors the SYCL element-wise operations to improve performance by:

- Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead.
- Introducing helper functions `op_xxx` for each unary operation to encapsulate the logic.
- Replacing direct kernel calls with calls to these inlined functions.
- Using `__dpct_inline__` to encourage compiler inlining.
- Minor code cleanup and consistency improvements.

The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices.

* vulkan: Increase workgroup size for GLU, for performance (llama/14345)

* vulkan: Increase workgroup size for GLU, for performance

* vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup

* merge fix

* metal : add support for split and swap

ggml-ci

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Akarshan <akarshan@menlo.ai>
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-07-01 17:54:53 +03:00
00b36237ba vulkan: Add fusion support for RMS_NORM+MUL (llama/14366)
* vulkan: Add fusion support for RMS_NORM+MUL

- Add a use_count to ggml_tensor, so we can detect if an output is used more than once.
- Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor.
- Add detection logic and basic fusion logic in ggml-vulkan.
- Add some testing support for fusion. Rather than computing one node at a time, allow
for computing the whole graph and just testing one node's results. Add rms_norm_mul tests
and enable a llama test.

* extract some common fusion logic

* fix -Winconsistent-missing-override

* move ggml_can_fuse to a common function

* build fix

* C and C++ versions of can_fuse

* move use count to the graph to avoid data races and double increments when used in multiple threads

* use hash table lookup to find node index

* change use_counts to be indexed by hash table slot

* minimize hash lookups

style fixes

* last node doesn't need single use.
fix type.
handle mul operands being swapped.

* remove redundant parameter

---------

Co-authored-by: slaren <slarengh@gmail.com>
2025-07-01 17:54:53 +03:00
b900ee424c CUDA: add bf16 and f32 support to cublas_mul_mat_batched (llama/14361)
* CUDA: add bf16 and f32 support to cublas_mul_mat_batched

* Review: add type traits and make function more generic

* Review: make check more explicit, add back comments, and fix formatting

* Review: fix formatting, remove useless type conversion, fix naming for bools
2025-07-01 17:54:53 +03:00