Commit Graph

3042 Commits

Author SHA1 Message Date
Sigbjørn Skjæret
367cd11f5d cuda : fix GGML_CUDA_GRAPHS=OFF (llama/15300)
* fix USE_CUDA_GRAPH=OFF

ggml-ci

* check capture status

* completely disable capturing check instead
2025-08-18 20:30:45 +03:00
Jonathan Graehl
c76ec72d59 finetune: SGD optimizer, more CLI args (llama/13873)
* examples/finetune -opt SGD (stochastic gradient descent) memory opt

add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating
m, v tensors.

support finetune.cpp arg -opt SGD (or sgd). (default adamw as before)

llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch)
when using SGD instead of 19gb (55 sec/epoch) using adamw.
(wikipedia 100 lines finetune)

(
using the same GPU memory, adamw can only do before OOM 512
batch/context, reaching:
train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00
val:   [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00

SGD is superior, though it converges slower, with max before OOM 1728
batch/context (esp see the better validation perf):
train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00
val:   [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00
)

note: when finetuning long enough (or w/ enough -lr),
validation accuracy *eventually* drops ('catastrophic forgetting')

-lr-half (halflife) option useful for SGD to avoid oscillation or
super slow underdamped learning (makes setting -lr more forgiving).
terminal -lr for now is set by lr-halvings i.e. if you want at most
1/8 the inital -lr you set -lr-halvings 3.

note: objective loss not directly comparable between adamw, sgd? -
check perplexity or accuracy or consider relative improvements
for convergence

new finetune args -wd 1e-9 to enable weight decay in sgd or adamw,
and max -epochs N (default 2 as before)

cache (1 - wd*alpha) in 'adamw' opt struct -
no noticeable perf benefit, disabled (still done
for new SGD though)

since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params
would probably be able to change between SGD and AdamW with each epoch
but would need to use adamw for the first (unconfirmed - no cmdline arg
to set such a policy yet)

test-opt checks adamw as before and now sgd (except for a few disabled
tests for sgd only; probably just needs logging values and adding
alternate reference values);  tolerance on the 'regression'
test is broader for sgd (so we don't need many more epochs)

* Vulkan: Implement GGML_OP_OPT_STEP_SGD

* tests: Fix OPT_STEP_SGD test-backend-ops

* SGD op param store weight-decay and not 1-alpha*wd

* minor + cosmetic changes

* fix vulkan sgd

* try CI fix

---------

Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-08-18 20:30:45 +03:00
uvos
cbaec6c4ac HIP: bump requirement to rocm 6.1 (llama/15296) 2025-08-18 20:30:45 +03:00
Judd
80ef57f0f0 ggml : update ggml_rope_multi (llama/12665)
* update `rope_multi`:

1. add `ggml_rope_multi_inplace`;
1. use `GGML_MROPE_SECTIONS` instead of 4.

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-08-18 20:30:45 +03:00
Georgi Gerganov
0e8b244366 ggml : repack block_iq4_nlx8 (llama/14904)
ggml-ci
2025-08-18 20:30:45 +03:00
Oliver Simons
b8b1b50c47 CUDA: Optimize reduce_rows_f32 kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n (llama/15132)
* Factor out `reduce_rows_f32` from common.cuh

This increases iteration cycle speed by not having to recompile
every kernel all the time

* Hide memory-latency by loop unrolling in reduce_rows_f32

* Further optimizations to `reduce_rows_f32`

1. Increase threadblock size to better hide latency of memory requests.
   As a consequence of bigger threadblocks, do 2-step summation, using
   shared memory to communicate results between invocations
2. Use sum_temp array to reduce waits on sum
3. Adjust num_unroll to reflext bigger threadblock
4. Improve default block_dims, increase support for more block_dims

* Add perf tests for `reduce_rows_f32` kernel

* Add heuristic to toggle 128/512 threads based on sm count

Break even point was the minimum of the following multiples.

| GPU Model                     | Nrow SM Count Multiple |
| -----------                   | -----------            |
| RTX 4000 SFF ADA              | 2.0x                   |
| RTX 6000 ADA                  | 2.5x                   |
| RTX PRO 6000 Blackwell Max-Q  | 3.04x                  |
| RTX PRO 4500 Blackwell	| 3.15x                  |

* Ensure perf gains also for small ncols and large nrows

Alternative to this, one could have also made the number of unrollings
template-able, but that would require compiling the kernel multiple
times, increasing binary size unnecessarily

* Modify perf and unit-tests

* Apply auto-formatting by clang

* Fix CI build failure

See https://github.com/ggml-org/llama.cpp/actions/runs/16798370266/job/47573716079?pr=15132#step:7:486
Building with VS generator worked though.

* Remove sm_count property from `ggml_backend_cuda_context`

Requested by @JohannesGaessler, and should fix remaining CI issues as a
side-effect

* Add CUB-based implementation for GGML_OP_MEAN

Currently this branch is only executed for nrows==1

* Add heuristics to execute CUB branch only when it brings perf

Heuristics were determined on the following HW:

* RTX 4000 SFF ADA
* RTX 6000 ADA
* RTX PRO 6000 Blackwell Max-Q
* RTX PRO 4500 Blackwell

* Add unit-test for CUB-based mean

Tests should run with CUDA Graphs enabled per default on NVGPUs

* Rename `USE_CUB` to `GGML_CUDA_USE_CUB`

Suggested by @JohannesGaessler

* Unindent Preprocessor directives

See
https://github.com/ggml-org/llama.cpp/pull/15132#discussion_r2269213506
2025-08-18 20:30:45 +03:00
Tak-RS
4e234ac013 ggml-rpc: chunk send()/recv() to avoid EINVAL for very large tensors over RPC (macOS & others) (llama/15188)
* ggml-rpc: chunk send()/recv() to avoid EINVAL for very large tensors over RPC (macOS & others). Fixes #15055

* ggml-rpc: rename RPC_IO_CHUNK->MAX_CHUNK_SIZE, use std::min() for cap, switch to GGML_LOG_ERROR, handle 0-length send/recv

* rpc: drop n==0 special case in send_data(); retry in loop per review

* rpc: remove trailing whitespace in send_data()

---------

Co-authored-by: Shinnosuke Takagi <nosuke@nosukenoMacBook-Pro.local>
2025-08-18 20:30:45 +03:00
uvos
8df931b608 HIP: disable sync warp shuffel operators from clr amd_warp_sync_functions.h (llama/15273) 2025-08-18 20:30:45 +03:00
Romain Biessy
1334f434f3 sycl: Fix and disable more configurations of mul_mat (llama/15151)
* sycl: Fix and disable more configurations of mul_mat

* Disable more configurations
2025-08-18 20:30:45 +03:00
rmatif
139110701e opencl: allow mixed f16/f32 add (llama/15140) 2025-08-18 20:30:45 +03:00
Aman Gupta
082c7ba67c CUDA cmake: add -lineinfo for easier debug (llama/15260) 2025-08-18 20:30:45 +03:00
Chenguang Li
0effaad964 CANN: GGML_OP_CPY optimization (llama/15070)
Signed-off-by: noemotiovon <757486878@qq.com>
2025-08-18 20:30:45 +03:00
R0CKSTAR
8e2ddfec31 musa: fix failures in test-backend-ops for mul_mat_id op (llama/15236)
* musa: fix failures in test-backend-ops for mul_mat_id op

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Address review comments

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-08-18 20:30:45 +03:00
hipudding
3e2c262c08 CANN: Add broadcast for softmax and FA (llama/15208)
* refactor softmax

* fix fa

* fix mask shape

* format

* add comments

* Remove whitespace
2025-08-18 20:30:45 +03:00
Charles Xu
30cc11dc94 kleidiai: fix unsigned overflow bug (llama/15150)
* kleidiai: fix unsigned overflow bug

* address review comments
2025-08-18 20:30:45 +03:00
David Zhao
457eadfe6f cuda: refactored ssm_scan and use CUB (llama/13291)
* cuda: refactored ssm_scan to use CUB

* fixed compilation error when when not using CUB

* assign L to constant and use size_t instead of int

* deduplicated functions

* change min blocks per mp to 1

* Use cub load and store warp transpose

* suppress clang warning
2025-08-18 20:30:45 +03:00
Aman Gupta
93c7a08019 CUDA: add attention sinks for tile and wmma (llama/15178)
* CUDA: add attention sinks for tile and wmma

* Review: formatting changes + remove syncthreads from tile + remove warp_reduce_max from wmma
2025-08-18 20:30:45 +03:00
compilade
62566a5436 gguf-py : add Numpy MXFP4 de/quantization support (llama/15111)
* gguf-py : add MXFP4 de/quantization support

* ggml-quants : handle zero amax for MXFP4
2025-08-18 20:30:45 +03:00
AN Long
573bf9d128 ggml : fix field name when new ggml_backend (llama/14944) 2025-08-18 20:30:45 +03:00
Johannes Gäßler
2baea5e4b3 CUDA: attention sinks for mma FlashAttention (llama/15157) 2025-08-18 20:30:45 +03:00
lhez
8a36cd924a opencl: support sink in soft_max (attn sinks) (llama/15152) 2025-08-18 20:30:45 +03:00
Jeff Bolz
1984530710 vulkan: support fattn sinks (llama/15126) 2025-08-18 20:30:45 +03:00
Jeff Bolz
414e9074e0 vulkan: Add env var to disable host visible vidmem (llama/15109) 2025-08-18 20:30:45 +03:00
uvos
813ceb2a74 HIP: add cmake option to enable compiler output of kernel resource usage metrics (llama/15103) 2025-08-18 20:30:45 +03:00
Christian Kastner
6d7ffea292 ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (llama/15094)
Any available libraries are found and loaded dynamically at runtime.
2025-08-18 20:30:45 +03:00
Johannes Gäßler
5caf8a1ea2 CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16 (llama/15131)
* CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16
2025-08-18 20:30:45 +03:00
rmatif
b405fd88b3 fix profiling crash (llama/15072) 2025-08-18 20:30:45 +03:00
lhez
d153cfb507 opencl: add swiglu_oai and add_id (llama/15121)
* opencl: add `swiglu-oai`

* opencl: add `add_id`

* opencl: add missing `add_id.cl`
2025-08-18 20:30:45 +03:00
Diego Devesa
6fb55d8f7c ggml : fix fallback to CPU for ununsupported ops (llama/15118) 2025-08-18 20:30:45 +03:00
Chenguang Li
e809e81e69 CANN: add support for ACL Graph (llama/15065)
* feat(cann): add optional support for ACL Graph execution

This commit adds support for executing ggml computational graphs using
Huawei's ACL graph mode via the USE_CANN_GRAPH flag. The support can be
enabled at compile time using the CMake option:

    -DUSE_CANN_GRAPH=ON

By default, ACL graph execution is **disabled**, and the fallback path
uses node-by-node execution.

Key additions:
- CMake option  to toggle graph mode
- Graph capture and execution logic using
- Tensor property matching to determine whether graph update is required
- Safe fallback and logging if the environment variable LLAMA_SET_ROWS
  is unset or invalid

This prepares the backend for performance improvements in repetitive graph
execution scenarios on Ascend devices.

Signed-off-by: noemotiovon <757486878@qq.com>

* Fix review comments

Signed-off-by: noemotiovon <757486878@qq.com>

* remane USE_CANN_GRAPH to USE_ACL_GRAPH

Signed-off-by: noemotiovon <757486878@qq.com>

* fix typo

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
2025-08-18 20:30:45 +03:00
Georgi Gerganov
d3aab3efde llama : add gpt-oss (llama/15091)
* oai moe

* compat with new checkpoint

* add attn sink impl

* add rope scaling yarn

* logits match with latest transformers code

* wip chat template

* rm trailing space

* use ggml_scale_bias

* rm redundant is_swa_all

* convert interleaved gate_up

* graph : fix activation function to match reference (llama/7)

* vocab : handle o200k_harmony special tokens

* ggml : add attention sinks support (llama/1)

* llama : add attn sinks

* ggml : add attn sinks

* cuda : add attn sinks

* vulkan : add support for sinks in softmax

remove unnecessary return

* ggml : add fused swiglu_oai op (llama/11)

* ggml : add fused swiglu_oai op

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* update CUDA impl

* cont : metal impl

* add vulkan impl

* test-backend-ops : more test cases, clean up

* llama : remove unfused impl

* remove extra lines

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>

* repack mxfp4 upon conversion

* clean up a bit

* enable thinking

* add quick hack to render only some special tokens

* fix bf16 conversion

* remove vocab hack

* webui ok

* support chat parsing for gpt-oss

* fix webui

* direct mapping mxfp4, FINALLY

* force using mxfp4

* properly use lazy tensor

* ggml : add mxfp4

ggml : use e8m0 conversion instead of powf

Co-authored-by: Diego Devesa <slarengh@gmail.com>

change kvalues_mxfp4 table to match e2m1 (llama/6)

metal : remove quantization for now (not used)

cuda : fix disabled CUDA graphs due to ffn moe bias

vulkan : add support for mxfp4

cont : add cm2 dequant

* ggml : add ggml_add_id (llama/13)

* ggml : add ggml_add_id

* add cuda impl

* llama : add weight support check for add_id

* perf opt

* add vulkan impl

* rename cuda files

* add metal impl

* allow in-place ggml_add_id

* llama : keep biases on CPU with --cpu-moe

* llama : fix compile error

ggml-ci

* cuda : add fallback for __nv_cvt_e8m0_to_bf16raw

ggml-ci

* cleanup

ggml-ci

* sycl : fix supports_op for MXFP4

ggml-ci

* fix Unknown reasoning format

* ggml-cpu : fix AVX build

ggml-ci

* fix hip build

ggml-ci

* cuda : add mxfp4 dequantization support for cuBLAS

ggml-ci

* ggml-cpu : fix mxfp4 fallback definitions for some architectures

ggml-ci

* cuda : fix version required for __nv_cvt_e8m0_to_bf16raw

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: slaren <slarengh@gmail.com>
2025-08-18 20:30:45 +03:00
Romain Biessy
6558022873 sycl: fix mul_mat selection (llama/15092) 2025-08-18 20:30:45 +03:00
Christian Kastner
349b9a2097 cmake: Add GGML_BACKEND_DIR option (llama/15074)
* cmake: Add GGML_BACKEND_DIR option

This can be used by distributions to specify where to look for backends
when ggml is built with GGML_BACKEND_DL=ON.

* Fix phrasing
2025-08-18 20:30:45 +03:00
Jeff Bolz
00ff38376a vulkan: fix build when using glslang that does not support coopmat2 (llama/15062) 2025-08-18 20:30:45 +03:00
Jeff Bolz
abc971e69a vulkan: Use coopmat2 for conv2d (llama/14982) 2025-08-18 20:30:45 +03:00
lhez
53d8c5179f opencl: fix adreno compiler detection logic (llama/15029) 2025-08-18 20:30:45 +03:00
Johannes Gäßler
d6e7315717 CUDA: use mma FA kernel for gqa > 4 on RTX 4000 (llama/15035) 2025-08-18 20:30:45 +03:00
leejet
a3123e105b cuda: make im2col a little faster (llama/15025) 2025-08-18 20:30:45 +03:00
Georgi Gerganov
d119ecf0c1 cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1 (llama/15038)
* cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1

ggml-ci

* cont : fix cont types

ggml-ci

* cont : adopt variable names and comment from the other branch
2025-08-18 20:30:45 +03:00
Jeff Bolz
b374fd6172 vulkan: coopmat2 mul_mat optimizations (llama/14934)
- Increase tile size for k-quants, to match non-k-quants
- Choose more carefully between large and medium tiles, considering how it
  interacts with split_k
- Allow larger/non-power of two split_k, and make the splits a multiple of 256
- Use split_k==3 to when >1/2 and <=2/3 of the SMs would hae been used
2025-08-18 20:30:45 +03:00
Jeff Bolz
97341224b2 vulkan: Support ne[3]>1 in noncontig matrix-vector multiply (llama/15015) 2025-08-18 20:30:45 +03:00
Jeff Bolz
46e9e5b9a7 vulkan: optimizations for direct convolution (llama/14933)
* vulkan: optimizations for direct convolution

- Empirically choose a better tile size. Reducing BS_K/BS_NPQ helps fill
  the GPU. The new size should be amenable to using coopmat, too.
- Fix shmem bank conflicts. 16B padding should work with coopmat.
- Some explicit loop unrolling.
- Skip math/stores work for parts of the tile that are OOB.
- Apply fastdiv opt.
- Disable shuffles for NV.

* Three tiles sizes for CONV_2D, and a heuristic to choose

* reallow collectives for pre-Turing

* make SHMEM_PAD a spec constant

* fixes for intel perf - no shmem padding, placeholder shader core count

* shader variants with/without unrolling

* 0cc4m's fixes for AMD perf

Co-authored-by: 0cc4m <picard12@live.de>

---------

Co-authored-by: 0cc4m <picard12@live.de>
2025-08-18 20:30:45 +03:00
Johannes Gäßler
7e7557ac50 CUDA: fix MMQ nwarps for AMD with warp_size==32 (llama/15014) 2025-08-18 20:30:45 +03:00
lhez
ba6a81c9c9 opencl: add f16 for add, sub, mul, div (llama/14984) 2025-08-18 20:30:45 +03:00
Srihari-mcw
1c6cb7df47 ggml : Q2k interleaving implementation - x86/x64 SIMD (llama/14373)
* Initial Q2_K Block Interleaving Implementation

* Addressed review comments and clean up of the code

* Post rebase fixes

* Initial CI/CD fixes

* Update declarations in arch-fallback.h

* Changes for GEMV Q2_K in arch-fallback.h

* Enable repacking only on AVX-512 machines

* Update comments in repack.cpp

* Address q2k comments

---------

Co-authored-by: Manogna-Sree <elisetti.manognasree@multicorewareinc.com>
2025-08-18 20:30:45 +03:00
diannao
78668cb8d1 docker : add cann build pipline (llama/14591)
* docker: add cann build pipline

* docker: add cann build pipline

* docker: fix cann devops

* cann : fix multi card hccl

* Update ggml/src/ggml-cann/ggml-cann.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Update ggml-cann.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-08-18 20:30:45 +03:00
Ruben Ortlam
41e161657e Vulkan: Fix minor debug mode issues (llama/14899)
* vulkan: fix debug mode issues

* vulkan: remove broken check_results GGML_OP_SET_ROWS support
2025-08-18 20:30:45 +03:00
hipudding
572152d6af CANN: Improve loading efficiency after converting weights to NZ format. (llama/14985)
* CANN: Improve loading efficiency after converting weights to NZ format.

* CANN: fix typo
2025-08-18 20:30:45 +03:00
lhez
4904bc3bda opencl: add mul_mat_f32_f32_l4_lm and mul_mat_f16_f32_l4_lm (llama/14809) 2025-08-18 20:30:45 +03:00
uvos
8ed27b407d HIP: enable mfma mmq on gfx908 and gfx90a for select datatypes and shapes (llama/14949) 2025-08-18 20:30:45 +03:00