Commit Graph

2905 Commits

Author SHA1 Message Date
a16da91365 examples : update links in wasm examples (#3318)
* fix 404 link

* update link in whisper.wasm example

* update example in command.wasm

* update link in bench.wasm example

* update link in stream.wasm example
2025-07-12 23:22:35 +02:00
3775c503d5 sync : resolve conflicts (#0)
ggml-ci
2025-07-12 19:23:56 +03:00
6ddff4d96a talk-llama : sync llama.cpp
ggml-ci
2025-07-12 19:23:56 +03:00
6d64e4abf3 sync : ggml 2025-07-12 19:23:56 +03:00
85dcc74b88 sync : resolve conflicts (ggml/0)
ggml-ci
2025-07-12 19:23:56 +03:00
915fc153a5 vulkan: support SET_ROWS (llama/14587)
* vulkan: support SET_ROWS

Add variants of the copy_to_quant shader that do the SET_ROWS operation.
Change these shaders to spread the work across the workgroup.
The memory access pattern is probably not great (one thread per quant block),
but should be fine for now.

* vulkan: optimize set_rows

Larger workgroups for non-quant types.
Set "norepeat" (there is manual repeat logic).
Use fastmod.
2025-07-12 19:23:56 +03:00
8670a3fd5d vulkan: optimizations for deepseek prompt processing (llama/14555)
* vulkan: allow unclamped loads in coopmat2 mul_mat_id shader

* vulkan: increase coopmat2 mul_mat_id tile size

* vulkan: optimize mat_mul_id row_ids search to batch loads, and port to coopmat1 path

* vulkan: use smaller FA row size when head size is large. applies to both scalar and CM2 paths (CM1 isn't used due to shared memory limits)
2025-07-12 19:23:56 +03:00
74f6d47904 model : support LiquidAI LFM2 hybrid family (llama/14620)
**Important**
LFM2 was [merged ](https://github.com/huggingface/transformers/pull/39340)into transformers, but has not yet been released.
To convert into gguf, install transformers from source
```shell
pip install "transformers @ git+https://github.com/huggingface/transformers.git@main"
```
2025-07-12 19:23:56 +03:00
a4ff4ec9cb HIP : Add HIP 7.0+ compatibility for hipBLAS compute types (llama/14634) 2025-07-12 19:23:56 +03:00
b0754136be opencl: add tiled mul_mat_f16_f32 (llama/14535)
* add tiled mul_mat_f16_f32

* fix trailing whitespace

* add insightful comments
2025-07-12 19:23:56 +03:00
6f113cbcaa opencl: add set_rows for f16 and f32 (llama/14547)
* opencl: add `set_rows` for `f16` and `f32`

* opencl: better choose workgroup size for `set_rows`
2025-07-12 19:23:56 +03:00
3c21cde540 SYCL: Initial set_rows kernel implementation (llama/14562)
* SYCL: Initial set_rows kernel implementation

* Revert max_threads to 256

* Refactor set_rows and address review comments

* Deduplicate conversion function

* Remove guard before kernel launch and refactor

* Fix and add back SFINAE
2025-07-12 19:23:56 +03:00
fb885fa48b cuda : support Falcon-H1 state size for SSM_SCAN (llama/14602) 2025-07-12 19:23:56 +03:00
2021870fb8 ggml : add ggml_scale_bias (llama/14417)
* ggml : add ggml_scale_bias

* ggml_vec_mad1_f32

* add more simd

* add CUDA

* sycl

* vulkan

* cann (placeholder)

* opencl

* will this fix cpu?

* fix cuda

* suggestions from coderabbit

* fix cann compile error

* vDSP_vsmsa

* rm __ARM_FEATURE_SVE

* use memcpy for op params

* make code looks more consistent

* use scalar for __ARM_FEATURE_SVE

* add x param to ggml_vec_mad1_f32
2025-07-12 19:23:56 +03:00
48b18f9eb8 ggml : prevent integer overflow in gguf tensor size calculation (llama/14595) 2025-07-12 19:23:56 +03:00
fadb3233b6 vulkan: optimize flash attention split_k_reduce (llama/14554)
* vulkan: allow FA split_k with smaller KV values

* vulkan: spread split_k_reduce work across more threads

k_num can get rather large. Use the whole workgroup to reduce the M/L values.

Launch a thread for each element in the HSV dimension of the output. Helps a
lot for large HSV (like deepseek).
2025-07-12 19:23:56 +03:00
9750e4c988 vulkan : fix rope with partial rotation and non-cont src (llama/14582) 2025-07-12 19:23:56 +03:00
c3942b3db6 cuda : fix rope with partial rotation and non-cont src (llama/14580)
* cuda : fix rope non-cont

ggml-ci

* cont : fix multi-rope + add test

ggml-ci

* sycl : try fix

ggml-ci

* cont : fix sycl + clean-up cuda

ggml-ci
2025-07-12 19:23:56 +03:00
98e7beac6c CUDA: add bilinear interpolation for upscale (llama/14563) 2025-07-12 19:23:56 +03:00
7e9c6bbab2 musa: fix build warnings (unused variable) (llama/14561)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-12 19:23:56 +03:00
8e545f466c CUDA: add bf16 and i32 to getrows (llama/14529) 2025-07-12 19:23:56 +03:00
Eve
e753b9a952 vulkan: increase LOAD_VEC_A to 8 (IQ1/IQ2) or 4 (IQ3) (llama/14485)
Commit taken from remyoudompheng's PR https://github.com/ggml-org/llama.cpp/pull/12260

Co-authored-by: Rémy Oudompheng <remyoudompheng@gmail.com>
2025-07-12 19:23:56 +03:00
9d0c408260 vulkan: fix rms_norm+mul fusion (llama/14545)
The fused operation was grabbing the epsilon value from the wrong place.

Add an env var to disable fusion.

Add some missing checks for supported shapes/types.

Handle fused rms_norm+mul in check_results.
2025-07-12 19:23:56 +03:00
3aebb8d5d3 vulkan: Handle updated FA dim2/3 definition (llama/14518)
* vulkan: Handle updated FA dim2/3 definition

Pack mask boolean and n_head_log2 into a single dword to keep the push
constant block under the 128B limit.

* handle null mask for gqa

* allow gqa with dim3>1
2025-07-12 19:23:56 +03:00
df5af1dc75 opencl: add GELU_ERF (llama/14476) 2025-07-12 19:23:56 +03:00
10d0d28f7c metal : disable fast math in all quantize kernels (llama/14528)
ggml-ci
2025-07-12 19:23:56 +03:00
af304ef080 CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator (llama/14002)
Co-authored-by: luyuhong <luyuhong@kylinos.cn>
2025-07-12 19:23:56 +03:00
e8138c51d2 ggml : implement GEGLU_ERF and GEGLU_QUICK ops (llama/14445) 2025-07-12 19:23:56 +03:00
7cec4cc83a opencl : broadcast for soft_max (llama/14510) 2025-07-12 19:23:56 +03:00
a432929d58 vulkan: support mixed/deepseekR1 FA head sizes (llama/14509)
* vulkan: better parameterize FA by head sizes

* vulkan: support mixed/deepseekR1 FA head sizes
2025-07-12 19:23:56 +03:00
4aaf8114e7 ggml: backward pass for split swiglu (llama/14483) 2025-07-12 19:23:56 +03:00
0ca760433c Fix conditional enabling following arch checks for ggml-sycl (llama/14504)
Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
2025-07-12 19:23:56 +03:00
ed639c7f22 kv-cache : use ggml_set_rows (llama/14285)
* kv-cache : use ggml_set_rows

ggml-ci

* graph : separate k and v indices

ggml-ci

* cont : remove redundant ifs

ggml-ci

* kv-cache : improve find_slot impl

* kv-cache : bounds-check when accessing slot_info indices

* kv-cache : add comments

ggml-ci

* ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends

ggml-ci
2025-07-12 19:23:56 +03:00
0abd0660e1 ggml : fix FA mask dim 2 and 3 (llama/14505)
* ggml : fix FA mask dim 2 and 3

ggml-ci

* backends : unsupport batched FA in CUDA and Vulkan

ggml-ci

* vulkan : disable FA for mask->ne[2] != 1
2025-07-12 19:23:56 +03:00
9cde908c0a CUDA: add dynamic shared mem to softmax, refactor general usage (llama/14497) 2025-07-12 19:23:56 +03:00
d2d120c256 llama : initial Mamba-2 support (llama/9126)
* llama : initial Mamba-2 support

* ggml : SIMD ggml_ssm_scan for Mamba-2

* ggml : improve ggml_mul speed when masking recurrent states

* llama : support running Mamba-Codestral-7B-v0.1

* llama : fix Mamba-2 conv state saving

* ggml : make the ggml_mul fast broadcast path more consistently formatted

* llama : remove unused variable

* llama : add missing break

* convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires
workarounds to work correctly.

* llama : avoid redundant state copy for Mamba 1 and 2

* metal : attempt to adapt SSM_SCAN for Mamba-2

* metal : fix SSM_SCAN pipeline scope

* metal : use log and exp instead of log1pf and expf in SSM_SCAN

* metal : remove unused arguments for SSM_SCAN

The max index is 31, so trimming the arguments is necessary.

* metal : add back n_seqs to SSM_SCAN args

Whoops, this is needed for the offset in the concatenated output.

* metal : fix SSM_SCAN state head offset

* metal : fix wrong number of tokens per sequence in SSM_SCAN

* ggml : remove unused fast broadcast path in GGML_MUL

This was initially added because states were masked with ggml_mul,
but this is no longer done and so this "optimisation" is no longer
necessary, or at least not worth the additional code complexity.

* ggml : avoid multiply by D in GGML_OP_SSM_SCAN

This makes the weight buft detection in src/llama.cpp simpler.

* convert : transpose Mamba-2 A, D and reshape SSM_NORM

This breaks existing conversions of Mamba-2 models
to avoid some reshapes.

Not sure if it's a good idea,
but it makes the graph slightly cleaner.

* llama : more appropriate SSM_SCAN and SSM_CONV buft support checks

* convert : fix flake8 lint

* metal : fix confusion between ; and ,

* metal : add missing args for nb references in ssm_scan_f32_group

* metal : single-user mamba2 inference works

* kv-cache : remove const_cast when setting inputs for s_copy

And also fix multi-user inference for recurrent models
by using cell_id instead of i as the kv cell index
when populating s_copy.

* convert : avoid AutoConfig for Mamba and Mamba2 hparams

* kv-cache : allow context shift for recurrent models

* graph : fix recurrent state copies when avoiding copies

Works, but using lambda functions might not be that clean.

* ggml : fix mamba2 ssm scan when compiled with SVE

* ggml-cpu : reorder SVE FMA for consistency with other SIMD arches

* cuda : implement ssm scan for Mamba2

There is still room for improvement, but it works!

* cuda : adapt Mamba1 ssm scan to shape changes from Mamba2

* mamba : fix mismatched new and delete size for llm_build_mamba

Subclasses of llm_graph_context cannot have extra fields,
because the called destructor is not the one from the subclass.
This otherwise would cause problems when runnning Mamba-(1|2) inference
when compiled -DGGML_SANITIZE_ADDRESS=ON

* cuda : graceful fallback for Mamba-1 models with weird embd size
2025-07-12 19:23:56 +03:00
fb5c4095ee CUDA: add softmax broadcast (llama/14475)
* CUDA: add softmax broadcast

* Pass by const ref

* Review: Use blockDims for indexing, remove designated initializers

* Add TODO for noncontigous input/output
2025-07-12 19:23:56 +03:00
70515ed728 CUDA: broadcasting for FlashAttention mask (llama/14500) 2025-07-12 19:23:56 +03:00
1b3e06a400 vulkan: support softmax/FA batch and broadcast (llama/14449) 2025-07-12 19:23:56 +03:00
d1286cf32b ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (llama/14435) 2025-07-12 19:23:56 +03:00
2e04b81f3e opencl : fix possible buffer overflow in dump_tensor (llama/14490) 2025-07-12 19:23:56 +03:00
cd87a2f7e0 opencl : skip empty nodes on cgraph compute (llama/14491) 2025-07-12 19:23:56 +03:00
e43c38f9f1 opencl : update upscale to support align corners (llama/14488) 2025-07-12 19:23:56 +03:00
ab850d4680 ggml : Callback before abort (llama/14481)
* Add a callback that will be called just before abort. This allows apps without a console to display a message to the user and save data if needed.

* Return previous callback to allow callback chaining

* style fixes

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-07-12 19:23:56 +03:00
cdf5e72163 ci : disable fast-math for Metal GHA CI (llama/14478)
* ci : disable fast-math for Metal GHA CI

ggml-ci

* cont : remove -g flag

ggml-ci
2025-07-12 19:23:56 +03:00
32d7c10766 CANN: update aclnnGroupedMatmulV2 to aclnnGroupedMatmulV3 (llama/14411)
* [CANN]update to aclnnGroupedMatmulV2

Signed-off-by: noemotiovon <757486878@qq.com>

* Support MUL_MAT_ID on 310p

Signed-off-by: noemotiovon <757486878@qq.com>

* fix editorconfig

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
2025-07-12 19:23:56 +03:00
3c7939cfe5 vulkan: Split large mul_mat_id to fit in shared memory (llama/14451) 2025-07-12 19:23:56 +03:00
6fc80e8456 add GELU_ERF (llama/14455) 2025-07-12 19:23:56 +03:00
19b9aaf044 vulkan : implement bilinear interpolation for ggml_upscale/ggml_interpolate (ggml/1291)
* supports GGML_SCALE_MODE_BILINEAR and GGML_SCALE_FLAG_ALIGN_CORNERS
2025-07-12 19:23:56 +03:00
f98cb6607b vulkan : implement ggml_roll (ggml/1290)
* vulkan : implement ggml_roll

* vulkan : refactor vk_op_unary_push_constants initialization
2025-07-12 19:23:56 +03:00