Commit Graph

845 Commits

Author SHA1 Message Date
f44b53480f sycl: disable reorder for sycl mulmat (llama/13536) 2025-05-27 18:03:00 +03:00
e04e8f1c79 metal : fix typo in FA kernel comments (llama/13651) 2025-05-27 18:03:00 +03:00
ee3f177cba sycl : Overcoming workaround for mmap() allocation on Windows (llama/13482)
* Remove mmap workaround on windows

After some testing I found that mmap is supported on windows and for
many GPUs on Linux. Therefore I remove the workaround for windows since
it is not necessary.

* Update llama-bench README

SYCL backend introduced a workaround that allows execution of
llama-bench also without specifying `--mmp 0` flag
2025-05-27 18:03:00 +03:00
0b69f74e15 Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence (llama/13607) 2025-05-27 18:03:00 +03:00
9da3fc27be CANN: Support MOE Model MUL_MAT_ID (llama/13042)
Signed-off-by: noemotiovon <757486878@qq.com>
2025-05-19 14:58:39 +03:00
2c13651e08 cmake: use the current build config for vulkan-shaders-gen (llama/13595)
* fix: use the current build config for `vulkan-shaders-gen`

* fix: only pass a valid build type to `--config`
2025-05-19 14:58:39 +03:00
13dca86c56 vulkan: move common FA code to flash_attn_base.comp (llama/13556)
* vulkan: move common FA code to flash_attn_base.comp

* vulkan: move common FA index/stride setup code to flash_attn_base.comp

* build fix
2025-05-19 14:58:39 +03:00
6d61a09bc4 vulkan: use scalar FA rather than coopmat2 when N==1 (llama/13554) 2025-05-19 14:58:39 +03:00
4fedad988b metal : add FA-vec kernel for head size 64 (llama/13583)
ggml-ci
2025-05-19 14:58:39 +03:00
a8e17a244d sycl : fixed compilation warnings (llama/13582) 2025-05-19 14:58:39 +03:00
0c76acd08a gguf : use ggml log system (llama/13571)
* gguf : use ggml log system

* llama : remove unnecessary new lines in exception messages
2025-05-19 14:58:39 +03:00
27964db1be sycl: simplify bin_bcast_kernel (llama/13383) 2025-05-19 14:58:39 +03:00
8081e7a23d sycl: reordered Q4_K MMVQ (llama/13109) 2025-05-19 14:58:39 +03:00
d807c497a4 sycl: use oneDNN for matrices multiplication (llama/12972) 2025-05-19 14:58:39 +03:00
8e9bf548f4 arm64: optimize q6_k_q8_k kernel with i8mm (llama/13519)
This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction.

Tested on neoverse-n2 with llama3 8b q6_k quantization model.
- 40% ~ 54% S_PP uplift for all batch sizes
- 16% ~ 47% S_TG uplift for batch size 4 and above

Perplexity doesn't change with this PR.

```
// tested on neoverse-n2
$ llama-batched-bench \
      -m Meta-Llama-3-8B-Instruct-Q6_K.gguf \
      --no-mmap -fa \
      -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
      -npl 1,2,4,8,16,32 \
      -t 64

---------------------------------------------------------------------
|    PP |     TG |    B |       S_PP t/s      |       S_TG t/s      |
|       |        |      | original |  this pr | original |  this pr |
|-------|--------|------|----------|----------|----------|----------|
|   128 |    128 |    1 |    78.52 |   109.18 |    18.63 |    18.88 |
|   128 |    128 |    2 |    84.62 |   123.94 |    34.54 |    36.92 |
|   128 |    128 |    4 |    84.36 |   122.49 |    52.65 |    61.32 |
|   128 |    128 |    8 |    90.52 |   138.87 |    63.46 |    84.41 |
|   128 |    128 |   16 |    90.11 |   138.56 |    71.04 |   101.33 |
|   128 |    128 |   32 |    89.81 |   137.79 |    75.14 |   110.47 |
---------------------------------------------------------------------
```
2025-05-19 14:58:39 +03:00
0dda27bc0b CUDA: fix crash on large batch size for quant. MoE (llama/13537) 2025-05-19 14:58:39 +03:00
ffa4720f25 CUDA: faster Deepseek FA, add Turing support (llama/13435) 2025-05-19 14:58:39 +03:00
9b8eea28b5 cmake: simplify vulkan shader test logic (llama/13263) 2025-05-19 14:58:39 +03:00
162bbe8220 vulkan: KHR_coopmat flash attention (llama/13506)
This shader uses coopmat1 to do the Q*K^T multiply. The P*V multiply is more
difficult for various reasons so I haven't done it. Performance for this
shader is around 2.5x better than for the scalar shader when doing prompt
processing. Some of the benefit may be from other optimizations like staging
through shared memory, or splitting by rows.
2025-05-19 14:58:39 +03:00
a221288dc6 vulkan: workaround FA compile failures on macos (llama/13517) 2025-05-19 14:58:39 +03:00
08436716ae metal : use FA-vec kernel up to batch size 20 (llama/13496)
* batched-bench : fix pp batch contents

* metal : optimize multi-sequence FA vec kernel

ggml-ci

* metal : use FA-vec kernel up to batch size 20

ggml-ci
2025-05-19 14:58:39 +03:00
e11fc21e6c metal : optimize multi-sequence FA vec kernel (llama/13493)
* batched-bench : fix pp batch contents

* metal : optimize multi-sequence FA vec kernel

ggml-ci
2025-05-19 14:58:39 +03:00
a77a924b20 ggml-cpu: Update KleidiAI to v1.6 and fix include directives (llama/13509)
Signed-off-by: Dan Johansson <dan.johansson@arm.com>
2025-05-19 14:58:39 +03:00
405b9c77ad mnist: fix segmentation fault (ggml/1227) 2025-05-19 14:58:39 +03:00
9c3bfc1499 ggml : fix apple OS check in ggml_print_backtrace (ggml/1229) 2025-05-19 14:58:39 +03:00
5b7797f674 ggml : Fix missing backtrace on Linux (ggml/1228)
* Modern Linux defaults /proc/sys/kernel/yama/ptrace_scope to 1
* Fixed lldb attach
* Simplify by having the child do ggml_print_backtrace_symbols
2025-05-19 14:58:39 +03:00
75e9a840c5 ggml : add mrope kernel for metal (llama/13457) 2025-05-13 13:59:21 +03:00
41ed62bdbc metal : optimize MoE for large batches (llama/13388) 2025-05-13 13:59:21 +03:00
029c8837f8 opencl: remove unnecessary assert for add (llama/13257) 2025-05-13 13:59:21 +03:00
5d8b068249 llama/ggml: add LLM training support (llama/10544)
* llama/ggml: add LLM training support

more compact progress bar

llama_save_model_to_file

llama_opt_param_filter

ggml_graph_dup force_grads

refactor ggml_opt, fix test-opt

* remove logits_all

* refactor CUDA implementation for ACC

* reset graph at beginning of opt period
2025-05-13 13:59:21 +03:00
93ef22657e ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel (llama/13053)
* ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel

Signed-off-by: Dan Johansson <dan.johansson@arm.com>

* * code review fixes

Signed-off-by: Dan Johansson <dan.johansson@arm.com>

* * adds a comment that clarifies barrier usage

Signed-off-by: Dan Johansson <dan.johansson@arm.com>

---------

Signed-off-by: Dan Johansson <dan.johansson@arm.com>
Co-authored-by: Charles Xu <charles.xu@arm.com>
2025-05-13 13:59:21 +03:00
866f685bbc CUDA: fix misaligned synchronization in FA (llama/13469) 2025-05-13 13:59:21 +03:00
250bcc041a enable dpcpp nightly builds with libraries (llama/13406) 2025-05-13 13:59:21 +03:00
90b17a99bf CUDA: fix crash with partial offloading of MoE (llama/13439) 2025-05-13 13:59:21 +03:00
e1b2ace0f8 Add --no-op-offload to improve -ot pp perf in MoE models like llama4 400B (llama/13386) 2025-05-13 13:59:21 +03:00
6db0e01db6 CUDA: fix race conditions FlashAttention kernels (llama/13438) 2025-05-13 13:59:21 +03:00
16f3546f38 CUDA: fix FlashAttention on Turing (llama/13415) 2025-05-13 13:59:21 +03:00
a04b329ad1 vulkan: scalar flash attention implementation (llama/13324)
* vulkan: scalar flash attention implementation

* vulkan: always use fp32 for scalar flash attention

* vulkan: use vector loads in scalar flash attention shader

* vulkan: remove PV matrix, helps with register usage

* vulkan: reduce register usage in scalar FA, but perf may be slightly worse

* vulkan: load each Q value once. optimize O reduction. more tuning

* vulkan: support q4_0/q8_0 KV in scalar FA

* CI: increase timeout to accommodate newly-supported tests

* vulkan: for scalar FA, select between 1 and 8 rows

* vulkan: avoid using Float16 capability in scalar FA
2025-05-13 13:59:21 +03:00
45d8b2352e sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (llama/12858)
* sycl : Implemented reorder Q4_0 mmvq

Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>

* sycl : Fixed mmvq being called when reorder is disabled

* sycl : Improved comments in the quants header

Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>

* Use static_assert

* safe_div -> ceil_div

* Clarify qi comment

* change the reorder tensor from init to execute OP

* dbg

* Undo changes to test-backend-ops

* Refactor changes on top of q4_0 reorder fix

* Missing Reverts

* Refactored opt_for_reorder logic to simplify code path

* Explicit inlining and unroll

* Renamed mul_mat_algo enum for consistency

---------

Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>
Co-authored-by: romain.biessy <romain.biessy@codeplay.com>
2025-05-13 13:59:21 +03:00
2d436bfbfb CUDA: FA support for Deepseek (Ampere or newer) (llama/13306)
* CUDA: FA support for Deepseek (Ampere or newer)

* do loop unrolling via C++ template
2025-05-13 13:59:21 +03:00
4b7cbb62ef CUDA: fix crash on large batch size for MoE models (llama/13384) 2025-05-13 13:59:21 +03:00
e27c91f6d6 rpc : add rpc_msg_set_tensor_hash_req (llama/13353)
* rpc : add rpc_msg_set_tensor_hash_req

Use a dedicated struct for the request of RPC_CMD_SET_TENSOR_HASH which
makes the code cleaner.

* fix
2025-05-13 13:59:21 +03:00
e46df4850f vulkan: Allow up to 4096 elements for mul_mat_id row_ids (llama/13326)
This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf:

GGML_ASSERT(nei0 * nei1 <= 3072);

The tensor is 8 x 512. Increase this array size to accommodate.
2025-05-13 13:59:21 +03:00
e8a7f1b7bb sycl: addressing non-contiguous src1 mul_mats (nc and batched) (llama/13343)
* sycl: fixed non-contiguous src1 mul_mats (nc and batched)

* Fixed wrong static_cast inside kernel
2025-05-13 13:59:21 +03:00
09e6b66025 cuda : remove nrows_x in mul_mat_q_process_tile (llama/13325)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-05-07 21:00:32 +03:00
d41cf26a0f CUDA: mix virt/real CUDA archs for GGML_NATIVE=OFF (llama/13135) 2025-05-07 21:00:32 +03:00
3c67195be9 SYCL: Disable reorder optimize by default and stop setting tensor extras when optimize is disabled (llama/13254)
* SYCL: Do not set tensor extras when reorder optimize is disabled

* SYCL: Disable reorder optimize by default
2025-05-07 21:00:32 +03:00
f9f78a773f CUDA: fix bad asserts for partial offload (llama/13337) 2025-05-07 21:00:32 +03:00
be55e25cac CUDA: fix --split-mode row for MMQ (llama/13323) 2025-05-07 21:00:32 +03:00
2ffdda99e8 CUDA: fix logic for clearing padding with -ngl 0 (llama/13320) 2025-05-07 21:00:32 +03:00