Commit Graph

977 Commits

Author SHA1 Message Date
71adde9203 ggml-cpu : remove unnecesary arm feature detection (llama/14281)
Support for Arm runtime feature detection has now been added to GGML_CPU_ALL_VARIANTS. This removes the old and not very functional code.
2025-06-21 07:34:17 +03:00
cef59c1e26 build : suppress gcc15 compile warnings (llama/14261)
* Change _contains_any() substrs to std::string_view and fix the find comparison logic.
2025-06-21 07:34:17 +03:00
a02a2d4240 sycl: Cleanup codepaths in Get Rows in sycl backend (llama/14215)
Addresses unused reorder path
2025-06-21 07:34:17 +03:00
be4ea0826b llamafile : support s390x SIMD instruction set (llama/14273) 2025-06-21 07:34:17 +03:00
1aca7b5c8a Vulkan: Set device max size for host memory to avoid OOM warning and fallback to CPU buffer (llama/14249) 2025-06-21 07:34:17 +03:00
b251d739ad metal : add mean kernel (llama/14267)
* metal : add mean kernel

ggml-ci

* cont : dedup implementation

ggml-ci
2025-06-21 07:34:17 +03:00
203451bcba ggml-cpu: reduce asm calls for hsum (llama/14037)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-06-21 07:34:17 +03:00
34940abe53 ggml-cpu: fix uncaught underscore terminators (llama/14023)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-06-21 07:34:17 +03:00
4fc9c34126 ggml: Add Apple support for GGML_CPU_ALL_VARIANTS (llama/14258) 2025-06-21 07:34:17 +03:00
471df139fa Add ggml_roll (ggml/1274)
* ggml : add ggml_roll

* use set/get_op_params & std::min
2025-06-21 07:34:17 +03:00
0e068779c7 cmake: remove shader-gen step-targets from ggml-vulkan (llama/14226)
* Remove step-targets from vulkan-shaders-gen

* Unset DESTDIR when building vulkan-shaders-gen
2025-06-18 12:40:34 +03:00
ac8a303c9a ggml-cpu : remove the weak alias trick (llama/14221) 2025-06-18 12:40:34 +03:00
2a84593960 musa: fix build warning (unused variable) (llama/14231)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-06-18 12:40:34 +03:00
44871c8a3e llama : add thread safety test (llama/14035)
* llama : add thread safety test

* llamafile : remove global state

* llama : better LLAMA_SPLIT_MODE_NONE logic

when main_gpu < 0 GPU devices are not used

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-18 12:40:34 +03:00
ad6cd94a3a cmake: clean up external project logic for vulkan-shaders-gen (llama/14179)
* Remove install step for vulkan-shaders-gen

* Add install step to normalize msvc with make

* Regenerate modified shaders at build-time
2025-06-18 12:40:34 +03:00
dbad9d8fba HIP: disable rocwmma on gfx12 by default until rocm 7.0 (llama/14202) 2025-06-18 12:40:34 +03:00
518835ee56 ggml: Add Android support for GGML_CPU_ALL_VARIANTS (llama/14206) 2025-06-18 12:40:34 +03:00
a3d1c55c66 vulkan: mutex around vkQueueSubmit (llama/14127)
This fixes the remaining crash in test-thread-safety on my system.
2025-06-18 12:40:34 +03:00
0c25129d30 ggml-cpu : rework weak alias on apple targets (llama/14146)
* ggml-cpu : rework weak alias on apple targets

* fix powerpc detection

* fix ppc detection

* fix powerpc detection on darwin
2025-06-18 12:40:34 +03:00
a433680a2f CUDA/HIP: fix ssm_scan on devices where warp size is not 32 (llama/14196) 2025-06-18 12:40:34 +03:00
aeaed9806f HIP: Replace usage of depricated preprocessor macro __AMDGCN_WAVEFRONT_SIZE__ (llama/14183) 2025-06-18 12:40:34 +03:00
4ea599afdf sycl: Adding additional cpy dbg print output (llama/14034) 2025-06-18 12:40:34 +03:00
783cf0309f SYCL: Bump oneMath commit (llama/14152)
Update oneMath commit to merged PR https://github.com/uxlfoundation/oneMath/pull/669
which adds SYCL-Graph support for recording CUDA BLAS commands.

With this change the `MUL_MAT` tests now pass on DPC++ CUDA backends with SYCL-Graph
enabled. Prior to this change, an error would be thrown.

```
$ GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0 -o MUL_MAT -p type_a=f16,type_b=f32,m=16,n=1,k=256,bs=\\[1,1\\],nr=\\[2

UR CUDA ERROR:
        Value:           700
        Name:            CUDA_ERROR_ILLEGAL_ADDRESS
        Description:     an illegal memory access was encountered
        Function:        operator()
        Source Location: $HOME/dpcpp/unified-runtime/source/adapters/cuda/queue.cpp:154

Native API failed. Native API returns: 2147483646 (UR_RESULT_ERROR_UNKNOWN)
Exception caught at file:$HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp, line:3598, func:operator()
SYCL error: CHECK_TRY_ERROR((stream)->wait()): Meet error in this line code!
  in function ggml_backend_sycl_synchronize at $HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:3598
$HOME/llama.cpp/ggml/src/ggml-sycl/../ggml-sycl/common.hpp:118: SYCL error
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
```
2025-06-18 12:40:34 +03:00
0097eaf839 sycl: Remove not needed copy f16->f32 for dnnl mul mat (llama/14125) 2025-06-18 12:40:34 +03:00
a96a880f7b cmake : handle whitepsaces in path during metal build (llama/14126)
* cmake : handle whitepsaces in path during metal build

ggml-ci

* cont : proper fix

ggml-ci

---------

Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2025-06-18 12:40:34 +03:00
26c16ad6bd Implement GGML_CPU_ALL_VARIANTS for ARM (llama/14080)
* ggml-cpu: Factor out feature detection build from x86

* ggml-cpu: Add ARM feature detection and scoring

This is analogous to cpu-feats-x86.cpp. However, to detect compile-time
activation of features, we rely on GGML_USE_<FEAT> which need to be set
in cmake, instead of GGML_<FEAT> that users would set for x86.

This is because on ARM, users specify features with GGML_CPU_ARM_ARCH,
rather than with individual flags.

* ggml-cpu: Implement GGML_CPU_ALL_VARIANTS for ARM

Like x86, however to pass around arch flags within cmake, we use
GGML_INTERNAL_<FEAT> as we don't have GGML_<FEAT>.

Some features are optional, so we may need to build multiple backends
per arch version (armv8.2_1, armv8.2_2, ...), and let the scoring
function sort out which one can be used.

* ggml-cpu: Limit ARM GGML_CPU_ALL_VARIANTS to Linux for now

The other platforms will need their own specific variants.

This also fixes the bug that the the variant-building branch was always
being executed as the else-branch of GGML_NATIVE=OFF. The branch is
moved to an elseif-branch which restores the previous behavior.
2025-06-18 12:40:34 +03:00
40d0d47cf1 vulkan: Better thread-safety for command pools/buffers (llama/14116)
This change moves the command pool/buffer tracking into a vk_command_pool
structure. There are two instances per context (for compute+transfer) and
two instances per device for operations that don't go through a context.
This should prevent separate contexts from stomping on each other.
2025-06-18 12:40:34 +03:00
40c6525517 vulkan: Track descriptor pools/sets per-context (llama/14109)
Use the same descriptor set layout for all pipelines (MAX_PARAMETER_COUNT == 8)
and move it to the vk_device. Move all the descriptor pool and set tracking to
the context - none of it is specific to pipelines anymore. It has a single vector
of pools and vector of sets, and a single counter to track requests and a single
counter to track use.
2025-06-18 12:40:34 +03:00
74c68067dc opencl: add mul_mv_id_q4_0_f32_8x_flat (llama/14003) 2025-06-18 12:40:34 +03:00
794bf23994 Vulkan: Don't default to CPU device (like llvmpipe), even if no other device is available, to allow fallback to CPU backend (llama/14099) 2025-06-18 12:40:34 +03:00
26dcc196c7 rpc : nicer error messages for RPC server crash (llama/14076) 2025-06-18 12:40:34 +03:00
1b01c0cc4e ggml : remove unused ggml_context_container (ggml/1272)
This commit removes the unused `ggml_context_container` structure from
the ggml library. It looks like the usage of this struct was removed in
Commit 4757fe18d56ec11bf9c07feaca6e9d5b5357e7f4 ("ggml : alloc
ggml_contexts on the heap (whisper/2525)").

The motivation for this changes is to improve code clarity/readability.
2025-06-18 12:40:34 +03:00
93d543905e ggml : fix weak alias win32 (#0)
ggml-ci
2025-06-10 12:40:33 +03:00
175e7e4f1a files : remove old sources (part 2) 2025-06-10 12:40:33 +03:00
38347a7dda files : remove old sources 2025-06-10 12:40:33 +03:00
7a675807a2 metal : use less stack memory in FA kernel (llama/14088)
* metal : use less stack memory in FA kernel

ggml-ci

* cont : fix BF16 variant
2025-06-10 12:40:33 +03:00
8cbc889f85 ggml-cpu : split arch-specific implementations (llama/13892)
* move ggml-cpu-aarch64 to repack

* split quantize_row_q8_0/1

* split helper functions

* split ggml_vec_dot_q4_0_q8_0

* split ggml_vec_dot_q4_1_q8_1

* split ggml_vec_dot_q5_0_q8_0

* split ggml_vec_dot_q5_1_q8_1

* split ggml_vec_dot_q8_0_q8_0

* split ggml_vec_dot_tq1_0_q8_K

* split ggml_vec_dot_tq2_0_q8_K

* split ggml_vec_dot_q2_K_q8_K

* split ggml_vec_dot_q3_K_q8_K

* split ggml_vec_dot_q4_K_q8_K

* split ggml_vec_dot_q5_K_q8_K

* split ggml_vec_dot_q6_K_q8_K

* split ggml_vec_dot_iq2_xxs_q8_K

* split ggml_vec_dot_iq2_xs_q8_K

* split ggml_vec_dot_iq2_s_q8_K

* split ggml_vec_dot_iq3_xxs_q8_K

* split ggml_vec_dot_iq3_s_q8_K

* split ggml_vec_dot_iq1_s_q8_K

* split ggml_vec_dot_iq1_m_q8_K

* split ggml_vec_dot_iq4_nl_q8_0

* split ggml_vec_dot_iq4_xs_q8_K

* fix typos

* fix missing prototypes

* rename ggml-cpu-quants.c

* rename ggml-cpu-traits

* rename arm folder

* move cpu-feats-x86.cpp

* rename ggml-cpu-hbm

* update arm detection macro in quants.c

* move iq quant tables

* split ggml_quantize_mat_q8_0/K

* split ggml_gemv_*

* split ggml_gemm_*

* rename namespace aarch64 to repack

* use weak aliases to replace test macros

* rename GGML_CPU_AARCH64 to GGML_CPU_REPACK

* rename more aarch64 to repack

* clean up rebase leftover

* fix compilation errors

* remove trailing spaces

* try to fix clang compilation errors

* try to fix clang compilation errors again

* try to fix clang compilation errors, 3rd attempt

* try to fix clang compilation errors, 4th attempt

* try to fix clang compilation errors, 5th attempt

* try to fix clang compilation errors, 6th attempt

* try to fix clang compilation errors, 7th attempt

* try to fix clang compilation errors, 8th attempt

* try to fix clang compilation errors, 9th attempt

* more cleanup

* fix compilation errors

* fix apple targets

* fix a typo in arm version of ggml_vec_dot_q4_K_q8_K

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-10 12:40:33 +03:00
e16a84cd95 cuda : fix device sync on buffer clear (llama/14033) 2025-06-10 12:40:33 +03:00
26282282fa CANN: Simplify the environment variable setting(#13104)
* Simplify the environment variable setting to specify the memory pool type.

* Adjust the GGML_CANN_ASYNC_MODE setting to accept yes, enable, 1, or on (case-insensitive) as valid options.

* update

* fix CI

* update

* delete whitespace

* fix according to review

* update CANN.md

* update CANN.md
2025-06-10 12:40:33 +03:00
4737a8c780 sycl: Add reorder to Q6_K mmvq implementation (llama/13885)
* Add Reorder to Q6_K mmvq implementation

* Address PR comments: clean up comments

* Remove unused parameter after refactoring q4_k

* Adding inline to function and removing unnecessary reference to int

---------

Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
2025-06-10 12:40:33 +03:00
8a70f4d18b cuda : fix buffer type check with integrated GPUs (llama/14069) 2025-06-10 12:40:33 +03:00
489dc158a6 SYCL: Implement few same quantized type copy kernels (llama/13739)
* SYCL: Implement few same quantized type copy kernels

* Use memcpy for copying contiguous tensors

ggml-ci

* feat(sycl): add contiguous tensor copy support and device checks

Adds a memcpy path for contiguous tensors of the same type to optimize data transfer. Updates device support checks to recognize contiguous tensor operations, improving compatibility and performance.

* refactor: replace specific block copy functions with template

The changes replace multiple redundant block copy functions (e.g., cpy_block_q8_0_q8_0, cpy_block_q5_0_q5_0) with a single templated function cpy_blck_q_q. This reduces code duplication by using a generic template that works for any block type, improving maintainability while preserving the same functionality. The template is instantiated with specific block types (e.g., block_q8_0) where needed.

* Exclude BF16 support for COPY tensors for now
ggml-ci

* perf: adjust SYCL copy kernel block sizes for efficiency

Use ceil_div to ensure full element coverage and update nd_range parameters to better align with SYCL block sizes, improving parallelism and device utilization in copy operations.
2025-06-10 12:40:33 +03:00
f0f5a9f7fb vulkan: Enable VK_KHR_cooperative_matrix extension for Intel Xe2 GPUs (llama/14001)
* allowing B580 and U9-288V

* experimenting code to detect Xe2

* allowing coopmat only for Xe2 GPUs

* fixed comment wording

* fixed comment wording

* removed unnecessary driver check
2025-06-10 12:40:33 +03:00
13a03c5d33 llama : allow using mmap without PrefetchVirtualMemory, apply GGML_WIN_VER to llama.cpp sources (llama/14013) 2025-06-10 12:40:33 +03:00
6dd91d4f7e vulkan: automatically deduce size of push constants (llama/13936) 2025-06-10 12:40:33 +03:00
5171b24f70 ggml-vulkan: adds support for op CONV_TRANSPOSE_1D (llama/13813)
* * ggml-vulkan: adds op CONV_TRANSPOSE_1D

* test-backend-ops: adds more spohisticated tests for CONV_TRANSPOSE_1D

* Missing barrier added to shader.
Number of additional tests reduced to 108.

* * Fixes typo in variable name.

* Removes extra whitespaces.

* Adds int64->int32 casts to prevent possible warnings.

* Problem size reduced in tests to pass tests with llvmpipe.

* supports_op condition moved from unintended position
2025-06-10 12:40:33 +03:00
23e2fe0682 releases : use dl backend for linux release, remove arm64 linux release (llama/13996) 2025-06-10 12:40:33 +03:00
7f4d110f53 CUDA: fix FTZ in FA for Gemma 3 (llama/13991) 2025-06-10 12:40:33 +03:00
ee0ef39fee vulkan: fix warnings in perf logger querypool code (llama/13937) 2025-06-10 12:40:33 +03:00
62791ba2e6 opencl: add backend_synchronize (llama/13939)
* This is not needed by the normal use where the result is read
  using `tensor_get`, but it allows perf mode of `test-backend-ops`
  to properly measure performance.
2025-06-10 12:40:33 +03:00