Commit Graph

212 Commits

Author SHA1 Message Date
Georgi Gerganov
0e7798677a ggml : add SSM Metal kernels (llama/8546)
* ggml : add ggml_ssm_conv metal impl

* ggml : add ssm_scan metal impl

ggml-ci
2024-08-28 13:22:20 +03:00
slaren
58a36d2e3b metal : gemma2 flash attention support (llama/9159) 2024-08-28 13:22:20 +03:00
Johannes Gäßler
24d8534bd8 CPU/CUDA: Gemma 2 FlashAttention support (llama/8542)
* CPU/CUDA: Gemma 2 FlashAttention support

* apply logit_softcap to scale in kernel

* disable logit softcapping tests on Metal

* remove metal check
2024-08-28 13:22:20 +03:00
Akarshan Biswas
9b16ddd3a5 Add a space to supress a cmake warning (llama/9133) 2024-08-28 13:22:20 +03:00
luoyu-intel
32f88af17b Add oneDNN primitive support (llama/9091)
* add onednn

* add sycl_f16

* add dnnl stream

* add engine map

* use dnnl for intel only

* use fp16fp16fp16

* update doc
2024-08-28 13:22:20 +03:00
compilade
9bf7250bf9 llama : simplify Mamba with advanced batch splits (llama/8526)
* llama : advanced batch splits

This includes equal-sequence-length batch splits which are useful
to simplify recurrent model operators.

* llama : always make recurrent state slots contiguous

* ggml : simplify mamba operators

* llama : fix integer signedness mixing

* llama : logits_all has priority over batch->logits

Otherwise, the server embeddings tests failed.
This was likely an existing problem but was only detected here
because of an additional assertion.

* llama : apply suggestions

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* llama : fix t5 segfault

* llama : fix Mamba session save and restore

* llama : minor cosmetic changes

* llama : rename llama_reorder_outputs to llama_output_reorder

Also move it closer to llama_output_reserve.

* llama : fix pooled embeddings when using batches with equal_seqs

* minor : add struct members for clarity

ggml-ci

* llama : fix T5 segfault again

* llama : fix Mamba pooled embeddings with multiple sequences

Until the pooled embeddings are refactored to allow splitting
across ubatches for causal embeddings,
recurrent models can only process a single sequence per ubatch
when calculating pooled embeddings.

* llama : add llama_model_is_recurrent to simplify figuring that out

This will make it easier to more cleanly support RWKV-v6 and Mamba-2.

* llama : fix simple splits when the batch contains embeddings

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-08-28 13:22:20 +03:00
Meng, Hengyu
17e49d3ab2 fallback mmvq (llama/9088)
* fallback mmvq to mul_mat

* mmvq in cuda path

* Update ggml/src/ggml-sycl.cpp

Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@codeplay.com>

---------

Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@codeplay.com>
2024-08-28 13:22:20 +03:00
zhentaoyu
58b725282a Fix SYCL im2col and convert Overflow with Large Dims (llama/9052)
* sycl: fix im2col overflow and sync with cuda

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* sycl: fix convert overflow

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* sycl: fix convert and dequantize

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* sycl: fix ib in dmmv

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* sycl:refine convert

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* sycl: move downsample global_range into common

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* test: add im2col and convert test cases

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* test: make new cases only in sycl

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* test: comment new test_cases for only local testing

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

---------

Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
2024-08-28 13:22:20 +03:00
Radoslav Gerganov
7e59afa1e0 rpc : print error message when failed to connect endpoint (llama/9042) 2024-08-28 13:22:20 +03:00
Radoslav Gerganov
5ac022140e rpc : prevent crashes on invalid input (llama/9040)
Add more checks which prevent RPC server from crashing if invalid input
is received from client
2024-08-28 13:22:20 +03:00
Nico Bosshard
0eaa67280c ggml : dynamic ggml_sched_max_splits based on graph_size (llama/9047)
* ggml : Dynamic ggml_sched_max_splits based on graph_size

* Fixed and readded debug code for causes
2024-08-28 13:22:20 +03:00
Georgi Gerganov
5a62fdb735 cmake : remove unused option GGML_CURL (llama/9011) 2024-08-28 13:22:20 +03:00
Daniel Bevenius
60098d6204 ggml : move rope type enum to ggml.h (llama/8949)
* ggml : move rope type enum to ggml.h

This commit moves the `llama_rope_type` enum from `llama.h` to
`ggml.h` and changes its name to `ggml_rope_type`.

The motivation for this change is to address the TODO in `llama.h` and
use the enum in ggml.

Note: This commit does not change the `mode` parameter to be of type
`enum ggml_rope_type`. The name `mode` and its usage suggest that it
might be more generic and possibly used as a bit field for multiple
flags. Further investigation/discussion may be needed to determine
if `mode` should be restricted to RoPE types.

* squash! ggml : move rope type enum to ggml.h

This commit removes GGML_ROPE_TYPE_NONE and GGML_ROPE_TYPE_GLM from
ggml.h, and back the llama_rope_type enum.

I've kept the assert for GGML_ROPE_TYPE_GLM as I'm not sure if it is
safe to remove it yet.

* squash! ggml : move rope type enum to ggml.h

This commit removes the enum ggml_rope_type from ggml.h and replaces it
with a define (GGML_ROPE_TYPE_NEOX). This define is used in the code to
check if the mode is set to GPT-NeoX. Also the enum llama_rope_type has
been updated to reflect this change.

* squash! ggml : move rope type enum to ggml.h

This commit contains a suggestion enable the GGML_ROPE_TYPE_NEOX
macro/define to be passed to the shader compiler.

* squash! ggml : move rope type enum to ggml.h

This commit fixes the editorconfig-checker warnings.

* squash! ggml : move rope type enum to ggml.h

Update comment for ggml_rope function.

* Revert "squash! ggml : move rope type enum to ggml.h"

This reverts commit 6261222bd0dc0efd51f0fb0435ad3f16a5b52fd6.

* squash! ggml : move rope type enum to ggml.h

Add GGML_ROPE_TYPE_NEOX to rope_common.comp.

* remove extra line

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-08-28 13:22:20 +03:00
DavidKorczynski
317293e6a7 ggml: fix div-by-zero (llama/9003)
Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70724

In order to access the above bug you need to login using one of the
emails in
https://github.com/google/oss-fuzz/blob/master/projects/llamacpp/project.yaml#L3-L5

Signed-off-by: David Korczynski <david@adalogics.com>
2024-08-28 13:22:20 +03:00
Markus Tavenrath
488a966c07 Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. (llama/8943)
* Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead.

- Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove.
- ggml_vk_sync_buffer introduce a full pipeline sync which has a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors.

* Fix small typo

---------

Co-authored-by: 0cc4m <picard12@live.de>
2024-08-28 13:22:20 +03:00
Johannes Gäßler
8954769aa2 feat: ref. cross entropy, add CUDA, fix grad test (ggml/929) 2024-08-28 13:22:20 +03:00
Johannes Gäßler
df06468d9e ggml: remove bad assert (ggml/928) 2024-08-28 13:22:20 +03:00
Johannes Gäßler
1fbd828a5d examples: add MNIST training + missing ops 2024-08-28 13:22:20 +03:00
Georgi Gerganov
9e3c5345cd sync : ggml vulkan (ggml/0)
ggml-ci
2024-08-21 11:07:13 +03:00
Radoslav Gerganov
b6c05ce82f yolo : add backend support (ggml/924)
* yolo : add backend support

* metal : add sub and sqrt kernels

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-08-21 11:07:13 +03:00
Daniel Bevenius
52c80cac00 ggml : fix typo in ggml-quants.c comment (ggml/922) 2024-08-21 11:07:13 +03:00
Ronsor
3643120690 feat: add new sin and cos operators (ggml/919)
* ggml : add sin/cos operators

* ggml-cuda : add sin/cos operators

* ggml : add corresponding tests for sin/cos

* ggml : add backward computation for sin/cos operators

* ggml-vulkan : add sin/cos operators

* ggml-vulkan : add sin/cos shader source

* metal : add sin, cos

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-08-21 11:07:13 +03:00
Salvatore Mesoraca
993f0df419
ggml : support forward pass broadcasting in ggml_sub (ggml/914)
* ggml: support forward pass broadcasting in ggml_sub

Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>

* Use assert instead of GGML_ASSERT in ggml_compute_forward_sub_f32

The check is already performed in ggml_sub_impl

Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>

---------

Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
2024-08-12 11:58:49 +03:00
slaren
9b1788483c
metal : fix uninitialized abort_callback (llama/8968) 2024-08-12 11:58:49 +03:00
Georgi Gerganov
ad37d26983
rpc : sanitize tensor data + warnings (llama/0)
Co-authored-by: slaren <slarengh@gmail.com>
2024-08-12 11:58:46 +03:00
Mengqing Cao
81c999fe0a
cann : add Ascend NPU support (#2336)
* enable Ascend NPU in src/whisper.cpp
  * sync test-backend-ops with llama.cpp
2024-08-09 15:21:56 +03:00
hipudding
be88ee1d75 ggml : add CANN backend (llama/0)
ggml-ci
2024-08-09 09:58:16 +03:00
slaren
ee14c02365 ggml-backend : fix async copy from CPU (llama/8897)
* ggml-backend : fix async copy from CPU

* cuda : more reliable async copy, fix stream used when the devices are the same
2024-08-08 22:48:46 +03:00
Ouadie EL FAROUKI
ab39dd34e1 Updated SYCL device filtering (llama/8901)
* Updated device filter to depend on default_selector (fixes non-intel device issues)
* Small related update to example/sycl Readme
2024-08-08 22:48:46 +03:00
Johannes Gäßler
b1348d3530 CUDA/HIP: fix tests/test-backend-ops (llama/8896) 2024-08-08 22:48:46 +03:00
Johannes Gäßler
90641b5cf4 CUDA: fix padding logic for FP16/FP32 (llama/8884) 2024-08-08 22:48:46 +03:00
Molly Sophia
4160b930f1 ggml : add epsilon as a parameter for group_norm (llama/8818)
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2024-08-08 22:48:46 +03:00
Justine Tunney
7a96e661e4 ggml : fix overflows in elu function (llama/8866)
It's helpful to use expm1f(x), because expf(x)-1 will result in overflow
for 25% of single-precision floating point numbers.
2024-08-08 22:48:46 +03:00
jdomke
a902fb4ab2 ggml : reading the runtime sve config of the cpu (llama/8709)
* ggml : reading the runtime sve config of the cpu

* change to one time init to prevent performance drop

* prefix variable to avoid possible conflicts

* revert xxhash fix and add brackets

---------

Co-authored-by: domke <673751-domke@users.noreply.gitlab.com>
2024-08-08 22:48:46 +03:00
Sigbjørn Skjæret
6cb38c3673 Fix conversion of unnormalized BF16->BF16 weights (llama/7843)
* add truncate_bf16

* truncate intermediate fp32 if converting bf16 to bf16

* fix masking in __compute_fp32_to_bf16

* np.int16 no longer used

* missing cast and additional numpy 2.x fix

* ggml-impl : do not flush bf16 subnormals to zero

* ggml : add reference fp32 to bf16 conversion

The fast version is no longer equivalent for all platforms
because of the handling of subnormal values.

* gguf-py : remove flush to zero for bf16 subnormals

* gguf-py : remove float32 truncation to bf16

Rounding achieves the same thing in the cases where this was used.

* missed prototype update in merge

* merge cleanup

---------

Co-authored-by: Francis Couture-Harpin <git@compilade.net>
2024-08-08 22:48:46 +03:00
Ouadie EL FAROUKI
9cf14ebcbc Fixing wrong VDR iq4nl value (llama/8812) 2024-08-08 22:48:46 +03:00
matteo
8e39ee171f ggml-cuda: Adding support for unified memory (llama/8035)
* Adding support for unified memory

* adding again the documentation about unified memory

* refactoring: Moved the unified memory code in the correct location.

* Fixed compilation error when using hipblas

* cleaning up the documentation

* Updating the documentation

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* adding one more case where the PR should not be enabled

---------

Co-authored-by: matteo serva <matteo.serva@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-08-08 22:48:46 +03:00
Alex O'Connell
d26250f78c Build: Only include execinfo.h on linux systems that support it (llama/8783)
* Only enable backtrace on GLIBC linux systems

* fix missing file from copy

* use glibc macro instead of defining a custom one
2024-08-08 22:48:46 +03:00
slaren
5218ea21b8 cuda : fix dmmv cols requirement to 2*GGML_CUDA_DMMV_X (llama/8800)
* cuda : fix dmmv cols requirement to 2*GGML_CUDA_DMMV_X

* update asserts

* only use dmmv for supported types

* add test
2024-08-08 22:48:46 +03:00
l3utterfly
e60be821ce added android implementation of ggml_print_backtrace_symbols (llama/8751)
* added android implementation of ggml_print_backtrace_symbols

* Update ggml/src/ggml.c

Co-authored-by: slaren <slarengh@gmail.com>

* Update ggml/src/ggml.c

Co-authored-by: slaren <slarengh@gmail.com>

* Update ggml/src/ggml.c

Co-authored-by: slaren <slarengh@gmail.com>

* Update ggml/src/ggml.c

Co-authored-by: slaren <slarengh@gmail.com>

* Update ggml/src/ggml.c

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-08-08 22:48:46 +03:00
wangshuai09
19708df884 cann: update cmake (llama/8765) 2024-08-08 22:48:46 +03:00
zhentaoyu
3f190addda Add TIMESTEP_EMBEDDING OP (llama/8707)
Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
2024-08-08 22:48:46 +03:00
CarterLi999
b355ee7cfa ggml: bugfix: fix the inactive elements is agnostic for risc-v vector (llama/8748)
In these codes, we want to retain the value that they previously held
when mask[i] is false. So we should use undisturbed. With the default
agnostic policy of rvv intrinsic, these values can be held or be
written with 1s.

Co-authored-by: carter.li <carter.li@starfivetech.com>
2024-08-08 22:48:46 +03:00
R0CKSTAR
49ac8872b4 cuda : organize vendor-specific headers into vendors directory (llama/8746)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-08-08 22:48:46 +03:00
Meng, Hengyu
8ef98ae7e3 add conv support (llama/8688) 2024-08-08 22:48:46 +03:00
R0CKSTAR
e471adcfa5 feat: Support Moore Threads GPU (llama/8383)
* Update doc for MUSA

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Add GGML_MUSA in Makefile

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Add GGML_MUSA in CMake

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* CUDA => MUSA

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* MUSA adds support for __vsubss4

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Fix CI build failure

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2024-08-08 22:48:46 +03:00
Borislav Stanimirov
aa816c922c ggml : ignore more msvc warnings (ggml/906) 2024-08-08 22:48:46 +03:00
Georgi Gerganov
b3264eb266 metal : fix struct name (ggml/912)
ggml-ci
2024-08-08 22:48:46 +03:00
Conrad Kramer
eb2eb87a58 metal : add abort callback (ggml/905) 2024-08-08 22:48:46 +03:00
0cc4m
83fcb0e486 vulkan : implement Stable Diffusion operators (ggml/904)
* Fix Vulkan repeat op

* Implement Vulkan concat op

* Delete old Vulkan shader generator

* Implement Vulkan im2col op

* Implement Vulkan unary gelu_quick op

* Implement Vulkan group_norm op

* Implement Vulkan timestep_embedding op

* Implement Vulkan upscale op

* Fix Vulkan vk_context tensor extra index issue

* Fix Vulkan matmul shader parameter bug

* Properly fix Vulkan matmul shader parameter bug

* Add Vulkan ADD f16 + f32 -> f16 operator support

* Implement Vulkan tanh op

* Fix Vulkan group count too large Validation error on non-Nvidia GPUs

* Throw error when too much memory is requested

* Fix another Vulkan group count too large Validation error on non-Nvidia GPUs

* Fix matmul MMQ condition

* Implement Vulkan pad op

* Fix Vulkan crash when tensor is used multiple times in a compute graph

* Add Vulkan CONCAT f16 + f16 -> f16 op

* Add Vulkan LEAKY_RELU op
2024-08-08 22:48:46 +03:00
Daniel Bevenius
f7bb412878 ggml : move c parameter comment to ggml_rope_ext (ggml/901)
This commit moves the comment for the c parameter from ggml_rope to
ggml_rope_ext. The comment is currently incorrect as ggml_rope does not
have a c parameter (freq_factors tensor).

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-08-08 22:48:46 +03:00
Georgi Gerganov
ef6dcf0d0c ggml : resolve sync conflicst (ggml/0)
ggml-ci
2024-08-08 22:48:46 +03:00
Dibakar Gope
525f190917 ggml : add ggml-aarch64 (ggml/0) 2024-08-08 22:48:46 +03:00
slaren
dd916a2852 ggml : reduce hash table reset cost (llama/8698)
* ggml : reduce hash table reset cost

* fix unreachable code warnings after GGML_ASSERT(false)

* GGML_ASSERT(false) -> GGML_ABORT("fatal error")

* GGML_ABORT use format string
2024-08-08 22:48:46 +03:00
DavidKorczynski
0620fe00ec ggml: handle ggml_init failure to fix NULL pointer deref (llama/8692)
`ggml_init` can fail if no unused context is found. In that case, a NULL-pointer deref will happen later in the code during a call to `ggml_set_on_alloc`.

This fixes it by bailing out if no context is found.
2024-08-08 22:48:46 +03:00
Chen Xi
31d0a9a14f fix multi-gpu issue on sycl (llama/8554)
---------

Signed-off-by: Chen Xi <xi2chen@intel.com>
Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>
2024-08-08 22:48:46 +03:00
Georgi Gerganov
c06970dd72 ggml : add and use ggml_cpu_has_llamafile() (llama/8664) 2024-08-08 22:48:46 +03:00
Joe Todd
7598acf525 Re-add erroneously removed -fsycl from GGML_EXTRA_LIBS (llama/8667) 2024-08-08 22:48:46 +03:00
Joe Todd
43ddfce969 sycl : Add support for non-release DPC++ & oneMKL (llama/8644)
* Update cmake to support nvidia hardware & open-source compiler
---------
Signed-off-by: Joe Todd <joe.todd@codeplay.com>
2024-08-08 22:48:46 +03:00
0cc4m
a7e6d2cd9c Vulkan IQ4_NL Support (llama/8613)
* Fix Vulkan matmul tests compile errors

* Add Vulkan IQ4_NL support

* Fix Vulkan DeepSeek-Coder-V2-Lite MoE support
2024-08-08 22:48:46 +03:00
Jeroen Mostert
86506b0c5c Allow all RDNA2 archs to use sdot4 intrinsic (llama/8629)
The check gating the use of `__builtin_amdgc_sdot4` specifically checks for gfx1030. This causes a severe perf regression for anything gfx103? that's not gfx1030 and not using `HSA_OVERRIDE_GFX_VERSION` (if you've built ROCm to support it). We already have a generic RDNA2 define, let's use it.
2024-08-08 22:48:46 +03:00
luoyu-intel
11182fae34 fix scratch size of softmax (llama/8642) 2024-08-08 22:48:46 +03:00
Mark Zhuang
0bc8bffe1d ggml: fix compile error for RISC-V (llama/8623) 2024-08-08 22:48:46 +03:00
Johannes Gäßler
8c4f30497a CUDA: MMQ code deduplication + iquant support (llama/8495)
* CUDA: MMQ code deduplication + iquant support

* 1 less parallel job for CI build
2024-08-08 22:48:46 +03:00
Georgi Gerganov
b1ee3a8444 gguf : handle null name during init (llama/8587) 2024-08-08 22:48:46 +03:00
slaren
be9a16fd3f ggml : fix quant dot product with odd number of blocks (llama/8549)
* ggml : fix iq4_nl dot product with odd number of blocks

* ggml : fix odd blocks for ARM_NEON (llama/8556)

* ggml : fix iq4_nl dot product with odd number of blocks

* ggml : fix q4_1

* ggml : fix q5_0

* ggml : fix q5_1

* ggml : fix iq4_nl metal

ggml-ci

* ggml : fix q4_0

* ggml : fix q8_0

ggml-ci

* ggml : remove special Q4_0 code for first 2 blocks

* ggml : fix sumf redefinition

---------

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-08-08 22:48:46 +03:00
Clint Herron
f4d9a95b0f ggml : add friendlier error message to fopen errors (llama/8575)
* Add additional error information when model files fail to load.

* Adding additional error information to most instances of fopen.
2024-08-08 22:48:46 +03:00
Johannes Gäßler
a8ab3abe09 CUDA: fix partial offloading for ne0 % 256 != 0 (llama/8572) 2024-08-08 22:48:46 +03:00
65a
fb6a835938 cmake : install all ggml public headers (llama/8480)
Co-authored-by: 65a <65a@65a.invalid>
2024-08-08 22:48:46 +03:00
hipudding
8923bb4292 Add Ascend NPU backend (llama/6035)
* [CANN] Add Ascend NPU backend

Ascend is a full-stack AI computing infrastructure for industry
applications and services based on Huawei Ascend processors and
software.

CANN (Compute Architecture of Neural Networks), developped by
Huawei, is a heterogeneous computing architecture for AI.

Co-authored-by: wangshuai09 <391746016@qq.com>

* delete trailing whitespaces

* Modify the code based on review comment

* Rename LLAMA_CANN to GGML_CANN

* Make ggml-common.h private

* add ggml_cann prefix for acl funcs

* Add logging for CANN backend

* Delete Trailing whitespace

---------

Co-authored-by: wangshuai09 <391746016@qq.com>
2024-08-08 22:48:46 +03:00
Johannes Gäßler
fcba6aa352 make/cmake: add missing force MMQ/cuBLAS for HIP (llama/8515) 2024-08-08 22:48:46 +03:00
Xuan Son Nguyen
8807fe608b Refactor lora adapter support (llama/8332)
* lora: load to devide buft

* add patch tensor function

* correct tensor patch

* llama_lora_adapter_apply

* correct ggml_backend_tensor_copy

* add llm_build_mm

* fix auto merge

* update based on review comments

* add convert script

* no more transpose A

* add f16 convert

* add metadata check

* add sanity check

* fix ftype

* add requirements

* fix requirements

* fix outfile

* conversion: only allow selected models

* fix types

* cuda : do not use dmmv if the tensor does not have enough cols

* llama : lora fixes

* do not disable mmap with lora

Co-authored-by: slaren <slarengh@gmail.com>

* llm_build_lora_mm_id

* convert_lora : MoE LoRA conversion support

* convert_lora : prefer safetensors, similarly to convert_hf

* convert_hf : simplify modify_tensors for InternLM2

* convert_lora : lazy conversion

* llama : load and use alpha from LoRA adapters

* llama : use llm_build_lora_mm in most model graphs

* auto scale

* Revert "auto scale"

This reverts commit 42415a4874e0f963e4aca6796ea5dfb97cd17464.

* remove redundant params

* Apply suggestions from code review

Co-authored-by: slaren <slarengh@gmail.com>

* change kv metadata

* move add_type to __init__

* convert_hf : move add_type to main()

* convert_lora : use the GGUFWriter from Model instead of overwriting it

---------

Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Francis Couture-Harpin <git@compilade.net>
2024-08-08 22:48:46 +03:00
Meng, Hengyu
3e94c7a81d add concat through dim 1/2 (llama/8483)
* add concat through dim 1/2
2024-08-08 22:48:46 +03:00
0cc4m
77af3254e1 Vulkan MMQ Fix (llama/8479)
* Fix incoherence by adding missing LOAD_VEC_A parameter

* Fix Vulkan op result checker build error
2024-08-08 22:48:46 +03:00
bandoti
d4b3cffec4 vulkan : cmake integration (llama/8119)
* Add Vulkan to CMake pkg

* Add Sycl to CMake pkg

* Add OpenMP to CMake pkg

* Split generated shader file into separate translation unit

* Add CMake target for Vulkan shaders

* Update README.md

* Add make target for Vulkan shaders

* Use pkg-config to locate vulkan library

* Add vulkan SDK dep to ubuntu-22-cmake-vulkan workflow

* Clean up tabs

* Move sudo to apt-key invocation

* Forward GGML_EXTRA_LIBS to CMake config pkg

* Update vulkan obj file paths

* Add shaderc to nix pkg

* Add python3 to Vulkan nix build

* Link against ggml in cmake pkg

* Remove Python dependency from Vulkan build

* code review changes

* Remove trailing newline

* Add cflags from pkg-config to fix w64devkit build

* Update README.md

* Remove trailing whitespace

* Update README.md

* Remove trailing whitespace

* Fix doc heading

* Make glslc required Vulkan component

* remove clblast from nix pkg
2024-08-08 22:48:46 +03:00
Georgi Gerganov
b852a4c5ca metal : template-ify some of the kernels (llama/8447)
ggml-ci
2024-08-08 22:48:46 +03:00
Georgi Gerganov
2157abaab4 ggml : minor naming changes (llama/8433)
* ggml : minor naming changes

ggml-ci

* ggml : use PRId64 [no ci]

* ggml : revert FA K/Q names
2024-08-08 22:48:46 +03:00
Chen Xi
68d609a12c fix the mul_mat_id ut issues (llama/8427)
* fix part of mul_mat_id

* skip the bfloat 16 sycl ut

Signed-off-by: Chen Xi <xi2chen@intel.com>

---------

Signed-off-by: Chen Xi <xi2chen@intel.com>
Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>
Co-authored-by: Chen Xi <xi2chen@intel.com>
2024-08-08 22:48:46 +03:00
Nicholai Tukanov
5a8ae474f0 ggml : add NVPL BLAS support (ggml/8329) (llama/8425)
* ggml : add NVPL BLAS support

* ggml : replace `<BLASLIB>_ENABLE_CBLAS` with `GGML_BLAS_USE_<BLASLIB>`

---------

Co-authored-by: ntukanov <ntukanov@nvidia.com>
2024-08-08 22:48:46 +03:00
Daniel Bevenius
84493d7f3e cuda : suppress 'noreturn' warn in no_device_code (llama/8414)
* cuda : suppress 'noreturn' warn in no_device_code

This commit adds a while(true) loop to the no_device_code function in
common.cuh. This is done to suppress the warning:

```console
/src/ggml-cuda/template-instances/../common.cuh:346:1: warning:
function declared 'noreturn' should not return [-Winvalid-noreturn]
  346 | }
      | ^
```

The motivation for this is to reduce the number of warnings when
compilng with GGML_HIPBLAS=ON.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* squash! cuda : suppress 'noreturn' warn in no_device_code

Update __trap macro instead of using a while loop to suppress the
warning.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

---------

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-08-08 22:48:46 +03:00
Johannes Gäßler
15d71189e9 CUDA: optimize and refactor MMQ (llama/8416)
* CUDA: optimize and refactor MMQ

* explicit q8_1 memory layouts, add documentation
2024-08-08 22:48:46 +03:00
AidanBeltonS
37e962580f Use multi_ptr to clean up deprecated warnings (llama/8256) 2024-08-08 22:48:46 +03:00
Georgi Gerganov
db0ea7a2f2 ggml : move sgemm sources to llamafile subfolder (llama/8394)
ggml-ci
2024-08-08 22:48:46 +03:00
Dibakar Gope
5498b0e6c0 ggml : add AArch64 optimized GEMV and GEMM Q4 kernels (llama/5780)
* Arm AArch64: optimized GEMV and GEMM kernels for q4_0_q8_0, and q8_0_q8_0 quantization

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions

* Arm AArch64: add copyright claim only to ggml-aarch64.cpp and ggml-aarch64.h files

* Arm AArch64: minor code refactoring for rebase

* Arm AArch64: minor code refactoring for resolving a build issue with cmake

* Arm AArch64: minor code refactoring to split the Q4_0_AARC64 type into three separate types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8

* Arm AArch64: minor code change for resolving a build issue with server-windows

* retrigger checks

* Arm AArch64: minor code changes for rebase

* Arm AArch64: minor changes to skip the pr#7433 vec_dot code for arm cpus with SVE VL not equal to 256 bits

* Arm AArch64: remove stale LLAMA_QKK_64 from CMakeLists.txt and delete build.zig

* Arm AArch64: add reference scalar gemm and gemv, and avoid dynamic memory allocations during quantization for Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8

* Arm AArch64: add multithreaded quantization support for the new types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8

* Arm AArch64: minor code refactoring

* Arm AArch64: simplify logic for calling gemm and gemv functions in ggml_compute_forward_mul_mat

* Arm AArch64: minimize changes in ggml_compute_forward_mul_mat

* Arm AArch64: minor code refactoring, and add reference scalar code to quantize routines for new quant types

* Arm AArch64: minor code refactoring

* Arm AArch64: minor code refactoring

* Arm AArch64: minor code refactoring

* rebase on the latest master commit 3fd62a6 and adapt to the new directory structure

* Arm AArch64: remove a redundant comment

* Arm AArch64: add pragma in ggml-aarch64.c to turn -Woverlength-strings warning off

* Arm AArch64: use __aarch64__ check to guard 64-bit neon kernels

* Arm AArch64: update docs/build.md README to include compile time flags for buiilding the Q4_0_4_4 quant type
2024-08-08 22:48:46 +03:00
Alberto Cabrera Pérez
2af4a52c39 sycl : Reenabled mmvq path for the SYCL Nvidia Backend (llama/8372)
* SYCL : Reenabled mmvq path for the SYCL Nvidia Backend

* Reduced verbosity of comment
2024-08-08 22:48:46 +03:00
Alberto Cabrera Pérez
eee2fe882e sycl : fix powf call in device code (llama/8368) 2024-08-08 22:48:46 +03:00
Mahesh Madhav
0d1a11e5e2 ggml : loop tiling optimizations for scalar path (ggml/898)
Apply a loop tiling technique to the generic path, which provides
performance upside for ISAs with enough registers to take advantage
of it. Also helps the compiler optimize this path.
2024-08-08 22:48:46 +03:00
Ivan Filipov
b2ead7d6f4 ggml: add support for float16 input tensors in pooling operations (ggml/895)
* Add support for float16 tensors in 1d pooling operations

* Add support for float16 input tensors in 2d pooling operations

* code cleanup

remove unnecessary casting during srow ptr initialization

---------

Co-authored-by: vanaka11 <vanaka1189@gmail.com>
2024-08-08 22:48:46 +03:00
Tony Wasserka
8da6fd4dff vulkan : initialize vk_buffer_struct members to VK_NULL_HANDLE (ggml/893)
This prevents invalid frees when destroying a partially initialized
vk_buffer_struct. For example, this could happen in ggml_vk_create_buffer
when running out of device memory.

Co-authored-by: Tony Wasserka <neobrain@users.noreply.github.com>
2024-08-08 22:48:46 +03:00
Borislav Stanimirov
ab8ec9e940 cmake : only enable GGML_NATIVE and x86 flags if not crosscompiling (ggml/885) 2024-08-08 22:48:46 +03:00
Matt Stephenson
f68298ce06
whisper : use vulkan as gpu backend when available (#2302)
* ggml: use vulkan as gpu backend when available

Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com>

* whisper: enable using vk as default buffer type

Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com>

---------

Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com>
2024-07-16 10:21:09 +03:00
Georgi Gerganov
49868aa851 ggml : sync sycl (skip) (#0) 2024-07-08 14:53:55 +03:00
Daniel Bevenius
95f2a191c0 ggml : remove unnecessary UNUSED macro call (ggml/880)
This commit removes an UNUSED macro call that is not needed as the
variable n0 is used in the code and will not produce a warning.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-07-08 14:53:55 +03:00
Natsu
00422ec3cf cmake : add GGML_BUILD and GGML_SHARED macro definitions (llama/8281) 2024-07-08 14:53:55 +03:00
Ouadie EL FAROUKI
c5b05321e9 Enabled more data types for oneMKL gemm_batch (llama/8236) 2024-07-08 14:53:55 +03:00
Johannes Gäßler
5dc636a65a CUDA: MMQ support for iq4_nl, iq4_xs (llama/8278) 2024-07-08 14:53:55 +03:00
Daniele
73703a144f CUDA: revert part of the RDNA1 optimizations (llama/8309)
The change on the launch_bounds was causing a small performance drop in perplexity of 25 t/s
2024-07-08 14:53:55 +03:00
Johannes Gäßler
e89fdceec2 CUDA: fix MMQ stream-k rounding if ne00 % 128 != 0 (llama/8311) 2024-07-08 14:53:55 +03:00
luoyu-intel
29a2739d27 Fix WARP_SIZE=16 bug of Intel GPU (llama/8266)
* fix group_norm ut

* split softmax

* fix softmax

* add concat support condition

* revert debug code

* move QK_WARP_SIZE to presets.hpp
2024-07-08 14:53:55 +03:00
Neo Zhang Jianyu
ee6d17f6b4 rm get_work_group_size() by local cache for performance (llama/8286)
Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>
2024-07-08 14:53:55 +03:00
Daniele
95e90823d9 Define and optimize RDNA1 (llama/8085) 2024-07-08 14:53:55 +03:00
Judd
005cc45df3 fix typo (llama/8267)
Co-authored-by: Judd <foldl@boxvest.com>
2024-07-08 14:53:55 +03:00
Clint Herron
c2c60dc9ba Removes multiple newlines at the end of files that is breaking the editorconfig step of CI. (llama/8258) 2024-07-08 14:53:55 +03:00
slaren
4af3194b7c cuda : update supports_op for matrix multiplication (llama/8245) 2024-07-08 14:53:55 +03:00
luoyu-intel
4a2ba1a065 Fix win build conflict of math library (llama/8230)
* fix win build conflict of math library

* fix the condition: !(win32 & SYCL)

* revert warp_size=16
2024-07-08 14:53:55 +03:00
luoyu-intel
f096cc6807 Fix the sub group size of Intel (llama/8106)
* use warp_size macro for all sycl kernels

* fix mask of permute_sub_group_by_xor

* fix rms_norm with correct warp number

* fix rms_norm_f32/group_norm_f32

* move norm to norm.cpp file

* fix quantize bug

* fix mmvq's batch size
2024-07-08 14:53:55 +03:00
Johannes Gäßler
e4bc83ab47 CUDA: refactor and optimize IQ MMVQ (llama/8215)
* CUDA: refactor and optimize IQ MMVQ

* uint -> uint32_t

* __dp4a -> ggml_cuda_dp4a

* remove MIN_CC_DP4A checks

* change default

* try CI fix
2024-07-08 14:53:55 +03:00
zhentaoyu
db7e0dbe6e Update SYCL-Rope op and Refactor (llama/8157)
* align with rope.cu and move sycl-op to a single file
2024-07-08 14:53:55 +03:00
Johannes Gäßler
bf88c94da9 CUDA: fix MMQ stream-k for --split-mode row (llama/8167) 2024-07-08 14:53:55 +03:00
John Balis
3eea171cab feat: cuda implementation for ggml_conv_transpose_1d (ggml/854)
* conv transpose 1d passing test for 1d input and kernel

* working for different input and output channel counts, added test for variable stride

* initial draft appears to work with stride other than 1

* working with all old and new conv1d  tests

* added a test for large tensors

* removed use cuda hardcoding

* restored test-conv-transpose.c

* removed unused arugments, and fixed bug where test failure would cause subsequent tests to fail

* fixed accumulator bug

* added test to test-backend-ops

* fixed mistake

* addressed review

* fixed includes

* removed blank lines

* style and warning fixes

* return failure when test fails

* fix supports_op

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-07-08 14:53:55 +03:00
slaren
04e7fa6f4f
ggml : add GGML_CUDA_USE_GRAPHS option, restore GGML_CUDA_FORCE_CUBLAS (cmake) (llama/8140) 2024-06-26 23:18:11 +03:00
Georgi Gerganov
e30c679928
whisper : reorganize source code + improve CMake (#2256)
* scripts : update sync [no ci]

* files : reorganize [no ci]

* sync : llama.cpp

* cmake : link math library

* cmake : build normal ggml library

* files : move headers to include

* objc : fix path to ggml-metal.h

* ci : fix WHISPER_CUDA -> GGML_CUDA

* scripts : sync LICENSE [no ci]
2024-06-26 19:34:09 +03:00