Commit Graph

1364 Commits

Author SHA1 Message Date
slaren
8f5dc729d9 ggml : use atomic_flag for critical section (llama/7598)
* ggml : use atomic_flag for critical section

* add windows shims
2024-06-16 18:19:48 +03:00
Georgi Gerganov
02fc147a0b examples : adapt to new ggml_concat (ggml/0) 2024-06-16 18:19:48 +03:00
zhouwg
109148ac84 ggml : fix typo in ggml.c (llama/7603) 2024-06-16 18:19:48 +03:00
Meng, Hengyu
3563473d2c Align GEMM dispatch (llama/7566)
* align GEMM dispatch
2024-06-16 18:19:48 +03:00
Georgi Gerganov
046834198d sycl : fix assert (llama/7563) 2024-06-16 18:19:48 +03:00
k.h.lai
0a2ad9de06 vulkan: properly initialize vulkan devices for LLAMA_SPLIT_MODE_NONE (llama/7552) 2024-06-16 18:19:48 +03:00
Radoslav Gerganov
39b0640b09 rpc : resource management rework (llama/7562)
* rpc : resource management rework

* address review comments
2024-06-16 18:19:48 +03:00
Neo Zhang
8dca71de64 fix ggml_sycl_mul_mat_id() to match the change of api (llama/7436)
* fix mul_mat_id to match the change of api

* rm comment

* rm unused or duplicated code, rename as review comment
2024-06-16 18:19:48 +03:00
Georgi Gerganov
812787cbc5 ggml : generalize GGML_OP_CONCAT (llama/7563)
* ggml : generalize GGML_OP_CONCAT (WIP)

ggml-ci

* tests : add dim != 2 tests

* metal : generalize concat kernel

* tests : naming

* cuda : generalize concat kernel

ggml-ci

* sycl : add warning and assert

* ggml : fix op params handling

* metal : bugfix kernel

ggml-ci

* ggml : reimplement CPU and Metal

* cuda : add asserts

ggml-ci

* ggml : fix ptrs

ggml-ci
2024-06-16 18:19:48 +03:00
Djip007
68ef10805e update HIP_UMA #7399 (llama/7414)
* update HIP_UMA #7399

add use of hipMemAdviseSetCoarseGrain when LLAMA_HIP_UMA is enable.
- get x2 on prompte eval and x1.5 on token gen with rocm6.0 on ryzen 7940HX iGPU (780M/gfx1103)

* simplify code, more consistent style

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-06-16 18:19:48 +03:00
agray3
96fdb90f5f Allow multiple copy function pointers for CUDA graph kernel param updates (llama/7565)
CUDA graphs require parameter updates to kernels associated with
GGML_OP_CPY nodes. Previously the implementation only checked for a
single CUDA kernel in such nodes, but this caused a bug in cases where
2 such kernels exist. This fixes the issue by using a vector to allow
multiple function pointers to be stored and checked against.

Fixes #7942
2024-06-16 18:19:48 +03:00
AidanBeltonS
e98f9ac554 Fix q_xxs using mul_mat_q (llama/7459) 2024-06-16 18:19:48 +03:00
AidanBeltonS
02d481595b Add freq factors (llama/7495) 2024-06-16 18:19:48 +03:00
Georgi Gerganov
7091c7ab5a metal : add GGML_OP_REPEAT kernels (llama/7557)
ggml-ci
2024-06-16 18:19:48 +03:00
Georgi Gerganov
d70ccb75f5 metal : disable FA kernel for HS=256 (llama/7556)
ggml-ci
2024-06-16 18:19:48 +03:00
Georgi Gerganov
5ee048eb67 ggml : restore ggml_rope_xpos_inplace (ggml/0)
ggml-ci
2024-06-16 18:19:48 +03:00
Masaya, Kato
37ed71c964 ggml: aarch64: SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot (llama/7433)
* Add SVE support for q4_0_q8_0 q8_0_q8_0

* remove ifdef
2024-06-16 18:19:48 +03:00
Georgi Gerganov
8cd7a3df37 ggml : silence UB sanitizer error during iq2_xxs quantization (llama/0) 2024-06-16 18:19:48 +03:00
Georgi Gerganov
04a3279320 ggml : remove ggml_flash_attn and ggml_flash_ff (llama/7463)
ggml-ci
2024-06-16 18:19:48 +03:00
Georgi Gerganov
45ddda8e0c ggml : drop support for QK_K=64 (llama/7473)
* ggml : drop support for QK_K=64

ggml-ci

* opencl : restore QK_K=256 define
2024-06-16 18:19:48 +03:00
0cc4m
c41317fd66 Update vulkan rope implementation to support frequency factors (llama/7475) 2024-06-16 18:19:48 +03:00
Johannes Gäßler
96b8419b27 CUDA: fix FA out-of-bounds reads (llama/7479) 2024-06-16 18:19:48 +03:00
Johannes Gäßler
3c63f4cf35 CUDA: fix FA out-of-bounds writes (llama/7465) 2024-06-16 18:19:48 +03:00
Georgi Gerganov
5848dfd9c8 cuda : fix compile warning (llama/7454) 2024-06-16 18:19:48 +03:00
Johannes Gäßler
29ab5d0326 CUDA: remove incorrect precision check (llama/7454) 2024-06-16 18:19:48 +03:00
Georgi Gerganov
c4d6958b3e cuda : fix rope + add tests (llama/7452)
* cuda : fix rope pos data

ggml-ci

* ggml : drop mode & 1 == 1 support for ggml_rope

ggml-ci

* ggml : support freq_factors for f16 rope (CPU)

ggml-ci

* tests : add rope tests using frequency factors

ggml-ci
2024-06-16 18:19:48 +03:00
liuwei-git
c9dcb75118 llama : add phi3 128K model support (llama/7225)
* add phi3 128k support in convert-hf-to-gguf

* add phi3 128k support in cuda

* address build warnings on llama.cpp

* adjust index value in cuda long rope freq factors

* add long rope support in ggml cpu backend

* make freq factors only depend on ctx size

* remove unused rope scaling type 'su' frin gguf converter

* fix flint warnings on convert-hf-to-gguf.py

* set to the short freq factor when context size is small than trained context size

* add one line of comments

* metal : support rope freq_factors

* ggml : update ggml_rope_ext API to support freq. factors

* backends : add dev messages to support rope freq. factors

* minor : style

* tests : update to use new rope API

* backends : fix pragma semicolons

* minor : cleanup

* llama : move rope factors from KV header to tensors

* llama : remove tmp assert

* cuda : fix compile warning

* convert : read/write n_head_kv

* llama : fix uninitialized tensors

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-16 18:19:48 +03:00
Georgi Gerganov
bbdbc3fc62 metal : handle F16 inf values, fix FA partial offload (llama/7434)
ggml-ci
2024-06-16 18:19:48 +03:00
Johannes Gäßler
28c207a541 CUDA: fix unused warning in mmq.cu (llama/7442) 2024-06-16 18:19:48 +03:00
Johannes Gäßler
c23f830983 CUDA: deduplicate mmq code (llama/7397) 2024-06-16 18:19:48 +03:00
Radoslav Gerganov
caeeb32b41 rpc : track allocated buffers (llama/7411)
* rpc : track allocated buffers

ref: #7407

* rpc : pack rpc_tensor tightly
2024-06-16 18:19:48 +03:00
AidanBeltonS
584cc1177a Update SYCL upscale operation (llama/7321)
* Update SYCL upscale operation

* Formatting

* Remove messages
2024-06-16 18:19:48 +03:00
Herman Semenov
cc1ae10989 ggml-opencl, llama: using reserve() if count already known (llama/7272) 2024-06-16 18:19:48 +03:00
junchao-loongson
eb26f55b40 ggml : add loongarch lsx and lasx support (llama/6454)
* add loongarch lsx and lasx optimize code

* Add loongarch compilation support to makefile

* revert stb_image.h

* opt bytes_from_nibbles_32 and sum_i16_pairs_float

* fix undeclared

* format code

* update

* update 2

---------

Co-authored-by: Jinyang He <hejinyang@loongson.cn>
2024-06-16 18:19:48 +03:00
Srihari-mcw
eb2b086584 Add provisions for windows support for BF16 code including CMake provision for enabling AVX512_BF16 (llama/7258) 2024-06-16 18:19:48 +03:00
0cc4m
67919cfe11 Vulkan Embedding Fix (llama/7360)
* Fix empty Vulkan host buffers

Add fp32 fp16 matmul shader

Fix matmul shader alignment

* Remove deprecated tensor->backend uses

* Fix Vulkan validation errors on embedding models with no offloaded layers

* Fix Vulkan llava segfault when not offloading layers
2024-06-16 18:19:48 +03:00
slaren
bf5fc81a8a ggml : fix another case of quants nans (llama/7387) 2024-06-16 18:19:48 +03:00
Johannes Gäßler
2b07dc3186 ggml: implement quantized KV cache for FA (llama/7372) 2024-06-16 18:19:48 +03:00
slaren
951c463d39 cuda : clear error after buffer allocation failure (llama/7376) 2024-06-16 18:19:48 +03:00
fraxy-v
7f257b210f Capture CUDA logging output (llama/7298)
* logging: output capture in cuda module

* fix compile error

* fix: vsnprintf terminates with 0, string use not correct

* post review

* Update llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

* Update llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-06-16 18:19:48 +03:00
Georgi Gerganov
705fe30a02 android : use "ci-android" branch for CI (llama/7341)
* android : use "ci-android" branch for CI

* ggml : disable SIMD exp and silu for 32-bit ARM

ggml-ci

* android : do not fetch, use add_subdirectory instead

* cmake : provide binary dir
2024-06-16 18:19:48 +03:00
Johannes Gäßler
45b5b95e29 CUDA: deduplicate FlashAttention code (llama/7352) 2024-06-16 18:19:48 +03:00
Engininja2
f2c47d1e6a cuda : add half2 __shfl_xor() for ROCm 5.5 (llama/7263) 2024-06-16 18:19:48 +03:00
0cc4m
b4bb9b9036 Update and fix Vulkan soft_max and argsort implementations (llama/7237)
* Update and fix Vulkan softmax implementation

* Update and fix Vulkan argsort implementation
2024-06-16 18:19:48 +03:00
slaren
2bc6483299 ggml : fix quants nans when all the group weights are very close to zero (llama/7313) 2024-06-16 18:19:48 +03:00
Johannes Gäßler
ec52f900e4 CUDA: faster large batch FA without tensor cores (llama/7314) 2024-06-16 18:19:48 +03:00
Radoslav Gerganov
77d708fabb rpc : set SO_REUSEADDR for the server socket (llama/7320)
ref: #7293
2024-06-16 18:19:48 +03:00
Herman Semenov
c00149c861 ggml-quants, llama : removed excess checks (llama/7274) 2024-06-16 18:19:48 +03:00
Justine Tunney
574661f2e6 ggml : rewrite silu and softmax for cpu (llama/7154)
This change upstreams llamafile's vectorized expf() functions. This lets
us compute softmax and silu more accurately than the short[65536] lookup
table that GGML previously used to make this operation go faster. We can
support aarch64 and sse2+ with the worst case rounding error of 2ulp. It
makes make -j8 tests && ./tests/test-backend-ops -o SOFT_MAX -b CPU perf
go 1.5x faster for SSE2+FMA, 1.9x faster for AVX2+FMA and 2.1x on AVX512
2024-06-16 18:19:48 +03:00
Radoslav Gerganov
7bd69349bf rpc : add command line arg for specifying backend memory
ref: #7293
2024-06-16 18:19:48 +03:00