29 Commits

Author SHA1 Message Date
Karol Kontny
5981352bb5 ggml: Fix data race in ggml threadpool (llama/11736)
After the barrier in last iteration is executed, still the loop termination
condition will be executed. However main thread can destroy the cgraph object
and its nodes already, then another thread will access it, but the thing is already gone.
Also trouble can happen when n_nodes == 0 or abort is called, but I'm not sure if the
prior situation is possible.

Last syncronization should be done after the loop to ensure the cgraph/cplan won't be
accessed after the main thread exits from the function.
2025-02-27 08:55:36 +02:00
Jinyang He
bbd8364f5e ggml : optimize and build warning fix for LoongArch (llama/11709)
* ggml : optimize convert f32<->f16 for loongarch_asx

* ggml : optimize loongarch_asx extend i16,i8,u8 to i32,i16

* ggml : Fix warnings when run cpu CI locally on LoongArch
2025-02-27 08:55:36 +02:00
Johannes Gäßler
c262dc80e2 CPU/CUDA: fix (GQA) mul mat back, add CUDA support (llama/11380) 2025-02-03 22:00:57 +02:00
Jeff Bolz
7183a1eb72 vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl (llama/11166)
* vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl

Shaders are based on cpy.cu.

* vulkan: support copy from q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl to f32

* ggml: copy q->f32 assumes some contiguity in the destination
2025-02-03 22:00:57 +02:00
Johannes Gäßler
de49024e49 CUDA: backwards pass for misc. ops, add tests (llama/11257)
* CUDA: backwards pass for misc. ops, add tests

* remove restrict from pointers
2025-02-03 22:00:57 +02:00
Johannes Gäßler
54a2ee648f RoPE: fix back, CUDA support for back + noncont. (llama/11240)
* RoPE: fix back, CUDA support for back + noncont.

* fix comments reg. non-cont. RoPE support [no-ci]
2025-02-03 22:00:57 +02:00
issixx
f12559d590 ggml-cpu : fix ggml_graph_compute_thread did not terminate on abort. (ggml/1065)
some threads kept looping and failed to terminate properly after an abort during CPU execution.

Co-authored-by: issi <issi@gmail.com>
2025-02-03 22:00:57 +02:00
Molly Sophia
06209f6683 llama: add support for QRWKV6 model architecture (llama/11001)
llama: add support for QRWKV6 model architecture (llama/11001)

* WIP: Add support for RWKV6Qwen2

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV: Some graph simplification

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Add support for RWKV6Qwen2 with cpu and cuda GLA

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Fix some typos

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* code format changes

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Fix wkv test & add gla test

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Fix cuda warning

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Update README.md

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Update ggml/src/ggml-cuda/gla.cu

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Fix fused lerp weights loading with RWKV6

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* better sanity check skipping for QRWKV6 in llama-quant

thanks @compilade

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: compilade <git@compilade.net>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: compilade <git@compilade.net>
2025-01-14 10:38:01 +02:00
Djip007
bcf937c216 ggml : more perfo with llamafile tinyblas on x86_64 (llama/10714)
* more perfo with llamafile tinyblas on x86_64.

- add bf16 suport
- change dispache strategie (thanks:
https://github.com/ikawrakow/ik_llama.cpp/pull/71 )
- reduce memory bandwidth

simple tinyblas dispache and more cache freindly

* tinyblas dynamic dispaching

* sgemm: add M blocs.

* - git 2.47 use short id of len 9.
- show-progress is not part of GNU Wget2

* remove not stable test
2025-01-04 10:45:01 +02:00
Diego Devesa
3387415bad ggml : fix const usage in SSE path (llama/10962) 2025-01-04 10:45:01 +02:00
HimariO
e22d38e4f2 llama : add Qwen2VL support + multimodal RoPE (llama/10361)
* Barebone Qwen2VL LLM convertor

* Add Qwen2VL cli entrypoint

* [WIP] add qwen2vl arch

* Verify m-rope output

* Add vl-rope/2d-rope support for qwen2vl ViT

* update qwen2vl cli tool

* update 5D tensor op workaround

* [WIP] qwen2vl vision model

* make batch and clip utils compatible with qwen2vl

* [WIP] create inference workflow, gguf convert script but fix

* correcting vision-rope behavior, add the missing last layer back to ViT

* add arg parser to qwen2vl_surgery

* replace variable size array with vector

* cuda-gdb cmake preset

* add fp32 mrope, vision rope kernel

* add fp16 support for qwen2vl and m-rope

* add `GGML_ROPE_TYPE_MROPE`, `GGML_ROPE_TYPE_VISION`

* fix rope op mode switching, out dated func args

* update `llama_hparams`

* update to keep up stream changes

* resolve linter, test errors

* add makefile entry, update speical image padding token

* add mrope unit test, fix few compiler warnings

* rename `mrope` related function, params

* minor updates on debug util, bug fixs

* add `m-rope` testcase to `test-backend-ops`

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix traililng whitespce

* store `llama_hparams.rope_sections` with fixed size array

* update position id tensor size check in GGML_OP_ROPE

* minor updates

* update `ggml_backend_*_supports_op` of unsupported backends

* remote old `rope_section` compare operator

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-12-18 12:52:16 +02:00
Karol Kontny
e6eed605cf ggml : Fix compilation issues on ARM platform when building without fp16 (llama/10811) 2024-12-18 12:52:16 +02:00
Diego Devesa
1193e494a9 remove CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS (llama/10797)
other windows build fixes
2024-12-18 12:52:16 +02:00
Djip007
e990d1b791 ggml : refactor online repacking (llama/10446)
* rename ggml-cpu-aarch64.c to .cpp

* reformat extra cpu backend.

- clean Q4_0_N_M and IQ4_0_N_M
  - remove from "file" tensor type
  - allow only with dynamic repack

- extract cpu extra bufts and convert to C++
  - hbm
  - "aarch64"

- more generic use of extra buffer
  - generalise extra_supports_op
  - new API for "cpu-accel":
     - amx
     - aarch64

* clang-format

* Clean Q4_0_N_M ref

Enable restrict on C++

* add op GGML_OP_MUL_MAT_ID for Q4_0_N_M with runtime repack

* added/corrected control on tensor size for Q4 repacking.

* Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* add debug logs on repacks.

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-12-18 12:52:16 +02:00
Diego Devesa
a815940e0e ggml : add predefined list of CPU backend variants to build (llama/10626)
* ggml : add predefined list of CPU backend variants to build

* update CPU dockerfiles
2024-12-08 20:14:35 +02:00
Diego Devesa
904e307bce ggml-cpu : fix HWCAP2_I8MM value (llama/10646) 2024-12-08 20:14:35 +02:00
PAB
b7c64a4352 ggml: add GGML_SET Metal kernel + i32 CPU kernel (ggml/1037)
* implemented cpu kernel

* add i32 test cases in test-backend-ops

* typedef `ggml_metal_kargs_set`

* implemented `kernel_set`

* memcpy
2024-12-08 20:14:35 +02:00
PAB
7895d39508 ggml : add GGML_PAD_REFLECT_1D operation (ggml/1034)
* ggml_pad_reflect_1d defined in header

* implemented on CPU

* called the forward pass

* impl Metal kernel

* added Metal kernel

* added OP_PAD_REFLECT_1D in test-backend-ops.cpp

* add test-pad-reflect-1d test case

* test case support multiple backend
2024-12-08 20:14:35 +02:00
Diego Devesa
3daeacad24 ggml : move AMX to the CPU backend (llama/10570)
ggml : automatic selection of best CPU backend (llama/10606)
2024-12-08 20:14:35 +02:00
Georgi Gerganov
3623bd58f2 ggml : fix I8MM Q4_1 scaling factor conversion (llama/10562)
ggml-ci
2024-12-08 20:14:35 +02:00
Georgi Gerganov
4ca1e72fe0 ggml : fix row condition for i8mm kernels (llama/10561)
ggml-ci
2024-12-08 20:14:35 +02:00
Shupei Fan
330273901f ggml-cpu: support IQ4_NL_4_4 by runtime repack (llama/10541)
* ggml-cpu: support IQ4_NL_4_4 by runtime repack

* ggml-cpu: add __ARM_FEATURE_DOTPROD guard
2024-12-08 20:14:35 +02:00
Diego Devesa
77e3e4a090 ggml : add support for dynamic loading of backends (llama/10469)
* ggml : add support for dynamic loading of backends

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-12-08 20:14:35 +02:00
Diego Devesa
8b1c1c30a7 ggml : do not use ARM features not included in the build (llama/10457) 2024-12-08 20:14:35 +02:00
FirstTimeEZ
45cf1634dc ggml : fix undefined reference to 'getcpu' (llama/10354)
https://github.com/ggerganov/llama.cpp/issues/10352
2024-11-20 21:00:08 +02:00
Johannes Gäßler
c9541741e6 ggml: new optimization interface (ggml/988)
* ggml: new optimization interface

remove test2.c, test3.c

store adamw params in tensor

move grads from tensor to graph

* avoid segfault upon API misuse

* add ggml-opt.h to public headers

* remove dependence of ggml-opt.cpp on ggml-cpu.h
2024-11-20 21:00:08 +02:00
Eve
3216efef2e AVX BF16 and single scale quant optimizations (llama/10212)
* use 128 bit loads (i've tried 256->128 to death and its slower)

* double accumulator

* avx bf16 vec dot

* +3% q4_0 inference

* +7% tg +5% pp compared to master

* slower f16c version, kep for reference

* 256b version, also slow. i tried :)

* revert f16

* faster with madd

* split to functions

* Q8_0 and IQ4_NL, 5-7% faster

* fix potential overflow (performance reduced)

* 16 bit add for q4_0 only

* merge
2024-11-20 21:00:08 +02:00
Charles Xu
3298916e5e backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels (llama/9921)
* backend-cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2024-11-20 21:00:08 +02:00
Diego Devesa
746bf2596f ggml : build backends as libraries (llama/10256)
* ggml : build backends as libraries

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>
2024-11-20 21:00:08 +02:00