c98681e6d5
ggml : upgrade init_tensor API to return a ggml_status (llama/11854)
...
* Upgrade init_tensor API to return a ggml_status
To prepare for an 'abort-free' ggml
(ggml not to abort on OOMs but return a OOM status),
as agreeed with Diego in the ggml repo,
upgrade the init_tensor() and view_init() APIs
to return a ggml_status.
* misc fixes
---------
Co-authored-by: slaren <slarengh@gmail.com >
2025-03-08 15:13:01 +02:00
992b51b3d5
ggml: aarch64: implement SVE kernels for q2_k_q8_k vector dot (llama/12064)
...
* Added SVE Support for Q2_K Quantized Models
* Use 4-space indentation in the switch cases
* removed comments lines
* Remove the loop Retain the curly bracess for better understanding of code
* Remove the comment like added for q3_k_q8_k kernel
---------
Co-authored-by: vithulep <p.m.vithule1517@gmail.com >
2025-03-08 15:13:01 +02:00
394768c48b
ggml-cpu: Fix build with sve (llama/12059)
...
* ggml-cpu: Fix build with sve
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* ggml-cpu: Remove unused variable in sve q3_k vec dot
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
---------
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
2025-03-08 15:13:01 +02:00
60d2ddebdf
cuda/cpu: Increase support for fp16 unary operations (ggml/1125)
...
* Support fp16 unary operations in the CUDA backend
* cpu: increase fp16 support for unary operators in the CPU backend
* cuda: increase fp16 support for unary operators in the CUDA backend
* Add test cases for fp16 unary operators
* metal: update supports_op for unary operators that don't support fp16, to prevent test-backend-ops from failing
* metal: fix PR comments for unary op support after fp16 unary tests
2025-03-08 15:13:01 +02:00
cdaee8b4bd
Support pure float16 add/sub/mul/div operations in the CUDA (and CPU) backend (ggml/1121)
...
* Support float16-to-float16 add/sub/mul/div operations in the CUDA backend
* Add fp16 support for add/sub/mul/div on the CPU backend
* Add test cases for fp16 add/sub/mul/div
2025-02-27 08:55:36 +02:00
82e04e7670
ggml-cpu: Support s390x SIMD Instruction Set (llama/12019)
...
* ggml: add s390x ARCH_FLAGS for compilation
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: add SIMD for s390x using vector intrinsics
SIMD is activated for:
* ggml_vec_dot_f32
* ggml_vec_dot_f16
* ggml_vec_mad_f32
* ggml_vec_mad_f16
* ggml_vec_mad_f32_unroll
* ggml_vec_scale_f32
* ggml_vec_scale_f16
SIMD is NOT activated for:
* ggml_vec_dot_f16_unroll (pending bugfix)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: fix missing escape character in GGML_F32x4_REDUCE
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: add temporary patch for GGML_F32_ARR and GGML_F16_ARR
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: fix s390x GGML_F32x4_REDUCE
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: full SIMD activation for F32,F16 s390x
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: add option to disable s390x VXE/VXE2
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: change vecintrin.h include to ggml-cpu-impl
* add __VXE__ and __VXE2__ macros
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* cmake: add s390x target detection for VX/VXE/VXE2
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: move s390x vector intrinsics to ggml-cpu-impl.h
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: s390x Q8_0 SIMD
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: correct documentation for Q8_0
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: s390x reduce code complexity Q8_0
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: s390x bugfix typo Q8_0
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: s390x SIMD activated for Q4_1
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: s390x inline vec_reve
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: s390x SIMD activation for Q4_0
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: add VXE backend feature
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: remove test.py
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: s390x SIMD activation for quantize_row_q8_0
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: s390x SIMD activation for quantize_row_q8_1
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: s390x SIMD activation for iq4_xs
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: bugfix iq4_xs
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: s390x SIMD activation for iq4_nl
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: add float, double, and long vector data type
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: clean up iq4_xs SIMD
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: fix improper use of restrict keyword
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: update warning message for ggml_vec_tbl
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: untested implementation of ggml_vec_dot_iq2_xxs_q8_K
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: update ggml_vec_dot_q4_1_q8_1 to use typedefs
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: switch to restrict for iq4_nl
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: slight dot product speed improvement for q4_1_q8_1
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: s390x SIMD activation for q6_K
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: add missing `_t` to ggml_int8x16x4_t
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: fix missing `_t` for ggml_vec_xl_s8x4
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: fix more missing `_t`
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: add unroll and prefetch to Q8_0
increase of 3.86% for prompt processing and 32.22% for token generation
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: patch Q8_0 to use proper vector sizes
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: optimise Q8_0 dot prod compute kernel further
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: add unroll and prefetch to Q4_1
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: refactor Q6_K variable naming for readability
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: fix Q6_K typos
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: s390x SIMD activation for Q5_K
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: fix wrong char*x16_t naming
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: Q5_K y0 wrong signness
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: fix Q5_K invalid uchar type
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: fix Q5_K invalid uchar type
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: s390x SIMD activation for Q4_K
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: fix Q4_K invalid vector intrinsics
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: simplify ggml_padd_s16 compute kernel
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: correct ggml-cpu vxe wording
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: change ggml_aligned_malloc alignment to 256
256 is the cache line size for s390x platforms
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: resolve pr merge via cherry-pick 225bbbf
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml : fix LoongArch compile error with 128-bit SIMD (llama/11701)
* ggml: resolve pr merge via cherry-pick 4571953
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
* ggml: cmake remove fork when determining s390x machine type
thank you @ericcurtin
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
---------
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com >
Co-authored-by: Jinyang He <hejinyang@loongson.cn >
Co-authored-by: junchao-zhao <68935141+junchao-loongson@users.noreply.github.com >
2025-02-27 08:55:36 +02:00
dc21871fcb
ggml-cpu: Add CPU backend support for KleidiAI library (llama/11390)
...
* ggml-cpu: Add CPU backend support for KleidiAI library
* Add environmental variable GGML_KLEIDIAI_SME
* Add support for multithread LHS conversion
* Switch kernel selection order to dotprod and i8mm
* updates for review comments
* More updates for review comments
* Reorganize and rename KleidiAI files
* Move ggml-cpu-traits.h to source file
* Update cmake for SME build and add alignment for SME
* Remove append GGML_USE_CPU_KLEIDIAI to the GGML_CDEF_PUBLIC list
2025-02-27 08:55:36 +02:00
64a430bc81
ggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot (llama/11917)
...
* Added SVE Implementation for Q3_K Kernel in ggml-cpu-quants.c file
* Improved Formating of code in ggml-cpu-quants.c file
* style : minor fixes
* style : less whitespaces
* style : ptr spaceing
---------
Co-authored-by: vithulep <p.m.vithule1517@gmail.com >
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2025-02-27 08:55:36 +02:00
a7fc1038ca
repo : update links to new url (llama/11886)
...
* repo : update links to new url
ggml-ci
* cont : more urls
ggml-ci
2025-02-27 08:55:36 +02:00
e3d9ffb98b
ggml: optimize some vec dot functions for LoongArch ASX (llama/11842)
...
* Optimize ggml_vec_dot_q3_K_q8_K for LoongArch ASX
* Optimize ggml_vec_dot_q4_K_q8_K for LoongArch ASX
* Optimize ggml_vec_dot_q6_K_q8_K for LoongArch ASX
* Optimize ggml_vec_dot_q5_K_q8_K for LoongArch ASX
* Optimize ggml_vec_dot_q2_K_q8_K for LoongArch ASX
* Optimize mul_sum_i8_pairs_float for LoongArch ASX
* Optimize ggml_vec_dot_iq4_xs_q8_K for LoongArch ASX
2025-02-27 08:55:36 +02:00
defe731263
llamafile: use member variable instead of constant for iq4nlt (llama/11780)
2025-02-27 08:55:36 +02:00
d2c5154bb5
ggml-cpu : add chunking support to mul_mat_id (llama/11666)
...
* ggml-cpu : add chunking support to mul_mat_id
* allocate chunk counter in wdata
parallelize src1 quantization by column to allows parallelization even when there is only one row
* disable for arm
* cleanup
* better way to disable for arm
* fix uninitialized counter when using 1 thread only
* revert test-backend-ops changes
2025-02-27 08:55:36 +02:00
4fac43fe00
ggml : x2 speed for WASM by optimizing SIMD (llama/11453)
...
* ggml : x2 speed for WASM by optimizing SIMD
* fix bad merging
* rm trailing spaces
* rm redundant clamp
* better quantize_row_q8_K
Co-authored-by: camel-cdr <camel-cdr@protonmail.com >
* remove memset that causes buffer overflow
Co-authored-by: camel-cdr <camel-cdr@protonmail.com >
---------
Co-authored-by: camel-cdr <camel-cdr@protonmail.com >
2025-02-27 08:55:36 +02:00
d597f83e1a
ggml : fix multi-threaded clamp_f32 (llama/11824)
...
* Bug fix for clamp_f32
When using tensors larger than 1d clamp operation does not work due to the restriction of returning if ith is not 0.
* Bug fix for clamp_f32
* Bug fix for clamp_f32
2025-02-27 08:55:36 +02:00
e5edcc6259
ggml-cpu: Fix duplicate MATMUL_INT8 (llama/11817)
...
Signed-off-by: Weizhao Ouyang <o451686892@gmail.com >
2025-02-27 08:55:36 +02:00
91d02de332
Fix #11802 : Compile bug - RegQueryValueExA changed to RegQueryValueEx (llama/11803)
...
* Fix #11802 : Compile bug - RegQueryValueExA changed to RegQueryValueEx
* Fix #11802 : PR #11803 - keep RegQueryValueExA, remove TEXT macro, description needs to be ANSI string
2025-02-27 08:55:36 +02:00
5981352bb5
ggml: Fix data race in ggml threadpool (llama/11736)
...
After the barrier in last iteration is executed, still the loop termination
condition will be executed. However main thread can destroy the cgraph object
and its nodes already, then another thread will access it, but the thing is already gone.
Also trouble can happen when n_nodes == 0 or abort is called, but I'm not sure if the
prior situation is possible.
Last syncronization should be done after the loop to ensure the cgraph/cplan won't be
accessed after the main thread exits from the function.
2025-02-27 08:55:36 +02:00
bbd8364f5e
ggml : optimize and build warning fix for LoongArch (llama/11709)
...
* ggml : optimize convert f32<->f16 for loongarch_asx
* ggml : optimize loongarch_asx extend i16,i8,u8 to i32,i16
* ggml : Fix warnings when run cpu CI locally on LoongArch
2025-02-27 08:55:36 +02:00
f8242ec483
ggml : fix LoongArch compile error with 128-bit SIMD (llama/11701)
2025-02-27 08:55:36 +02:00
46d07b9c85
cmake : fix compile assumptions for power9/etc ( #2777 )
...
* Add small comment re: VSX to readme
Co-authored-by: midnight <midnight@example.com >
2025-02-05 14:41:10 +02:00
c262dc80e2
CPU/CUDA: fix (GQA) mul mat back, add CUDA support (llama/11380)
2025-02-03 22:00:57 +02:00
7183a1eb72
vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl (llama/11166)
...
* vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl
Shaders are based on cpy.cu.
* vulkan: support copy from q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl to f32
* ggml: copy q->f32 assumes some contiguity in the destination
2025-02-03 22:00:57 +02:00
de49024e49
CUDA: backwards pass for misc. ops, add tests (llama/11257)
...
* CUDA: backwards pass for misc. ops, add tests
* remove restrict from pointers
2025-02-03 22:00:57 +02:00
db6383094c
ggml: aarch64: implement SVE kernels for q4_K_q8_K vector dot (llama/11227)
...
* Add SVE support for q4_K_q8_K
* Update ggml/src/ggml-cpu/ggml-cpu-quants.c
change to use K_SCALE_SIZE
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2025-02-03 22:00:57 +02:00
54a2ee648f
RoPE: fix back, CUDA support for back + noncont. (llama/11240)
...
* RoPE: fix back, CUDA support for back + noncont.
* fix comments reg. non-cont. RoPE support [no-ci]
2025-02-03 22:00:57 +02:00
f12559d590
ggml-cpu : fix ggml_graph_compute_thread did not terminate on abort. (ggml/1065)
...
some threads kept looping and failed to terminate properly after an abort during CPU execution.
Co-authored-by: issi <issi@gmail.com >
2025-02-03 22:00:57 +02:00
06209f6683
llama: add support for QRWKV6 model architecture (llama/11001)
...
llama: add support for QRWKV6 model architecture (llama/11001)
* WIP: Add support for RWKV6Qwen2
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* RWKV: Some graph simplification
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* Add support for RWKV6Qwen2 with cpu and cuda GLA
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* Fix some typos
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* code format changes
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* Fix wkv test & add gla test
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* Fix cuda warning
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* Update README.md
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* Update ggml/src/ggml-cuda/gla.cu
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Fix fused lerp weights loading with RWKV6
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
* better sanity check skipping for QRWKV6 in llama-quant
thanks @compilade
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
Co-authored-by: compilade <git@compilade.net >
---------
Signed-off-by: Molly Sophia <mollysophia379@gmail.com >
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
Co-authored-by: compilade <git@compilade.net >
2025-01-14 10:38:01 +02:00
124eec1664
llamafile : ppc64le MMA INT8 implementation (llama/10912)
...
This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for quantised int8 datatype.
This change results in 10% - 70% improvement
in total speed(ie all tokens/total time), across
various batch sizes.
The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.
Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com >
2025-01-14 10:38:01 +02:00
09fabffdf5
ggml-backend : only offload from host buffers (fix) (llama/11124)
2025-01-14 10:38:01 +02:00
3fcba3e58b
ggml : fixes for AVXVNNI instruction set with MSVC and Clang (llama/11027)
...
* Fixes for clang AVX VNNI
* enable AVX VNNI and alder lake build for MSVC
* Apply suggestions from code review
---------
Co-authored-by: slaren <slarengh@gmail.com >
2025-01-04 10:45:01 +02:00
bcf937c216
ggml : more perfo with llamafile tinyblas on x86_64 (llama/10714)
...
* more perfo with llamafile tinyblas on x86_64.
- add bf16 suport
- change dispache strategie (thanks:
https://github.com/ikawrakow/ik_llama.cpp/pull/71 )
- reduce memory bandwidth
simple tinyblas dispache and more cache freindly
* tinyblas dynamic dispaching
* sgemm: add M blocs.
* - git 2.47 use short id of len 9.
- show-progress is not part of GNU Wget2
* remove not stable test
2025-01-04 10:45:01 +02:00
b8d90953d7
ggml : use wstring for backend search paths (llama/10960)
...
ggml-ci
2025-01-04 10:45:01 +02:00
60a422147b
ggml : fix arm enabled features check (llama/10961)
2025-01-04 10:45:01 +02:00
3387415bad
ggml : fix const usage in SSE path (llama/10962)
2025-01-04 10:45:01 +02:00
6d502f33dc
ggml-cpu: replace NEON asm with intrinsics in ggml_gemv_q4_0_4x8_q8_0() (llama/10874)
...
* ggml-cpu: replace NEON asm with intrinsics in ggml_gemv_q4_0_4x8_q8_0()
Signed-off-by: Adrien Gallouët <angt@huggingface.co >
* ggml-cpu: format code
Signed-off-by: Adrien Gallouët <angt@huggingface.co >
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co >
2025-01-04 10:45:01 +02:00
1462d92588
ggml : add test for SVE and disable when it fails (llama/10906)
2025-01-04 10:45:01 +02:00
7ba1a41f47
ggml: fix arm build with gcc (llama/10895)
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co >
2025-01-04 10:45:01 +02:00
5ea088636f
ggml : fix arm build (llama/10890)
...
* ggml: GGML_NATIVE uses -mcpu=native on ARM
Signed-off-by: Adrien Gallouët <angt@huggingface.co >
* ggml: Show detected features with GGML_NATIVE
Signed-off-by: Adrien Gallouët <angt@huggingface.co >
* remove msvc support, add GGML_CPU_ARM_ARCH option
* disable llamafile in android example
* march -> mcpu, skip adding feature macros
ggml-ci
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co >
Co-authored-by: Adrien Gallouët <angt@huggingface.co >
2025-01-04 10:45:01 +02:00
6576af00d7
files : remove old sources
2024-12-18 12:52:16 +02:00
479499dc0e
ggml : update ggml_backend_cpu_device_supports_op (llama/10867)
...
* ggml : fix cpy op for IQ-quants to use reference impl
ggml-ci
* ggml : disable tests involving i-matrix quantization
* ggml : update ggml_backend_cpu_device_supports_op
ggml-ci
2024-12-18 12:52:16 +02:00
e22d38e4f2
llama : add Qwen2VL support + multimodal RoPE (llama/10361)
...
* Barebone Qwen2VL LLM convertor
* Add Qwen2VL cli entrypoint
* [WIP] add qwen2vl arch
* Verify m-rope output
* Add vl-rope/2d-rope support for qwen2vl ViT
* update qwen2vl cli tool
* update 5D tensor op workaround
* [WIP] qwen2vl vision model
* make batch and clip utils compatible with qwen2vl
* [WIP] create inference workflow, gguf convert script but fix
* correcting vision-rope behavior, add the missing last layer back to ViT
* add arg parser to qwen2vl_surgery
* replace variable size array with vector
* cuda-gdb cmake preset
* add fp32 mrope, vision rope kernel
* add fp16 support for qwen2vl and m-rope
* add `GGML_ROPE_TYPE_MROPE`, `GGML_ROPE_TYPE_VISION`
* fix rope op mode switching, out dated func args
* update `llama_hparams`
* update to keep up stream changes
* resolve linter, test errors
* add makefile entry, update speical image padding token
* add mrope unit test, fix few compiler warnings
* rename `mrope` related function, params
* minor updates on debug util, bug fixs
* add `m-rope` testcase to `test-backend-ops`
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* fix traililng whitespce
* store `llama_hparams.rope_sections` with fixed size array
* update position id tensor size check in GGML_OP_ROPE
* minor updates
* update `ggml_backend_*_supports_op` of unsupported backends
* remote old `rope_section` compare operator
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2024-12-18 12:52:16 +02:00
e6eed605cf
ggml : Fix compilation issues on ARM platform when building without fp16 (llama/10811)
2024-12-18 12:52:16 +02:00
1193e494a9
remove CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS (llama/10797)
...
other windows build fixes
2024-12-18 12:52:16 +02:00
d0a050b51f
ggml : disable iq4_nl interleave size 8 (llama/10709)
...
ggml-ci
2024-12-18 12:52:16 +02:00
e990d1b791
ggml : refactor online repacking (llama/10446)
...
* rename ggml-cpu-aarch64.c to .cpp
* reformat extra cpu backend.
- clean Q4_0_N_M and IQ4_0_N_M
- remove from "file" tensor type
- allow only with dynamic repack
- extract cpu extra bufts and convert to C++
- hbm
- "aarch64"
- more generic use of extra buffer
- generalise extra_supports_op
- new API for "cpu-accel":
- amx
- aarch64
* clang-format
* Clean Q4_0_N_M ref
Enable restrict on C++
* add op GGML_OP_MUL_MAT_ID for Q4_0_N_M with runtime repack
* added/corrected control on tensor size for Q4 repacking.
* Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* add debug logs on repacks.
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
2024-12-18 12:52:16 +02:00
94e7da1ff2
cmake : fix "amd64" processor string ( #2638 )
2024-12-17 18:34:32 +02:00
a815940e0e
ggml : add predefined list of CPU backend variants to build (llama/10626)
...
* ggml : add predefined list of CPU backend variants to build
* update CPU dockerfiles
2024-12-08 20:14:35 +02:00
904e307bce
ggml-cpu : fix HWCAP2_I8MM value (llama/10646)
2024-12-08 20:14:35 +02:00
b7c64a4352
ggml: add GGML_SET
Metal kernel + i32 CPU kernel (ggml/1037)
...
* implemented cpu kernel
* add i32 test cases in test-backend-ops
* typedef `ggml_metal_kargs_set`
* implemented `kernel_set`
* memcpy
2024-12-08 20:14:35 +02:00
7895d39508
ggml : add GGML_PAD_REFLECT_1D
operation (ggml/1034)
...
* ggml_pad_reflect_1d defined in header
* implemented on CPU
* called the forward pass
* impl Metal kernel
* added Metal kernel
* added OP_PAD_REFLECT_1D in test-backend-ops.cpp
* add test-pad-reflect-1d test case
* test case support multiple backend
2024-12-08 20:14:35 +02:00