whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2024-11-07 16:44:13 +01:00

Author	SHA1	Message	Date
slaren	b2ad484c89	ggml : do not crash when quantizing q4_x_x with an imatrix (llama/9192)	2024-08-28 13:22:20 +03:00
Georgi Gerganov	d96a17848f	metal : separate scale and mask from QKT in FA kernel (llama/9189) * metal : separate scale and mask from QKT in FA kernel * metal : ne01 check no longer necessary * metal : keep data in local memory	2024-08-28 13:22:20 +03:00
Georgi Gerganov	0e7798677a	ggml : add SSM Metal kernels (llama/8546) * ggml : add ggml_ssm_conv metal impl * ggml : add ssm_scan metal impl ggml-ci	2024-08-28 13:22:20 +03:00
slaren	58a36d2e3b	metal : gemma2 flash attention support (llama/9159)	2024-08-28 13:22:20 +03:00
Johannes Gäßler	24d8534bd8	CPU/CUDA: Gemma 2 FlashAttention support (llama/8542) * CPU/CUDA: Gemma 2 FlashAttention support * apply logit_softcap to scale in kernel * disable logit softcapping tests on Metal * remove metal check	2024-08-28 13:22:20 +03:00
Akarshan Biswas	9b16ddd3a5	Add a space to supress a cmake warning (llama/9133)	2024-08-28 13:22:20 +03:00
luoyu-intel	32f88af17b	Add oneDNN primitive support (llama/9091) * add onednn * add sycl_f16 * add dnnl stream * add engine map * use dnnl for intel only * use fp16fp16fp16 * update doc	2024-08-28 13:22:20 +03:00
compilade	9bf7250bf9	llama : simplify Mamba with advanced batch splits (llama/8526) * llama : advanced batch splits This includes equal-sequence-length batch splits which are useful to simplify recurrent model operators. * llama : always make recurrent state slots contiguous * ggml : simplify mamba operators * llama : fix integer signedness mixing * llama : logits_all has priority over batch->logits Otherwise, the server embeddings tests failed. This was likely an existing problem but was only detected here because of an additional assertion. * llama : apply suggestions Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * llama : fix t5 segfault * llama : fix Mamba session save and restore * llama : minor cosmetic changes * llama : rename llama_reorder_outputs to llama_output_reorder Also move it closer to llama_output_reserve. * llama : fix pooled embeddings when using batches with equal_seqs * minor : add struct members for clarity ggml-ci * llama : fix T5 segfault again * llama : fix Mamba pooled embeddings with multiple sequences Until the pooled embeddings are refactored to allow splitting across ubatches for causal embeddings, recurrent models can only process a single sequence per ubatch when calculating pooled embeddings. * llama : add llama_model_is_recurrent to simplify figuring that out This will make it easier to more cleanly support RWKV-v6 and Mamba-2. * llama : fix simple splits when the batch contains embeddings --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-08-28 13:22:20 +03:00
Meng, Hengyu	17e49d3ab2	fallback mmvq (llama/9088) * fallback mmvq to mul_mat * mmvq in cuda path * Update ggml/src/ggml-sycl.cpp Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@codeplay.com> --------- Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@codeplay.com>	2024-08-28 13:22:20 +03:00
zhentaoyu	58b725282a	Fix SYCL `im2col` and `convert` Overflow with Large Dims (llama/9052) * sycl: fix im2col overflow and sync with cuda Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl: fix convert overflow Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl: fix convert and dequantize Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl: fix ib in dmmv Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl:refine convert Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl: move downsample global_range into common Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * test: add im2col and convert test cases Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * test: make new cases only in sycl Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * test: comment new test_cases for only local testing Signed-off-by: zhentaoyu <zhentao.yu@intel.com> --------- Signed-off-by: zhentaoyu <zhentao.yu@intel.com>	2024-08-28 13:22:20 +03:00
Radoslav Gerganov	7e59afa1e0	rpc : print error message when failed to connect endpoint (llama/9042)	2024-08-28 13:22:20 +03:00
Radoslav Gerganov	5ac022140e	rpc : prevent crashes on invalid input (llama/9040) Add more checks which prevent RPC server from crashing if invalid input is received from client	2024-08-28 13:22:20 +03:00
Nico Bosshard	0eaa67280c	ggml : dynamic ggml_sched_max_splits based on graph_size (llama/9047) * ggml : Dynamic ggml_sched_max_splits based on graph_size * Fixed and readded debug code for causes	2024-08-28 13:22:20 +03:00
Georgi Gerganov	5a62fdb735	cmake : remove unused option GGML_CURL (llama/9011)	2024-08-28 13:22:20 +03:00
Daniel Bevenius	60098d6204	ggml : move rope type enum to ggml.h (llama/8949) * ggml : move rope type enum to ggml.h This commit moves the `llama_rope_type` enum from `llama.h` to `ggml.h` and changes its name to `ggml_rope_type`. The motivation for this change is to address the TODO in `llama.h` and use the enum in ggml. Note: This commit does not change the `mode` parameter to be of type `enum ggml_rope_type`. The name `mode` and its usage suggest that it might be more generic and possibly used as a bit field for multiple flags. Further investigation/discussion may be needed to determine if `mode` should be restricted to RoPE types. * squash! ggml : move rope type enum to ggml.h This commit removes GGML_ROPE_TYPE_NONE and GGML_ROPE_TYPE_GLM from ggml.h, and back the llama_rope_type enum. I've kept the assert for GGML_ROPE_TYPE_GLM as I'm not sure if it is safe to remove it yet. * squash! ggml : move rope type enum to ggml.h This commit removes the enum ggml_rope_type from ggml.h and replaces it with a define (GGML_ROPE_TYPE_NEOX). This define is used in the code to check if the mode is set to GPT-NeoX. Also the enum llama_rope_type has been updated to reflect this change. * squash! ggml : move rope type enum to ggml.h This commit contains a suggestion enable the GGML_ROPE_TYPE_NEOX macro/define to be passed to the shader compiler. * squash! ggml : move rope type enum to ggml.h This commit fixes the editorconfig-checker warnings. * squash! ggml : move rope type enum to ggml.h Update comment for ggml_rope function. * Revert "squash! ggml : move rope type enum to ggml.h" This reverts commit 6261222bd0dc0efd51f0fb0435ad3f16a5b52fd6. * squash! ggml : move rope type enum to ggml.h Add GGML_ROPE_TYPE_NEOX to rope_common.comp. * remove extra line --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-08-28 13:22:20 +03:00
DavidKorczynski	317293e6a7	ggml: fix div-by-zero (llama/9003) Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70724 In order to access the above bug you need to login using one of the emails in https://github.com/google/oss-fuzz/blob/master/projects/llamacpp/project.yaml#L3-L5 Signed-off-by: David Korczynski <david@adalogics.com>	2024-08-28 13:22:20 +03:00
Markus Tavenrath	488a966c07	Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. (llama/8943) * Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. - Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove. - ggml_vk_sync_buffer introduce a full pipeline sync which has a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors. * Fix small typo --------- Co-authored-by: 0cc4m <picard12@live.de>	2024-08-28 13:22:20 +03:00
Johannes Gäßler	8954769aa2	feat: ref. cross entropy, add CUDA, fix grad test (ggml/929)	2024-08-28 13:22:20 +03:00
Johannes Gäßler	df06468d9e	ggml: remove bad assert (ggml/928)	2024-08-28 13:22:20 +03:00
Johannes Gäßler	1fbd828a5d	examples: add MNIST training + missing ops	2024-08-28 13:22:20 +03:00
Georgi Gerganov	9e3c5345cd	sync : ggml vulkan (ggml/0) ggml-ci	2024-08-21 11:07:13 +03:00
Radoslav Gerganov	b6c05ce82f	yolo : add backend support (ggml/924) * yolo : add backend support * metal : add sub and sqrt kernels --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-08-21 11:07:13 +03:00
Daniel Bevenius	52c80cac00	ggml : fix typo in ggml-quants.c comment (ggml/922)	2024-08-21 11:07:13 +03:00
Ronsor	3643120690	feat: add new `sin` and `cos` operators (ggml/919) * ggml : add sin/cos operators * ggml-cuda : add sin/cos operators * ggml : add corresponding tests for sin/cos * ggml : add backward computation for sin/cos operators * ggml-vulkan : add sin/cos operators * ggml-vulkan : add sin/cos shader source * metal : add sin, cos --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-08-21 11:07:13 +03:00
Salvatore Mesoraca	993f0df419	ggml : support forward pass broadcasting in ggml_sub (ggml/914) * ggml: support forward pass broadcasting in ggml_sub Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com> * Use assert instead of GGML_ASSERT in ggml_compute_forward_sub_f32 The check is already performed in ggml_sub_impl Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com> --------- Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>	2024-08-12 11:58:49 +03:00
slaren	9b1788483c	metal : fix uninitialized abort_callback (llama/8968)	2024-08-12 11:58:49 +03:00
Georgi Gerganov	ad37d26983	rpc : sanitize tensor data + warnings (llama/0) Co-authored-by: slaren <slarengh@gmail.com>	2024-08-12 11:58:46 +03:00
Mengqing Cao	81c999fe0a	cann : add Ascend NPU support (#2336 ) * enable Ascend NPU in src/whisper.cpp * sync test-backend-ops with llama.cpp	2024-08-09 15:21:56 +03:00
hipudding	be88ee1d75	ggml : add CANN backend (llama/0) ggml-ci	2024-08-09 09:58:16 +03:00
slaren	ee14c02365	ggml-backend : fix async copy from CPU (llama/8897) * ggml-backend : fix async copy from CPU * cuda : more reliable async copy, fix stream used when the devices are the same	2024-08-08 22:48:46 +03:00
Ouadie EL FAROUKI	ab39dd34e1	Updated SYCL device filtering (llama/8901) * Updated device filter to depend on default_selector (fixes non-intel device issues) * Small related update to example/sycl Readme	2024-08-08 22:48:46 +03:00
Johannes Gäßler	b1348d3530	CUDA/HIP: fix tests/test-backend-ops (llama/8896)	2024-08-08 22:48:46 +03:00
Johannes Gäßler	90641b5cf4	CUDA: fix padding logic for FP16/FP32 (llama/8884)	2024-08-08 22:48:46 +03:00
Molly Sophia	4160b930f1	ggml : add epsilon as a parameter for group_norm (llama/8818) Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-08-08 22:48:46 +03:00
Justine Tunney	7a96e661e4	ggml : fix overflows in elu function (llama/8866) It's helpful to use expm1f(x), because expf(x)-1 will result in overflow for 25% of single-precision floating point numbers.	2024-08-08 22:48:46 +03:00
jdomke	a902fb4ab2	ggml : reading the runtime sve config of the cpu (llama/8709) * ggml : reading the runtime sve config of the cpu * change to one time init to prevent performance drop * prefix variable to avoid possible conflicts * revert xxhash fix and add brackets --------- Co-authored-by: domke <673751-domke@users.noreply.gitlab.com>	2024-08-08 22:48:46 +03:00
Sigbjørn Skjæret	6cb38c3673	Fix conversion of unnormalized BF16->BF16 weights (llama/7843) * add truncate_bf16 * truncate intermediate fp32 if converting bf16 to bf16 * fix masking in __compute_fp32_to_bf16 * np.int16 no longer used * missing cast and additional numpy 2.x fix * ggml-impl : do not flush bf16 subnormals to zero * ggml : add reference fp32 to bf16 conversion The fast version is no longer equivalent for all platforms because of the handling of subnormal values. * gguf-py : remove flush to zero for bf16 subnormals * gguf-py : remove float32 truncation to bf16 Rounding achieves the same thing in the cases where this was used. * missed prototype update in merge * merge cleanup --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net>	2024-08-08 22:48:46 +03:00
Ouadie EL FAROUKI	9cf14ebcbc	Fixing wrong VDR iq4nl value (llama/8812)	2024-08-08 22:48:46 +03:00
matteo	8e39ee171f	ggml-cuda: Adding support for unified memory (llama/8035) * Adding support for unified memory * adding again the documentation about unified memory * refactoring: Moved the unified memory code in the correct location. * Fixed compilation error when using hipblas * cleaning up the documentation * Updating the documentation Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * adding one more case where the PR should not be enabled --------- Co-authored-by: matteo serva <matteo.serva@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-08-08 22:48:46 +03:00
Alex O'Connell	d26250f78c	Build: Only include execinfo.h on linux systems that support it (llama/8783) * Only enable backtrace on GLIBC linux systems * fix missing file from copy * use glibc macro instead of defining a custom one	2024-08-08 22:48:46 +03:00
slaren	5218ea21b8	cuda : fix dmmv cols requirement to 2GGML_CUDA_DMMV_X (llama/8800) cuda : fix dmmv cols requirement to 2GGML_CUDA_DMMV_X update asserts * only use dmmv for supported types * add test	2024-08-08 22:48:46 +03:00
l3utterfly	e60be821ce	added android implementation of ggml_print_backtrace_symbols (llama/8751) * added android implementation of ggml_print_backtrace_symbols * Update ggml/src/ggml.c Co-authored-by: slaren <slarengh@gmail.com> * Update ggml/src/ggml.c Co-authored-by: slaren <slarengh@gmail.com> * Update ggml/src/ggml.c Co-authored-by: slaren <slarengh@gmail.com> * Update ggml/src/ggml.c Co-authored-by: slaren <slarengh@gmail.com> * Update ggml/src/ggml.c Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-08-08 22:48:46 +03:00
wangshuai09	19708df884	cann: update cmake (llama/8765)	2024-08-08 22:48:46 +03:00
zhentaoyu	3f190addda	Add `TIMESTEP_EMBEDDING` OP (llama/8707) Signed-off-by: zhentaoyu <zhentao.yu@intel.com>	2024-08-08 22:48:46 +03:00
CarterLi999	b355ee7cfa	ggml: bugfix: fix the inactive elements is agnostic for risc-v vector (llama/8748) In these codes, we want to retain the value that they previously held when mask[i] is false. So we should use undisturbed. With the default agnostic policy of rvv intrinsic, these values can be held or be written with 1s. Co-authored-by: carter.li <carter.li@starfivetech.com>	2024-08-08 22:48:46 +03:00
R0CKSTAR	49ac8872b4	cuda : organize vendor-specific headers into vendors directory (llama/8746) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-08-08 22:48:46 +03:00
Meng, Hengyu	8ef98ae7e3	add conv support (llama/8688)	2024-08-08 22:48:46 +03:00
R0CKSTAR	e471adcfa5	feat: Support Moore Threads GPU (llama/8383) * Update doc for MUSA Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Add GGML_MUSA in Makefile Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Add GGML_MUSA in CMake Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * CUDA => MUSA Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * MUSA adds support for __vsubss4 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Fix CI build failure Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-08-08 22:48:46 +03:00
Borislav Stanimirov	aa816c922c	ggml : ignore more msvc warnings (ggml/906)	2024-08-08 22:48:46 +03:00
Georgi Gerganov	b3264eb266	metal : fix struct name (ggml/912) ggml-ci	2024-08-08 22:48:46 +03:00
Conrad Kramer	eb2eb87a58	metal : add abort callback (ggml/905)	2024-08-08 22:48:46 +03:00
0cc4m	83fcb0e486	vulkan : implement Stable Diffusion operators (ggml/904) * Fix Vulkan repeat op * Implement Vulkan concat op * Delete old Vulkan shader generator * Implement Vulkan im2col op * Implement Vulkan unary gelu_quick op * Implement Vulkan group_norm op * Implement Vulkan timestep_embedding op * Implement Vulkan upscale op * Fix Vulkan vk_context tensor extra index issue * Fix Vulkan matmul shader parameter bug * Properly fix Vulkan matmul shader parameter bug * Add Vulkan ADD f16 + f32 -> f16 operator support * Implement Vulkan tanh op * Fix Vulkan group count too large Validation error on non-Nvidia GPUs * Throw error when too much memory is requested * Fix another Vulkan group count too large Validation error on non-Nvidia GPUs * Fix matmul MMQ condition * Implement Vulkan pad op * Fix Vulkan crash when tensor is used multiple times in a compute graph * Add Vulkan CONCAT f16 + f16 -> f16 op * Add Vulkan LEAKY_RELU op	2024-08-08 22:48:46 +03:00
Daniel Bevenius	f7bb412878	ggml : move c parameter comment to ggml_rope_ext (ggml/901) This commit moves the comment for the c parameter from ggml_rope to ggml_rope_ext. The comment is currently incorrect as ggml_rope does not have a c parameter (freq_factors tensor). Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-08-08 22:48:46 +03:00
Georgi Gerganov	ef6dcf0d0c	ggml : resolve sync conflicst (ggml/0) ggml-ci	2024-08-08 22:48:46 +03:00
Dibakar Gope	525f190917	ggml : add ggml-aarch64 (ggml/0)	2024-08-08 22:48:46 +03:00
slaren	dd916a2852	ggml : reduce hash table reset cost (llama/8698) * ggml : reduce hash table reset cost * fix unreachable code warnings after GGML_ASSERT(false) * GGML_ASSERT(false) -> GGML_ABORT("fatal error") * GGML_ABORT use format string	2024-08-08 22:48:46 +03:00
DavidKorczynski	0620fe00ec	ggml: handle ggml_init failure to fix NULL pointer deref (llama/8692) `ggml_init` can fail if no unused context is found. In that case, a NULL-pointer deref will happen later in the code during a call to `ggml_set_on_alloc`. This fixes it by bailing out if no context is found.	2024-08-08 22:48:46 +03:00
Chen Xi	31d0a9a14f	fix multi-gpu issue on sycl (llama/8554) --------- Signed-off-by: Chen Xi <xi2chen@intel.com> Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>	2024-08-08 22:48:46 +03:00
Georgi Gerganov	c06970dd72	ggml : add and use ggml_cpu_has_llamafile() (llama/8664)	2024-08-08 22:48:46 +03:00
Joe Todd	7598acf525	Re-add erroneously removed -fsycl from GGML_EXTRA_LIBS (llama/8667)	2024-08-08 22:48:46 +03:00
Joe Todd	43ddfce969	sycl : Add support for non-release DPC++ & oneMKL (llama/8644) * Update cmake to support nvidia hardware & open-source compiler --------- Signed-off-by: Joe Todd <joe.todd@codeplay.com>	2024-08-08 22:48:46 +03:00
0cc4m	a7e6d2cd9c	Vulkan IQ4_NL Support (llama/8613) * Fix Vulkan matmul tests compile errors * Add Vulkan IQ4_NL support * Fix Vulkan DeepSeek-Coder-V2-Lite MoE support	2024-08-08 22:48:46 +03:00
Jeroen Mostert	86506b0c5c	Allow all RDNA2 archs to use sdot4 intrinsic (llama/8629) The check gating the use of `__builtin_amdgc_sdot4` specifically checks for gfx1030. This causes a severe perf regression for anything gfx103? that's not gfx1030 and not using `HSA_OVERRIDE_GFX_VERSION` (if you've built ROCm to support it). We already have a generic RDNA2 define, let's use it.	2024-08-08 22:48:46 +03:00
luoyu-intel	11182fae34	fix scratch size of softmax (llama/8642)	2024-08-08 22:48:46 +03:00
Mark Zhuang	0bc8bffe1d	ggml: fix compile error for RISC-V (llama/8623)	2024-08-08 22:48:46 +03:00
Johannes Gäßler	8c4f30497a	CUDA: MMQ code deduplication + iquant support (llama/8495) * CUDA: MMQ code deduplication + iquant support * 1 less parallel job for CI build	2024-08-08 22:48:46 +03:00
Georgi Gerganov	b1ee3a8444	gguf : handle null name during init (llama/8587)	2024-08-08 22:48:46 +03:00
slaren	be9a16fd3f	ggml : fix quant dot product with odd number of blocks (llama/8549) * ggml : fix iq4_nl dot product with odd number of blocks * ggml : fix odd blocks for ARM_NEON (llama/8556) * ggml : fix iq4_nl dot product with odd number of blocks * ggml : fix q4_1 * ggml : fix q5_0 * ggml : fix q5_1 * ggml : fix iq4_nl metal ggml-ci * ggml : fix q4_0 * ggml : fix q8_0 ggml-ci * ggml : remove special Q4_0 code for first 2 blocks * ggml : fix sumf redefinition --------- Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-08-08 22:48:46 +03:00
Clint Herron	f4d9a95b0f	ggml : add friendlier error message to fopen errors (llama/8575) * Add additional error information when model files fail to load. * Adding additional error information to most instances of fopen.	2024-08-08 22:48:46 +03:00
Johannes Gäßler	a8ab3abe09	CUDA: fix partial offloading for ne0 % 256 != 0 (llama/8572)	2024-08-08 22:48:46 +03:00
65a	fb6a835938	cmake : install all ggml public headers (llama/8480) Co-authored-by: 65a <65a@65a.invalid>	2024-08-08 22:48:46 +03:00
hipudding	8923bb4292	Add Ascend NPU backend (llama/6035) * [CANN] Add Ascend NPU backend Ascend is a full-stack AI computing infrastructure for industry applications and services based on Huawei Ascend processors and software. CANN (Compute Architecture of Neural Networks), developped by Huawei, is a heterogeneous computing architecture for AI. Co-authored-by: wangshuai09 <391746016@qq.com> * delete trailing whitespaces * Modify the code based on review comment * Rename LLAMA_CANN to GGML_CANN * Make ggml-common.h private * add ggml_cann prefix for acl funcs * Add logging for CANN backend * Delete Trailing whitespace --------- Co-authored-by: wangshuai09 <391746016@qq.com>	2024-08-08 22:48:46 +03:00
Johannes Gäßler	fcba6aa352	make/cmake: add missing force MMQ/cuBLAS for HIP (llama/8515)	2024-08-08 22:48:46 +03:00
Xuan Son Nguyen	8807fe608b	Refactor lora adapter support (llama/8332) * lora: load to devide buft * add patch tensor function * correct tensor patch * llama_lora_adapter_apply * correct ggml_backend_tensor_copy * add llm_build_mm * fix auto merge * update based on review comments * add convert script * no more transpose A * add f16 convert * add metadata check * add sanity check * fix ftype * add requirements * fix requirements * fix outfile * conversion: only allow selected models * fix types * cuda : do not use dmmv if the tensor does not have enough cols * llama : lora fixes * do not disable mmap with lora Co-authored-by: slaren <slarengh@gmail.com> * llm_build_lora_mm_id * convert_lora : MoE LoRA conversion support * convert_lora : prefer safetensors, similarly to convert_hf * convert_hf : simplify modify_tensors for InternLM2 * convert_lora : lazy conversion * llama : load and use alpha from LoRA adapters * llama : use llm_build_lora_mm in most model graphs * auto scale * Revert "auto scale" This reverts commit 42415a4874e0f963e4aca6796ea5dfb97cd17464. * remove redundant params * Apply suggestions from code review Co-authored-by: slaren <slarengh@gmail.com> * change kv metadata * move add_type to __init__ * convert_hf : move add_type to main() * convert_lora : use the GGUFWriter from Model instead of overwriting it --------- Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Francis Couture-Harpin <git@compilade.net>	2024-08-08 22:48:46 +03:00
Meng, Hengyu	3e94c7a81d	add concat through dim 1/2 (llama/8483) * add concat through dim 1/2	2024-08-08 22:48:46 +03:00
0cc4m	77af3254e1	Vulkan MMQ Fix (llama/8479) * Fix incoherence by adding missing LOAD_VEC_A parameter * Fix Vulkan op result checker build error	2024-08-08 22:48:46 +03:00
bandoti	d4b3cffec4	vulkan : cmake integration (llama/8119) * Add Vulkan to CMake pkg * Add Sycl to CMake pkg * Add OpenMP to CMake pkg * Split generated shader file into separate translation unit * Add CMake target for Vulkan shaders * Update README.md * Add make target for Vulkan shaders * Use pkg-config to locate vulkan library * Add vulkan SDK dep to ubuntu-22-cmake-vulkan workflow * Clean up tabs * Move sudo to apt-key invocation * Forward GGML_EXTRA_LIBS to CMake config pkg * Update vulkan obj file paths * Add shaderc to nix pkg * Add python3 to Vulkan nix build * Link against ggml in cmake pkg * Remove Python dependency from Vulkan build * code review changes * Remove trailing newline * Add cflags from pkg-config to fix w64devkit build * Update README.md * Remove trailing whitespace * Update README.md * Remove trailing whitespace * Fix doc heading * Make glslc required Vulkan component * remove clblast from nix pkg	2024-08-08 22:48:46 +03:00
Georgi Gerganov	b852a4c5ca	metal : template-ify some of the kernels (llama/8447) ggml-ci	2024-08-08 22:48:46 +03:00
Georgi Gerganov	2157abaab4	ggml : minor naming changes (llama/8433) * ggml : minor naming changes ggml-ci * ggml : use PRId64 [no ci] * ggml : revert FA K/Q names	2024-08-08 22:48:46 +03:00
Chen Xi	68d609a12c	fix the mul_mat_id ut issues (llama/8427) * fix part of mul_mat_id * skip the bfloat 16 sycl ut Signed-off-by: Chen Xi <xi2chen@intel.com> --------- Signed-off-by: Chen Xi <xi2chen@intel.com> Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com> Co-authored-by: Chen Xi <xi2chen@intel.com>	2024-08-08 22:48:46 +03:00
Nicholai Tukanov	5a8ae474f0	ggml : add NVPL BLAS support (ggml/8329) (llama/8425) * ggml : add NVPL BLAS support * ggml : replace `<BLASLIB>_ENABLE_CBLAS` with `GGML_BLAS_USE_<BLASLIB>` --------- Co-authored-by: ntukanov <ntukanov@nvidia.com>	2024-08-08 22:48:46 +03:00
Daniel Bevenius	84493d7f3e	cuda : suppress 'noreturn' warn in no_device_code (llama/8414) * cuda : suppress 'noreturn' warn in no_device_code This commit adds a while(true) loop to the no_device_code function in common.cuh. This is done to suppress the warning: ```console /src/ggml-cuda/template-instances/../common.cuh:346:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn] 346 \| } \| ^ ``` The motivation for this is to reduce the number of warnings when compilng with GGML_HIPBLAS=ON. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * squash! cuda : suppress 'noreturn' warn in no_device_code Update __trap macro instead of using a while loop to suppress the warning. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-08-08 22:48:46 +03:00
Johannes Gäßler	15d71189e9	CUDA: optimize and refactor MMQ (llama/8416) * CUDA: optimize and refactor MMQ * explicit q8_1 memory layouts, add documentation	2024-08-08 22:48:46 +03:00
AidanBeltonS	37e962580f	Use multi_ptr to clean up deprecated warnings (llama/8256)	2024-08-08 22:48:46 +03:00
Georgi Gerganov	db0ea7a2f2	ggml : move sgemm sources to llamafile subfolder (llama/8394) ggml-ci	2024-08-08 22:48:46 +03:00
Dibakar Gope	5498b0e6c0	ggml : add AArch64 optimized GEMV and GEMM Q4 kernels (llama/5780) * Arm AArch64: optimized GEMV and GEMM kernels for q4_0_q8_0, and q8_0_q8_0 quantization * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add copyright claim only to ggml-aarch64.cpp and ggml-aarch64.h files * Arm AArch64: minor code refactoring for rebase * Arm AArch64: minor code refactoring for resolving a build issue with cmake * Arm AArch64: minor code refactoring to split the Q4_0_AARC64 type into three separate types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8 * Arm AArch64: minor code change for resolving a build issue with server-windows * retrigger checks * Arm AArch64: minor code changes for rebase * Arm AArch64: minor changes to skip the pr#7433 vec_dot code for arm cpus with SVE VL not equal to 256 bits * Arm AArch64: remove stale LLAMA_QKK_64 from CMakeLists.txt and delete build.zig * Arm AArch64: add reference scalar gemm and gemv, and avoid dynamic memory allocations during quantization for Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8 * Arm AArch64: add multithreaded quantization support for the new types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8 * Arm AArch64: minor code refactoring * Arm AArch64: simplify logic for calling gemm and gemv functions in ggml_compute_forward_mul_mat * Arm AArch64: minimize changes in ggml_compute_forward_mul_mat * Arm AArch64: minor code refactoring, and add reference scalar code to quantize routines for new quant types * Arm AArch64: minor code refactoring * Arm AArch64: minor code refactoring * Arm AArch64: minor code refactoring * rebase on the latest master commit 3fd62a6 and adapt to the new directory structure * Arm AArch64: remove a redundant comment * Arm AArch64: add pragma in ggml-aarch64.c to turn -Woverlength-strings warning off * Arm AArch64: use __aarch64__ check to guard 64-bit neon kernels * Arm AArch64: update docs/build.md README to include compile time flags for buiilding the Q4_0_4_4 quant type	2024-08-08 22:48:46 +03:00
Alberto Cabrera Pérez	2af4a52c39	sycl : Reenabled mmvq path for the SYCL Nvidia Backend (llama/8372) * SYCL : Reenabled mmvq path for the SYCL Nvidia Backend * Reduced verbosity of comment	2024-08-08 22:48:46 +03:00
Alberto Cabrera Pérez	eee2fe882e	sycl : fix powf call in device code (llama/8368)	2024-08-08 22:48:46 +03:00
Mahesh Madhav	0d1a11e5e2	ggml : loop tiling optimizations for scalar path (ggml/898) Apply a loop tiling technique to the generic path, which provides performance upside for ISAs with enough registers to take advantage of it. Also helps the compiler optimize this path.	2024-08-08 22:48:46 +03:00
Ivan Filipov	b2ead7d6f4	ggml: add support for float16 input tensors in pooling operations (ggml/895) * Add support for float16 tensors in 1d pooling operations * Add support for float16 input tensors in 2d pooling operations * code cleanup remove unnecessary casting during srow ptr initialization --------- Co-authored-by: vanaka11 <vanaka1189@gmail.com>	2024-08-08 22:48:46 +03:00
Tony Wasserka	8da6fd4dff	vulkan : initialize vk_buffer_struct members to VK_NULL_HANDLE (ggml/893) This prevents invalid frees when destroying a partially initialized vk_buffer_struct. For example, this could happen in ggml_vk_create_buffer when running out of device memory. Co-authored-by: Tony Wasserka <neobrain@users.noreply.github.com>	2024-08-08 22:48:46 +03:00
Borislav Stanimirov	ab8ec9e940	cmake : only enable GGML_NATIVE and x86 flags if not crosscompiling (ggml/885)	2024-08-08 22:48:46 +03:00
Matt Stephenson	f68298ce06	whisper : use vulkan as gpu backend when available (#2302 ) * ggml: use vulkan as gpu backend when available Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com> * whisper: enable using vk as default buffer type Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com> --------- Signed-off-by: Matt Stephenson <mstephenson6@users.noreply.github.com>	2024-07-16 10:21:09 +03:00
Georgi Gerganov	49868aa851	ggml : sync sycl (skip) (#0 )	2024-07-08 14:53:55 +03:00
Daniel Bevenius	95f2a191c0	ggml : remove unnecessary UNUSED macro call (ggml/880) This commit removes an UNUSED macro call that is not needed as the variable n0 is used in the code and will not produce a warning. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-07-08 14:53:55 +03:00
Natsu	00422ec3cf	cmake : add GGML_BUILD and GGML_SHARED macro definitions (llama/8281)	2024-07-08 14:53:55 +03:00
Ouadie EL FAROUKI	c5b05321e9	Enabled more data types for oneMKL gemm_batch (llama/8236)	2024-07-08 14:53:55 +03:00
Johannes Gäßler	5dc636a65a	CUDA: MMQ support for iq4_nl, iq4_xs (llama/8278)	2024-07-08 14:53:55 +03:00
Daniele	73703a144f	CUDA: revert part of the RDNA1 optimizations (llama/8309) The change on the launch_bounds was causing a small performance drop in perplexity of 25 t/s	2024-07-08 14:53:55 +03:00
Johannes Gäßler	e89fdceec2	CUDA: fix MMQ stream-k rounding if ne00 % 128 != 0 (llama/8311)	2024-07-08 14:53:55 +03:00

1 2 3 4

164 Commits