whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2025-08-15 17:02:31 +02:00

Author	SHA1	Message	Date
mky_coder	bf4cb4abad	whisper : optimize fft() function (#2242 ) Co-authored-by: Mike Fan <60965742+mike-fzy@users.noreply.github.com>	2024-06-18 18:10:33 +03:00
Georgi Gerganov	e293f17d34	talk-llama : sync llama.cpp	2024-06-18 09:45:37 +03:00
Georgi Gerganov	5d950c4b8d	whisper : use ggml_backend_sched (#2239 ) * whisper : use ggml_backend_sched (wip) * use sched in whisper_allocr * whisper : single backend in whisper_context * whisper : remove whisper_state->backends_used * whisper : remove whisper_context->backend * whisper : reset scheduler after init * whisper : fix external encoder (e.g. CoreML) * whisper : cleanup * whisper : handle null GPU buffer types + fix sycl --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-06-18 09:39:40 +03:00
Georgi Gerganov	820446e230	fix : remove extra files	2024-06-18 09:39:40 +03:00
Georgi Gerganov	54d5823ebe	scripts : sync ggml-blas	2024-06-18 09:39:40 +03:00
Georgi Gerganov	5181494e9f	build : update make / cmake	2024-06-18 09:39:40 +03:00
Georgi Gerganov	4a6e6e8b30	sync : ggml	2024-06-18 09:39:40 +03:00
slaren	de29b193f6	move BLAS to a separate backend (cont) (llama/6210) ggml-ci	2024-06-18 09:39:40 +03:00
0cc4m	922971041b	Vulkan Shader Refactor, Memory Debugging Option (llama/7947) * Refactor shaders, extract GLSL code from ggml_vk_generate_shaders.py into vulkan-shaders directory * Improve debug log code * Add memory debug output option * Fix flake8 * Fix unnecessary high llama-3 VRAM use	2024-06-18 09:39:40 +03:00
Georgi Gerganov	63a767a134	scripts : stop sync whisper example from ggml	2024-06-18 09:39:40 +03:00
Georgi Gerganov	30841fa786	cmake : fix sycl build (#0 )	2024-06-16 18:19:48 +03:00
Georgi Gerganov	3b1ac03828	ggml : remove OpenCL (#0 )	2024-06-16 18:19:48 +03:00
Georgi Gerganov	990de617b5	sycl : sync (#0 )	2024-06-16 18:19:48 +03:00
Georgi Gerganov	6975600b4b	cuda : enable CUDA graphs (#0 )	2024-06-16 18:19:48 +03:00
Georgi Gerganov	061eeb9f61	talk-llama : sync llama.cpp	2024-06-16 18:19:48 +03:00
Georgi Gerganov	4942b1b428	cmake : fix CUDA build (#0 )	2024-06-16 18:19:48 +03:00
Georgi Gerganov	3c7cc5c437	sync : ggml ggml-ci	2024-06-16 18:19:48 +03:00
Hong Bo PENG	5cd42ee2cc	ggml : fix and optimize ppc64le (ggml/849) * fix compile issues introduced by loongarch_asx * restore quant changes to merge * fix compile issues introduced by loongarch_asx * further optimize by using vec_msum & vec_sum4s on ppc64le	2024-06-16 18:19:48 +03:00
Daniel Bevenius	ee718f3da6	ggml : remove duplicate include of ggml-common.h (ggml/853) Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-06-16 18:19:48 +03:00
Meng, Hengyu	63eac1f608	remove global variables (llama/7710) * separate DPCT helpers outside * replace global variables with context * remove useless extra * update mul_mat condition * remove duplicate buft initialization * remove duplicate extra and global work group size * remove useless backend check * remove duplicated extras * use macro for group_size and remove cuda-related	2024-06-16 18:19:48 +03:00
Johannes Gäßler	b17ba2815b	CUDA: faster q2_K, q3_K MMQ + int8 tensor cores (llama/7921) * CUDA: faster q2_K, q3_K MMQ + int8 tensor cores * try CI fix * try CI fix * try CI fix * fix data race * rever q2_K precision related changes	2024-06-16 18:19:48 +03:00
Georgi Gerganov	7a489af2f3	metal : utilize max shared memory for mul_mat_id (llama/7935)	2024-06-16 18:19:48 +03:00
Radoslav Gerganov	4a4ea13d6d	rpc : fix ggml_backend_rpc_supports_buft() (llama/7918)	2024-06-16 18:19:48 +03:00
slaren	174a461fc6	move BLAS to a separate backend (llama/6210) * move BLAS to a separate backend * rename GGML_USE_OPENBLAS to GGML_USE_BLAS * alloc : reuse same buffer when the same buffer type if used multiple times * set number of threads automatically for openblas and blis * sched : print assignments when GGML_SCHED_DEBUG env variable is set * sched : allow ops with weights on an incompatible buffer type This will cause the weight to be copied to a backend that supports the op, which is very costly. The weight should have been stored in a buffer of a backend that can run the op, but llama.cpp cannot do this automatically at the moment. --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-06-16 18:19:48 +03:00
Johannes Gäßler	d8b7a24bc9	CUDA: fix broken oob check for FA vec f32 kernel (llama/7904)	2024-06-16 18:19:48 +03:00
Georgi Gerganov	acf3832c9c	tests : add non-cont unary tests (llama/7857) * tests : add non-cont unary tests * ggml : update unary asserts and "supports_op" ggml-ci	2024-06-16 18:19:48 +03:00
Georgi Gerganov	d29ac44303	ggml : improve ggml_is_contiguous logic (llama/7856) * ggml : improve ggml_is_contiguous logic ggml-ci * ggml : support more contiguous cases ggml-ci	2024-06-16 18:19:48 +03:00
k.h.lai	12638dfef0	vulkan: select only one device for single gpu with multiple drivers (llama/7582)	2024-06-16 18:19:48 +03:00
0cc4m	f100b3b523	Update Vulkan RoPE implementation (llama/7818) * Update Vulkan RoPE implementation * Return nullptr on alloc_buffer when allocation fails, instead of throwing an exception Minor fixes * Fix segfault when running out of VRAM Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-06-16 18:19:48 +03:00
Johannes Gäßler	a99e213a82	CUDA: int8 tensor cores for MMQ (q4_K, q5_K, q6_K) (llama/7860)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	7483d2b61c	CUDA: use tensor cores for MMQ (llama/7676) * CUDA: int8 tensor cores for MMQ (legacy quants) * fix out-of-bounds writes * __builtin_assume -> GGML_CUDA_ASSUME * fix writeback returning too early	2024-06-16 18:19:48 +03:00
Ben Ashbaugh	1fe5948227	use the correct SYCL context for host USM allocations (llama/7777) Signed-off-by: Ben Ashbaugh <ben.ashbaugh@intel.com>	2024-06-16 18:19:48 +03:00
Johannes Gäßler	760497e1ab	CUDA: revise q8_1 data layout for mul_mat_q (llama/7824)	2024-06-16 18:19:48 +03:00
slaren	b172e7714c	vulkan : reuse parent extra for views (llama/7806) * vulkan : reuse parent extra for views * Fix validation error when multiple compute contexts are used in a graph --------- Co-authored-by: 0cc4m <picard12@live.de>	2024-06-16 18:19:48 +03:00
pengxin99	dc01aadb18	fix softmax r2r result wrong issue (llama/7811)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	e08c62149b	CUDA: refactor mmq, dmmv, mmvq (llama/7716) * CUDA: refactor mmq, dmmv, mmvq * fix out-of-bounds write * struct for qk, qr, qi * fix cmake build * mmq_type_traits	2024-06-16 18:19:48 +03:00
Georgi Gerganov	abab4500fa	ggml : refactor rope norm/neox (llama/7634) * ggml : unify rope norm/neox (CPU) * ggml : fix compile warning * ggml : remove GLM rope mode ggml-ci * metal : better rope implementation ggml-ci * cuda : better rope implementation ggml-ci * naming : n_orig_ctx -> n_ctx_orig ggml-ci * dev : add reminders to update backends ggml-ci * vulkan : fix ggml_rope_ext() usage * cuda : fix array size + indents ggml-ci	2024-06-16 18:19:48 +03:00
agray3	e666315fa8	Allow number of nodes in CUDA graph to change (llama/7738) Previously the code would have failed to cope in the case that the number of nodes changes in an existing CUDA graph. This fixes the issue by removing an unnecessary conditional.	2024-06-16 18:19:48 +03:00
Georgi Gerganov	3f869af14c	ggml : remove OpenCL (llama/7735) ggml-ci	2024-06-16 18:19:48 +03:00
Georgi Gerganov	cbacb7634c	ggml : prevent builds with -ffinite-math-only (llama/7726) This enforces a check that -fno-finite-math-only was set and that the operating compiling mode is not in finite maths mode. This is because during rewriting of silu and softmax for cpu #7154 there emerged an issue where the result that was observed when >1 slot was nondeterministic as found by @JohannesGaessler. @LostRuins narrowed the problem down to -ffinite-math-only which was theorised to be due to SiLU, instead of flushing small values to 0, returns NaN or some other garbage. @jart proposed a fix that @ggerganov then implemented in this fix ref https://github.com/ggerganov/llama.cpp/pull/7154#issuecomment-2145661825	2024-06-16 18:19:48 +03:00
Radoslav Gerganov	6cc3b022ee	llama : offload to RPC in addition to other backends (llama/7640) * llama : offload to RPC in addition to other backends * - fix copy_tensor being called on the src buffer instead of the dst buffer - always initialize views in the view_src buffer - add RPC backend to Makefile build - add endpoint to all RPC object names * add rpc-server to Makefile * Update llama.cpp Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-06-16 18:19:48 +03:00
Masaya, Kato	e5e38d4920	ggml : use OpenMP as a thread pool (llama/7606) * ggml: Added OpenMP for multi-threads processing * ggml : Limit the number of threads used to avoid deadlock * update shared state n_threads in parallel region * clear numa affinity for main thread even with openmp * enable openmp by default * fix msvc build * disable openmp on macos * ci : disable openmp with thread sanitizer * Update ggml.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-06-16 18:19:48 +03:00
0cc4m	2a6bab5655	Vulkan Mixture of Experts (MoE) support (llama/7628) * Finish Vulkan mul_mat_id implementation * Add Vulkan sum_rows and div ops * Fix MUL_MAT_ID matrix matrix shader * Fix MUL_MAT_ID matrix vector shader dispatch size * Fix MUL_MAT_ID matrix vector shader and dispatch code * Update Vulkan CPU offload for MUL_MAT_ID * Fix crash when using split mode none and setting a main GPU	2024-06-16 18:19:48 +03:00
woachk	8c01c9b85c	kompute : implement op_getrows_f32 (llama/6403) op_getrows_f32 is required since https://github.com/ggerganov/llama.cpp/pull/6122 for the Vulkan w/ Kompute backend to be functional. As such, implement this op to make this backend functional again.	2024-06-16 18:19:48 +03:00
Dave Airlie	d1123d795e	fix bug introduced in using calloc (llama/7701) compilade pointed this out on the previous MR	2024-06-16 18:19:48 +03:00
Johannes Gäßler	9b3d784020	Fix FlashAttention debug test, FP32 assert (llama/7684)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	a16137d13d	CUDA: fix Pascal FA, deq. KV to FP16 for batch > 8 (llama/7681)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	5582039d0a	CUDA: quantized KV support for FA vec (llama/7527) * CUDA: quantized KV support for FA vec * try CI fix * fix commented-out kernel variants * add q8_0 q4_0 tests * fix nwarps > batch size * split fattn compile via extern templates * fix flake8 * fix metal tests * fix cmake * make generate_cu_files.py executable * add autogenerated .cu files * fix AMD * error if type_v != FP16 and not flash_attn * remove obsolete code	2024-06-16 18:19:48 +03:00
Georgi Gerganov	9a16c643e2	ggml : fix loongson compile warnings (llama/7537) * ggml : fix loongson compile warnings ggml-ci * Fix loongarch quantize test fail. Fix unexpected error introduced during rebase code. * tests : disable json test due to lack of python on the CI node ggml-ci --------- Co-authored-by: junchao-loongson <zhaojunchao@loongson.cn>	2024-06-16 18:19:48 +03:00
Chris Elrod	10a8a23100	faster avx512 exp implementation (llama/7551) * faster avx512 exp implementation * x->r * improve accuracy, handle special cases * remove `e`	2024-06-16 18:19:48 +03:00

1 2 3 4 5 ...

1420 Commits