whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2024-11-07 16:44:13 +01:00

Author	SHA1	Message	Date
Georgi Gerganov	cbacb7634c	ggml : prevent builds with -ffinite-math-only (llama/7726) This enforces a check that -fno-finite-math-only was set and that the operating compiling mode is not in finite maths mode. This is because during rewriting of silu and softmax for cpu #7154 there emerged an issue where the result that was observed when >1 slot was nondeterministic as found by @JohannesGaessler. @LostRuins narrowed the problem down to -ffinite-math-only which was theorised to be due to SiLU, instead of flushing small values to 0, returns NaN or some other garbage. @jart proposed a fix that @ggerganov then implemented in this fix ref https://github.com/ggerganov/llama.cpp/pull/7154#issuecomment-2145661825	2024-06-16 18:19:48 +03:00
Radoslav Gerganov	6cc3b022ee	llama : offload to RPC in addition to other backends (llama/7640) * llama : offload to RPC in addition to other backends * - fix copy_tensor being called on the src buffer instead of the dst buffer - always initialize views in the view_src buffer - add RPC backend to Makefile build - add endpoint to all RPC object names * add rpc-server to Makefile * Update llama.cpp Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-06-16 18:19:48 +03:00
Masaya, Kato	e5e38d4920	ggml : use OpenMP as a thread pool (llama/7606) * ggml: Added OpenMP for multi-threads processing * ggml : Limit the number of threads used to avoid deadlock * update shared state n_threads in parallel region * clear numa affinity for main thread even with openmp * enable openmp by default * fix msvc build * disable openmp on macos * ci : disable openmp with thread sanitizer * Update ggml.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-06-16 18:19:48 +03:00
0cc4m	2a6bab5655	Vulkan Mixture of Experts (MoE) support (llama/7628) * Finish Vulkan mul_mat_id implementation * Add Vulkan sum_rows and div ops * Fix MUL_MAT_ID matrix matrix shader * Fix MUL_MAT_ID matrix vector shader dispatch size * Fix MUL_MAT_ID matrix vector shader and dispatch code * Update Vulkan CPU offload for MUL_MAT_ID * Fix crash when using split mode none and setting a main GPU	2024-06-16 18:19:48 +03:00
woachk	8c01c9b85c	kompute : implement op_getrows_f32 (llama/6403) op_getrows_f32 is required since https://github.com/ggerganov/llama.cpp/pull/6122 for the Vulkan w/ Kompute backend to be functional. As such, implement this op to make this backend functional again.	2024-06-16 18:19:48 +03:00
Dave Airlie	d1123d795e	fix bug introduced in using calloc (llama/7701) compilade pointed this out on the previous MR	2024-06-16 18:19:48 +03:00
Johannes Gäßler	9b3d784020	Fix FlashAttention debug test, FP32 assert (llama/7684)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	a16137d13d	CUDA: fix Pascal FA, deq. KV to FP16 for batch > 8 (llama/7681)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	5582039d0a	CUDA: quantized KV support for FA vec (llama/7527) * CUDA: quantized KV support for FA vec * try CI fix * fix commented-out kernel variants * add q8_0 q4_0 tests * fix nwarps > batch size * split fattn compile via extern templates * fix flake8 * fix metal tests * fix cmake * make generate_cu_files.py executable * add autogenerated .cu files * fix AMD * error if type_v != FP16 and not flash_attn * remove obsolete code	2024-06-16 18:19:48 +03:00
Georgi Gerganov	9a16c643e2	ggml : fix loongson compile warnings (llama/7537) * ggml : fix loongson compile warnings ggml-ci * Fix loongarch quantize test fail. Fix unexpected error introduced during rebase code. * tests : disable json test due to lack of python on the CI node ggml-ci --------- Co-authored-by: junchao-loongson <zhaojunchao@loongson.cn>	2024-06-16 18:19:48 +03:00
Chris Elrod	10a8a23100	faster avx512 exp implementation (llama/7551) * faster avx512 exp implementation * x->r * improve accuracy, handle special cases * remove `e`	2024-06-16 18:19:48 +03:00
junchao-loongson	29cfeef77f	ggml : fix loongarch build (O2 issue) (llama/7636)	2024-06-16 18:19:48 +03:00
Georgi Gerganov	e66e9ea25b	metal : remove invalid asserts (llama/7617)	2024-06-16 18:19:48 +03:00
Georgi Gerganov	276779a849	metal : add missing asserts (llama/7617)	2024-06-16 18:19:48 +03:00
Georgi Gerganov	1f35ce61c1	ggml : fix YARN + add tests + add asserts (llama/7617) * tests : add rope tests ggml-ci * ggml : fixes (hopefully) ggml-ci * tests : add non-cont tests ggml-ci * cuda : add asserts for rope/norm + fix DS2 ggml-ci * ggml : assert contiguousness * tests : reduce RoPE tests ggml-ci	2024-06-16 18:19:48 +03:00
Georgi Gerganov	4b19cc3ed4	cuda : non-cont concat support (llama/7610) * tests : add non-cont concat tests * cuda : non-cont concat support ggml-ci	2024-06-16 18:19:48 +03:00
Radoslav Gerganov	a535d348dd	llama-bench : add support for the RPC backend (llama/7435)	2024-06-16 18:19:48 +03:00
slaren	8f5dc729d9	ggml : use atomic_flag for critical section (llama/7598) * ggml : use atomic_flag for critical section * add windows shims	2024-06-16 18:19:48 +03:00
Georgi Gerganov	02fc147a0b	examples : adapt to new ggml_concat (ggml/0)	2024-06-16 18:19:48 +03:00
zhouwg	109148ac84	ggml : fix typo in ggml.c (llama/7603)	2024-06-16 18:19:48 +03:00
Meng, Hengyu	3563473d2c	Align GEMM dispatch (llama/7566) * align GEMM dispatch	2024-06-16 18:19:48 +03:00
Georgi Gerganov	046834198d	sycl : fix assert (llama/7563)	2024-06-16 18:19:48 +03:00
k.h.lai	0a2ad9de06	vulkan: properly initialize vulkan devices for LLAMA_SPLIT_MODE_NONE (llama/7552)	2024-06-16 18:19:48 +03:00
Radoslav Gerganov	39b0640b09	rpc : resource management rework (llama/7562) * rpc : resource management rework * address review comments	2024-06-16 18:19:48 +03:00
Neo Zhang	8dca71de64	fix ggml_sycl_mul_mat_id() to match the change of api (llama/7436) * fix mul_mat_id to match the change of api * rm comment * rm unused or duplicated code, rename as review comment	2024-06-16 18:19:48 +03:00
Georgi Gerganov	812787cbc5	ggml : generalize GGML_OP_CONCAT (llama/7563) * ggml : generalize GGML_OP_CONCAT (WIP) ggml-ci * tests : add dim != 2 tests * metal : generalize concat kernel * tests : naming * cuda : generalize concat kernel ggml-ci * sycl : add warning and assert * ggml : fix op params handling * metal : bugfix kernel ggml-ci * ggml : reimplement CPU and Metal * cuda : add asserts ggml-ci * ggml : fix ptrs ggml-ci	2024-06-16 18:19:48 +03:00
Djip007	68ef10805e	update HIP_UMA #7399 (llama/7414) * update HIP_UMA #7399 add use of hipMemAdviseSetCoarseGrain when LLAMA_HIP_UMA is enable. - get x2 on prompte eval and x1.5 on token gen with rocm6.0 on ryzen 7940HX iGPU (780M/gfx1103) * simplify code, more consistent style --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-06-16 18:19:48 +03:00
agray3	96fdb90f5f	Allow multiple copy function pointers for CUDA graph kernel param updates (llama/7565) CUDA graphs require parameter updates to kernels associated with GGML_OP_CPY nodes. Previously the implementation only checked for a single CUDA kernel in such nodes, but this caused a bug in cases where 2 such kernels exist. This fixes the issue by using a vector to allow multiple function pointers to be stored and checked against. Fixes #7942	2024-06-16 18:19:48 +03:00
AidanBeltonS	e98f9ac554	Fix q_xxs using mul_mat_q (llama/7459)	2024-06-16 18:19:48 +03:00
AidanBeltonS	02d481595b	Add freq factors (llama/7495)	2024-06-16 18:19:48 +03:00
Georgi Gerganov	7091c7ab5a	metal : add GGML_OP_REPEAT kernels (llama/7557) ggml-ci	2024-06-16 18:19:48 +03:00
Georgi Gerganov	d70ccb75f5	metal : disable FA kernel for HS=256 (llama/7556) ggml-ci	2024-06-16 18:19:48 +03:00
Georgi Gerganov	5ee048eb67	ggml : restore ggml_rope_xpos_inplace (ggml/0) ggml-ci	2024-06-16 18:19:48 +03:00
Masaya, Kato	37ed71c964	ggml: aarch64: SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot (llama/7433) * Add SVE support for q4_0_q8_0 q8_0_q8_0 * remove ifdef	2024-06-16 18:19:48 +03:00
Georgi Gerganov	8cd7a3df37	ggml : silence UB sanitizer error during iq2_xxs quantization (llama/0)	2024-06-16 18:19:48 +03:00
Georgi Gerganov	04a3279320	ggml : remove ggml_flash_attn and ggml_flash_ff (llama/7463) ggml-ci	2024-06-16 18:19:48 +03:00
Georgi Gerganov	45ddda8e0c	ggml : drop support for QK_K=64 (llama/7473) * ggml : drop support for QK_K=64 ggml-ci * opencl : restore QK_K=256 define	2024-06-16 18:19:48 +03:00
0cc4m	c41317fd66	Update vulkan rope implementation to support frequency factors (llama/7475)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	96b8419b27	CUDA: fix FA out-of-bounds reads (llama/7479)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	3c63f4cf35	CUDA: fix FA out-of-bounds writes (llama/7465)	2024-06-16 18:19:48 +03:00
Georgi Gerganov	5848dfd9c8	cuda : fix compile warning (llama/7454)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	29ab5d0326	CUDA: remove incorrect precision check (llama/7454)	2024-06-16 18:19:48 +03:00
Georgi Gerganov	c4d6958b3e	cuda : fix rope + add tests (llama/7452) * cuda : fix rope pos data ggml-ci * ggml : drop mode & 1 == 1 support for ggml_rope ggml-ci * ggml : support freq_factors for f16 rope (CPU) ggml-ci * tests : add rope tests using frequency factors ggml-ci	2024-06-16 18:19:48 +03:00
liuwei-git	c9dcb75118	llama : add phi3 128K model support (llama/7225) * add phi3 128k support in convert-hf-to-gguf * add phi3 128k support in cuda * address build warnings on llama.cpp * adjust index value in cuda long rope freq factors * add long rope support in ggml cpu backend * make freq factors only depend on ctx size * remove unused rope scaling type 'su' frin gguf converter * fix flint warnings on convert-hf-to-gguf.py * set to the short freq factor when context size is small than trained context size * add one line of comments * metal : support rope freq_factors * ggml : update ggml_rope_ext API to support freq. factors * backends : add dev messages to support rope freq. factors * minor : style * tests : update to use new rope API * backends : fix pragma semicolons * minor : cleanup * llama : move rope factors from KV header to tensors * llama : remove tmp assert * cuda : fix compile warning * convert : read/write n_head_kv * llama : fix uninitialized tensors --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-06-16 18:19:48 +03:00
Georgi Gerganov	bbdbc3fc62	metal : handle F16 inf values, fix FA partial offload (llama/7434) ggml-ci	2024-06-16 18:19:48 +03:00
Johannes Gäßler	28c207a541	CUDA: fix unused warning in mmq.cu (llama/7442)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	c23f830983	CUDA: deduplicate mmq code (llama/7397)	2024-06-16 18:19:48 +03:00
Radoslav Gerganov	caeeb32b41	rpc : track allocated buffers (llama/7411) * rpc : track allocated buffers ref: #7407 * rpc : pack rpc_tensor tightly	2024-06-16 18:19:48 +03:00
AidanBeltonS	584cc1177a	Update SYCL upscale operation (llama/7321) * Update SYCL upscale operation * Formatting * Remove messages	2024-06-16 18:19:48 +03:00
Herman Semenov	cc1ae10989	ggml-opencl, llama: using reserve() if count already known (llama/7272)	2024-06-16 18:19:48 +03:00

... 3 4 5 6 7 ...

1581 Commits