whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2025-08-03 09:59:39 +02:00

Author	SHA1	Message	Date
Georgi Gerganov	f25edade2b	whisper : alternative way to handle the external encoders	2024-02-12 16:32:26 +02:00
Georgi Gerganov	74c260fe34	whisper : fix usage of extenral encoders (e.g. CoreML)	2024-02-12 15:21:21 +02:00
Georgi Gerganov	551529290d	talk-llama : sync llama.cpp	2024-02-12 10:39:58 +02:00
Georgi Gerganov	25a90ffa38	sync : ggml	2024-02-12 09:32:15 +02:00
Georgi Gerganov	866b67ca93	ggml-backend : sync remnant	2024-02-12 09:31:12 +02:00
Johannes Gäßler	d7e9f58f7f	CUDA: mul_mat_vec_q tiling, refactor mul mat logic (llama/5434) * CUDA: mul_mat_vec_q tiling, refactor mul mat logic Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-02-12 09:31:12 +02:00
Sergio López	04839bae22	vulkan: only use M-sized matmul on Apple GPUs (llama/5412) * vulkan: refactor guess_matmul_pipeline for vendor Refactor ggml_vk_guess_matmul_pipeline to simplify adding per-vendor conditionals. Signed-off-by: Sergio Lopez <slp@redhat.com> * vulkan: only use M-sized matmul on Apple GPUs L-sized and S-sized matmuls are broken on Apple GPUs, force using M-size with this vendor. Signed-off-by: Sergio Lopez <slp@redhat.com> --------- Signed-off-by: Sergio Lopez <slp@redhat.com>	2024-02-12 09:31:12 +02:00
Georgi Gerganov	3cc6e04a52	ggml : fix compile warnings (unused vars) (llama/4966)	2024-02-12 09:31:11 +02:00
snadampal	b7ef178b9c	ggml : add mmla kernels for quantized GEMM (llama/4966) * ggml: aarch64: implement smmla kernel for q8_0_q8_0 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q8_0_q8_0 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: aarch64: implement smmla kernel for q4_0_q8_0 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q4_0_q8_0 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: aarch64: implement smmla kernel for q4_1_q8_1 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q4_1_q8_1 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: update unit tests for the new vec_dot interface * llama.cpp: add MATMUL_INT8 capability to system_info	2024-02-12 09:31:11 +02:00
Ian Bull	47dfe9d4db	metal : use autoreleasepool to avoid memory leaks (llama/5437) There appears to be a known memory leak when using the `MLTCommandBuffer`. It is suggested to use `@autoreleasepool` in [1,2] [1] https://developer.apple.com/forums/thread/662721 [2] https://forums.developer.apple.com/forums/thread/120931 This change-set wraps the `ggml_metal_graph_compute` in a `@autoreleasepool`. This commit addresses https://github.com/ggerganov/llama.cpp/issues/5436	2024-02-12 09:31:11 +02:00
slaren	1d3270cc8f	ggml-alloc : v3 (ggml/727) * ggml-alloc v3 ggml-ci * fix ci ggml-ci * whisper : check for backend buffer allocation failures * whisper : avoid leaks when initialization fails * cleanup ggml-ci * style fixes ggml-ci	2024-02-12 09:31:11 +02:00
dscripka	a6fb6ab597	examples : added audio_ctx argument to main and server (#1857 ) * added audio_ctx argument to main and server examples * Better default value Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * better default value (again) Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-12 09:19:07 +02:00
Didzis Gosko	163e74b6c3	metal : option to embed MSL source into compiled binary (#1842 ) * ggml : embed Metal library source (ggml-metal.metal) into binary enable by setting WHISPER_EMBED_METAL_LIBRARY * rename the build option * rename the preprocessor directive * generate Metal library embedding assembly on-fly during build process	2024-02-11 16:41:41 +02:00
Georgi Gerganov	f273e66dc6	examples : initialize context params properly (#1852 )	2024-02-11 16:39:12 +02:00
Georgi Gerganov	02b4c52c12	talk-llama : sync llama.cpp	2024-02-10 10:10:59 +02:00
Georgi Gerganov	518199c09e	sync : ggml	2024-02-10 09:56:47 +02:00
Georgi Gerganov	8b17a2f776	src : relocate new backend sources	2024-02-10 09:55:47 +02:00
Michael Podvitskiy	b6d2827914	ggml : fix `error C2078: too many initializers` for MSVC ARM64 (llama/5404)	2024-02-10 09:55:47 +02:00
Johannes Gäßler	9711bae0b3	CUDA: more warps for mmvq on NVIDIA (llama/5394)	2024-02-10 09:55:47 +02:00
Johannes Gäßler	eec38f63bd	CUDA: fixed mmvq kernel for bs 2,3,4 and -sm row (llama/5386)	2024-02-10 09:55:47 +02:00
0cc4m	ef5e6b746f	Basic Vulkan Multi-GPU implementation (llama/5321) * Initial Vulkan multi-gpu implementation Move most global variables into backend context * Add names to backend device functions * Add further missing cleanup code * Reduce code duplication in tensor split layer assignment * generalize LLAMA_SPLIT_LAYER for all backends, do not expose device count and memory in llama.h * Only do device info print in the beginning and initialize one backend for cpu assist Add missing cleanup code * Rework backend memory management to make sure devices and buffers get properly allocated and freed * Rename cpu assist free function --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-02-10 09:55:47 +02:00
Johannes Gäßler	77bf6b5f56	CUDA: mul_mat_vec_q max. batch size 8 -> 4 (llama/5370)	2024-02-10 09:55:47 +02:00
Kawrakow	b562fff9d0	Slight quantization improvement for Q4_K and Q5_K (llama/5361) * Q4_K: slightly better quantization * Q5_K: slightly better quantization --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-10 09:55:47 +02:00
Johannes Gäßler	b5dec374f4	CUDA: mul_mat_vec_q for batch sizes > 1 (llama/5351)	2024-02-10 09:55:47 +02:00
Kawrakow	fa0dc6167c	ggml : make use of ggml-quants.h possible in C++ code (llama/5338) * Make use of ggml-quants.h possible in C++ code * One cannot possibly be defining static_assert in a C++ compilation --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-10 09:55:47 +02:00
Dr. Tom Murphy VII Ph.D	55bcd62a4b	ggml : avoid duplicating function calls using MIN/MAX macros (llama/5325) * Avoid duplicating function calls when using MIN/MAX macros. Since these copy "a" and "b" they ask the compiler to evaluate one of them twice. The compiler doesn't have a problem with removing the duplication in something like MAX(0, x + 2), but in some cases we're calling functions, and those calls just happen twice. By explicitly evaluating at the expression we get smaller and faster code without duplicate calls. See ggml_rope_yarn_corr_dims in Compiler Explorer: https://godbolt.org/z/Ee4KMrvKh Code behaves exactly the same. * Update ggml.c --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-10 09:55:46 +02:00
Kawrakow	0ed762d691	iq2_xxs: tune quantization (llama/5320) We get slightly better PPL, and we cut quantization time in nearly half. The trick is to 1st quantize without forcing points onto the E8-lattice. We can then use a narrower search range around the block scale that we got that way. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-10 09:55:46 +02:00
slaren	1b5bb7792e	cuda : fix LLAMA_CUDA_F16 (llama/5262)	2024-02-10 09:55:46 +02:00
Georgi Gerganov	9b735cea77	metal : add im2col F32 dst support (llama/5132)	2024-02-10 09:55:46 +02:00
JidongZhang-THU	12c462d656	llava : add MobileVLM support (llama/5132) * New Feature: 1. Sum_Rows: fix cuda kernel overflow fix block shape error when nrows too big 2. Im2Col: Support Batch in cuda Support f32 to f32 both in cpu && cuda 3. DepthWiseConv: Support by Im2Col && MulMat 4. Pool_2d: Supoort avg pooling in cuda 5. HardSigmoid: Imp in cuda 6. HardSwish: Imp in cuda * fix tabs instead of spaces * code clean * CUDA POOL2D * ADD POOL2D test case in test-backend-ops.cpp * code clean * fix pool2d_kernel nits * fix bug in pool2d kernel * fix avg pooling, count_include_pad nits * test-backend-ops : add more pool_2d tests * cuda : fix warnings and formatting * ggml : check types in release builds too in pool_2d * test-backend-ops : remove f16 pool_2d tests * cuda : more style fixes * Add assert in ggml_cuda_op_pool2d * pool2d float padding fallback * test-backend-ops : add dst_type to im2col --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-02-10 09:55:46 +02:00
slaren	fc7b0e2c28	ggml : limit n_threads to the max n_tasks (llama/5238)	2024-02-10 09:55:46 +02:00
Jared Van Bortel	f850a067ed	kompute : llama-bench support and ggml_cpu_has_kompute() (llama/5226)	2024-02-10 09:55:46 +02:00
Michael Podvitskiy	f75e1197f1	ggml : add abort_callback for cpu backend (ggml/725) * a way to use abort_callback with the cpu backend * whisper update	2024-02-10 09:55:46 +02:00
Georgi Gerganov	aa8a75e287	extra : update sync scripts	2024-02-10 09:55:19 +02:00
Valentin Gosu	80e8a2ea39	server : allow CORS request with authorization headers (#1850 ) Whisper plugin in Obsidian requires an API key which is then sent as an authorization header. However, the presence of an authorization header requires a CORS Preflight, so both the OPTIONS method and the Access-Control-Allow-Headers: authorization must be handled.	2024-02-09 17:42:41 +02:00
Neuman Vong	19f8048139	whisper.android : how to build with CLBlast (#1809 ) * FetchContent * OpenCL * Documentation and make optional * Specify GGML build options in build.gradle * Use gradle properties * @ggerganov Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * @gpokat --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-09 17:39:05 +02:00
Didzis Gosko	0f80e5a80a	whisper : expose CUDA device setting in public API (#1840 ) * Makefile : allow to override CUDA_ARCH_FLAG * whisper : allow to select GPU (CUDA) device from public API	2024-02-09 17:27:47 +02:00
Didzis Gosko	b6559333ff	make : add macOS deployment target option (#1839 )	2024-02-09 17:26:29 +02:00
Georgi Gerganov	434b8f3b96	talk-llama : stream response (#1121 )	2024-02-06 19:56:12 +02:00
Georgi Gerganov	7a74e929c8	sync : ggml (#0 )	2024-01-30 21:30:26 +02:00
Kawrakow	361ecebe90	ggml : fix IQ3_XXS on Metal (llama/5219) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-30 21:28:00 +02:00
Georgi Gerganov	807cbc672e	sync : ggml (llama/0)	2024-01-30 21:27:59 +02:00
Kawrakow	98ae5276b7	Faster AVX2 dot product for IQ2_XS (llama/5187) * iq2xs: faster AVX2 dot product * iq2xs: small AVX2 imrovement * Speed up computing sign bits in AVX2 iq2_xs dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Peter Reid <peter@peterreid.net>	2024-01-30 21:27:59 +02:00
Kawrakow	6adb969b09	SOTA 3-bit quants (llama/5196) * iq3_xxs: quantize/dequantize RMSE seems a bit high-ish at about half-way between q2_K and q3_K, so need to check more. * iq3_xxs: CUDA dequantize works * iq2_xxs: tuning quantization * iq3_xxs: starting to look better PPL on wiki.test.raw LLaMA-v1-7B: 6.4218 LLaMA-v2-7B: 6.3560 Mistral-7B : 6.0717 This is better than Q3_K_XS, with a 5% reduction in quantized model size. * iq3_xxs: CUDA dot product We have PP-512: 5891 t/s TG-128: 143.9 t/s * iq3_xxs: scalar and AVX2 dot products * iq3_xxs: ARM_NEON and Metal Metal performance is decent, ARM_NEON is pathetic * iq3_xxs: slightly better grid points * Faster iq3_xxs and iq2_xs dot products on CUDA * iq3_xxs: add some quant mix * iq3_xxs: fix failing quantization test Dot product still fails. Is this real? * iq3_xxs: hopefully fix ROCm * iq3_xxs: failing tests This time the dot product accuracy did find an actual bug in the AVX2 implementation. * Add IQ3_XXS to test-backend-ops --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-30 21:27:59 +02:00
Paul Tsochantaris	8a7d6ff51a	ggml alloc: Fix for null dereference on alloc failure (llama/5200) * Fix for a null pointer dereference if a metal GGML buffer fails to be allocated * Freeing the allocated buffers rather than the pointer in ggml-alloc.c * Fixed the fix of the fix	2024-01-30 21:27:59 +02:00
Jared Van Bortel	25f650a8e8	Nomic Vulkan backend (llama/4456) Signed-off-by: Jared Van Bortel <jared@nomic.ai> Co-authored-by: niansa <anton-sa@web.de> Co-authored-by: Adam Treat <treat.adam@gmail.com> Co-authored-by: Aaron Miller <apage43@ninjawhale.com> Co-authored-by: ToKiNoBug <tokinobug@163.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>	2024-01-30 21:27:59 +02:00
slaren	44e517f074	ggml : add max buffer sizes to opencl and metal backends (llama/5181)	2024-01-30 21:27:59 +02:00
Paul Tsochantaris	cb9de61659	metal : free metal objects (llama/5161) * Releasing MTLFunction references after Metal pipeline construction * Keeping the `ggml_metal_kernel` structure * Spacing fix * Whitespace fix	2024-01-30 21:27:59 +02:00
Georgi Gerganov	a2ef80d66f	gguf : fix comparison (ggml/715) ggml-ci	2024-01-30 21:27:59 +02:00
John Balis	baa190446a	`ggml_cuda_cpy` support for 4d tensors and float16->float32 upcasting (ggml/686) * added cuda float16->float32 upcasting to ggml_cuda_cpy * added ability to copy 4d tensors with the cuda backend * added tests for float16_>float32 upcast and 4d tensor cuda copys * added 4d copy test for float32->float16 copy * applied patch suggested by @iamlemec * simplify cpy tests --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-01-30 21:27:59 +02:00

1 2 3 4 5 ...

1025 Commits