whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2025-08-02 02:03:02 +02:00

Author	SHA1	Message	Date
Neuman Vong	a38efcb9fd	vulkan: Find optimal memory type but with fallback (llama/5381) * @0cc4m feedback * More feedback @0cc4m	2024-02-19 15:53:22 +02:00
AT	31591649a0	Early return for zero size calls to get_tensor. (llama/5482) * Early return for zero size calls to get_tensor. Signed-off-by: Adam Treat <treat.adam@gmail.com> * Update ggml-kompute.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-kompute.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Add an early return to the get/set tensor when the size is null. Signed-off-by: Adam Treat <treat.adam@gmail.com> * Early return after the assertions. Signed-off-by: Adam Treat <treat.adam@gmail.com> * Since we do the early return in the generic backend now no reason to do so here as well. Signed-off-by: Adam Treat <treat.adam@gmail.com> --------- Signed-off-by: Adam Treat <treat.adam@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-19 15:53:22 +02:00
Kawrakow	4f5c46a84f	ggml-quants : fix compiler warnings (shadow variable) (llama/5472) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-19 15:53:22 +02:00
Abhilash Majumder	462ffc58db	ggml-sycl: Replace 3d ops with macro (llama/5458) * use macro * use macro * fix format	2024-02-19 15:53:21 +02:00
Georgi Gerganov	65faae0b6a	build : update CBLAS flags + fix unused var warning (#0 )	2024-02-19 14:44:46 +02:00
Davidson Francis	dda4b0ed06	main : check if input files exist before proceeding (#1872 ) Until the most recent commit (`3d42463`), the main.cpp sample file does not check whether the input files exist or not. Consequently, the model is loaded first before reporting whether there was a failure or not when processing a file. In environments with HDD, this can take about 50 seconds or more, depending on the loaded model. This commit addresses this issue by checking in advance whether the input files exist or not.	2024-02-19 10:51:26 +02:00
Felix	07d04280be	examples : clean up common code (#1871 ) move some utility functions into common.h	2024-02-19 10:50:15 +02:00
Jumper775	917c56ded4	models : fix openvino setup info (#1874 )	2024-02-19 02:19:47 +00:00
Georgi Gerganov	3d42463845	models : add update py requirements	2024-02-13 11:51:32 +02:00
Georgi Gerganov	3ffc83d90a	swift : package no longer use ggml dependency (#1861 ) * Revert "swift : update Package.swift to use ggml as package dependency (#1701)" This reverts commit `993acb5d41`. * spm : add ggml.h	2024-02-12 19:54:11 +02:00
Georgi Gerganov	e3c5e2cba8	whisper : fix external encoder (#1860 )	2024-02-12 19:53:51 +02:00
Georgi Gerganov	b742f13e70	sync : ggml	2024-02-12 19:07:56 +02:00
slaren	52c529eeb1	ggml-alloc : allocate all leafs as if they were inputs (ggml/731) * ggml-alloc : allocate all leafs as if they were inputs * ensure static leafs are allocated * gpt-2-backend : remove unnecesary ggml_new_tensor * update other gpt-2 examples to remove ggml_new_tensor calls in the graph	2024-02-12 19:07:38 +02:00
Georgi Gerganov	551529290d	talk-llama : sync llama.cpp	2024-02-12 10:39:58 +02:00
Georgi Gerganov	25a90ffa38	sync : ggml	2024-02-12 09:32:15 +02:00
Georgi Gerganov	866b67ca93	ggml-backend : sync remnant	2024-02-12 09:31:12 +02:00
Johannes Gäßler	d7e9f58f7f	CUDA: mul_mat_vec_q tiling, refactor mul mat logic (llama/5434) * CUDA: mul_mat_vec_q tiling, refactor mul mat logic Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-02-12 09:31:12 +02:00
Sergio López	04839bae22	vulkan: only use M-sized matmul on Apple GPUs (llama/5412) * vulkan: refactor guess_matmul_pipeline for vendor Refactor ggml_vk_guess_matmul_pipeline to simplify adding per-vendor conditionals. Signed-off-by: Sergio Lopez <slp@redhat.com> * vulkan: only use M-sized matmul on Apple GPUs L-sized and S-sized matmuls are broken on Apple GPUs, force using M-size with this vendor. Signed-off-by: Sergio Lopez <slp@redhat.com> --------- Signed-off-by: Sergio Lopez <slp@redhat.com>	2024-02-12 09:31:12 +02:00
Georgi Gerganov	3cc6e04a52	ggml : fix compile warnings (unused vars) (llama/4966)	2024-02-12 09:31:11 +02:00
snadampal	b7ef178b9c	ggml : add mmla kernels for quantized GEMM (llama/4966) * ggml: aarch64: implement smmla kernel for q8_0_q8_0 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q8_0_q8_0 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: aarch64: implement smmla kernel for q4_0_q8_0 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q4_0_q8_0 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: aarch64: implement smmla kernel for q4_1_q8_1 quantized gemm armv8.2-a and above supports MMLA instructions that have higher throughput than DOT. this commit adds mmla kernel for q4_1_q8_1 gemm. The feature is enabled if the platform supports "__ARM_FEATURE_MATMUL_INT8" On AWS Graviton3 processors this kernel resulted up to 1.5x improvement for prompt evaluation throughput compared to the default sdot kernel. * ggml: update unit tests for the new vec_dot interface * llama.cpp: add MATMUL_INT8 capability to system_info	2024-02-12 09:31:11 +02:00
Ian Bull	47dfe9d4db	metal : use autoreleasepool to avoid memory leaks (llama/5437) There appears to be a known memory leak when using the `MLTCommandBuffer`. It is suggested to use `@autoreleasepool` in [1,2] [1] https://developer.apple.com/forums/thread/662721 [2] https://forums.developer.apple.com/forums/thread/120931 This change-set wraps the `ggml_metal_graph_compute` in a `@autoreleasepool`. This commit addresses https://github.com/ggerganov/llama.cpp/issues/5436	2024-02-12 09:31:11 +02:00
slaren	1d3270cc8f	ggml-alloc : v3 (ggml/727) * ggml-alloc v3 ggml-ci * fix ci ggml-ci * whisper : check for backend buffer allocation failures * whisper : avoid leaks when initialization fails * cleanup ggml-ci * style fixes ggml-ci	2024-02-12 09:31:11 +02:00
dscripka	a6fb6ab597	examples : added audio_ctx argument to main and server (#1857 ) * added audio_ctx argument to main and server examples * Better default value Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * better default value (again) Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-12 09:19:07 +02:00
Didzis Gosko	163e74b6c3	metal : option to embed MSL source into compiled binary (#1842 ) * ggml : embed Metal library source (ggml-metal.metal) into binary enable by setting WHISPER_EMBED_METAL_LIBRARY * rename the build option * rename the preprocessor directive * generate Metal library embedding assembly on-fly during build process	2024-02-11 16:41:41 +02:00
Georgi Gerganov	f273e66dc6	examples : initialize context params properly (#1852 )	2024-02-11 16:39:12 +02:00
Georgi Gerganov	02b4c52c12	talk-llama : sync llama.cpp	2024-02-10 10:10:59 +02:00
Georgi Gerganov	518199c09e	sync : ggml	2024-02-10 09:56:47 +02:00
Georgi Gerganov	8b17a2f776	src : relocate new backend sources	2024-02-10 09:55:47 +02:00
Michael Podvitskiy	b6d2827914	ggml : fix `error C2078: too many initializers` for MSVC ARM64 (llama/5404)	2024-02-10 09:55:47 +02:00
Johannes Gäßler	9711bae0b3	CUDA: more warps for mmvq on NVIDIA (llama/5394)	2024-02-10 09:55:47 +02:00
Johannes Gäßler	eec38f63bd	CUDA: fixed mmvq kernel for bs 2,3,4 and -sm row (llama/5386)	2024-02-10 09:55:47 +02:00
0cc4m	ef5e6b746f	Basic Vulkan Multi-GPU implementation (llama/5321) * Initial Vulkan multi-gpu implementation Move most global variables into backend context * Add names to backend device functions * Add further missing cleanup code * Reduce code duplication in tensor split layer assignment * generalize LLAMA_SPLIT_LAYER for all backends, do not expose device count and memory in llama.h * Only do device info print in the beginning and initialize one backend for cpu assist Add missing cleanup code * Rework backend memory management to make sure devices and buffers get properly allocated and freed * Rename cpu assist free function --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-02-10 09:55:47 +02:00
Johannes Gäßler	77bf6b5f56	CUDA: mul_mat_vec_q max. batch size 8 -> 4 (llama/5370)	2024-02-10 09:55:47 +02:00
Kawrakow	b562fff9d0	Slight quantization improvement for Q4_K and Q5_K (llama/5361) * Q4_K: slightly better quantization * Q5_K: slightly better quantization --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-10 09:55:47 +02:00
Johannes Gäßler	b5dec374f4	CUDA: mul_mat_vec_q for batch sizes > 1 (llama/5351)	2024-02-10 09:55:47 +02:00
Kawrakow	fa0dc6167c	ggml : make use of ggml-quants.h possible in C++ code (llama/5338) * Make use of ggml-quants.h possible in C++ code * One cannot possibly be defining static_assert in a C++ compilation --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-10 09:55:47 +02:00
Dr. Tom Murphy VII Ph.D	55bcd62a4b	ggml : avoid duplicating function calls using MIN/MAX macros (llama/5325) * Avoid duplicating function calls when using MIN/MAX macros. Since these copy "a" and "b" they ask the compiler to evaluate one of them twice. The compiler doesn't have a problem with removing the duplication in something like MAX(0, x + 2), but in some cases we're calling functions, and those calls just happen twice. By explicitly evaluating at the expression we get smaller and faster code without duplicate calls. See ggml_rope_yarn_corr_dims in Compiler Explorer: https://godbolt.org/z/Ee4KMrvKh Code behaves exactly the same. * Update ggml.c --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-10 09:55:46 +02:00
Kawrakow	0ed762d691	iq2_xxs: tune quantization (llama/5320) We get slightly better PPL, and we cut quantization time in nearly half. The trick is to 1st quantize without forcing points onto the E8-lattice. We can then use a narrower search range around the block scale that we got that way. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-02-10 09:55:46 +02:00
slaren	1b5bb7792e	cuda : fix LLAMA_CUDA_F16 (llama/5262)	2024-02-10 09:55:46 +02:00
Georgi Gerganov	9b735cea77	metal : add im2col F32 dst support (llama/5132)	2024-02-10 09:55:46 +02:00
JidongZhang-THU	12c462d656	llava : add MobileVLM support (llama/5132) * New Feature: 1. Sum_Rows: fix cuda kernel overflow fix block shape error when nrows too big 2. Im2Col: Support Batch in cuda Support f32 to f32 both in cpu && cuda 3. DepthWiseConv: Support by Im2Col && MulMat 4. Pool_2d: Supoort avg pooling in cuda 5. HardSigmoid: Imp in cuda 6. HardSwish: Imp in cuda * fix tabs instead of spaces * code clean * CUDA POOL2D * ADD POOL2D test case in test-backend-ops.cpp * code clean * fix pool2d_kernel nits * fix bug in pool2d kernel * fix avg pooling, count_include_pad nits * test-backend-ops : add more pool_2d tests * cuda : fix warnings and formatting * ggml : check types in release builds too in pool_2d * test-backend-ops : remove f16 pool_2d tests * cuda : more style fixes * Add assert in ggml_cuda_op_pool2d * pool2d float padding fallback * test-backend-ops : add dst_type to im2col --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-02-10 09:55:46 +02:00
slaren	fc7b0e2c28	ggml : limit n_threads to the max n_tasks (llama/5238)	2024-02-10 09:55:46 +02:00
Jared Van Bortel	f850a067ed	kompute : llama-bench support and ggml_cpu_has_kompute() (llama/5226)	2024-02-10 09:55:46 +02:00
Michael Podvitskiy	f75e1197f1	ggml : add abort_callback for cpu backend (ggml/725) * a way to use abort_callback with the cpu backend * whisper update	2024-02-10 09:55:46 +02:00
Georgi Gerganov	aa8a75e287	extra : update sync scripts	2024-02-10 09:55:19 +02:00
Valentin Gosu	80e8a2ea39	server : allow CORS request with authorization headers (#1850 ) Whisper plugin in Obsidian requires an API key which is then sent as an authorization header. However, the presence of an authorization header requires a CORS Preflight, so both the OPTIONS method and the Access-Control-Allow-Headers: authorization must be handled.	2024-02-09 17:42:41 +02:00
Neuman Vong	19f8048139	whisper.android : how to build with CLBlast (#1809 ) * FetchContent * OpenCL * Documentation and make optional * Specify GGML build options in build.gradle * Use gradle properties * @ggerganov Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * @gpokat --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-09 17:39:05 +02:00
Didzis Gosko	0f80e5a80a	whisper : expose CUDA device setting in public API (#1840 ) * Makefile : allow to override CUDA_ARCH_FLAG * whisper : allow to select GPU (CUDA) device from public API	2024-02-09 17:27:47 +02:00
Didzis Gosko	b6559333ff	make : add macOS deployment target option (#1839 )	2024-02-09 17:26:29 +02:00
Georgi Gerganov	434b8f3b96	talk-llama : stream response (#1121 )	2024-02-06 19:56:12 +02:00

1 2 3 4 5 ...

1036 Commits