whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2024-11-07 16:44:13 +01:00

Author	SHA1	Message	Date
Justine Tunney	7a4f7d825e	ggml : add llamafile sgemm (llama/6414) This change upstreams llamafile's cpu matrix multiplication kernels which improve image and prompt evaluation speed. For starters, Q4_0 and Q8_0 weights should go ~40% faster on CPU. The biggest benefits are with data types like f16 / f32, which process prompts 2x faster thus making them faster than quantized data types for prompt evals. This change also introduces bona fide AVX512 support since tinyBLAS is able to exploit the larger register file. For example, on my CPU llama.cpp llava-cli processes an image prompt at 305 tokens/second, using the Q4_K and Q4_0 types, which has always been faster than if we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With this change, f16 LLaVA performance leap frogs to 464 tokens/second. On Intel Core i9-14900K this change improves F16 prompt perf by 5x. For example, using llama.cpp at HEAD with Mistral 7b f16 to process a 215 token prompt will go 13 tok/sec. This change has fixes making it go 52 tok/sec. It's mostly thanks to my vectorized outer product kernels but also because I added support for correctly counting the number of cores on Alderlake, so the default thread count discounts Intel's new efficiency cores. Only Linux right now can count cores. This work was sponsored by Mozilla who's given permission to change the license of this code from Apache 2.0 to MIT. To read more about what's improved, and how it works, see: https://justine.lol/matmul/	2024-05-13 11:02:26 +03:00
Shijie	fdb2c87350	llama : add qwen2moe (llama/6074) * support qwen2moe * fix-review * metal : support unary ops for nelements % 4 != 0 * metal : require contiguousness for float4 unary kernels * metal : require contiguousness for float4 unary kernels (cont) * fix-review * names : for brevity "SHARED_EXP" -> "SHEXP" * llama : reuse build_moe_ffn() * llama : add model type name --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-05-13 11:02:26 +03:00
Neo Zhang Jianyu	98c0b77e0c	fix mul_mat_id() for new input, make the ut pass (llama/6682)	2024-05-13 11:02:26 +03:00
Dave	9d6d50d933	Added support for GGML_OP_CLAMP in Metal (llama/6662) * Added support for GGML_OP_CLAMP in Metal * Corrected size --------- Co-authored-by: dave-fl <dave@Davids-MacBook-Pro.local>	2024-05-13 11:02:26 +03:00
Neo Zhang Jianyu	c1320c1f0c	fix memcpy() crash, add missed cmd in guide, fix softmax (llama/6622) * disable mmap to fix memcpy crash, add missed cmd in guide, fix softmax * refactor to disable mmap for SYCL backend * fix compile error in other os * refactor the solution, use host buf to fix it, instead of disable mmap * keep to support mmap() * use host buff to reduce malloc times * revert to malloc/free solution, for threaad safe	2024-05-13 11:02:26 +03:00
Johannes Gäßler	66aaf03a7a	CUDA: fix matrix multiplication logic for tests (llama/6667)	2024-05-13 11:02:26 +03:00
slaren	00a0947c65	metal : unify mul_mv_id kernels (llama/6556)	2024-05-13 11:02:26 +03:00
jiez	60f3713026	llama : add gguf_remove_key + remove split meta during quantize (llama/6591) * Remove split metadata when quantize model shards * Find metadata key by enum * Correct loop range for gguf_remove_key and code format * Free kv memory --------- Co-authored-by: z5269887 <z5269887@unsw.edu.au>	2024-05-13 11:02:26 +03:00
Justina Cho	37e6757453	feat: implemented sigmoid function (ggml/806) * added sigmoid function * implemented metal kernel for sigmoid * implemented cuda kernel for sigmoid * added sigmoid unary op and incremented count	2024-05-13 11:02:26 +03:00
Borislav Stanimirov	8dcefdf4a9	build: fix and ignore msvc warnings (ggml/805)	2024-05-13 11:02:26 +03:00
Przemysław Pawełczyk	73d13ad19a	ggml : expose SSE3 and SSSE3 for MSVC when AVX is available (#2128 )	2024-05-08 18:33:43 +03:00
Przemysław Pawełczyk	b6680fab50	build : improve disabling AVX-512 (#2129 ) * cmake : make WHISPER_NO_AVX512=ON disable all subsets of AVX-512 Previously it happened only for MSVC, but it makes sense to have the same behavior for other compilers too. * make : reorder x86 ISA extensions in chronological order And update compiler flags at the end to ease modifying conditions. * make : support WHISPER_NO_AVX512=1 for disabling all AVX-512 subsets. That way you do not have to override each AVX-512 subset setting individually if it has been turned on during autodetection.	2024-05-08 18:32:43 +03:00
Borislav Stanimirov	f760756078	minor: add CMakeSettings.json to gitignore (#2094 )	2024-05-08 11:03:21 +03:00
Pedro Probst	58210d6a76	examples : fix node compilation (#2115 ) * node : fix compilation and update examples * node : fix readme * Update addon.node test	2024-05-02 22:52:55 +01:00
Przemysław Pawełczyk	8fac6455ff	make : change GNU make default CXX from g++ to c++ (#2100 )	2024-04-28 22:54:21 +01:00
goldwaving	22b6598cc9	Remove unnecessary memory reallocation in fft (#2080 ) fft_out needs to be twice the frame_size, not the frame_step. It is resized in fft() anyway, but this change prevents an unnecessary reallocation. n_fft must match the mel filter size, so it is best not to calculate it from the framesize. We only need to get the magnitudes for half the spectrum since the other half is a mirror and not used in the mel filter loop later.	2024-04-28 18:36:12 +01:00
Georgi Gerganov	858452d58d	models : disable old script (#2079 )	2024-04-24 14:56:30 +03:00
Georgi Gerganov	7f85e1d7fd	whisper : more prominent log message for sub-1s audio (#2065 )	2024-04-24 14:46:06 +03:00
Georgi Gerganov	b0c3cbf2e8	main : pass nullptr when regex is empty (#2070 )	2024-04-17 12:23:47 +03:00
AIWintermuteAI	a750868428	readme : add up-to-date repository for Python bindings (#2063 ) README	2024-04-16 14:15:52 +03:00
Georgi Gerganov	7395c70a74	release : v1.5.5	2024-04-16 14:08:31 +03:00
Emmanuel Schmidbauer	9fab28135c	server : add dtw (#2044 ) * server.cpp: add dtw * Update examples/server/server.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-15 22:16:58 +03:00
Didzis Gosko	08d3eef97d	build : fix embedded Metal library generation (#2045 )	2024-04-15 20:23:05 +03:00
Pedro Probst	1b5439a6c2	node : support no timestamps (#2048 ) * fix: node: do not compute timestamps if you do not need them * feat: add no_timestamps parameter to node addon	2024-04-15 20:03:34 +03:00
Didzis Gosko	c7f95b7ca2	build : detect AVX512 in Makefile, add AVX512 option in CMake (#2043 ) * make : add AVX512 detection to Makefile and CMakeLists.txt * make : autodetect more AVX512 instruction subsets * cmake : do not default to AVX512, must be enabled explicitly * cmake : enable a set of AVX512 subsets, when AVX512 is turned on * make : consolidate AVX512 subsets, add AVX512 VBMI * cmake : revert to NO AVX512 setting, add settings for AVX512 VNNI and VBMI * make : re-introduce AVX512VNNI back * cmake : remove superfluous comment line	2024-04-15 20:02:09 +03:00
Kendrick Taylor	5c554c04ff	whisper.nvim : fix missing reference to "model" variable (#2049 )	2024-04-15 19:41:28 +03:00
Ikko Eltociear Ashimine	c383f091a1	whisper : update grammar-parser.cpp (#2058 ) preceeding -> preceding	2024-04-15 19:40:27 +03:00
Georgi Gerganov	8f253ef3af	sync : ggml	2024-04-09 20:27:55 +03:00
Georgi Gerganov	c7dc37f97c	license : update copyright notice + add AUTHORS	2024-04-09 20:27:44 +03:00
Carolinabanana	526332873b	llama : add Command R Plus support (llama/6491) * Add Command R Plus GGUF * Add Command R Plus GGUF * Loading works up to LayerNorm2D * Export new tensors in 1D so they are not quantized. * Fix embedding layer based on Noeda's example * Whitespace * Add line * Fix unexpected tokens on MPS. Re-add F16 fix. ((Noeda) * dranger003: Fix block index overflow in CUDA dequantizing. * Reverted blocked multiplication code as it still has issues and could affect other Llama arches * export norms as f32 * fix overflow issues during quant and other cleanup * Type convention Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * dranger003: Fix more int overflow during quant. --------- Co-authored-by: S <seast@Ss-Mac-Studio.local> Co-authored-by: S <s@example.com> Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-09 20:26:18 +03:00
Abhilash Majumder	1d2721ca72	remove row=1 cond (llama/6532)	2024-04-09 20:26:18 +03:00
Neo Zhang Jianyu	219e601dab	support/fix OPs GGML_TYPE_IQ4_NL, GGML_TYPE_IQ4_XS, GGML_TYPE_IQ3_XXS, GGML_TYPE_IQ3_S, GGML_TYPE_IQ2_XXS, GGML_TYPE_IQ2_XS, GGML_TYPE_IQ2_S, GGML_TYPE_IQ1_S, GGML_TYPE_IQ1_M (llama/6521)	2024-04-09 20:26:18 +03:00
Georgi Gerganov	3b8aade3c2	scripts : update sync	2024-04-09 20:25:50 +03:00
Georgi Gerganov	52ccd4a3a8	files : rename ./extra to ./scripts	2024-04-09 20:13:41 +03:00
Brad Murray	5275074d37	whisper : fix DTW memory access (#2012 ) * Fix DTW memory access * Memory fix - Apply changes from denersc	2024-04-09 18:38:19 +03:00
ulatekh	c15b4cda7d	common : fix file-handle leak in read_wav() (#2026 ) Now it cleans up in case of error.	2024-04-09 18:34:34 +03:00
Rotem Dan	d3cfb6ca2b	main : set stdin to binary mode on Windows (#2025 )	2024-04-09 18:33:32 +03:00
slashlib	956ef860bc	cmake : support for CPU BLAS build via Intel MKL (#2024 )	2024-04-09 18:32:46 +03:00
ulatekh	671b4bde6c	main : allow a response-file as the sole parameter (#2019 ) * The "main" example now allows a response-file as the sole parameter. A response-file is a text file with command-line parameters, one per line. Prefix the name of the response-file with "@" to identify it as such. It's used under MS Windows to work around command-line length limits. It may be useful under other platforms to simplify character-escaping. * minor : style --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-09 18:31:16 +03:00
ulatekh	c8eeb93a6a	whisper : suppress tokens with a regex (#1997 ) * Allow a regular expression to describe tokens to suppress. Example: --suppress-tokens-re "[,\.]\|[ ]?[0-9]+" will suppress commas, periods, and numeric tokens. Technique inspired by https://github.com/openai/whisper/discussions/1041 Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Blind change to fix Java test. --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-09 18:27:28 +03:00
ulatekh	319fe5146e	cmake : create solution folders (#2004 ) * Create solution folders in the CMake build. * Fixed non-SDL2 build. * Fixed emscripten build.	2024-04-09 18:23:33 +03:00
Georgi Gerganov	13c22321d1	sync : ggml	2024-04-07 17:04:56 +03:00
Georgi Gerganov	ccbe9d5676	extra : sync grammar-parser	2024-04-07 17:04:22 +03:00
Georgi Gerganov	81a3c41aa0	talk-llama : sync llama.cpp	2024-04-07 16:21:08 +03:00
Georgi Gerganov	a50207c65d	sync : ggml	2024-04-07 16:18:11 +03:00
Georgi Gerganov	97878e53fd	sync : llama.cpp (skip) ggml-ci	2024-04-07 16:15:57 +03:00
Ouadie EL FAROUKI	61b05815e0	Fixed minor bug when enabling FP16 for non intel targets (llama/6464) * moved INTEL_MKL guard from gemm_impl to gemm (wrapper) * Update ggml-sycl.cpp Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com> --------- Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com>	2024-04-07 16:15:57 +03:00
slaren	1dce94cf26	ggml : mul_mat_id use the same tensor for all the experts (llama/6387) * ggml : update mul_mat_id to use the same tensor for all the experts * update cuda * minor * update metal * update test-backend-ops * fix cuda * Update ggml-metal.m Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update convert.py * update convert-hf-to-gguf.py * update convert.py for mixtral hf models * Update convert-hf-to-gguf.py Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * cuda : support non-pow-2 number of experts * allow quantize to work for split and merged experts models in the same way * cleanup + disable mmap automatically with split tensors models * update imatrix * test-backend-ops : test qwen argsort * update grok model loading * llama : add merged experts tensors to the grok tensor map * minor * gguf : bump version * fix quantizing of merged experts * convert-hf-to-gguf.py : update grok (untested) * make linter happy * cuda/argsort : use shared memory instead of pool memory * convert : fix grok tensor names * metal : add support for non-pow-2 argsort * llama : more loader cleanup, better error checking * cuda : fix warning * llama : still use mmap for loading old models, but copy the data to a host buffer * add review note * llama : remove ffn tensor counting + add sanity check ggml-ci * convert : fix handling of n_experts == None ggml-ci * imatrix : fix ncall counters * llama : produce error if imatrix size does not match * quantize : terminate on errors + trace logs ggml-ci * metal : pad shared memory to 16 bytes --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-04-07 16:15:57 +03:00
Meng, Hengyu	f12e982c0b	Disable iqx on windows as WA (llama/6435) * disable iqx on windows as WA * array instead of global_memory	2024-04-07 16:15:57 +03:00
0cc4m	fa966b9b40	Vulkan k-quant mmq and ggml-backend offload functionality (llama/6155) * Fix Vulkan no kv offload incoherence * Add k-quant mul mat mat shaders * Rework working buffer allocation, reduces vram use noticeably Clean up cpu assist code, replaced with ggml-backend offload function * Default to all dedicated GPUs * Add fallback for integrated GPUs if no dedicated GPUs are found * Add debug info which device is allocating memory * Fix Intel dequant issue Fix validation issue * Fix Vulkan GGML_OP_GET_ROWS implementation * Clean up merge artifacts * Remove Vulkan warning	2024-04-07 16:15:57 +03:00

1 2 3 4 5 ...

1221 Commits