whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2024-12-28 09:39:07 +01:00

Author	SHA1	Message	Date
Kawrakow	97b12212dd	ggml : SOTA 2-bit quants (add IQ2_XS) (llama/4856) * iq2_xs: basics * iq2_xs: this should have been in the basics * iq2_xs: CUDA and scalar CPU works * iq2_xs: WIP Metal * iq2_xs: Metal now works * iq2_xs: working, but dog slow, ARM_NEON dot product * iq2_xs: better ARM_NEON dot product We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when running on the CPU. * iq2_xs: AVX2 dot product - 19.5 t/s * iq2_xs: faster AVX2 dit product 21.4 t/s for TG-128, 59.2 t/s for PP-512. The latter is 2x compared to the previous version. * iq2_xs: had forgotten to delete iq2-data.h * Add llama enum for IQ2_XS --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-11 21:50:01 +02:00
Paul Tsochantaris	9fa34d79ec	metal : put encoder debug group behind a define (llama/4873)	2024-01-11 21:50:01 +02:00
Georgi Gerganov	a0a64a19dd	metal : improve dequantize precision to match CPU (llama/4836) ggml-ci	2024-01-11 21:50:01 +02:00
Georgi Gerganov	bbc23611fa	ggml : fix vld1q_s8_x4 32-bit compat (llama/4828) * ggml : fix vld1q_s8_x4 32-bit compat ggml-ci * ggml : fix 32-bit ARM compat (cont) ggml-ci	2024-01-11 21:50:01 +02:00
Johannes Gäßler	e9783a1fb4	CUDA: faster softmax via shared memory + fp16 math (llama/4742)	2024-01-11 21:50:01 +02:00
Georgi Gerganov	9e0cc28792	metal : fix deprecation warning (ggml/690)	2024-01-11 21:50:00 +02:00
Timothy Cronin	73072a7c73	ggml : remove ggml_cpy_inplace and ggml_cont_inplace (ggml/693)	2024-01-11 21:50:00 +02:00
Jack Mousseau	a8ba1262ff	metal : wrap each operation in debug group (ggml/690)	2024-01-11 21:50:00 +02:00
leejet	e66a9a7806	ggml : change GGML_MAX_NAME at compile time (ggml/682) * change GGML_MAX_NAME to 128 * allow controlling the value of GGML_MAX_NAME through external macro definitions	2024-01-11 21:50:00 +02:00
Halalaluyafail3	338442d773	Fix execlp call (ggml/689) NULL can be an integer constant expression with the value zero, in this case the behavior would be undefined because of an incorrect type being passed to the variable arguments.	2024-01-11 21:50:00 +02:00
Kawrakow	10651bddf6	SOTA 2-bit quants (llama/4773) * iq2_xxs: basics * iq2_xxs: scalar and AVX2 dot products Needed to change Q8_K to have quants in the -127...127 range, else the IQ2_XXS AVX implementation becomes very awkward. The alternative would have been to use Q8_0 instead. Perhaps I'll change later, for now this is what we have. * iq2_xxs: ARM_NEON dot product Somehow strangely slow (112 ms/token). * iq2_xxs: WIP Metal Dequantize works, something is still wrong with the dot product. * iq2_xxs: Metal dot product now works We have PP-512 = 475 t/s TG-128 = 47.3 t/s Not the greatest performance, but not complete garbage either. * iq2_xxs: slighty faster dot product TG-128 is now 48.4 t/s * iq2_xxs: slighty faster dot product TG-128 is now 50.9 t/s * iq2_xxs: even faster Metal dot product TG-128 is now 54.1 t/s. Strangely enough, putting the signs lookup table into shared memory has a bigger impact than the grid values being in shared memory. * iq2_xxs: dequantize CUDA kernel - fix conflict with master * iq2_xxs: quantized CUDA dot product (MMVQ) We get TG-128 = 153.1 t/s * iq2_xxs: slightly faster CUDA dot product TG-128 is now at 155.1 t/s. * iq2_xxs: add to llama ftype enum * iq2_xxs: fix MoE on Metal * Fix missing MMQ ops when on hipBLAS I had put the ggml_supports_mmq call at the wrong place. * Fix bug in qequantize_row_iq2_xxs The 0.25f factor was missing. Great detective work by @ggerganov! * Fixing tests * PR suggestion --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-01-11 21:50:00 +02:00
Johannes Gäßler	53d4d0b30d	CUDA: fixed redundant value dequantization (llama/4809)	2024-01-11 21:50:00 +02:00
Konstantin Zhuravlyov	2865e4710b	ggml : use __builtin_amdgcn_sudot4 in __dp4a for gfx11 (llama/4787)	2024-01-11 21:50:00 +02:00
Georgi Gerganov	c46a74a19d	ggml : do not sched_yield when calling BLAS (llama/4761) * ggml : do not sched_yield when calling BLAS ggml-ci * ggml : fix do_yield logic ggml-ci * ggml : simplify do_yield logic ggml-ci	2024-01-11 21:50:00 +02:00
Georgi Gerganov	46dc49a6a1	ggml : include stdlib.h before intrin.h (llama/4736)	2024-01-11 21:49:59 +02:00
Alexandru Mariuti	cc7f872131	swift : checkout ggml commit instead of branch (#1750 )	2024-01-10 18:12:06 +02:00
RhinoDevel	bcc1658cd0	talk-llama : add optional Piper TTS support (#1749 ) Add commented-out command to optionally use Piper (https://github.com/rhasspy/piper) as text-to-speech solution for the talk-llama example. Piper voices sound almost like real people which is a big improvement (e.g.) from something like espeak.	2024-01-10 16:15:28 +02:00
Emmanuel Schmidbauer	c46886f599	server : add request path option(#1741 )	2024-01-08 22:39:51 +00:00
Georgi Gerganov	29f78392c1	main : add cli option to disable system prints (#1740 )	2024-01-08 16:41:28 +02:00
Georgi Gerganov	022756a872	server : fix server temperature + add temperature_inc (#1729 ) * server : fix server temperature + add temperature_inc * server : change dashes to underscores in parameter names	2024-01-07 13:35:14 +02:00
Georgi Gerganov	3b8c2dff57	talk-llama : sync latest llama.cpp	2024-01-06 17:22:57 +02:00
Georgi Gerganov	0b9af32a8b	release : v1.5.4	2024-01-05 17:11:27 +02:00
Erik Scholz	11b1b63b14	fix : cuda order of synchronization when setting a buffer (ggml/679) * fix : cuda order of synchronization when setting a buffer * also sync before memcpy --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-01-05 17:01:59 +02:00
Georgi Gerganov	0e26a6c92e	metal : switch back to default.metallib (ggml/681) ggml-ci	2024-01-05 16:31:30 +02:00
Georgi Gerganov	66d8f0b7f1	ggml : fix q2_k bpw in comments (ggml/680)	2024-01-05 16:31:20 +02:00
Yajing Tang	ba5bcde874	coreml : fix ANE optimized encoder (#1716 )	2024-01-04 16:28:30 +02:00
Georgi Gerganov	ab0a8593c5	whisper.swiftui : add .gitignore	2024-01-04 15:00:27 +02:00
Georgi Gerganov	668ffc9b23	whispser : reset the "batched" timings (#1721 )	2024-01-04 13:38:39 +02:00
Georgi Gerganov	9962371f71	release : v1.5.3	2024-01-03 19:36:33 +02:00
Ashraful Islam	993acb5d41	swift : update Package.swift to use ggml as package dependency (#1701 ) * updates Package.swift to use ggml as dependency * cleans up the Package.swift file by removing redundant source files * updates ggml url src to ggerganov	2024-01-03 19:30:26 +02:00
Finn Voorhees	a3d0aa73d1	ggml : add error handling to graph_compute (#1714 )	2024-01-03 15:39:43 +02:00
Georgi Gerganov	14c57952f7	cuda : simplify expression Co-authored-by: slaren <slarengh@gmail.com>	2024-01-03 14:43:51 +02:00
Georgi Gerganov	6c369d6788	cuda : mark I16 and I32 ops as unsupported ggml-ci	2024-01-03 14:43:51 +02:00
Georgi Gerganov	4cdd9aad9b	metal : add kernel_get_rows_i32 ggml-ci	2024-01-03 14:43:51 +02:00
Georgi Gerganov	f38c057503	metal : optimize ggml_mul_mat_id (faster Mixtral PP) (llama/4725) * ggml : disable fast-math for Metal (cmake build only) ggml-ci * metal : fix Metal API debug warnings * cmake : add -fno-inline for Metal build (llama/4545) * metal : fix API debug warnings * metal : fix compile warnings * metal : use uint64_t for strides * cmake : rename option to LLAMA_METAL_SHADER_DEBUG * metal : fix mat-vec Q8_0 kernel for BS > 1 * metal : normalize mat-vec kernel signatures * cmake : respect LLAMA_QKK_64 option * metal : fix mat-vec Q4_K kernel for QK_K == 64 * metal : optimizing ggml_mul_mat_id (wip) * metal : minor fix * metal : opt mul_mm_id	2024-01-03 14:43:51 +02:00
Georgi Gerganov	1e5544b39b	metal : enable shader debugging (cmake option) (llama/4705) * ggml : disable fast-math for Metal (cmake build only) ggml-ci * metal : fix Metal API debug warnings * cmake : add -fno-inline for Metal build (llama/4545) * metal : fix API debug warnings * metal : fix compile warnings * metal : use uint64_t for strides * cmake : rename option to LLAMA_METAL_SHADER_DEBUG * metal : fix mat-vec Q8_0 kernel for BS > 1 * metal : normalize mat-vec kernel signatures * cmake : respect LLAMA_QKK_64 option * metal : fix mat-vec Q4_K kernel for QK_K == 64 ggml-ci	2024-01-03 14:43:51 +02:00
Georgi Gerganov	d5673af79f	ggml : add ggml_vdotq_s32 alias (llama/4715) ggml-ci	2024-01-03 14:43:51 +02:00
Johannes Gäßler	a28dacec65	CUDA: fixed tensor cores not being used on RDNA3 (llama/4697)	2024-01-03 14:43:51 +02:00
automaticcat	dbe29d4e33	ggml : add ggml_cpu_has_avx_vnni() (llama/4589) * feat: add avx_vnni based on intel documents * ggml: add avx vnni based on intel document * llama: add avx vnni information display * docs: add more details about using oneMKL and oneAPI for intel processors * docs: add more details about using oneMKL and oneAPI for intel processors * docs: add more details about using oneMKL and oneAPI for intel processors * docs: add more details about using oneMKL and oneAPI for intel processors * docs: add more details about using oneMKL and oneAPI for intel processors * Update ggml.c Fix indentation upgate Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-03 14:43:51 +02:00
Johannes Gäßler	fe3a67c546	CUDA: fix tensor core logic for Pascal and HIP (llama/4682)	2024-01-03 14:43:51 +02:00
hydai	b138ff2be3	cuda: fix vmm oom issue on NVIDIA AGX Orin (llama/4687) Signed-off-by: hydai <hydai@secondstate.io>	2024-01-03 14:43:51 +02:00
Guillaume Wenzek	cf6f1e4181	ggml : extend ggml_get_rows, ggml_repeat, ggml_concat (ggml/639) * add more int ops * ggml_compute_forward_dup_bytes * add tests * PR comments * tests : minor indentations --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-01-03 14:43:51 +02:00
Georgi Gerganov	620a223814	scripts : fix sync order + metal sed	2024-01-03 14:43:51 +02:00
Andreu Huguet	f39f9690ec	examples : fix WASM Stack Overflow (#1713 ) Fix for problem: """ RuntimeError: Aborted(Stack overflow! Stack cookie has been overwritten at 0x12be2b10, expected hex dwords 0x89BACDFE and 0x2135467, but received 0x00000000 0x00000000) """ That appears when executing the WASM example with the newer versions.	2024-01-02 16:50:04 +00:00
bobqianic	f9ca90256b	docker : fix the publishing of the CUDA Docker image (#1704 )	2023-12-30 23:12:31 +02:00
Georgi Gerganov	2623640cd6	scripts : do not sync commits from this repo	2023-12-29 15:03:08 +02:00
Tamotsu Takahashi	d87de61ae6	ci : build with CLBlast + ggml-opencl use GGML_API (#1576 ) * Build with CLBlast * Declare GGML_API After rebasing, examples/talk-llama failed: "D:\a\whisper.cpp\whisper.cpp\build\ALL_BUILD.vcxproj" (build target) (1) -> "D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj" (default target) (14) -> (Link target) -> llama.obj : error LNK2019: unresolved external symbol ggml_cl_free_data referenced in function "public: __cdecl llama_model::~llama_model(void)" (??1llama_model@@QEAA@XZ) [D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj] llama.obj : error LNK2019: unresolved external symbol ggml_cl_transform_tensor referenced in function "public: void __cdecl llama_model_loader::load_all_data(struct ggml_context ,void (__cdecl)(float,void ),void ,struct llama_mlock *)" (?load_all_data@llama_model_loader@@QEAAXPEAUggml_context@@P6AXMPEAX@Z1PEAUllama_mlock@@@Z) [D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj] D:\a\whisper.cpp\whisper.cpp\build\bin\Release\talk-llama.exe : fatal error LNK1120: 2 unresolved externals [D:\a\whisper.cpp\whisper.cpp\build\examples\talk-llama\talk-llama.vcxproj]	2023-12-29 12:23:27 +02:00
bobqianic	f5f485f899	whisper : replace `tensor->n_dims` with `ggml_n_dims(tensor)` (#1694 )	2023-12-29 11:38:35 +02:00
Georgi Gerganov	e77b27c331	sync : ggml (VMM, sync-ggml-am, dotprod ARM fixes, CUDA fixes) (#1691 ) * scripts : add sync-ggml-am.sh * sync : ggml (VMM, ARM dot prod fix, etc.) * build : fix CUDA build * ggml : fix some mul mat cases + add tests for src1 F16 `dbd02958fa`	2023-12-29 11:30:47 +02:00
Dimo	a5cc3dc8a2	download : fix large q5 model name (#1695 ) fixed typo in large-v3-q5-0 model name to match HF link	2023-12-29 11:14:32 +02:00

1 2 3 4 5 ...

940 Commits