whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2025-06-20 17:57:52 +02:00

Author	SHA1	Message	Date
Johannes Gäßler	e4bc83ab47	CUDA: refactor and optimize IQ MMVQ (llama/8215) * CUDA: refactor and optimize IQ MMVQ * uint -> uint32_t * __dp4a -> ggml_cuda_dp4a * remove MIN_CC_DP4A checks * change default * try CI fix	2024-07-08 14:53:55 +03:00
zhentaoyu	db7e0dbe6e	Update SYCL-Rope op and Refactor (llama/8157) * align with rope.cu and move sycl-op to a single file	2024-07-08 14:53:55 +03:00
Johannes Gäßler	bf88c94da9	CUDA: fix MMQ stream-k for --split-mode row (llama/8167)	2024-07-08 14:53:55 +03:00
John Balis	3eea171cab	feat: cuda implementation for `ggml_conv_transpose_1d` (ggml/854) * conv transpose 1d passing test for 1d input and kernel * working for different input and output channel counts, added test for variable stride * initial draft appears to work with stride other than 1 * working with all old and new conv1d tests * added a test for large tensors * removed use cuda hardcoding * restored test-conv-transpose.c * removed unused arugments, and fixed bug where test failure would cause subsequent tests to fail * fixed accumulator bug * added test to test-backend-ops * fixed mistake * addressed review * fixed includes * removed blank lines * style and warning fixes * return failure when test fails * fix supports_op --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-07-08 14:53:55 +03:00
Georgi Gerganov	64a56ebf13	ci : disable java build	2024-07-08 14:26:59 +03:00
Emmanuel Schmidbauer	bec9836849	server : add inference path to make OAI API compatible (#2270 )	2024-07-08 14:24:58 +03:00
Georgi Gerganov	c118733a29	sync : ggml + fix sync script	2024-06-26 23:20:19 +03:00
Georgi Gerganov	bb3dd45524	make : disable CUDA graphs	2024-06-26 23:20:13 +03:00
slaren	04e7fa6f4f	ggml : add GGML_CUDA_USE_GRAPHS option, restore GGML_CUDA_FORCE_CUBLAS (cmake) (llama/8140)	2024-06-26 23:18:11 +03:00
Georgi Gerganov	9f7f36d4c9	make : disable CUDA mel build	2024-06-26 22:25:25 +03:00
Georgi Gerganov	4a62efbb95	cmake : minor fixes	2024-06-26 21:42:39 +03:00
Georgi Gerganov	0a55a70b9b	make : fix missing -O3 same as https://github.com/ggerganov/llama.cpp/pull/8143	2024-06-26 21:21:12 +03:00
Georgi Gerganov	dc8cc2dd6f	whisper : disable CUDA mel + fix FFMPEG	2024-06-26 20:11:38 +03:00
Georgi Gerganov	3efedb9511	sync : ggml	2024-06-26 19:40:23 +03:00
Georgi Gerganov	e30c679928	whisper : reorganize source code + improve CMake (#2256 ) * scripts : update sync [no ci] * files : reorganize [no ci] * sync : llama.cpp * cmake : link math library * cmake : build normal ggml library * files : move headers to include * objc : fix path to ggml-metal.h * ci : fix WHISPER_CUDA -> GGML_CUDA * scripts : sync LICENSE [no ci]	2024-06-26 19:34:09 +03:00
mky_coder	bf4cb4abad	whisper : optimize fft() function (#2242 ) Co-authored-by: Mike Fan <60965742+mike-fzy@users.noreply.github.com>	2024-06-18 18:10:33 +03:00
Georgi Gerganov	e293f17d34	talk-llama : sync llama.cpp	2024-06-18 09:45:37 +03:00
Georgi Gerganov	5d950c4b8d	whisper : use ggml_backend_sched (#2239 ) * whisper : use ggml_backend_sched (wip) * use sched in whisper_allocr * whisper : single backend in whisper_context * whisper : remove whisper_state->backends_used * whisper : remove whisper_context->backend * whisper : reset scheduler after init * whisper : fix external encoder (e.g. CoreML) * whisper : cleanup * whisper : handle null GPU buffer types + fix sycl --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-06-18 09:39:40 +03:00
Georgi Gerganov	820446e230	fix : remove extra files	2024-06-18 09:39:40 +03:00
Georgi Gerganov	54d5823ebe	scripts : sync ggml-blas	2024-06-18 09:39:40 +03:00
Georgi Gerganov	5181494e9f	build : update make / cmake	2024-06-18 09:39:40 +03:00
Georgi Gerganov	4a6e6e8b30	sync : ggml	2024-06-18 09:39:40 +03:00
slaren	de29b193f6	move BLAS to a separate backend (cont) (llama/6210) ggml-ci	2024-06-18 09:39:40 +03:00
0cc4m	922971041b	Vulkan Shader Refactor, Memory Debugging Option (llama/7947) * Refactor shaders, extract GLSL code from ggml_vk_generate_shaders.py into vulkan-shaders directory * Improve debug log code * Add memory debug output option * Fix flake8 * Fix unnecessary high llama-3 VRAM use	2024-06-18 09:39:40 +03:00
Georgi Gerganov	63a767a134	scripts : stop sync whisper example from ggml	2024-06-18 09:39:40 +03:00
Georgi Gerganov	30841fa786	cmake : fix sycl build (#0 )	2024-06-16 18:19:48 +03:00
Georgi Gerganov	3b1ac03828	ggml : remove OpenCL (#0 )	2024-06-16 18:19:48 +03:00
Georgi Gerganov	990de617b5	sycl : sync (#0 )	2024-06-16 18:19:48 +03:00
Georgi Gerganov	6975600b4b	cuda : enable CUDA graphs (#0 )	2024-06-16 18:19:48 +03:00
Georgi Gerganov	061eeb9f61	talk-llama : sync llama.cpp	2024-06-16 18:19:48 +03:00
Georgi Gerganov	4942b1b428	cmake : fix CUDA build (#0 )	2024-06-16 18:19:48 +03:00
Georgi Gerganov	3c7cc5c437	sync : ggml ggml-ci	2024-06-16 18:19:48 +03:00
Hong Bo PENG	5cd42ee2cc	ggml : fix and optimize ppc64le (ggml/849) * fix compile issues introduced by loongarch_asx * restore quant changes to merge * fix compile issues introduced by loongarch_asx * further optimize by using vec_msum & vec_sum4s on ppc64le	2024-06-16 18:19:48 +03:00
Daniel Bevenius	ee718f3da6	ggml : remove duplicate include of ggml-common.h (ggml/853) Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-06-16 18:19:48 +03:00
Meng, Hengyu	63eac1f608	remove global variables (llama/7710) * separate DPCT helpers outside * replace global variables with context * remove useless extra * update mul_mat condition * remove duplicate buft initialization * remove duplicate extra and global work group size * remove useless backend check * remove duplicated extras * use macro for group_size and remove cuda-related	2024-06-16 18:19:48 +03:00
Johannes Gäßler	b17ba2815b	CUDA: faster q2_K, q3_K MMQ + int8 tensor cores (llama/7921) * CUDA: faster q2_K, q3_K MMQ + int8 tensor cores * try CI fix * try CI fix * try CI fix * fix data race * rever q2_K precision related changes	2024-06-16 18:19:48 +03:00
Georgi Gerganov	7a489af2f3	metal : utilize max shared memory for mul_mat_id (llama/7935)	2024-06-16 18:19:48 +03:00
Radoslav Gerganov	4a4ea13d6d	rpc : fix ggml_backend_rpc_supports_buft() (llama/7918)	2024-06-16 18:19:48 +03:00
slaren	174a461fc6	move BLAS to a separate backend (llama/6210) * move BLAS to a separate backend * rename GGML_USE_OPENBLAS to GGML_USE_BLAS * alloc : reuse same buffer when the same buffer type if used multiple times * set number of threads automatically for openblas and blis * sched : print assignments when GGML_SCHED_DEBUG env variable is set * sched : allow ops with weights on an incompatible buffer type This will cause the weight to be copied to a backend that supports the op, which is very costly. The weight should have been stored in a buffer of a backend that can run the op, but llama.cpp cannot do this automatically at the moment. --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-06-16 18:19:48 +03:00
Johannes Gäßler	d8b7a24bc9	CUDA: fix broken oob check for FA vec f32 kernel (llama/7904)	2024-06-16 18:19:48 +03:00
Georgi Gerganov	acf3832c9c	tests : add non-cont unary tests (llama/7857) * tests : add non-cont unary tests * ggml : update unary asserts and "supports_op" ggml-ci	2024-06-16 18:19:48 +03:00
Georgi Gerganov	d29ac44303	ggml : improve ggml_is_contiguous logic (llama/7856) * ggml : improve ggml_is_contiguous logic ggml-ci * ggml : support more contiguous cases ggml-ci	2024-06-16 18:19:48 +03:00
k.h.lai	12638dfef0	vulkan: select only one device for single gpu with multiple drivers (llama/7582)	2024-06-16 18:19:48 +03:00
0cc4m	f100b3b523	Update Vulkan RoPE implementation (llama/7818) * Update Vulkan RoPE implementation * Return nullptr on alloc_buffer when allocation fails, instead of throwing an exception Minor fixes * Fix segfault when running out of VRAM Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-06-16 18:19:48 +03:00
Johannes Gäßler	a99e213a82	CUDA: int8 tensor cores for MMQ (q4_K, q5_K, q6_K) (llama/7860)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	7483d2b61c	CUDA: use tensor cores for MMQ (llama/7676) * CUDA: int8 tensor cores for MMQ (legacy quants) * fix out-of-bounds writes * __builtin_assume -> GGML_CUDA_ASSUME * fix writeback returning too early	2024-06-16 18:19:48 +03:00
Ben Ashbaugh	1fe5948227	use the correct SYCL context for host USM allocations (llama/7777) Signed-off-by: Ben Ashbaugh <ben.ashbaugh@intel.com>	2024-06-16 18:19:48 +03:00
Johannes Gäßler	760497e1ab	CUDA: revise q8_1 data layout for mul_mat_q (llama/7824)	2024-06-16 18:19:48 +03:00
slaren	b172e7714c	vulkan : reuse parent extra for views (llama/7806) * vulkan : reuse parent extra for views * Fix validation error when multiple compute contexts are used in a graph --------- Co-authored-by: 0cc4m <picard12@live.de>	2024-06-16 18:19:48 +03:00
pengxin99	dc01aadb18	fix softmax r2r result wrong issue (llama/7811)	2024-06-16 18:19:48 +03:00

1 2 3 4 5 ...

1435 Commits