whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2025-07-19 02:14:45 +02:00

Author	SHA1	Message	Date
rmatif	b0754136be	opencl: add tiled mul_mat_f16_f32 (llama/14535) * add tiled mul_mat_f16_f32 * fix trailing whitespace * add insightful comments	2025-07-12 19:23:56 +03:00
lhez	6f113cbcaa	opencl: add `set_rows` for `f16` and `f32` (llama/14547) * opencl: add `set_rows` for `f16` and `f32` * opencl: better choose workgroup size for `set_rows`	2025-07-12 19:23:56 +03:00
Akarshan Biswas	3c21cde540	SYCL: Initial set_rows kernel implementation (llama/14562) * SYCL: Initial set_rows kernel implementation * Revert max_threads to 256 * Refactor set_rows and address review comments * Deduplicate conversion function * Remove guard before kernel launch and refactor * Fix and add back SFINAE	2025-07-12 19:23:56 +03:00
compilade	fb885fa48b	cuda : support Falcon-H1 state size for SSM_SCAN (llama/14602)	2025-07-12 19:23:56 +03:00
Xuan-Son Nguyen	2021870fb8	ggml : add ggml_scale_bias (llama/14417) * ggml : add ggml_scale_bias * ggml_vec_mad1_f32 * add more simd * add CUDA * sycl * vulkan * cann (placeholder) * opencl * will this fix cpu? * fix cuda * suggestions from coderabbit * fix cann compile error * vDSP_vsmsa * rm __ARM_FEATURE_SVE * use memcpy for op params * make code looks more consistent * use scalar for __ARM_FEATURE_SVE * add x param to ggml_vec_mad1_f32	2025-07-12 19:23:56 +03:00
Miaoqian Lin	48b18f9eb8	ggml : prevent integer overflow in gguf tensor size calculation (llama/14595)	2025-07-12 19:23:56 +03:00
Jeff Bolz	fadb3233b6	vulkan: optimize flash attention split_k_reduce (llama/14554) * vulkan: allow FA split_k with smaller KV values * vulkan: spread split_k_reduce work across more threads k_num can get rather large. Use the whole workgroup to reduce the M/L values. Launch a thread for each element in the HSV dimension of the output. Helps a lot for large HSV (like deepseek).	2025-07-12 19:23:56 +03:00
Jeff Bolz	9750e4c988	vulkan : fix rope with partial rotation and non-cont src (llama/14582)	2025-07-12 19:23:56 +03:00
Georgi Gerganov	c3942b3db6	cuda : fix rope with partial rotation and non-cont src (llama/14580) * cuda : fix rope non-cont ggml-ci * cont : fix multi-rope + add test ggml-ci * sycl : try fix ggml-ci * cont : fix sycl + clean-up cuda ggml-ci	2025-07-12 19:23:56 +03:00
Aman Gupta	98e7beac6c	CUDA: add bilinear interpolation for upscale (llama/14563)	2025-07-12 19:23:56 +03:00
R0CKSTAR	7e9c6bbab2	musa: fix build warnings (unused variable) (llama/14561) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-07-12 19:23:56 +03:00
Aman Gupta	8e545f466c	CUDA: add bf16 and i32 to getrows (llama/14529)	2025-07-12 19:23:56 +03:00
Eve	e753b9a952	vulkan: increase LOAD_VEC_A to 8 (IQ1/IQ2) or 4 (IQ3) (llama/14485) Commit taken from remyoudompheng's PR https://github.com/ggml-org/llama.cpp/pull/12260 Co-authored-by: Rémy Oudompheng <remyoudompheng@gmail.com>	2025-07-12 19:23:56 +03:00
Jeff Bolz	9d0c408260	vulkan: fix rms_norm+mul fusion (llama/14545) The fused operation was grabbing the epsilon value from the wrong place. Add an env var to disable fusion. Add some missing checks for supported shapes/types. Handle fused rms_norm+mul in check_results.	2025-07-12 19:23:56 +03:00
Jeff Bolz	3aebb8d5d3	vulkan: Handle updated FA dim2/3 definition (llama/14518) * vulkan: Handle updated FA dim2/3 definition Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit. * handle null mask for gqa * allow gqa with dim3>1	2025-07-12 19:23:56 +03:00
Sigbjørn Skjæret	df5af1dc75	opencl: add GELU_ERF (llama/14476)	2025-07-12 19:23:56 +03:00
Georgi Gerganov	10d0d28f7c	metal : disable fast math in all quantize kernels (llama/14528) ggml-ci	2025-07-12 19:23:56 +03:00
luyhcsu	af304ef080	CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator (llama/14002) Co-authored-by: luyuhong <luyuhong@kylinos.cn>	2025-07-12 19:23:56 +03:00
Sigbjørn Skjæret	e8138c51d2	ggml : implement GEGLU_ERF and GEGLU_QUICK ops (llama/14445)	2025-07-12 19:23:56 +03:00
lhez	7cec4cc83a	opencl : broadcast for soft_max (llama/14510)	2025-07-12 19:23:56 +03:00
Jeff Bolz	a432929d58	vulkan: support mixed/deepseekR1 FA head sizes (llama/14509) * vulkan: better parameterize FA by head sizes * vulkan: support mixed/deepseekR1 FA head sizes	2025-07-12 19:23:56 +03:00
Johannes Gäßler	4aaf8114e7	ggml: backward pass for split swiglu (llama/14483)	2025-07-12 19:23:56 +03:00
Nicolò Scipione	0ca760433c	Fix conditional enabling following arch checks for ggml-sycl (llama/14504) Signed-off-by: nscipione <nicolo.scipione@codeplay.com>	2025-07-12 19:23:56 +03:00
Georgi Gerganov	ed639c7f22	kv-cache : use ggml_set_rows (llama/14285) * kv-cache : use ggml_set_rows ggml-ci * graph : separate k and v indices ggml-ci * cont : remove redundant ifs ggml-ci * kv-cache : improve find_slot impl * kv-cache : bounds-check when accessing slot_info indices * kv-cache : add comments ggml-ci * ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends ggml-ci	2025-07-12 19:23:56 +03:00
Georgi Gerganov	0abd0660e1	ggml : fix FA mask dim 2 and 3 (llama/14505) * ggml : fix FA mask dim 2 and 3 ggml-ci * backends : unsupport batched FA in CUDA and Vulkan ggml-ci * vulkan : disable FA for mask->ne[2] != 1	2025-07-12 19:23:56 +03:00
Aman Gupta	9cde908c0a	CUDA: add dynamic shared mem to softmax, refactor general usage (llama/14497)	2025-07-12 19:23:56 +03:00
compilade	d2d120c256	llama : initial Mamba-2 support (llama/9126) * llama : initial Mamba-2 support * ggml : SIMD ggml_ssm_scan for Mamba-2 * ggml : improve ggml_mul speed when masking recurrent states * llama : support running Mamba-Codestral-7B-v0.1 * llama : fix Mamba-2 conv state saving * ggml : make the ggml_mul fast broadcast path more consistently formatted * llama : remove unused variable * llama : add missing break * convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly. * llama : avoid redundant state copy for Mamba 1 and 2 * metal : attempt to adapt SSM_SCAN for Mamba-2 * metal : fix SSM_SCAN pipeline scope * metal : use log and exp instead of log1pf and expf in SSM_SCAN * metal : remove unused arguments for SSM_SCAN The max index is 31, so trimming the arguments is necessary. * metal : add back n_seqs to SSM_SCAN args Whoops, this is needed for the offset in the concatenated output. * metal : fix SSM_SCAN state head offset * metal : fix wrong number of tokens per sequence in SSM_SCAN * ggml : remove unused fast broadcast path in GGML_MUL This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity. * ggml : avoid multiply by D in GGML_OP_SSM_SCAN This makes the weight buft detection in src/llama.cpp simpler. * convert : transpose Mamba-2 A, D and reshape SSM_NORM This breaks existing conversions of Mamba-2 models to avoid some reshapes. Not sure if it's a good idea, but it makes the graph slightly cleaner. * llama : more appropriate SSM_SCAN and SSM_CONV buft support checks * convert : fix flake8 lint * metal : fix confusion between ; and , * metal : add missing args for nb references in ssm_scan_f32_group * metal : single-user mamba2 inference works * kv-cache : remove const_cast when setting inputs for s_copy And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy. * convert : avoid AutoConfig for Mamba and Mamba2 hparams * kv-cache : allow context shift for recurrent models * graph : fix recurrent state copies when avoiding copies Works, but using lambda functions might not be that clean. * ggml : fix mamba2 ssm scan when compiled with SVE * ggml-cpu : reorder SVE FMA for consistency with other SIMD arches * cuda : implement ssm scan for Mamba2 There is still room for improvement, but it works! * cuda : adapt Mamba1 ssm scan to shape changes from Mamba2 * mamba : fix mismatched new and delete size for llm_build_mamba Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1\|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON * cuda : graceful fallback for Mamba-1 models with weird embd size	2025-07-12 19:23:56 +03:00
Aman Gupta	fb5c4095ee	CUDA: add softmax broadcast (llama/14475) * CUDA: add softmax broadcast * Pass by const ref * Review: Use blockDims for indexing, remove designated initializers * Add TODO for noncontigous input/output	2025-07-12 19:23:56 +03:00
Johannes Gäßler	70515ed728	CUDA: broadcasting for FlashAttention mask (llama/14500)	2025-07-12 19:23:56 +03:00
Jeff Bolz	1b3e06a400	vulkan: support softmax/FA batch and broadcast (llama/14449)	2025-07-12 19:23:56 +03:00
Georgi Gerganov	d1286cf32b	ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (llama/14435)	2025-07-12 19:23:56 +03:00
zhouwg	2e04b81f3e	opencl : fix possible buffer overflow in dump_tensor (llama/14490)	2025-07-12 19:23:56 +03:00
Eric Zhang	cd87a2f7e0	opencl : skip empty nodes on cgraph compute (llama/14491)	2025-07-12 19:23:56 +03:00
lhez	e43c38f9f1	opencl : update upscale to support align corners (llama/14488)	2025-07-12 19:23:56 +03:00
Björn Ganster	ab850d4680	ggml : Callback before abort (llama/14481) * Add a callback that will be called just before abort. This allows apps without a console to display a message to the user and save data if needed. * Return previous callback to allow callback chaining * style fixes --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-07-12 19:23:56 +03:00
Georgi Gerganov	cdf5e72163	ci : disable fast-math for Metal GHA CI (llama/14478) * ci : disable fast-math for Metal GHA CI ggml-ci * cont : remove -g flag ggml-ci	2025-07-12 19:23:56 +03:00
Chenguang Li	32d7c10766	CANN: update aclnnGroupedMatmulV2 to aclnnGroupedMatmulV3 (llama/14411) * [CANN]update to aclnnGroupedMatmulV2 Signed-off-by: noemotiovon <757486878@qq.com> * Support MUL_MAT_ID on 310p Signed-off-by: noemotiovon <757486878@qq.com> * fix editorconfig Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-07-12 19:23:56 +03:00
Jeff Bolz	3c7939cfe5	vulkan: Split large mul_mat_id to fit in shared memory (llama/14451)	2025-07-12 19:23:56 +03:00
Sigbjørn Skjæret	6fc80e8456	add GELU_ERF (llama/14455)	2025-07-12 19:23:56 +03:00
Acly	19b9aaf044	vulkan : implement bilinear interpolation for ggml_upscale/ggml_interpolate (ggml/1291) * supports GGML_SCALE_MODE_BILINEAR and GGML_SCALE_FLAG_ALIGN_CORNERS	2025-07-12 19:23:56 +03:00
Acly	f98cb6607b	vulkan : implement ggml_roll (ggml/1290) * vulkan : implement ggml_roll * vulkan : refactor vk_op_unary_push_constants initialization	2025-07-12 19:23:56 +03:00
Daniel Bevenius	5ea5c58768	ggml : add version function to get lib version (ggml/1286) * ggml : add version function to get lib version This commit adds a function `ggml_version()` to the ggml library that returns the version of the library as a string. The motivation for this is that it can be useful to be able to programmatically check the version of the ggml library being used. Usage: ```c printf("GGML version: %s\n", ggml_version()); ``` Output: ```console GGML version: 0.0.2219 ``` * ggml : add ggml_commit() --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-07-12 19:23:56 +03:00
accessiblepixel	869335f2d5	server : add dtw.params for v3-large-turbo (#3307 ) * Add DTW model large-v3-turbo parameters to server.cpp example DTW support is available in whispercpp and the large-v3-turbo model has already been added to the sources, but the large-v3-turbo model hasn't been added to the server.cpp file to make use of it. This commit hopefully corrects that issue. * match original linebreak of original server.cpp file after adding large.v3.turbo dtw	2025-07-07 12:51:15 +03:00
Lin Xiaodong	d9999d54c8	feat: support vad for addon.node (#3301 ) Co-authored-by: linxiaodong <calm.lin@wukongsch.com>	2025-07-02 13:14:29 +03:00
Georgi Gerganov	bca021c974	sync : ggml ggml-ci	2025-07-01 17:54:53 +03:00
Georgi Gerganov	1f816de7da	talk-llama : sync llama.cpp	2025-07-01 17:54:53 +03:00
Georgi Gerganov	c4ea72be9a	ggml : remove trailing whitespace (llama/0)	2025-07-01 17:54:53 +03:00
lhez	1e930ab1b8	opencl : add GEGLU, REGLU, SWIGLU (llama/14456)	2025-07-01 17:54:53 +03:00
Aman Gupta	b5b237d49a	Add Conv2d for CPU (llama/14388) * Conv2D: Add CPU version * Half decent * Tiled approach for F32 * remove file * Fix tests * Support F16 operations * add assert about size * Review: further formatting fixes, add assert and use CPU version of fp32->fp16	2025-07-01 17:54:53 +03:00
Georgi Gerganov	679f31a9d1	metal : disable fast-math for some cpy kernels (llama/14460) * metal : disable fast-math for some cpy kernels ggml-ci * cont : disable for q4_1 ggml-ci * cont : disable for iq4_nl ggml-ci	2025-07-01 17:54:53 +03:00

1 2 3 4 5 ...

2896 Commits