whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2025-08-14 04:18:43 +02:00

Author	SHA1	Message	Date
Georgi Gerganov	c3942b3db6	cuda : fix rope with partial rotation and non-cont src (llama/14580) * cuda : fix rope non-cont ggml-ci * cont : fix multi-rope + add test ggml-ci * sycl : try fix ggml-ci * cont : fix sycl + clean-up cuda ggml-ci	2025-07-12 19:23:56 +03:00
Aman Gupta	98e7beac6c	CUDA: add bilinear interpolation for upscale (llama/14563)	2025-07-12 19:23:56 +03:00
R0CKSTAR	7e9c6bbab2	musa: fix build warnings (unused variable) (llama/14561) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-07-12 19:23:56 +03:00
Aman Gupta	8e545f466c	CUDA: add bf16 and i32 to getrows (llama/14529)	2025-07-12 19:23:56 +03:00
Eve	e753b9a952	vulkan: increase LOAD_VEC_A to 8 (IQ1/IQ2) or 4 (IQ3) (llama/14485) Commit taken from remyoudompheng's PR https://github.com/ggml-org/llama.cpp/pull/12260 Co-authored-by: Rémy Oudompheng <remyoudompheng@gmail.com>	2025-07-12 19:23:56 +03:00
Jeff Bolz	9d0c408260	vulkan: fix rms_norm+mul fusion (llama/14545) The fused operation was grabbing the epsilon value from the wrong place. Add an env var to disable fusion. Add some missing checks for supported shapes/types. Handle fused rms_norm+mul in check_results.	2025-07-12 19:23:56 +03:00
Jeff Bolz	3aebb8d5d3	vulkan: Handle updated FA dim2/3 definition (llama/14518) * vulkan: Handle updated FA dim2/3 definition Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit. * handle null mask for gqa * allow gqa with dim3>1	2025-07-12 19:23:56 +03:00
Sigbjørn Skjæret	df5af1dc75	opencl: add GELU_ERF (llama/14476)	2025-07-12 19:23:56 +03:00
Georgi Gerganov	10d0d28f7c	metal : disable fast math in all quantize kernels (llama/14528) ggml-ci	2025-07-12 19:23:56 +03:00
luyhcsu	af304ef080	CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator (llama/14002) Co-authored-by: luyuhong <luyuhong@kylinos.cn>	2025-07-12 19:23:56 +03:00
Sigbjørn Skjæret	e8138c51d2	ggml : implement GEGLU_ERF and GEGLU_QUICK ops (llama/14445)	2025-07-12 19:23:56 +03:00
lhez	7cec4cc83a	opencl : broadcast for soft_max (llama/14510)	2025-07-12 19:23:56 +03:00
Jeff Bolz	a432929d58	vulkan: support mixed/deepseekR1 FA head sizes (llama/14509) * vulkan: better parameterize FA by head sizes * vulkan: support mixed/deepseekR1 FA head sizes	2025-07-12 19:23:56 +03:00
Johannes Gäßler	4aaf8114e7	ggml: backward pass for split swiglu (llama/14483)	2025-07-12 19:23:56 +03:00
Nicolò Scipione	0ca760433c	Fix conditional enabling following arch checks for ggml-sycl (llama/14504) Signed-off-by: nscipione <nicolo.scipione@codeplay.com>	2025-07-12 19:23:56 +03:00
Georgi Gerganov	ed639c7f22	kv-cache : use ggml_set_rows (llama/14285) * kv-cache : use ggml_set_rows ggml-ci * graph : separate k and v indices ggml-ci * cont : remove redundant ifs ggml-ci * kv-cache : improve find_slot impl * kv-cache : bounds-check when accessing slot_info indices * kv-cache : add comments ggml-ci * ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends ggml-ci	2025-07-12 19:23:56 +03:00
Georgi Gerganov	0abd0660e1	ggml : fix FA mask dim 2 and 3 (llama/14505) * ggml : fix FA mask dim 2 and 3 ggml-ci * backends : unsupport batched FA in CUDA and Vulkan ggml-ci * vulkan : disable FA for mask->ne[2] != 1	2025-07-12 19:23:56 +03:00
Aman Gupta	9cde908c0a	CUDA: add dynamic shared mem to softmax, refactor general usage (llama/14497)	2025-07-12 19:23:56 +03:00
compilade	d2d120c256	llama : initial Mamba-2 support (llama/9126) * llama : initial Mamba-2 support * ggml : SIMD ggml_ssm_scan for Mamba-2 * ggml : improve ggml_mul speed when masking recurrent states * llama : support running Mamba-Codestral-7B-v0.1 * llama : fix Mamba-2 conv state saving * ggml : make the ggml_mul fast broadcast path more consistently formatted * llama : remove unused variable * llama : add missing break * convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly. * llama : avoid redundant state copy for Mamba 1 and 2 * metal : attempt to adapt SSM_SCAN for Mamba-2 * metal : fix SSM_SCAN pipeline scope * metal : use log and exp instead of log1pf and expf in SSM_SCAN * metal : remove unused arguments for SSM_SCAN The max index is 31, so trimming the arguments is necessary. * metal : add back n_seqs to SSM_SCAN args Whoops, this is needed for the offset in the concatenated output. * metal : fix SSM_SCAN state head offset * metal : fix wrong number of tokens per sequence in SSM_SCAN * ggml : remove unused fast broadcast path in GGML_MUL This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity. * ggml : avoid multiply by D in GGML_OP_SSM_SCAN This makes the weight buft detection in src/llama.cpp simpler. * convert : transpose Mamba-2 A, D and reshape SSM_NORM This breaks existing conversions of Mamba-2 models to avoid some reshapes. Not sure if it's a good idea, but it makes the graph slightly cleaner. * llama : more appropriate SSM_SCAN and SSM_CONV buft support checks * convert : fix flake8 lint * metal : fix confusion between ; and , * metal : add missing args for nb references in ssm_scan_f32_group * metal : single-user mamba2 inference works * kv-cache : remove const_cast when setting inputs for s_copy And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy. * convert : avoid AutoConfig for Mamba and Mamba2 hparams * kv-cache : allow context shift for recurrent models * graph : fix recurrent state copies when avoiding copies Works, but using lambda functions might not be that clean. * ggml : fix mamba2 ssm scan when compiled with SVE * ggml-cpu : reorder SVE FMA for consistency with other SIMD arches * cuda : implement ssm scan for Mamba2 There is still room for improvement, but it works! * cuda : adapt Mamba1 ssm scan to shape changes from Mamba2 * mamba : fix mismatched new and delete size for llm_build_mamba Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1\|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON * cuda : graceful fallback for Mamba-1 models with weird embd size	2025-07-12 19:23:56 +03:00
Aman Gupta	fb5c4095ee	CUDA: add softmax broadcast (llama/14475) * CUDA: add softmax broadcast * Pass by const ref * Review: Use blockDims for indexing, remove designated initializers * Add TODO for noncontigous input/output	2025-07-12 19:23:56 +03:00
Johannes Gäßler	70515ed728	CUDA: broadcasting for FlashAttention mask (llama/14500)	2025-07-12 19:23:56 +03:00
Jeff Bolz	1b3e06a400	vulkan: support softmax/FA batch and broadcast (llama/14449)	2025-07-12 19:23:56 +03:00
Georgi Gerganov	d1286cf32b	ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (llama/14435)	2025-07-12 19:23:56 +03:00
zhouwg	2e04b81f3e	opencl : fix possible buffer overflow in dump_tensor (llama/14490)	2025-07-12 19:23:56 +03:00
Eric Zhang	cd87a2f7e0	opencl : skip empty nodes on cgraph compute (llama/14491)	2025-07-12 19:23:56 +03:00
lhez	e43c38f9f1	opencl : update upscale to support align corners (llama/14488)	2025-07-12 19:23:56 +03:00
Björn Ganster	ab850d4680	ggml : Callback before abort (llama/14481) * Add a callback that will be called just before abort. This allows apps without a console to display a message to the user and save data if needed. * Return previous callback to allow callback chaining * style fixes --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-07-12 19:23:56 +03:00
Georgi Gerganov	cdf5e72163	ci : disable fast-math for Metal GHA CI (llama/14478) * ci : disable fast-math for Metal GHA CI ggml-ci * cont : remove -g flag ggml-ci	2025-07-12 19:23:56 +03:00
Chenguang Li	32d7c10766	CANN: update aclnnGroupedMatmulV2 to aclnnGroupedMatmulV3 (llama/14411) * [CANN]update to aclnnGroupedMatmulV2 Signed-off-by: noemotiovon <757486878@qq.com> * Support MUL_MAT_ID on 310p Signed-off-by: noemotiovon <757486878@qq.com> * fix editorconfig Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-07-12 19:23:56 +03:00
Jeff Bolz	3c7939cfe5	vulkan: Split large mul_mat_id to fit in shared memory (llama/14451)	2025-07-12 19:23:56 +03:00
Sigbjørn Skjæret	6fc80e8456	add GELU_ERF (llama/14455)	2025-07-12 19:23:56 +03:00
Acly	19b9aaf044	vulkan : implement bilinear interpolation for ggml_upscale/ggml_interpolate (ggml/1291) * supports GGML_SCALE_MODE_BILINEAR and GGML_SCALE_FLAG_ALIGN_CORNERS	2025-07-12 19:23:56 +03:00
Acly	f98cb6607b	vulkan : implement ggml_roll (ggml/1290) * vulkan : implement ggml_roll * vulkan : refactor vk_op_unary_push_constants initialization	2025-07-12 19:23:56 +03:00
Daniel Bevenius	5ea5c58768	ggml : add version function to get lib version (ggml/1286) * ggml : add version function to get lib version This commit adds a function `ggml_version()` to the ggml library that returns the version of the library as a string. The motivation for this is that it can be useful to be able to programmatically check the version of the ggml library being used. Usage: ```c printf("GGML version: %s\n", ggml_version()); ``` Output: ```console GGML version: 0.0.2219 ``` * ggml : add ggml_commit() --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-07-12 19:23:56 +03:00
accessiblepixel	869335f2d5	server : add dtw.params for v3-large-turbo (#3307 ) * Add DTW model large-v3-turbo parameters to server.cpp example DTW support is available in whispercpp and the large-v3-turbo model has already been added to the sources, but the large-v3-turbo model hasn't been added to the server.cpp file to make use of it. This commit hopefully corrects that issue. * match original linebreak of original server.cpp file after adding large.v3.turbo dtw	2025-07-07 12:51:15 +03:00
Lin Xiaodong	d9999d54c8	feat: support vad for addon.node (#3301 ) Co-authored-by: linxiaodong <calm.lin@wukongsch.com>	2025-07-02 13:14:29 +03:00
Georgi Gerganov	bca021c974	sync : ggml ggml-ci	2025-07-01 17:54:53 +03:00
Georgi Gerganov	1f816de7da	talk-llama : sync llama.cpp	2025-07-01 17:54:53 +03:00
Georgi Gerganov	c4ea72be9a	ggml : remove trailing whitespace (llama/0)	2025-07-01 17:54:53 +03:00
lhez	1e930ab1b8	opencl : add GEGLU, REGLU, SWIGLU (llama/14456)	2025-07-01 17:54:53 +03:00
Aman Gupta	b5b237d49a	Add Conv2d for CPU (llama/14388) * Conv2D: Add CPU version * Half decent * Tiled approach for F32 * remove file * Fix tests * Support F16 operations * add assert about size * Review: further formatting fixes, add assert and use CPU version of fp32->fp16	2025-07-01 17:54:53 +03:00
Georgi Gerganov	679f31a9d1	metal : disable fast-math for some cpy kernels (llama/14460) * metal : disable fast-math for some cpy kernels ggml-ci * cont : disable for q4_1 ggml-ci * cont : disable for iq4_nl ggml-ci	2025-07-01 17:54:53 +03:00
Romain Biessy	e29e36aee7	ggml-cpu: sycl: Re-enable exp f16 (llama/14462)	2025-07-01 17:54:53 +03:00
xiaobing318	6bb1234a56	cmake : Remove redundant include path in CMakeLists.txt (llama/14452) * Update docker.yml 修改docker.yml文件中的内容使其停止周期性的运行该workflow，如果想要运行该workflow可以手动启动 * Remove redundant include path in CMakeLists.txt The parent directory '..' was removed from the include directories for the ggml-cpu-feats target, to avoid unnecessary include paths. * Enable scheduled Docker image builds Uncomments the workflow schedule to trigger daily Docker image rebuilds at 04:12 UTC, improving automation and keeping images up to date.	2025-07-01 17:54:53 +03:00
Vedran Miletić	3239359bd1	scripts : make the shell scripts cross-platform (llama/14341)	2025-07-01 17:54:53 +03:00
Akarshan Biswas	e81be92931	SYCL: disable faulty fp16 exp kernel (llama/14395) * SYCL: disable faulty fp16 CPU exponent for now * Revert "SYCL: disable faulty fp16 CPU exponent for now" This reverts commit ed0aab1ec31b4eb4b0f275dd7acd41d96a375202. * SYCL: disable faulty fp16 CPU exponent for now * Fix logic of disabling exponent kernel	2025-07-01 17:54:53 +03:00
Sigbjørn Skjæret	130044f228	ggml : fix unmerged GGML_FPxx_TO_FPxx refactoring (llama/14443)	2025-07-01 17:54:53 +03:00
Sigbjørn Skjæret	8bc638ee56	ggml : implement REGLU/GEGLU/SWIGLU ops (llama/14158) * implement unary REGLU/GEGLU/SWIGLU cpu ops * relax constraints * duplicate shape of source * fix ggml_vec_geglu_f16 * special case gated ops * implement unary REGLU/GEGLU/SWIGLU cuda ops * tighten constraints again * refactor into GGML_GLU_OP * metal : add glu kernels ggml-ci * add CUDA_GLU_BLOCK_SIZE [no ci] * more constraints and use 64bit ints ggml-ci * 64bit multiplication [no ci] * implement swapped variants (cpu/cuda) * update comment [no ci] ggml-ci * Vulkan: Add GLU ops and shaders * SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate * ggml : implement GLU for split up/gate (llama/14181) * implement GLU for split up/gate * add tests for ggml_glu_split * Vulkan: Implement glu_split logic and shader support * add split to logging [no ci] * SYCL: refactor element_size ops and add split up and gate support to gated kernels * SYCL: switch GEGLU to use tanh approximation --------- Co-authored-by: 0cc4m <picard12@live.de> Co-authored-by: Akarshan <akarshan@menlo.ai> * GGML: increase OP count in assertion * Refactor: Optimize SYCL element-wise operations with unary function inlining This commit refactors the SYCL element-wise operations to improve performance by: - Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead. - Introducing helper functions `op_xxx` for each unary operation to encapsulate the logic. - Replacing direct kernel calls with calls to these inlined functions. - Using `__dpct_inline__` to encourage compiler inlining. - Minor code cleanup and consistency improvements. The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices. * vulkan: Increase workgroup size for GLU, for performance (llama/14345) * vulkan: Increase workgroup size for GLU, for performance * vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup * merge fix * metal : add support for split and swap ggml-ci --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: 0cc4m <picard12@live.de> Co-authored-by: Akarshan <akarshan@menlo.ai> Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-07-01 17:54:53 +03:00
Jeff Bolz	00b36237ba	vulkan: Add fusion support for RMS_NORM+MUL (llama/14366) * vulkan: Add fusion support for RMS_NORM+MUL - Add a use_count to ggml_tensor, so we can detect if an output is used more than once. - Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor. - Add detection logic and basic fusion logic in ggml-vulkan. - Add some testing support for fusion. Rather than computing one node at a time, allow for computing the whole graph and just testing one node's results. Add rms_norm_mul tests and enable a llama test. * extract some common fusion logic * fix -Winconsistent-missing-override * move ggml_can_fuse to a common function * build fix * C and C++ versions of can_fuse * move use count to the graph to avoid data races and double increments when used in multiple threads * use hash table lookup to find node index * change use_counts to be indexed by hash table slot * minimize hash lookups style fixes * last node doesn't need single use. fix type. handle mul operands being swapped. * remove redundant parameter --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-07-01 17:54:53 +03:00
Aman Gupta	b900ee424c	CUDA: add bf16 and f32 support to cublas_mul_mat_batched (llama/14361) * CUDA: add bf16 and f32 support to cublas_mul_mat_batched * Review: add type traits and make function more generic * Review: make check more explicit, add back comments, and fix formatting * Review: fix formatting, remove useless type conversion, fix naming for bools	2025-07-01 17:54:53 +03:00

1 2 3 4 5 ...

2888 Commits