whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2025-08-14 00:58:35 +02:00

Author	SHA1	Message	Date
Georgi Gerganov	28b39c624e	ggml : remove old kompute, cann (skip) (#3349 ) ggml-ci	2025-07-30 16:08:57 +03:00
Georgi Gerganov	d0a9d8c7f8	talk-llama : sync llama.cpp	2025-07-28 13:02:32 +03:00
Georgi Gerganov	5b4646df1a	sync : ggml ggml-ci	2025-07-28 13:02:32 +03:00
Erik Scholz	d96f4d8ea1	vulkan : add fp16 support for the conv_2d kernel (llama/14872) * add f16 to conv_2d testing * weaken conv2d test error threshold	2025-07-28 13:02:32 +03:00
Jeff Bolz	5693b857d2	vulkan: skip empty set_rows to avoid invalid API usage (llama/14860)	2025-07-28 13:02:32 +03:00
deepsek	b275e52b46	HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 (llama/14624) This commit adds support for MFMA instructions to MMQ. CDNA1/GFX908 CDNA2/GFX90a and CDNA3/GFX942 are supported by the MFMA-enabled code path added by this commit. The code path and stream-k is only enabled on CDNA3 for now as it fails to outperform blas in all cases on the other devices. Blas is currently only consistently outperformed on CDNA3 due to issues in the amd-provided blas libraries. This commit also improves the awareness of MMQ towards different warp sizes and as a side effect improves the performance of all quant formats besides q4_0 and q4_1, which regress slightly, on GCN gpus.	2025-07-28 13:02:32 +03:00
hipudding	4692558a1f	CANN: Implement GLU ops (llama/14884) Implement REGLU, GEGLU, SWIGLU ops according to #14158	2025-07-28 13:02:32 +03:00
R0CKSTAR	8643960acc	musa: fix build warnings (unused variable) (llama/14869) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-07-28 13:02:32 +03:00
Aaron Teo	6629201471	ggml-cpu : disable GGML_NNPA by default due to instability (llama/14880) * docs: update s390x document for sentencepiece Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit e086c5e3a7ab3463d8e0906efcfa39352db0a48d) * docs: update huggingface links + reword Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 8410b085ea8c46e22be38266147a1e94757ef108) * ggml-cpu: disable ggml-nnpa compile flag by default fixes #14877 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 412f4c7c88894b8f55846b4719c76892a23cfe09) * docs: update s390x build docs to reflect nnpa disable Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit c1eeae1d0c2edc74ab9fbeff2707b0d357cf0b4d) --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-28 13:02:32 +03:00
Gabe Goodhart	0b0de0bbf2	metal: SSM_SCAN performance (llama/14743) * feat: Add s_off as a parameter in the args struct This may not be necessary, but it more closely mirrors the CUDA kernel Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * perf: Parallelize mamba2 SSM_SCAN metal kernel over d_state This is a first attempt at optimizing the metal kernel. The changes here are: - Launch the kernel with a thread group of size d_state - Use simd groups and shared memory to do the summation for the y computation When tested with G4 tiny preview, this shows roughly a 3x speedup on prefill and 15% speedup on decode. Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update logic to correctly do the multi-layer parallel sum Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Correctly size the shared memory bufer and assert expected size relationships Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Compute block offsets once rather than once per token Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Use local variable for state recursion Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Use a secondary simd_sum instead of a for loop Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add assertion and comment about relationship between simd size and num simd groups Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Parallelize of d_state for mamba-1 Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Parallel sum in SSM_CONV Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * Revert "feat: Parallel sum in SSM_CONV" After discussion with @compilade, the size of the parallelism here is not worth the cost in complexity or overhead of the parallel for. https://github.com/ggml-org/llama.cpp/pull/14743#discussion_r2223395357 This reverts commit 16bc059660c1c59e566628201c0ca2c20c9f4bc3. Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Simplify shared memory sizing Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-Authored-By: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-07-28 13:02:32 +03:00
lhez	d414c3f6ac	opencl: add fused `rms_norm_mul` (llama/14841) * opencl: add fused `rms_norm` + `mul` * opencl: improve workgroup size for `rms_norm_mul`	2025-07-28 13:02:32 +03:00
Oliver Simons	bbf2389919	ggml : remove invalid portPos specifiers from dot files (llama/14838) Neither "g" nor "x" are valid portPos specifiers per the official [graphviz documents](https://graphviz.org/docs/attr-types/portPos/): > If a compass point is used, it must have the form "n","ne","e","se","s","sw","w","nw","c","_". I tested locally for it to fall back to default portPos specifier if an invalid portPos is specified. As a consequence, we can remove associated code.	2025-07-28 13:02:32 +03:00
Chris Rohlf	56350ecc12	rpc : check for null buffers in get/set/copy tensor endpoints (llama/14868)	2025-07-28 13:02:32 +03:00
Diego Devesa	270fa9b25c	sched : fix multiple evaluations of the same graph with pipeline parallelism (llama/14855) ggml-ci	2025-07-28 13:02:32 +03:00
R0CKSTAR	89ae789450	musa: upgrade musa sdk to rc4.2.0 (llama/14498) * musa: apply mublas API changes Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: update musa version to 4.2.0 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: restore MUSA graph settings in CMakeLists.txt Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: disable mudnnMemcpyAsync by default Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: switch back to non-mudnn images Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * minor changes Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: restore rc in docker image tag Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-07-28 13:02:32 +03:00
Kai Pastor	5823eabc78	cmake : Indent ggml-config.cmake (ggml/1310)	2025-07-28 13:02:32 +03:00
Alberto Cabrera Pérez	7dc5ae2d6a	sycl: fixed semantics of block offset calculation (llama/14814)	2025-07-28 13:02:32 +03:00
Georgi Gerganov	faedce5dcb	metal : fix fusion across different encoders (llama/14849) * metal : fix fusion across different encoders ggml-ci * cont : add assertion ggml-ci	2025-07-28 13:02:32 +03:00
Donghyeon Jeong	e648f9f079	sycl: fix undefined variable in work group size check (llama/14843)	2025-07-28 13:02:32 +03:00
Johannes Gäßler	95efcf011d	CUDA: fix overflow in FA, tune performance (llama/14840)	2025-07-28 13:02:32 +03:00
Johannes Gäßler	8272aa9f14	CUDA: fix compilation with GGML_CUDA_F16 (llama/14837)	2025-07-28 13:02:32 +03:00
Johannes Gäßler	a65976fc3c	CUDA: fix quantized KV cache + multiple sequences (llama/14822) * CUDA: fix quantized KV cache + multiple sequences * Update ggml/src/ggml-cuda/fattn-common.cuh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-07-28 13:02:32 +03:00
lixing-star	026d8a0c6e	ggml: fix loongarch quantize_row_q8_1 error (llama/14827)	2025-07-28 13:02:32 +03:00
chen fan	49d5540206	CANN: weight format to NZ for Ascend310P3 (llama/14407) * weight format to nz for 310p * remove quant weight format to nz * clean code * fix * make the conditions for converting weights to NZ format consistent * clean code	2025-07-28 13:02:32 +03:00
Aman Gupta	f8402d0a95	CUDA: add fused rms norm (llama/14800)	2025-07-28 13:02:32 +03:00
Jeff Bolz	c91361379a	vulkan: fix rms_norm_mul to handle broadcasting dim0 (llama/14817)	2025-07-28 13:02:32 +03:00
Sigbjørn Skjæret	810018a63a	cuda : implement bf16 cpy ops and enable bf16 cont (llama/14763) * implement bf16 cpy ops and enable bf16 cont * deduplicate copy functions * deduplicate checks	2025-07-28 13:02:32 +03:00
lhez	de49384ab3	opencl: remove unreachable `return` (llama/14806)	2025-07-28 13:02:32 +03:00
R0CKSTAR	9008410087	cuda: remove linking to cublasLt (llama/14790) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-07-28 13:02:32 +03:00
Sigbjørn Skjæret	e81e17b048	opencl: fix `im2col` when `KW!=KH` (llama/14803)	2025-07-28 13:02:32 +03:00
rmatif	a2a5612402	opencl: add conv2d kernel (llama/14403) * add conv2d kernel * fix trailing whitespace * whitespace fixe * handle f16 input and f16 kernel, more opt * resolve conflicts * use enqueue_ndrange_kernel	2025-07-28 13:02:32 +03:00
Romain Biessy	52ad451c8a	sycl: Fix im2col (llama/14797)	2025-07-28 13:02:32 +03:00
Charles Xu	fc2ff438fd	kleidiai: add support for get_rows (llama/14676) * kleidiai: add support for get_rows * apply fixes based on code review * apply more fixes based on code review	2025-07-28 13:02:32 +03:00
Jeff Bolz	e3f4162a06	vulkan/cuda: Fix im2col when KW!=KH (llama/14789) The tid is decomposed into "ow + kyOW + kxOW*KH". Change "ksize" to match.	2025-07-28 13:02:32 +03:00
Ervin Áron Tasnádi	92a9e85d8b	ggml: adds CONV_2D op and direct GEMM Vulkan implementation (llama/14316) * ggml/ggml-vulkan/test-backend-ops: adds CONV_2D for Vulkan * ggml-vulkan: adds f32 scalar shader to compute 2D convolution directly with gemm (no need for im2col), * test-backend-ops: adds test_case_ref to check the validity/performance of ops against reference implementations having different graphs, adds tests * * Performance fixes: minimized branch divergence, uses collectives to eliminate redundant calculation, macros removed. * Kernel shared memory size check * Updates test-backend-ops to support graphs for performance measurement. * * Apple/Win32 compile errors fixed * Subgroup size used to determine tile size -> fixes llvmpipe errors. * Collectives disabled by default. * Intel support is disabled as the performance is poor. * Conv2d enabled for Intel with disabled collectives, disabled for Apple * test-backend-ops modifications are reverted * Trailing spaces and missing override fixed. * Triggering pipeline relaunch. * Code formatted with .clang-format.	2025-07-28 13:02:32 +03:00
Peter0x44	50f983a17e	vulkan: Add logging for bf16 features to ggml_vk_print_gpu_info (#13274 ) (llama/14707)	2025-07-28 13:02:32 +03:00
0cc4m	b06f314667	Vulkan: Fix fprintf format-security warning (llama/14770)	2025-07-28 13:02:32 +03:00
Kai Pastor	5c3b794c51	cmake : fix usage issues (ggml/1257) * CMake config: Create target only once Fix error on repeated find_package(ggml). For simplicity, check only for the top-level ggml::ggml. * CMake config: Add CUDA link libs * CMake config: Add OpenCL link libs * CMake config: Use canonical find_dependency Use set and append to control link lib variables. Apply more $<LINK_ONLY...>. * CMake config: Wire OpenMP dependency	2025-07-28 13:02:32 +03:00
Daniel Bevenius	e238dc1bdd	ggml-cpu : remove stdlib include from repack.cpp (ggml/1276) This commit removes the inclusion of `<cstdlib>`. The motivation for this change is that this source file does not seem to use any functions from this header and the comment about `qsort` is a little misleading/confusing.	2025-07-28 13:02:32 +03:00
Rich Waters	e7bf0294ec	Support static xcframework packaging in build-xcframework.sh (#3322 ) * This commit allows for the building of a static xcframework by adding a BUILD_STATIC_XCFRAMEWORK option. When enabled, the build-xcframework.sh script builds a self-contained static whisper.xcframework. The motivation for this change is so that command line binaries can link whisper.cpp without forcing users to install the whisper.xcframework separately. * Update build-xcframework.sh Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com> * Address reviewer feedback: remove extra indentation around static xcframework creation. * squash! Address reviewer feedback: remove extra indentation around static xcframework creation. Fix whitespaces. --------- Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2025-07-26 12:25:44 +02:00
Daniel Bevenius	7de8dd783f	examples : add note about WHISPER_WASM_SINGLE_FILE [no ci] (#3332 ) This commit adds a note to the README files of the WASM examples about the `WHISPER_WASM_SINGLE_FILE` option. The motivation for this is that currently this option is not documented and might be surprising to users who expect a separate .wasm file to be generated. Refs: https://github.com/ggml-org/whisper.cpp/issues/3290	2025-07-24 16:06:48 +02:00
Daniel Bevenius	85e474fd55	ci : add paths to build.yml (#3333 ) This commit adds specific paths to the GitHub Actions workflow file `.github/workflows/build.yml`. The motivation for this to avoid unnecessary builds when unrelated files are changed, which can save resources and time during the CI process. Refs: https://github.com/ggml-org/whisper.cpp/issues/3285	2025-07-24 16:04:21 +02:00
R0CKSTAR	210bbbe4d5	musa: upgrade musa sdk to rc4.2.0 (#3324 ) * musa: upgrade musa sdk to 4.2.0 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: restore rc in docker image tag Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-07-24 13:19:57 +03:00
Sacha Arbonel	1f5cf0b288	server : hide language probabilities option behind flag (#3328 ) * examples/server: hide language probabilities option behind flag * code review * fix	2025-07-21 13:03:54 +02:00
BVK Chaitanya	2e6be2f380	go: fix Mac OS X builds (#3310 ) This commit fixes Go bindings build failure for Mac OS X (15.1) which is currently failing. Co-authored-by: Chaitanya Bayapuneni <bvk@mini.cinnamon-interval.ts.net>	2025-07-21 08:47:35 +02:00
Georgi Gerganov	c0dc391349	sync : ggml ggml-ci	2025-07-20 00:23:50 +03:00
Georgi Gerganov	0ed687c6f1	metal : fuse add, mul + add tests (llama/14596) ggml-ci	2025-07-20 00:23:50 +03:00
Oliver Simons	d4a7ea1634	cuda : Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs (llama/14741) * Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs Gemma3n uses Matrix-Matrix addition as part of their input processing, wrongly triggering CUDA_GRAPH disablement on NVGPUs even when batch-size of 1 is used. * Exclude `project_per_layer_input` by matching node names This ensures that all other graphs which don't exhibit this pattern do not have their behavior changed. * Revert unnecessary formatting changes	2025-07-20 00:23:50 +03:00
Aman Gupta	9a07cb064a	CUDA: set_rows + cpy.cu refactor (llama/14712)	2025-07-20 00:23:50 +03:00
Neo Zhang Jianyu	fed20b0682	use max work group size for device to replace the magic number (llama/14732)	2025-07-20 00:23:50 +03:00

1 2 3 4 5 ...

2971 Commits