whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2024-11-07 08:34:37 +01:00

Author	SHA1	Message	Date
Johannes Gäßler	84713613be	CUDA: fix 1D im2col, add tests (ggml/993)	2024-11-01 10:19:05 +02:00
leo-pony	ded89c9d08	Fix cann compilation error (llama/9891) Fix cann compilation error after merging llama.cpp supports dynamically loadable backends.	2024-11-01 10:19:05 +02:00
agray3	042e95d92f	Vectorize load instructions in dmmv f16 CUDA kernel (llama/9816) * Vectorize load instructions in dmmv f16 CUDA kernel Replaces scalar with vector load instructions, which substantially improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup. * addressed comment * Update ggml/src/ggml-cuda/dmmv.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-11-01 10:19:05 +02:00
Diego Devesa	81110c0174	ggml : move more prints to the ggml log system (llama/9839) * ggml : move more prints to the ggml log system * show BLAS OpenMP warnings in all builds using debug print	2024-11-01 10:19:05 +02:00
Diego Devesa	c313723860	rpc : add backend registry / device interfaces (llama/9812) * rpc : add backend registry / device interfaces * llama : add llama_supports_rpc API * ggml_backend_rpc_start_rpc_server -> ggml_backend_rpc_start_server	2024-11-01 10:19:05 +02:00
R0CKSTAR	e69b2371e2	musa: add docker image support (llama/9685) * mtgpu: add docker image support Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * mtgpu: enable docker workflow Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-11-01 10:19:05 +02:00
Diego Devesa	1531259b2c	ggml : fix BLAS with unsupported types (llama/9775) * ggml : do not use BLAS with types without to_float * ggml : return pointer from ggml_internal_get_type_traits to avoid unnecessary copies * ggml : rename ggml_internal_get_type_traits -> ggml_get_type_traits it's not really internal if everybody uses it	2024-11-01 10:19:05 +02:00
Diego Devesa	44bc2767fd	ggml : add backend registry / device interfaces to BLAS backend (llama/9752) * ggml : add backend registry / device interfaces to BLAS backend * fix mmap usage when using host buffers	2024-11-01 10:19:05 +02:00
Andrew Minh Nguyen	bd7ace7adc	Update building for Android (llama/9672) * docs : clarify building Android on Termux * docs : update building Android on Termux * docs : add cross-compiling for Android * cmake : link dl explicitly for Android	2024-11-01 10:19:05 +02:00
Georgi Gerganov	315364d7de	ggml : add metal backend registry / device (llama/9713) * ggml : add metal backend registry / device ggml-ci * metal : fix names [no ci] * metal : global registry and device instances ggml-ci * cont : alternative initialization of global objects ggml-ci * llama : adapt to backend changes ggml-ci * fixes * metal : fix indent * metal : fix build when MTLGPUFamilyApple3 is not available ggml-ci * fix merge * metal : avoid unnecessary singleton accesses ggml-ci * metal : minor fix [no ci] * metal : g_state -> g_ggml_ctx_dev_main [no ci] * metal : avoid reference of device context in the backend context ggml-ci * metal : minor [no ci] * metal : fix maxTransferRate check * metal : remove transfer rate stuff --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-11-01 10:19:05 +02:00
Paul Tsochantaris	80753d4da8	metal : single allocation of encode_async block (llama/9747) * Single allocation of encode_async block with non-ARC capture in ggml-metal.m * Moving Block_release to the deallocation code * Release encode block when re-setting encoding buffer count if needed * Update ggml/src/ggml-metal.m --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-11-01 10:19:05 +02:00
Daniel Bevenius	8f9bdca4c4	ggml-alloc : remove buffer_id from leaf_alloc (ggml/987) This commit removes the buffer_id field from the leaf_alloc struct. The motivation for is that this field is only written to and never read/used as far as I can tell. Each tensor_alloc has a buffer_id field and this is what caused me to look into this more closely, to understand what the buffer_id in leaf_alloc was used for.	2024-11-01 10:19:05 +02:00
Georgi Gerganov	aa037a60f3	ggml : alloc ggml_contexts on the heap (#2525 ) * whisper : reduce ggml_context usage * ggml : allocate contexts on the heap (v2) * ggml : aligned malloc -> malloc	2024-10-31 22:00:09 +02:00
SRHMorris	9f346d0084	vulkan : retry allocation with fallback flags (#2451 ) Co-authored-by: Samuel Morris <samuel.morris@artlist.io>	2024-10-06 10:34:20 +03:00
Georgi Gerganov	1ba185f4af	metal : zero-init buffer contexts (#0 )	2024-10-05 15:23:51 +03:00
Georgi Gerganov	941912467d	whisper : adapt to latest ggml (skip) (#0 )	2024-10-05 15:23:51 +03:00
Daniel Bevenius	0b1b094a67	ggml : fix typo in example usage ggml_gallocr_new (ggml/984)	2024-10-05 15:23:51 +03:00
Diego Devesa	40e52a76b9	ggml : fixes after sync (ggml/983) ggml : remove test-backend-buffer ggml : fix CUDA build warnings	2024-10-05 15:23:51 +03:00
Diego Devesa	cf977670e6	ggml-backend : add device and backend reg interfaces (llama/9707) Also: - metal : fix compute pass descriptor autorelease crash - ggml-backend : add device description to CPU backend - ggml: unify backend logging mechanism	2024-10-05 15:23:51 +03:00
Ouadie EL FAROUKI	df2c364de7	Fixed dequant precision issues in Q4_1 and Q5_1 (llama/9711)	2024-10-05 15:23:51 +03:00
Diego Devesa	1acfadb721	ggml-backend : add device and backend reg interfaces (llama/9707) Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-10-05 15:23:51 +03:00
Alberto Cabrera Pérez	ea642144d2	Initial cmake support of SYCL for AMD GPUs (llama/9658) sycl: initial cmake support of SYCL for AMD GPUs	2024-10-05 15:23:51 +03:00
Radoslav Gerganov	282a8654c4	vulkan : do not use tensor->extra (llama/9407) * vulkan : do not use tensor->extra This patch allows using the Vulkan backend with the RPC backend as tensor->extra is no longer used. Ref: #8536 * Adapt GGML_VULKAN_CHECK_RESULTS to extra removal (llama/2) --------- Co-authored-by: 0cc4m <picard12@live.de>	2024-10-05 15:23:51 +03:00
Johannes Gäßler	936cf3beb7	ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980)	2024-10-05 15:23:51 +03:00
Johannes Gäßler	bc92c2f8f0	ggml: refactor cross entropy loss CPU impl. (ggml/976)	2024-10-05 15:23:51 +03:00
Georgi Gerganov	162a455402	metal : reduce command encoding overhead (llama/9698)	2024-10-03 12:22:17 +03:00
Johannes Gäßler	5e9d6baa48	test: fix OPT_STEP_ADAMW for test-backend-ops (ggml/974)	2024-10-03 12:22:17 +03:00
Salvatore Mesoraca	845f8d663e	vulkan : mul_mat: fix UB with small warps (ggml/952) When the device's warp size is less than 16, it is possible for loadstride_a (mul_mm.comp:114) and loadstride_b (mul_mm.comp:115) to be set to 0. Because they are calculated as: the workgroup size, multiplied by LOAD_VEC_* (which can be 1) and divided by 16. And the workgroup size is set to be the same as the warp/subgroup size. The loadstride_* variables are used as increments in the loops that populate the buffers used for the multiplication. When they are 0 they cause an infinite loop. But infinite loops without side-effects are UB and the values of loadstride_* are known at compile time. So, the compiler quietly optimizes all the loops away. As a consequence, the buffers are not populated and the multiplication result is just a matrix with all elements set to 0. We prevent the UB by making sure that the workgroup size will never be less than 16, even if our device has a smaller warp size (e.g. 8). Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>	2024-10-03 12:22:17 +03:00
Borislav Stanimirov	31fdf05fda	ggml : fix ggml_cast (ggml/973)	2024-10-03 12:22:17 +03:00
Johannes Gäßler	0ac6666cd2	ggml: fix gradient allocation logic (ggml/966) * ggml: fix gradient allocation logic * gradient allocation in ggml_build_backward_expand * fixup * fix test-backend-ops grad * suggestions by slaren * fix test1.c * fix legacy opt API * fix test-grad0 * remove keep arg	2024-10-03 12:22:17 +03:00
Georgi Gerganov	6c91da80b8	ggml : define missing HWCAP flags (llama/9684) ggml-ci Co-authored-by: Willy Tarreau <w@1wt.eu>	2024-10-03 12:22:17 +03:00
Dan Johansson	c245168ba3	ggml : add run-time detection of neon, i8mm and sve (llama/9331) * ggml: Added run-time detection of neon, i8mm and sve Adds run-time detection of the Arm instructions set features neon, i8mm and sve for Linux and Apple build targets. * ggml: Extend feature detection to include non aarch64 Arm arch * ggml: Move definition of ggml_arm_arch_features to the global data section	2024-10-03 12:22:17 +03:00
Markus Tavenrath	280fee8fa0	Enable use to the rebar feature to upload buffers to the device. (llama/9251)	2024-10-03 12:22:17 +03:00
R0CKSTAR	78b4c1c25f	mtgpu: enable VMM (llama/9597) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-10-03 12:22:17 +03:00
Charles Xu	1edea2eb4b	ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels (llama/9217) * ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels * added fallback mechanism when the offline re-quantized model is not optimized for the underlying target. * fix for build errors * remove prints from the low-level code * Rebase to the latest upstream	2024-10-03 12:22:17 +03:00
Dou Xinpeng	96808786b7	cann: fix crash when llama-bench is running on multiple cann devices (llama/9627)	2024-10-03 12:22:17 +03:00
Johannes Gäßler	bb57ecb85e	CUDA: remove bad assert (ggml/972)	2024-10-03 12:22:17 +03:00
Jeff Bolz	abdb73c7cc	vulkan : multithread pipeline creation (ggml/963)	2024-10-03 12:22:17 +03:00
Jeff Bolz	391e548a43	vulkan : fix build for GGML_VULKAN_RUN_TESTS, add TFLOPS to log (ggml/961)	2024-10-03 12:22:17 +03:00
Salvatore Mesoraca	2a29afd4c6	vulkan : argsort barriers must be under uniform control flow (ggml/951) a return before a barrier (that happens only in some threads in a workgroup) leads to UB. While the old code actually works on some devices, it fails on some others (i.e. "smaller" GPUs). BTW, I think it would be better to set specialization constants when the graph is built, in that way the local workgroup could be sized appropriately. But it would take a lot of work. Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>	2024-10-03 12:22:17 +03:00
Georgi Gerganov	5963004ff9	ggml : fix GGML_MAX_N_THREADS + improve formatting (ggml/969)	2024-10-03 12:22:17 +03:00
Georgi Gerganov	1133ac98a8	ggml : add ggml-cpu-impl.h (skip) (#0 )	2024-09-24 19:45:08 +03:00
Eric Zhang	234f9bd320	ggml : add AVX512DQ requirement for AVX512 builds (llama/9622)	2024-09-24 19:45:08 +03:00
Georgi Gerganov	3b183cfae7	log : add CONT level for continuing previous log entry (llama/9610)	2024-09-24 19:45:08 +03:00
Max Krasnyansky	02285dff81	threads: fix msvc build without openmp (llama/9615) We're missing atomic_thread_fence() in MSVC builds when openmp is disabled.	2024-09-24 19:45:08 +03:00
Ivan	2fc1d20f9e	cuda: add q8_0->f32 cpy operation (llama/9571) llama: enable K-shift for quantized KV cache It will fail on unsupported backends or quant types.	2024-09-24 19:45:08 +03:00
Max Krasnyansky	08e8414f27	threads: improve ggml_barrier scaling with large number of threads (llama/9598) Make sure n_barrier and n_barrier_passed do not share the cache line to avoid cache line bouncing. This optimization shows performance improvements even for n_threads <= 8 cases. Resurect TSAN (Thread Sanitizer) check so that we can avoid doing expensive read-modify-write in the normal case and just use thread-fence as originally intended.	2024-09-24 19:45:08 +03:00
Srihari-mcw	05c6139625	ggml : AVX512 gemm for Q4_0_8_8 (llama/9532) * AVX512 version of ggml_gemm_q4_0_8x8_q8_0 * Remove zero vector parameter passing * Rename functions and rearrange order of macros * Edit commments * style : minor adjustments * Update x to start from 0 --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-09-24 19:45:08 +03:00
Georgi Gerganov	896c41ef30	metal : use F32 prec for K*Q in vec FA (llama/9595) ggml-ci	2024-09-24 19:45:08 +03:00
Akarshan Biswas	c36ddc43c6	Revert "[SYCL] fallback mmvq (ggml/9088)" (llama/9579) This reverts commit 50addec9a532a6518146ab837a85504850627316.	2024-09-24 19:45:08 +03:00

1 2 3 4 5

222 Commits