whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2024-11-07 08:34:37 +01:00

Author	SHA1	Message	Date
Radoslav Gerganov	25f9fee6fb	rpc : pack only RPC structs (llama/9959)	2024-11-01 10:19:05 +02:00
Neo Zhang Jianyu	7c1570bee6	fix mul_mat_vec_q and *_vec_q error (llama/9939) Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>	2024-11-01 10:19:05 +02:00
Radoslav Gerganov	4078e4c388	rpc : backend refactoring (llama/9912) * rpc : refactor backend Use structs for RPC request/response messages * rpc : refactor server	2024-11-01 10:19:05 +02:00
Ouadie EL FAROUKI	a4a22daa8f	Add SYCL Backend registry, device and Event Interfaces (llama/9705) * implemented missing SYCL event APIs * sycl : Added device and backend reg interfaces * Restructured ggml-sycl.cpp	2024-11-01 10:19:05 +02:00
Ma Mingfei	e1936eb2a5	add amx kernel for gemm (llama/8998) add intel amx isa detection add vnni kernel for gemv cases add vnni and amx kernel support for block_q8_0 code cleanup fix packing B issue enable openmp fine tune amx kernel switch to aten parallel pattern add error message for nested parallelism code cleanup add f16 support in ggml-amx add amx kernels for QK_K quant formats: Q4_K, Q5_K, Q6_K and IQ4_XS update CMakeList update README fix some compilation warning fix compiler warning when amx is not enabled minor change ggml-ci move ggml_amx_init from ggml.c to ggml-amx/mmq.cpp ggml-ci update CMakeLists with -mamx-tile, -mamx-int8 and -mamx-bf16 ggml-ci add amx as an ggml-backend update header file, the old path for immintrin.h has changed to ggml-cpu-impl.h minor change update CMakeLists.txt minor change apply weight prepacking in set_tensor method in ggml-backend fix compile error ggml-ci minor change ggml-ci update CMakeLists.txt ggml-ci add march dependency minor change ggml-ci change ggml_backend_buffer_is_host to return false for amx backend ggml-ci fix supports_op use device reg for AMX backend ggml-ci minor change ggml-ci minor change fix rebase set .buffer_from_host_ptr to be false for AMX backend	2024-11-01 10:19:05 +02:00
Diego Devesa	28b044dad9	vulkan : add backend registry / device interfaces (llama/9721) * vulkan : add backend registry / device interfaces * llama : print devices used on model load	2024-11-01 10:19:05 +02:00
Gilad S	b8f11a0a17	fix: allocating CPU buffer with size `0` (llama/9917)	2024-11-01 10:19:05 +02:00
Gilad S	ff5a838099	fix: use `vm_allocate` to allocate CPU backend buffer on macOS (llama/9875) * fix: use `vm_allocate` to allocate CPU backend buffer on macOS * fix: switch to `posix_memalign` to keep existing `free()` usages work * feat: move `GGML_ALIGNED_MALLOC` to `ggml-backend-impl.h`, add support for `vm_allocate` on macOS * style: formatting * fix: move const outside of `#ifndef` * style: formatting * fix: unused var * fix: transform `GGML_ALIGNED_MALLOC` and `GGML_ALIGNED_FREE` into functions and add them to `ggml-impl.h` * fix: unused var * fix: page align to `GGUF_DEFAULT_ALIGNMENT` * fix: page align to `TENSOR_ALIGNMENT` * fix: convert `TENSOR_ALIGNMENT` to a macro * fix: increase page size to `32` on iOS * fix: iOS page size * fix: `hbw_posix_memalign` alignment	2024-11-01 10:19:05 +02:00
Johannes Gäßler	84713613be	CUDA: fix 1D im2col, add tests (ggml/993)	2024-11-01 10:19:05 +02:00
leo-pony	ded89c9d08	Fix cann compilation error (llama/9891) Fix cann compilation error after merging llama.cpp supports dynamically loadable backends.	2024-11-01 10:19:05 +02:00
agray3	042e95d92f	Vectorize load instructions in dmmv f16 CUDA kernel (llama/9816) * Vectorize load instructions in dmmv f16 CUDA kernel Replaces scalar with vector load instructions, which substantially improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup. * addressed comment * Update ggml/src/ggml-cuda/dmmv.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-11-01 10:19:05 +02:00
Diego Devesa	81110c0174	ggml : move more prints to the ggml log system (llama/9839) * ggml : move more prints to the ggml log system * show BLAS OpenMP warnings in all builds using debug print	2024-11-01 10:19:05 +02:00
Diego Devesa	c313723860	rpc : add backend registry / device interfaces (llama/9812) * rpc : add backend registry / device interfaces * llama : add llama_supports_rpc API * ggml_backend_rpc_start_rpc_server -> ggml_backend_rpc_start_server	2024-11-01 10:19:05 +02:00
R0CKSTAR	e69b2371e2	musa: add docker image support (llama/9685) * mtgpu: add docker image support Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * mtgpu: enable docker workflow Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-11-01 10:19:05 +02:00
Diego Devesa	1531259b2c	ggml : fix BLAS with unsupported types (llama/9775) * ggml : do not use BLAS with types without to_float * ggml : return pointer from ggml_internal_get_type_traits to avoid unnecessary copies * ggml : rename ggml_internal_get_type_traits -> ggml_get_type_traits it's not really internal if everybody uses it	2024-11-01 10:19:05 +02:00
Diego Devesa	44bc2767fd	ggml : add backend registry / device interfaces to BLAS backend (llama/9752) * ggml : add backend registry / device interfaces to BLAS backend * fix mmap usage when using host buffers	2024-11-01 10:19:05 +02:00
Andrew Minh Nguyen	bd7ace7adc	Update building for Android (llama/9672) * docs : clarify building Android on Termux * docs : update building Android on Termux * docs : add cross-compiling for Android * cmake : link dl explicitly for Android	2024-11-01 10:19:05 +02:00
Georgi Gerganov	315364d7de	ggml : add metal backend registry / device (llama/9713) * ggml : add metal backend registry / device ggml-ci * metal : fix names [no ci] * metal : global registry and device instances ggml-ci * cont : alternative initialization of global objects ggml-ci * llama : adapt to backend changes ggml-ci * fixes * metal : fix indent * metal : fix build when MTLGPUFamilyApple3 is not available ggml-ci * fix merge * metal : avoid unnecessary singleton accesses ggml-ci * metal : minor fix [no ci] * metal : g_state -> g_ggml_ctx_dev_main [no ci] * metal : avoid reference of device context in the backend context ggml-ci * metal : minor [no ci] * metal : fix maxTransferRate check * metal : remove transfer rate stuff --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-11-01 10:19:05 +02:00
Paul Tsochantaris	80753d4da8	metal : single allocation of encode_async block (llama/9747) * Single allocation of encode_async block with non-ARC capture in ggml-metal.m * Moving Block_release to the deallocation code * Release encode block when re-setting encoding buffer count if needed * Update ggml/src/ggml-metal.m --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-11-01 10:19:05 +02:00
Daniel Bevenius	8f9bdca4c4	ggml-alloc : remove buffer_id from leaf_alloc (ggml/987) This commit removes the buffer_id field from the leaf_alloc struct. The motivation for is that this field is only written to and never read/used as far as I can tell. Each tensor_alloc has a buffer_id field and this is what caused me to look into this more closely, to understand what the buffer_id in leaf_alloc was used for.	2024-11-01 10:19:05 +02:00
Georgi Gerganov	aa037a60f3	ggml : alloc ggml_contexts on the heap (#2525 ) * whisper : reduce ggml_context usage * ggml : allocate contexts on the heap (v2) * ggml : aligned malloc -> malloc	2024-10-31 22:00:09 +02:00
SRHMorris	9f346d0084	vulkan : retry allocation with fallback flags (#2451 ) Co-authored-by: Samuel Morris <samuel.morris@artlist.io>	2024-10-06 10:34:20 +03:00
Georgi Gerganov	1ba185f4af	metal : zero-init buffer contexts (#0 )	2024-10-05 15:23:51 +03:00
Georgi Gerganov	941912467d	whisper : adapt to latest ggml (skip) (#0 )	2024-10-05 15:23:51 +03:00
Daniel Bevenius	0b1b094a67	ggml : fix typo in example usage ggml_gallocr_new (ggml/984)	2024-10-05 15:23:51 +03:00
Diego Devesa	40e52a76b9	ggml : fixes after sync (ggml/983) ggml : remove test-backend-buffer ggml : fix CUDA build warnings	2024-10-05 15:23:51 +03:00
Diego Devesa	cf977670e6	ggml-backend : add device and backend reg interfaces (llama/9707) Also: - metal : fix compute pass descriptor autorelease crash - ggml-backend : add device description to CPU backend - ggml: unify backend logging mechanism	2024-10-05 15:23:51 +03:00
Ouadie EL FAROUKI	df2c364de7	Fixed dequant precision issues in Q4_1 and Q5_1 (llama/9711)	2024-10-05 15:23:51 +03:00
Diego Devesa	1acfadb721	ggml-backend : add device and backend reg interfaces (llama/9707) Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-10-05 15:23:51 +03:00
Alberto Cabrera Pérez	ea642144d2	Initial cmake support of SYCL for AMD GPUs (llama/9658) sycl: initial cmake support of SYCL for AMD GPUs	2024-10-05 15:23:51 +03:00
Radoslav Gerganov	282a8654c4	vulkan : do not use tensor->extra (llama/9407) * vulkan : do not use tensor->extra This patch allows using the Vulkan backend with the RPC backend as tensor->extra is no longer used. Ref: #8536 * Adapt GGML_VULKAN_CHECK_RESULTS to extra removal (llama/2) --------- Co-authored-by: 0cc4m <picard12@live.de>	2024-10-05 15:23:51 +03:00
Johannes Gäßler	936cf3beb7	ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980)	2024-10-05 15:23:51 +03:00
Johannes Gäßler	bc92c2f8f0	ggml: refactor cross entropy loss CPU impl. (ggml/976)	2024-10-05 15:23:51 +03:00
Georgi Gerganov	162a455402	metal : reduce command encoding overhead (llama/9698)	2024-10-03 12:22:17 +03:00
Johannes Gäßler	5e9d6baa48	test: fix OPT_STEP_ADAMW for test-backend-ops (ggml/974)	2024-10-03 12:22:17 +03:00
Salvatore Mesoraca	845f8d663e	vulkan : mul_mat: fix UB with small warps (ggml/952) When the device's warp size is less than 16, it is possible for loadstride_a (mul_mm.comp:114) and loadstride_b (mul_mm.comp:115) to be set to 0. Because they are calculated as: the workgroup size, multiplied by LOAD_VEC_* (which can be 1) and divided by 16. And the workgroup size is set to be the same as the warp/subgroup size. The loadstride_* variables are used as increments in the loops that populate the buffers used for the multiplication. When they are 0 they cause an infinite loop. But infinite loops without side-effects are UB and the values of loadstride_* are known at compile time. So, the compiler quietly optimizes all the loops away. As a consequence, the buffers are not populated and the multiplication result is just a matrix with all elements set to 0. We prevent the UB by making sure that the workgroup size will never be less than 16, even if our device has a smaller warp size (e.g. 8). Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>	2024-10-03 12:22:17 +03:00
Borislav Stanimirov	31fdf05fda	ggml : fix ggml_cast (ggml/973)	2024-10-03 12:22:17 +03:00
Johannes Gäßler	0ac6666cd2	ggml: fix gradient allocation logic (ggml/966) * ggml: fix gradient allocation logic * gradient allocation in ggml_build_backward_expand * fixup * fix test-backend-ops grad * suggestions by slaren * fix test1.c * fix legacy opt API * fix test-grad0 * remove keep arg	2024-10-03 12:22:17 +03:00
Georgi Gerganov	6c91da80b8	ggml : define missing HWCAP flags (llama/9684) ggml-ci Co-authored-by: Willy Tarreau <w@1wt.eu>	2024-10-03 12:22:17 +03:00
Dan Johansson	c245168ba3	ggml : add run-time detection of neon, i8mm and sve (llama/9331) * ggml: Added run-time detection of neon, i8mm and sve Adds run-time detection of the Arm instructions set features neon, i8mm and sve for Linux and Apple build targets. * ggml: Extend feature detection to include non aarch64 Arm arch * ggml: Move definition of ggml_arm_arch_features to the global data section	2024-10-03 12:22:17 +03:00
Markus Tavenrath	280fee8fa0	Enable use to the rebar feature to upload buffers to the device. (llama/9251)	2024-10-03 12:22:17 +03:00
R0CKSTAR	78b4c1c25f	mtgpu: enable VMM (llama/9597) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-10-03 12:22:17 +03:00
Charles Xu	1edea2eb4b	ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels (llama/9217) * ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels * added fallback mechanism when the offline re-quantized model is not optimized for the underlying target. * fix for build errors * remove prints from the low-level code * Rebase to the latest upstream	2024-10-03 12:22:17 +03:00
Dou Xinpeng	96808786b7	cann: fix crash when llama-bench is running on multiple cann devices (llama/9627)	2024-10-03 12:22:17 +03:00
Johannes Gäßler	bb57ecb85e	CUDA: remove bad assert (ggml/972)	2024-10-03 12:22:17 +03:00
Jeff Bolz	abdb73c7cc	vulkan : multithread pipeline creation (ggml/963)	2024-10-03 12:22:17 +03:00
Jeff Bolz	391e548a43	vulkan : fix build for GGML_VULKAN_RUN_TESTS, add TFLOPS to log (ggml/961)	2024-10-03 12:22:17 +03:00
Salvatore Mesoraca	2a29afd4c6	vulkan : argsort barriers must be under uniform control flow (ggml/951) a return before a barrier (that happens only in some threads in a workgroup) leads to UB. While the old code actually works on some devices, it fails on some others (i.e. "smaller" GPUs). BTW, I think it would be better to set specialization constants when the graph is built, in that way the local workgroup could be sized appropriately. But it would take a lot of work. Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>	2024-10-03 12:22:17 +03:00
Georgi Gerganov	5963004ff9	ggml : fix GGML_MAX_N_THREADS + improve formatting (ggml/969)	2024-10-03 12:22:17 +03:00
Georgi Gerganov	1133ac98a8	ggml : add ggml-cpu-impl.h (skip) (#0 )	2024-09-24 19:45:08 +03:00

1 2 3 4 5

230 Commits