whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2025-08-17 04:11:06 +02:00

Author	SHA1	Message	Date
k.h.lai	0a2ad9de06	vulkan: properly initialize vulkan devices for LLAMA_SPLIT_MODE_NONE (llama/7552)	2024-06-16 18:19:48 +03:00
Radoslav Gerganov	39b0640b09	rpc : resource management rework (llama/7562) * rpc : resource management rework * address review comments	2024-06-16 18:19:48 +03:00
Neo Zhang	8dca71de64	fix ggml_sycl_mul_mat_id() to match the change of api (llama/7436) * fix mul_mat_id to match the change of api * rm comment * rm unused or duplicated code, rename as review comment	2024-06-16 18:19:48 +03:00
Georgi Gerganov	812787cbc5	ggml : generalize GGML_OP_CONCAT (llama/7563) * ggml : generalize GGML_OP_CONCAT (WIP) ggml-ci * tests : add dim != 2 tests * metal : generalize concat kernel * tests : naming * cuda : generalize concat kernel ggml-ci * sycl : add warning and assert * ggml : fix op params handling * metal : bugfix kernel ggml-ci * ggml : reimplement CPU and Metal * cuda : add asserts ggml-ci * ggml : fix ptrs ggml-ci	2024-06-16 18:19:48 +03:00
Djip007	68ef10805e	update HIP_UMA #7399 (llama/7414) * update HIP_UMA #7399 add use of hipMemAdviseSetCoarseGrain when LLAMA_HIP_UMA is enable. - get x2 on prompte eval and x1.5 on token gen with rocm6.0 on ryzen 7940HX iGPU (780M/gfx1103) * simplify code, more consistent style --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-06-16 18:19:48 +03:00
agray3	96fdb90f5f	Allow multiple copy function pointers for CUDA graph kernel param updates (llama/7565) CUDA graphs require parameter updates to kernels associated with GGML_OP_CPY nodes. Previously the implementation only checked for a single CUDA kernel in such nodes, but this caused a bug in cases where 2 such kernels exist. This fixes the issue by using a vector to allow multiple function pointers to be stored and checked against. Fixes #7942	2024-06-16 18:19:48 +03:00
AidanBeltonS	e98f9ac554	Fix q_xxs using mul_mat_q (llama/7459)	2024-06-16 18:19:48 +03:00
AidanBeltonS	02d481595b	Add freq factors (llama/7495)	2024-06-16 18:19:48 +03:00
Georgi Gerganov	7091c7ab5a	metal : add GGML_OP_REPEAT kernels (llama/7557) ggml-ci	2024-06-16 18:19:48 +03:00
Georgi Gerganov	d70ccb75f5	metal : disable FA kernel for HS=256 (llama/7556) ggml-ci	2024-06-16 18:19:48 +03:00
Georgi Gerganov	5ee048eb67	ggml : restore ggml_rope_xpos_inplace (ggml/0) ggml-ci	2024-06-16 18:19:48 +03:00
Masaya, Kato	37ed71c964	ggml: aarch64: SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot (llama/7433) * Add SVE support for q4_0_q8_0 q8_0_q8_0 * remove ifdef	2024-06-16 18:19:48 +03:00
Georgi Gerganov	8cd7a3df37	ggml : silence UB sanitizer error during iq2_xxs quantization (llama/0)	2024-06-16 18:19:48 +03:00
Georgi Gerganov	04a3279320	ggml : remove ggml_flash_attn and ggml_flash_ff (llama/7463) ggml-ci	2024-06-16 18:19:48 +03:00
Georgi Gerganov	45ddda8e0c	ggml : drop support for QK_K=64 (llama/7473) * ggml : drop support for QK_K=64 ggml-ci * opencl : restore QK_K=256 define	2024-06-16 18:19:48 +03:00
0cc4m	c41317fd66	Update vulkan rope implementation to support frequency factors (llama/7475)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	96b8419b27	CUDA: fix FA out-of-bounds reads (llama/7479)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	3c63f4cf35	CUDA: fix FA out-of-bounds writes (llama/7465)	2024-06-16 18:19:48 +03:00
Georgi Gerganov	5848dfd9c8	cuda : fix compile warning (llama/7454)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	29ab5d0326	CUDA: remove incorrect precision check (llama/7454)	2024-06-16 18:19:48 +03:00
Georgi Gerganov	c4d6958b3e	cuda : fix rope + add tests (llama/7452) * cuda : fix rope pos data ggml-ci * ggml : drop mode & 1 == 1 support for ggml_rope ggml-ci * ggml : support freq_factors for f16 rope (CPU) ggml-ci * tests : add rope tests using frequency factors ggml-ci	2024-06-16 18:19:48 +03:00
liuwei-git	c9dcb75118	llama : add phi3 128K model support (llama/7225) * add phi3 128k support in convert-hf-to-gguf * add phi3 128k support in cuda * address build warnings on llama.cpp * adjust index value in cuda long rope freq factors * add long rope support in ggml cpu backend * make freq factors only depend on ctx size * remove unused rope scaling type 'su' frin gguf converter * fix flint warnings on convert-hf-to-gguf.py * set to the short freq factor when context size is small than trained context size * add one line of comments * metal : support rope freq_factors * ggml : update ggml_rope_ext API to support freq. factors * backends : add dev messages to support rope freq. factors * minor : style * tests : update to use new rope API * backends : fix pragma semicolons * minor : cleanup * llama : move rope factors from KV header to tensors * llama : remove tmp assert * cuda : fix compile warning * convert : read/write n_head_kv * llama : fix uninitialized tensors --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-06-16 18:19:48 +03:00
Georgi Gerganov	bbdbc3fc62	metal : handle F16 inf values, fix FA partial offload (llama/7434) ggml-ci	2024-06-16 18:19:48 +03:00
Johannes Gäßler	28c207a541	CUDA: fix unused warning in mmq.cu (llama/7442)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	c23f830983	CUDA: deduplicate mmq code (llama/7397)	2024-06-16 18:19:48 +03:00
Radoslav Gerganov	caeeb32b41	rpc : track allocated buffers (llama/7411) * rpc : track allocated buffers ref: #7407 * rpc : pack rpc_tensor tightly	2024-06-16 18:19:48 +03:00
AidanBeltonS	584cc1177a	Update SYCL upscale operation (llama/7321) * Update SYCL upscale operation * Formatting * Remove messages	2024-06-16 18:19:48 +03:00
Herman Semenov	cc1ae10989	ggml-opencl, llama: using reserve() if count already known (llama/7272)	2024-06-16 18:19:48 +03:00
junchao-loongson	eb26f55b40	ggml : add loongarch lsx and lasx support (llama/6454) * add loongarch lsx and lasx optimize code * Add loongarch compilation support to makefile * revert stb_image.h * opt bytes_from_nibbles_32 and sum_i16_pairs_float * fix undeclared * format code * update * update 2 --------- Co-authored-by: Jinyang He <hejinyang@loongson.cn>	2024-06-16 18:19:48 +03:00
Srihari-mcw	eb2b086584	Add provisions for windows support for BF16 code including CMake provision for enabling AVX512_BF16 (llama/7258)	2024-06-16 18:19:48 +03:00
0cc4m	67919cfe11	Vulkan Embedding Fix (llama/7360) * Fix empty Vulkan host buffers Add fp32 fp16 matmul shader Fix matmul shader alignment * Remove deprecated tensor->backend uses * Fix Vulkan validation errors on embedding models with no offloaded layers * Fix Vulkan llava segfault when not offloading layers	2024-06-16 18:19:48 +03:00
slaren	bf5fc81a8a	ggml : fix another case of quants nans (llama/7387)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	2b07dc3186	ggml: implement quantized KV cache for FA (llama/7372)	2024-06-16 18:19:48 +03:00
slaren	951c463d39	cuda : clear error after buffer allocation failure (llama/7376)	2024-06-16 18:19:48 +03:00
fraxy-v	7f257b210f	Capture CUDA logging output (llama/7298) * logging: output capture in cuda module * fix compile error * fix: vsnprintf terminates with 0, string use not correct * post review * Update llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * Update llama.cpp Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-06-16 18:19:48 +03:00
Georgi Gerganov	705fe30a02	android : use "ci-android" branch for CI (llama/7341) * android : use "ci-android" branch for CI * ggml : disable SIMD exp and silu for 32-bit ARM ggml-ci * android : do not fetch, use add_subdirectory instead * cmake : provide binary dir	2024-06-16 18:19:48 +03:00
Johannes Gäßler	45b5b95e29	CUDA: deduplicate FlashAttention code (llama/7352)	2024-06-16 18:19:48 +03:00
Engininja2	f2c47d1e6a	cuda : add half2 __shfl_xor() for ROCm 5.5 (llama/7263)	2024-06-16 18:19:48 +03:00
0cc4m	b4bb9b9036	Update and fix Vulkan soft_max and argsort implementations (llama/7237) * Update and fix Vulkan softmax implementation * Update and fix Vulkan argsort implementation	2024-06-16 18:19:48 +03:00
slaren	2bc6483299	ggml : fix quants nans when all the group weights are very close to zero (llama/7313)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	ec52f900e4	CUDA: faster large batch FA without tensor cores (llama/7314)	2024-06-16 18:19:48 +03:00
Radoslav Gerganov	77d708fabb	rpc : set SO_REUSEADDR for the server socket (llama/7320) ref: #7293	2024-06-16 18:19:48 +03:00
Herman Semenov	c00149c861	ggml-quants, llama : removed excess checks (llama/7274)	2024-06-16 18:19:48 +03:00
Justine Tunney	574661f2e6	ggml : rewrite silu and softmax for cpu (llama/7154) This change upstreams llamafile's vectorized expf() functions. This lets us compute softmax and silu more accurately than the short[65536] lookup table that GGML previously used to make this operation go faster. We can support aarch64 and sse2+ with the worst case rounding error of 2ulp. It makes make -j8 tests && ./tests/test-backend-ops -o SOFT_MAX -b CPU perf go 1.5x faster for SSE2+FMA, 1.9x faster for AVX2+FMA and 2.1x on AVX512	2024-06-16 18:19:48 +03:00
Radoslav Gerganov	7bd69349bf	rpc : add command line arg for specifying backend memory ref: #7293	2024-06-16 18:19:48 +03:00
Max Krasnyansky	488ad99c13	Add support for properly optimized Windows ARM64 builds with LLVM and MSVC (llama/7191) * logging: add proper checks for clang to avoid errors and warnings with VA_ARGS * build: add CMake Presets and toolchian files for Windows ARM64 * matmul-int8: enable matmul-int8 with MSVC and fix Clang warnings * ci: add support for optimized Windows ARM64 builds with MSVC and LLVM * matmul-int8: fixed typos in q8_0_q8_0 matmuls Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * matmul-int8: remove unnecessary casts in q8_0_q8_0 --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-06-16 18:19:48 +03:00
kunnis	7178cceeaa	ggml : use dynamic thread scheduling for matrix multiplication (llama/6915) * Just reordering some structs. * Adding in the calls to mm_pause * Passing around the state * Renaming and moving a bunch of variables around. * Extracting the logic to it's own function. * Moving some variable definitions into the chunk function. * Moving some variables around * moving src1_cont inside * Moving row_size * adding the current_chunk * Reorg the code. * Formatting to match the orig patch * starting to setup the chunking variables * Starting the buildup of the loop * The yield shouldn't be necessary. * adding the looping structure based on the chunk configuration. * Add in the re-chunking code. * Making it much more likely to rechunk. * disable resizing if numa is enabled. * Updating comments with what we've learned. * Fix formatting * Couple more formatting fixes. * More style fixes. * Fix Warnings * Going with unused because there's conditional logic that needs it. * Update ggml.c * Update ggml.c ---------	2024-06-16 18:19:48 +03:00
agray3	8d55ccdb8c	Avoid unnecessarily disabling CUDA graphs (llama/7302) As discussed in PR #6766, CUDA graphs were being disabled in the presence of long prompts. This fixes the issue by avoiding the consective update counter from incrementing unnecessarily for tokens in which cuda graphs are disabled due to batch size > 1.	2024-06-16 18:19:48 +03:00
slaren	37a72cb170	ggml : tag ggml_tensor::backend as deprecated (llama/7290)	2024-06-16 18:19:48 +03:00
AidanBeltonS	bf9b69284f	Add missing " (llama/7303)	2024-06-16 18:19:48 +03:00

1 2 3 4 5 ...

1359 Commits