whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2025-06-24 19:51:26 +02:00

Author	SHA1	Message	Date
slaren	8f5dc729d9	ggml : use atomic_flag for critical section (llama/7598) * ggml : use atomic_flag for critical section * add windows shims	2024-06-16 18:19:48 +03:00
Georgi Gerganov	02fc147a0b	examples : adapt to new ggml_concat (ggml/0)	2024-06-16 18:19:48 +03:00
zhouwg	109148ac84	ggml : fix typo in ggml.c (llama/7603)	2024-06-16 18:19:48 +03:00
Meng, Hengyu	3563473d2c	Align GEMM dispatch (llama/7566) * align GEMM dispatch	2024-06-16 18:19:48 +03:00
Georgi Gerganov	046834198d	sycl : fix assert (llama/7563)	2024-06-16 18:19:48 +03:00
k.h.lai	0a2ad9de06	vulkan: properly initialize vulkan devices for LLAMA_SPLIT_MODE_NONE (llama/7552)	2024-06-16 18:19:48 +03:00
Radoslav Gerganov	39b0640b09	rpc : resource management rework (llama/7562) * rpc : resource management rework * address review comments	2024-06-16 18:19:48 +03:00
Neo Zhang	8dca71de64	fix ggml_sycl_mul_mat_id() to match the change of api (llama/7436) * fix mul_mat_id to match the change of api * rm comment * rm unused or duplicated code, rename as review comment	2024-06-16 18:19:48 +03:00
Georgi Gerganov	812787cbc5	ggml : generalize GGML_OP_CONCAT (llama/7563) * ggml : generalize GGML_OP_CONCAT (WIP) ggml-ci * tests : add dim != 2 tests * metal : generalize concat kernel * tests : naming * cuda : generalize concat kernel ggml-ci * sycl : add warning and assert * ggml : fix op params handling * metal : bugfix kernel ggml-ci * ggml : reimplement CPU and Metal * cuda : add asserts ggml-ci * ggml : fix ptrs ggml-ci	2024-06-16 18:19:48 +03:00
Djip007	68ef10805e	update HIP_UMA #7399 (llama/7414) * update HIP_UMA #7399 add use of hipMemAdviseSetCoarseGrain when LLAMA_HIP_UMA is enable. - get x2 on prompte eval and x1.5 on token gen with rocm6.0 on ryzen 7940HX iGPU (780M/gfx1103) * simplify code, more consistent style --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-06-16 18:19:48 +03:00
agray3	96fdb90f5f	Allow multiple copy function pointers for CUDA graph kernel param updates (llama/7565) CUDA graphs require parameter updates to kernels associated with GGML_OP_CPY nodes. Previously the implementation only checked for a single CUDA kernel in such nodes, but this caused a bug in cases where 2 such kernels exist. This fixes the issue by using a vector to allow multiple function pointers to be stored and checked against. Fixes #7942	2024-06-16 18:19:48 +03:00
AidanBeltonS	e98f9ac554	Fix q_xxs using mul_mat_q (llama/7459)	2024-06-16 18:19:48 +03:00
AidanBeltonS	02d481595b	Add freq factors (llama/7495)	2024-06-16 18:19:48 +03:00
Georgi Gerganov	7091c7ab5a	metal : add GGML_OP_REPEAT kernels (llama/7557) ggml-ci	2024-06-16 18:19:48 +03:00
Georgi Gerganov	d70ccb75f5	metal : disable FA kernel for HS=256 (llama/7556) ggml-ci	2024-06-16 18:19:48 +03:00
Georgi Gerganov	5ee048eb67	ggml : restore ggml_rope_xpos_inplace (ggml/0) ggml-ci	2024-06-16 18:19:48 +03:00
Masaya, Kato	37ed71c964	ggml: aarch64: SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot (llama/7433) * Add SVE support for q4_0_q8_0 q8_0_q8_0 * remove ifdef	2024-06-16 18:19:48 +03:00
Georgi Gerganov	8cd7a3df37	ggml : silence UB sanitizer error during iq2_xxs quantization (llama/0)	2024-06-16 18:19:48 +03:00
Georgi Gerganov	04a3279320	ggml : remove ggml_flash_attn and ggml_flash_ff (llama/7463) ggml-ci	2024-06-16 18:19:48 +03:00
Georgi Gerganov	45ddda8e0c	ggml : drop support for QK_K=64 (llama/7473) * ggml : drop support for QK_K=64 ggml-ci * opencl : restore QK_K=256 define	2024-06-16 18:19:48 +03:00
0cc4m	c41317fd66	Update vulkan rope implementation to support frequency factors (llama/7475)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	96b8419b27	CUDA: fix FA out-of-bounds reads (llama/7479)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	3c63f4cf35	CUDA: fix FA out-of-bounds writes (llama/7465)	2024-06-16 18:19:48 +03:00
Georgi Gerganov	5848dfd9c8	cuda : fix compile warning (llama/7454)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	29ab5d0326	CUDA: remove incorrect precision check (llama/7454)	2024-06-16 18:19:48 +03:00
Georgi Gerganov	c4d6958b3e	cuda : fix rope + add tests (llama/7452) * cuda : fix rope pos data ggml-ci * ggml : drop mode & 1 == 1 support for ggml_rope ggml-ci * ggml : support freq_factors for f16 rope (CPU) ggml-ci * tests : add rope tests using frequency factors ggml-ci	2024-06-16 18:19:48 +03:00
liuwei-git	c9dcb75118	llama : add phi3 128K model support (llama/7225) * add phi3 128k support in convert-hf-to-gguf * add phi3 128k support in cuda * address build warnings on llama.cpp * adjust index value in cuda long rope freq factors * add long rope support in ggml cpu backend * make freq factors only depend on ctx size * remove unused rope scaling type 'su' frin gguf converter * fix flint warnings on convert-hf-to-gguf.py * set to the short freq factor when context size is small than trained context size * add one line of comments * metal : support rope freq_factors * ggml : update ggml_rope_ext API to support freq. factors * backends : add dev messages to support rope freq. factors * minor : style * tests : update to use new rope API * backends : fix pragma semicolons * minor : cleanup * llama : move rope factors from KV header to tensors * llama : remove tmp assert * cuda : fix compile warning * convert : read/write n_head_kv * llama : fix uninitialized tensors --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-06-16 18:19:48 +03:00
Georgi Gerganov	bbdbc3fc62	metal : handle F16 inf values, fix FA partial offload (llama/7434) ggml-ci	2024-06-16 18:19:48 +03:00
Johannes Gäßler	28c207a541	CUDA: fix unused warning in mmq.cu (llama/7442)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	c23f830983	CUDA: deduplicate mmq code (llama/7397)	2024-06-16 18:19:48 +03:00
Radoslav Gerganov	caeeb32b41	rpc : track allocated buffers (llama/7411) * rpc : track allocated buffers ref: #7407 * rpc : pack rpc_tensor tightly	2024-06-16 18:19:48 +03:00
AidanBeltonS	584cc1177a	Update SYCL upscale operation (llama/7321) * Update SYCL upscale operation * Formatting * Remove messages	2024-06-16 18:19:48 +03:00
Herman Semenov	cc1ae10989	ggml-opencl, llama: using reserve() if count already known (llama/7272)	2024-06-16 18:19:48 +03:00
junchao-loongson	eb26f55b40	ggml : add loongarch lsx and lasx support (llama/6454) * add loongarch lsx and lasx optimize code * Add loongarch compilation support to makefile * revert stb_image.h * opt bytes_from_nibbles_32 and sum_i16_pairs_float * fix undeclared * format code * update * update 2 --------- Co-authored-by: Jinyang He <hejinyang@loongson.cn>	2024-06-16 18:19:48 +03:00
Srihari-mcw	eb2b086584	Add provisions for windows support for BF16 code including CMake provision for enabling AVX512_BF16 (llama/7258)	2024-06-16 18:19:48 +03:00
0cc4m	67919cfe11	Vulkan Embedding Fix (llama/7360) * Fix empty Vulkan host buffers Add fp32 fp16 matmul shader Fix matmul shader alignment * Remove deprecated tensor->backend uses * Fix Vulkan validation errors on embedding models with no offloaded layers * Fix Vulkan llava segfault when not offloading layers	2024-06-16 18:19:48 +03:00
slaren	bf5fc81a8a	ggml : fix another case of quants nans (llama/7387)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	2b07dc3186	ggml: implement quantized KV cache for FA (llama/7372)	2024-06-16 18:19:48 +03:00
slaren	951c463d39	cuda : clear error after buffer allocation failure (llama/7376)	2024-06-16 18:19:48 +03:00
fraxy-v	7f257b210f	Capture CUDA logging output (llama/7298) * logging: output capture in cuda module * fix compile error * fix: vsnprintf terminates with 0, string use not correct * post review * Update llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * Update llama.cpp Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-06-16 18:19:48 +03:00
Georgi Gerganov	705fe30a02	android : use "ci-android" branch for CI (llama/7341) * android : use "ci-android" branch for CI * ggml : disable SIMD exp and silu for 32-bit ARM ggml-ci * android : do not fetch, use add_subdirectory instead * cmake : provide binary dir	2024-06-16 18:19:48 +03:00
Johannes Gäßler	45b5b95e29	CUDA: deduplicate FlashAttention code (llama/7352)	2024-06-16 18:19:48 +03:00
Engininja2	f2c47d1e6a	cuda : add half2 __shfl_xor() for ROCm 5.5 (llama/7263)	2024-06-16 18:19:48 +03:00
0cc4m	b4bb9b9036	Update and fix Vulkan soft_max and argsort implementations (llama/7237) * Update and fix Vulkan softmax implementation * Update and fix Vulkan argsort implementation	2024-06-16 18:19:48 +03:00
slaren	2bc6483299	ggml : fix quants nans when all the group weights are very close to zero (llama/7313)	2024-06-16 18:19:48 +03:00
Johannes Gäßler	ec52f900e4	CUDA: faster large batch FA without tensor cores (llama/7314)	2024-06-16 18:19:48 +03:00
Radoslav Gerganov	77d708fabb	rpc : set SO_REUSEADDR for the server socket (llama/7320) ref: #7293	2024-06-16 18:19:48 +03:00
Herman Semenov	c00149c861	ggml-quants, llama : removed excess checks (llama/7274)	2024-06-16 18:19:48 +03:00
Justine Tunney	574661f2e6	ggml : rewrite silu and softmax for cpu (llama/7154) This change upstreams llamafile's vectorized expf() functions. This lets us compute softmax and silu more accurately than the short[65536] lookup table that GGML previously used to make this operation go faster. We can support aarch64 and sse2+ with the worst case rounding error of 2ulp. It makes make -j8 tests && ./tests/test-backend-ops -o SOFT_MAX -b CPU perf go 1.5x faster for SSE2+FMA, 1.9x faster for AVX2+FMA and 2.1x on AVX512	2024-06-16 18:19:48 +03:00
Radoslav Gerganov	7bd69349bf	rpc : add command line arg for specifying backend memory ref: #7293	2024-06-16 18:19:48 +03:00

1 2 3 4 5 ...

1364 Commits