whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2024-11-07 16:44:13 +01:00

Author	SHA1	Message	Date
Georgi Gerganov	acf3832c9c	tests : add non-cont unary tests (llama/7857) * tests : add non-cont unary tests * ggml : update unary asserts and "supports_op" ggml-ci	2024-06-16 18:19:48 +03:00
Ben Ashbaugh	1fe5948227	use the correct SYCL context for host USM allocations (llama/7777) Signed-off-by: Ben Ashbaugh <ben.ashbaugh@intel.com>	2024-06-16 18:19:48 +03:00
pengxin99	dc01aadb18	fix softmax r2r result wrong issue (llama/7811)	2024-06-16 18:19:48 +03:00
Georgi Gerganov	abab4500fa	ggml : refactor rope norm/neox (llama/7634) * ggml : unify rope norm/neox (CPU) * ggml : fix compile warning * ggml : remove GLM rope mode ggml-ci * metal : better rope implementation ggml-ci * cuda : better rope implementation ggml-ci * naming : n_orig_ctx -> n_ctx_orig ggml-ci * dev : add reminders to update backends ggml-ci * vulkan : fix ggml_rope_ext() usage * cuda : fix array size + indents ggml-ci	2024-06-16 18:19:48 +03:00
Georgi Gerganov	1f35ce61c1	ggml : fix YARN + add tests + add asserts (llama/7617) * tests : add rope tests ggml-ci * ggml : fixes (hopefully) ggml-ci * tests : add non-cont tests ggml-ci * cuda : add asserts for rope/norm + fix DS2 ggml-ci * ggml : assert contiguousness * tests : reduce RoPE tests ggml-ci	2024-06-16 18:19:48 +03:00
Meng, Hengyu	3563473d2c	Align GEMM dispatch (llama/7566) * align GEMM dispatch	2024-06-16 18:19:48 +03:00
Georgi Gerganov	046834198d	sycl : fix assert (llama/7563)	2024-06-16 18:19:48 +03:00
Neo Zhang	8dca71de64	fix ggml_sycl_mul_mat_id() to match the change of api (llama/7436) * fix mul_mat_id to match the change of api * rm comment * rm unused or duplicated code, rename as review comment	2024-06-16 18:19:48 +03:00
Georgi Gerganov	812787cbc5	ggml : generalize GGML_OP_CONCAT (llama/7563) * ggml : generalize GGML_OP_CONCAT (WIP) ggml-ci * tests : add dim != 2 tests * metal : generalize concat kernel * tests : naming * cuda : generalize concat kernel ggml-ci * sycl : add warning and assert * ggml : fix op params handling * metal : bugfix kernel ggml-ci * ggml : reimplement CPU and Metal * cuda : add asserts ggml-ci * ggml : fix ptrs ggml-ci	2024-06-16 18:19:48 +03:00
AidanBeltonS	e98f9ac554	Fix q_xxs using mul_mat_q (llama/7459)	2024-06-16 18:19:48 +03:00
AidanBeltonS	02d481595b	Add freq factors (llama/7495)	2024-06-16 18:19:48 +03:00
Georgi Gerganov	45ddda8e0c	ggml : drop support for QK_K=64 (llama/7473) * ggml : drop support for QK_K=64 ggml-ci * opencl : restore QK_K=256 define	2024-06-16 18:19:48 +03:00
liuwei-git	c9dcb75118	llama : add phi3 128K model support (llama/7225) * add phi3 128k support in convert-hf-to-gguf * add phi3 128k support in cuda * address build warnings on llama.cpp * adjust index value in cuda long rope freq factors * add long rope support in ggml cpu backend * make freq factors only depend on ctx size * remove unused rope scaling type 'su' frin gguf converter * fix flint warnings on convert-hf-to-gguf.py * set to the short freq factor when context size is small than trained context size * add one line of comments * metal : support rope freq_factors * ggml : update ggml_rope_ext API to support freq. factors * backends : add dev messages to support rope freq. factors * minor : style * tests : update to use new rope API * backends : fix pragma semicolons * minor : cleanup * llama : move rope factors from KV header to tensors * llama : remove tmp assert * cuda : fix compile warning * convert : read/write n_head_kv * llama : fix uninitialized tensors --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-06-16 18:19:48 +03:00
AidanBeltonS	584cc1177a	Update SYCL upscale operation (llama/7321) * Update SYCL upscale operation * Formatting * Remove messages	2024-06-16 18:19:48 +03:00
AidanBeltonS	bf9b69284f	Add missing " (llama/7303)	2024-06-16 18:19:48 +03:00
John Balis	c4de1e19df	ggml : add `ggml_upscale_ext` (ggml/814) * initial commit with CPU implementation of upscale to shape and test, cuda implementation next * experimental commit to see if dst shape is correct * test version * test * removed unnecessary params * refactor * fixed tests * ggml : metal impl + cleanup + sycl dev warnings * patched ggml_upscale cuda op to handle non-contiguous tensors, added test for non-contiguous behavior * metal : fix upsacle op to support nb00 + style --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-06-16 18:19:48 +03:00
Neo Zhang	8e7c22fbdb	rm wait() (llama/7233)	2024-05-14 19:16:29 +03:00
Georgi Gerganov	e54329da7b	ggml : full ALiBi support (llama/7192) * ggml : full ALiBi support * ggml : update ggml_soft_max_ext() CUDA, SYCL * ggml : ggml_flash_attn_ext() support ALiBi (CPU) * ggml : ggml_flash_attn_ext() support ALiBi (Metal) * ggml : fix warning * ggml : ggml_flash_attn_ext() support ALiBi (CUDA) ggml-ci * ggml : fix assert message * vulkan : add dev notes * ggml : require mask when using ALiBi ggml-ci * convert : fix convert for refact models	2024-05-13 11:02:26 +03:00
Ouadie EL FAROUKI	fe454b8d9e	Minor arithmetic improvement to mmvq wrapper kernel (llama/7172)	2024-05-13 11:02:26 +03:00
Georgi Gerganov	156a33a990	ggml : add Flash Attention (llama/5021) * ggml : add ggml_flash_attn_ext API * ggml : fix GQA support in ggml_flash_attn_ext * ggml : online attention (CPU) * metal : initial implementation * metal : f16 precision * metal : reduce branches * metal : specialize for head size * wip : 8 rows per simd group * wip : 4 rows per simd group * wip : template for rows per warp * metal : parallelize across KV size * metal : parallel reduce across heads * metal : efficient flash_attn_f16 implementation * metal : avoid redundant loads of the attention * metal : scale and mask in matrix form * metal : fix comment * llama : avoid ggml_cast, use F32 query * metal : add parallel reduce version (disabled) * metal : move output into local memory + optimize - the result from each simdgroup now stays in the registers - significantly reduced SRAM usage - more efficient skipping of -INF blocks - avoid simdgroup barrier in hot loop - add comments * metal : add tests, fix scaling, support C > 32 * metal : improve precision * ggml : fix f16 mad * metal : minor * metal : support Q > 8 * tests : add ATTN tests * metal : disable buffer allocation logs * tests : more * metal : faster inner loop for C == 32 * metal : fix array initialization * tests : ifdef * ggml : switch to padded F16 mask for ggml_soft_max, ggml_flash_attn_ext * ggml : fix ggml_soft_max mask requirement * cuda : fix soft_max to use correct mask size * cuda : add flash_attn kernel (wip) * metal : optimize softmax for C > 32 * metal : optimize softmax * tests : minor fix * cuda : avoid zeroing fragments * tests : update dims * cuda : fix __hisinf() result check * cuda : avoid warp_reduce for smax * cuda : use int instead of int64_t Noticeably improves performance (thanks to Johannes) * cuda : make loops use the same loop values Thanks Johannes again for the tip * cuda : unroll some of the loops * cuda : avoid __hisinf branches * cuda : use half2 in softmax * cuda : switch to 1 warp for bs > 16 * cuda : speed-up reduce part of the kernel * cuda : unroll QK^T loop cuda : fix -INF block check * cuda : simplify softmax * cuda : fix matrix names * cuda : minor * llama : adapt to F16 KQ_pos * llama : adapt new models to F16 KQ_mask * ggml : fix F16 store (ARM NEON) * llama : fix type of KQ_mask and KQ_pos * ggml : fix CPU soft_max * tests : add hs=256 * cuda : fix build * metal : improve perf via smaller int registers * cuda : adapt soft_max to F16 mask and pos * CUDA: faster FlashAttention, kernel for bs == 1 * 16 cols for Phi-2 * no vec for hs, no hs==256 ncols==32 for Volta * adjust kernel selection logic * 4 warps, 256 stride for all D * no ncols == 64 * Multiple parallel blocks for batch size 1 * fix compile warnings * fix excessive KQ_b loads * fix cmake build * fix KV cache padding, NaN from INFINITY (llama/6438) * llama : flash_attn cparam + fix defrag * server: support flash_attn param * server: bench: enable flash_attn param * CUDA: refactor host code, dyn. par. blocks * fix flash_attn_vec_f16 race condition * flush softmax exp below threshold to 0 * store temp KQ in registers * Calculate KQ as FP32 if KQV has GGML_PREC_F32 * Add __hgt2_mask implementation for CUDA 11 * fix KQ FP32 precision fpr parallel_blocks > 1 * llama-bench : add -fa,--flash-attn arg * metal : add BS=1 kernel for flash attention (llama/6508) * metal : add BS=1 kernel for flash attention (wip) * metal : support more than 1 warps * metal : opts * metal : opt * metal : switch to parallel reduce * metal : reduce registers * metal : simplify * metal : initial FA vec kernel * metal : use F32 attention accumulators * batched-bench : add fattn arg * llama : simplify llama_build_kv_store ggml-ci * llama : adapt build_olmo to changes * ggml : fix arm fp16 store on windows * metal : clean-up * metal : clean-up kernel code * metal : minor * tests : remove benchmarks ggml-ci * ggml : fix avx512 const correctness ggml-ci * ggml : fix soft_max with bias on CPU ggml-ci * common : print --flash-attn in help * ggml : fix num dimensions in ggml_flash_attn_ext * llama : force disable flash attention for incompatible models * ggml : ggml_soft_max support F16/F32 mask/pos ggml-ci * cuda : uint -> uint32_t * cuda : "constexpr dim3" -> "const dim3" ggml-ci * cuda : try to fix __hgt2_mask ggml-ci * ggml : add TODO's for F16/F32 mask/pos support in other backends * llama : replace bool need_kq_pos with use_alibi * llama : prep ALiBi support for BERT models ggml-ci * llama : fix n_batch requirements ggml-ci * cont * server : add help for --flash-attn arg * llama : disable FA for AMD * tests : remove TMP_ATTN_BENCH ggml-ci * llama : support save/load state with FA enabled ggml-ci * ci : add CUDA save-load-state tests ggml-ci * llama : llama_kv_cache_clear zeroes data + fix save-load seq ggml-ci * llama : fix copy-paste errors, add TODO * llama : disallow incompatible states * llama : update llama_state_get_size after v_trans field * metal : remove tmp log * llama : add static reminder for llama_state_get_size * metal : fix max nsg ggml-ci * ci : fix arg order ggml-ci --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Pierrick HYMBERT <pierrick.hymbert@gmail.com>	2024-05-13 11:02:26 +03:00
Neo Zhang	9ad202bee9	add device version in device list (llama/6959) Co-authored-by: arthw <>	2024-05-13 11:02:26 +03:00
slaren	c96b0a938e	ggml : group all experts in a single ggml_mul_mat_id (llama/6505) * ggml : group all experts in a single ggml_mul_mat_id cuda : improve mmid row copy * cuda : fix bin bcast with non-cont src0 * test-backend-ops : only run all mul mat tests for base types * llama : disable moe offloading with SYCL --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-05-13 11:02:26 +03:00
Neo Zhang Jianyu	98c0b77e0c	fix mul_mat_id() for new input, make the ut pass (llama/6682)	2024-05-13 11:02:26 +03:00
Neo Zhang Jianyu	c1320c1f0c	fix memcpy() crash, add missed cmd in guide, fix softmax (llama/6622) * disable mmap to fix memcpy crash, add missed cmd in guide, fix softmax * refactor to disable mmap for SYCL backend * fix compile error in other os * refactor the solution, use host buf to fix it, instead of disable mmap * keep to support mmap() * use host buff to reduce malloc times * revert to malloc/free solution, for threaad safe	2024-05-13 11:02:26 +03:00
Abhilash Majumder	1d2721ca72	remove row=1 cond (llama/6532)	2024-04-09 20:26:18 +03:00
Neo Zhang Jianyu	219e601dab	support/fix OPs GGML_TYPE_IQ4_NL, GGML_TYPE_IQ4_XS, GGML_TYPE_IQ3_XXS, GGML_TYPE_IQ3_S, GGML_TYPE_IQ2_XXS, GGML_TYPE_IQ2_XS, GGML_TYPE_IQ2_S, GGML_TYPE_IQ1_S, GGML_TYPE_IQ1_M (llama/6521)	2024-04-09 20:26:18 +03:00
Ouadie EL FAROUKI	61b05815e0	Fixed minor bug when enabling FP16 for non intel targets (llama/6464) * moved INTEL_MKL guard from gemm_impl to gemm (wrapper) * Update ggml-sycl.cpp Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com> --------- Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com>	2024-04-07 16:15:57 +03:00
Meng, Hengyu	f12e982c0b	Disable iqx on windows as WA (llama/6435) * disable iqx on windows as WA * array instead of global_memory	2024-04-07 16:15:57 +03:00
Neo Zhang Jianyu	b83a9fc9d3	fix set main gpu crash (llama/6339)	2024-04-07 16:15:56 +03:00
Georgi Gerganov	2948c740a2	sync : ggml (#2001 ) * sync : update scripts * sync : ggml * talk-llama : sync llama.cpp * make : WHISPER_CUBLAS -> WHISPER_CUDA * ci : try to fix sycl build * talk-llama : fix make build	2024-03-27 18:55:10 +02:00
slaren	8932c2d6ce	llama : add pipeline parallelism support (llama/6017) * llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs ggml-ci * server : add -ub, --ubatch-size parameter * fix server embedding test * llama : fix Mamba inference for pipeline parallelism Tested to work correctly with both `main` and `parallel` examples. * llama : limit max batch size to n_batch * add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism default increase to 4 (from 2) changing this value may improve performance for some systems, but increases memory usage * fix hip build * fix sycl build (disable cpy_tensor_async) * fix hip build * llama : limit n_batch and n_ubatch to n_ctx during context creation * llama : fix norm backend * batched-bench : sync after decode * swiftui : sync after decode * ggml : allow ggml_get_rows to use multiple threads if they are available * check n_ubatch >= n_tokens with non-casual attention * llama : do not limit n_batch to n_ctx with non-casual attn * server : construct batch with size of llama_n_batch * ggml_backend_cpu_graph_compute : fix return value when alloc fails * llama : better n_batch and n_ubatch comment * fix merge * small fix * reduce default n_batch to 2048 --------- Co-authored-by: Francis Couture-Harpin <git@compilade.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-15 14:01:13 +02:00
AidanBeltonS	2bddfdd7c8	Update get version (llama/6025)	2024-03-15 14:01:13 +02:00
Georgi Gerganov	46e3c3f112	ggml : reuse quantum structs across backends (llama/5943) * ggml : reuse quant blocks across backends ggml-ci * ggml : define helper constants only for CUDA and SYCL ggml-ci * ggml : define helper quantum constants for SYCL ggml-ci	2024-03-15 14:01:13 +02:00
Georgi Gerganov	a753926f02	sycl : update IQ1_S kernels (WIP - not working!) (llama/5995) * sycl : try to fix after IQ1_S changes * sycl : iq1s_grid -> iq1s_grid_gpu * sycl : fix grid type	2024-03-15 14:01:13 +02:00
Abhilash Majumder	4f88940ff6	Add q3_s and q1_s (llama/5886) * Add q3_s and q1_s * fix compilation * fix build * fix build * fix build * enable ops * rm macro * increase grid space	2024-03-15 14:01:12 +02:00
Georgi Gerganov	24eba5a2ff	ggml : add ggml-common.h to deduplicate shared code (llama/5940) * ggml : add ggml-common.h to shared code ggml-ci * scripts : update sync scripts * sycl : reuse quantum tables ggml-ci * ggml : minor * ggml : minor * sycl : try to fix build	2024-03-15 14:01:12 +02:00
Neo Zhang Jianyu	bae7c23fbf	Revert "[SYCL] fix error when set main gpu to non-zero (llama/5901)" (llama/5918) This reverts commit ceca1aef0738b57951cd12c603c3477e75312dec.	2024-03-08 11:38:33 +02:00
Neo Zhang Jianyu	18ea187d42	fix error when set main gpu to non-zero (llama/5901) * fix error when set main gpu to non-zero * fix delete condition	2024-03-08 11:38:33 +02:00
Neo Zhang Jianyu	7ff1894c34	add wait() to make code stable (llama/5895)	2024-03-08 11:38:33 +02:00
Neo Zhang Jianyu	9d9a405cfd	fix mul_mat fault in CI/unit-test (llama/5862) * fix mul_mat fault in cpy_f32_f16 * rm unused function * add wait() for memcpy * restore ci/run.sh, rename struct defination, fix bug in ggml_sycl_op_mul_mat_sycl * fix format issue * llama : fix segfault from unknown model arch name (llama/5820) * llama : fix segfault from unknown model arch name * llama : make all LLM maps const This also requires using `std::map::at` instead of its `operator[]` which does not exist for const maps. * llama : name LLM_ARCH_UNKNOWN to "(unknown)" This avoids errors from `std::map::at` when getting the general name of the model architecture. Using "(unknown)" instead of an empty string as per suggestion https://github.com/ggerganov/llama.cpp/pull/5820#issuecomment-1973735284 * llama : remove redundant inner const for LLM_TENSOR_NAMES The extra const won't do anything here as const maps return const references to values. Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * llama : remove redundant nullptr check in llm_arch_from_string Since LLM_ARCH_NAMES is a const map, no spurious elements with a NULL name are inserted anymore, so this check is dead code. --------- Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * llama : refactor internal quantization functions (llama/5830) * scripts : add pod-llama.sh * ggml : IQ3_S improvements (llama/5829) * iq3_s: somewhat faster AVX2 dot product On Ryzen a 7950X TG-128 increases to 16 t/s from 15.5 t/s using 16 threads. For 8 threads it is 13.85 t/s vs 11.75 t/s. PP-512 increases to 28.5 t/s from 23.8 t/s. * iq3_s: somewhat faster ARM_NEON dot product Still dog slow - 10.7 t/s up from 9.9 t/s. * iq3_s: another small ARM_NEON improvement 10.7 -> 11.0 t/s. Using vmulq_s8 is faster than the xor - sub trick that works best on AVX2. * iq3_s: minor improvement on Metal 49.4 t/s -> 50.3 t/s * iq3_s: PPL improvement E.g., for a context of 4096 LLaMA-v2-7B goes to 5.1340 from 5.1653. * iq3_s: use new grid everywhere * Fix ARM_NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> * convert-hf : make model class definitions self-contained (llama/5825) * convert : automatically fall back to HfVocab if tokenizer.model doesn't exist (llama/5821) * ggml : fix IQ3_S AVX implementation (llama/5834) ggml-ci * llama : add abort_callback to interrupt computation (llama/5409) * using abort_callback from ggml to stop llama computation * format fix * a brief explaining comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server: tests: passkey challenge / self-extend with context shift demo (llama/5832) * server: tests: add models endpoint scenario * server: /v1/models add some metadata * server: tests: add debug field in context before scenario * server: tests: download model from HF, add batch size * server: tests: add passkey test * server: tests: add group attention params * server: do not truncate prompt tokens if self-extend through group attention is enabled * server: logs: do not truncate log values * server: tests - passkey - first good working value of nga * server: tests: fix server timeout * server: tests: fix passkey, add doc, fix regex content matching, fix timeout * server: tests: fix regex content matching * server: tests: schedule slow tests on master * server: metrics: fix when no prompt processed * server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1 * server: tests: increase timeout for completion * server: tests: keep only the PHI-2 test * server: tests: passkey add a negative test * flake.lock: Update (llama/5842) Flake lock file updates: • Updated input 'flake-parts': 'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01) → 'github:hercules-ci/flake-parts/f7b3c975cf067e56e7cda6cb098ebe3fb4d74ca2' (2024-03-01) • Updated input 'flake-parts/nixpkgs-lib': 'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29) → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8?dir=lib' (2024-02-29) • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/cbc4211f0afffe6dfd2478a62615dd5175a13f9a' (2024-02-23) → 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8' (2024-02-29) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * server : init http requests thread pool with --parallel if set (llama/5836) * ci : schedule slow server tests only on Release or on demand (llama/5839) * llama : fix llama_copy_state_data with fragmented KV cache (llama/5840) The row size of the saved states was based on kv_self.head while it should be based on llama_kv_cache_cell_max. Existing session files should still work. * llama : fix llama_kv_cache_cell_max inability to return 1 I've also changed its return type to uint32_t, because this function is always used to set the value of uint32_t variables, and because the index already has this type. * llama : fix state size calculation Some bytes in the state were unaccounted for in llama_get_state_size. Since the logits reserve so much space, it did not cause problems. * gguf-dump : support i-quants (llama/5841) Co-authored-by: Black_Fox <radekliska@gmail.com> * llama : allow for user specified embedding pooling type (llama/5849) * allow for user specified pooling type * llama : use enum types over int --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * readme : add API changes section * cuda : fix data race in soft max (llama/5853) * main : support special tokens as reverse/anti prompt (llama/5847) * Support special tokens as reverse/anti prompt. * Tokenize antiprompts only once. * main : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * common : use LLAMA_DEFAULT_SEED (llama/5855) * add some new ops, fix some operators and add batch operations to certain operators. (ggml/747) * cuda: fix group_norm * cuda: add batch inference support for ggml_pad/ggml_upscale * add ggml_arrange * add ggml_timestep_embedding * update ggml_arange/ggml_timestep_embedding tests * cuda: fix im2col * add ggml_arange/ggml_timestep_embbeding support for metal backend * fix some bugs * fix some bugs * Update ggml.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-cuda.cu Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-metal.m Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-metal.m Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml-metal.metal Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * modify according to the review comments * ggml : fix compile warnings + code style * ggml : normalize compute_forward calls + fix seg fault in debug * minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: slaren <slarengh@gmail.com> * sync : ggml * add alias for chat template (llama/5858) * speculative : implement stochastic speculative sampling (llama/5625) * (WIP) Implement stochastic speculative decoding * sample from residual distribution on draft accept failure * fix #5657: force greedy sampling with probs when temp is 0 * remove p_accept parameter * fix style * remove unused variables * add srand() in speculative.cpp * replace use of rand() with mt19937 sampling * fixes based on review (@JohannesGaessler) * fix r random generation * randomly select next sequence to verify + fix bug in memory freeing * fix bug in active_seqs sync * fix uniform int distribution initialization * remove warnings from comparison between int and size_t * check grammar in `llama_sample_probability_distribution_impl` * remove malloc code by utilizing vectors * add PR link to README * cmake : handle cases where git index is not found in .git (llama/5844) * Update CMakeLists.txt * Update CMakeLists.txt * ggml : introduce ggml_status (ggml/750) * using enum as an exit code instead of macros * update return type from enum to unsigned int * indentation fix * compound update ggml_compute_exit_code -> ggml_status changed ggml_status from a bit-field type to simple codes ggml_status to string cast * ggml_status to string cast * GGML_CALL was removed Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * sync : ggml ggml-ci * ggml : fix unknown status (llama/0) * flake : fix * llama : fix embeddings (llama/5796) * llama : fix embeddings ggml-ci * llama : do not use KV cache for non-causal models ggml-ci * embeddings : fix llama_batch_init arg * llama : add pooling switch * llama : distinguish token vs sequence embeddings ggml-ci * llama : assert pooling tensor * llama : simplify causal mask condition ggml-ci * llama : assert input batch with pooling enabled * readme : update API changes list * nix: static build (llama/5814) * fix speculative decoding build on windows (llama/5874) * rebase and rm tailing space --------- Co-authored-by: LiangtaoJin <liang-tao.jin@intel.com> Co-authored-by: compilade <113953597+compilade@users.noreply.github.com> Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Kawrakow <48489457+ikawrakow@users.noreply.github.com> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Jared Van Bortel <jared@nomic.ai> Co-authored-by: Michael Podvitskiy <podvitskiymichael@gmail.com> Co-authored-by: Pierrick Hymbert <pierrick.hymbert@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Nindaleth <Nindaleth@users.noreply.github.com> Co-authored-by: Black_Fox <radekliska@gmail.com> Co-authored-by: Douglas Hanley <thesecretaryofwar@gmail.com> Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: DAN™ <dranger003@gmail.com> Co-authored-by: leejet <leejet714@gmail.com> Co-authored-by: Minsoo Cheong <54794500+mscheong01@users.noreply.github.com> Co-authored-by: Dane Madsen <dane_madsen@hotmail.com> Co-authored-by: hutli <6594598+hutli@users.noreply.github.com> Co-authored-by: Jeffrey Quesnelle <emozilla@nousresearch.com>	2024-03-08 11:38:32 +02:00
Michael Podvitskiy	9a0b59d990	ggml : introduce ggml_status (ggml/750) * using enum as an exit code instead of macros * update return type from enum to unsigned int * indentation fix * compound update ggml_compute_exit_code -> ggml_status changed ggml_status from a bit-field type to simple codes ggml_status to string cast * ggml_status to string cast * GGML_CALL was removed Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-08 11:38:32 +02:00
Neo Zhang Jianyu	c3bfc9bfda	Support multiple GPUs (split mode) on SYCL backend (llama/5806) * suport multiple cards: split-mode - layer\|row * rm warning * rebase with master, support tow new OPs, close feature for -sm=row, fix for unit test * update news * fix merge error * update according to review comments	2024-03-08 11:38:32 +02:00
AidanBeltonS	11dd0d4482	Use batched mul_mat pathway (llama/5591) * Use batched mul_mat pathway * rm extra line * Explicitly state scaled data type --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>	2024-03-08 11:38:31 +02:00
AidanBeltonS	8408a4be8e	Add support for soft_max ALiBi (llama/5639) * Add support for bias * Update pre-processor * rm commented code * fix format * fix CI --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>	2024-02-28 13:00:29 +02:00
Georgi Gerganov	fac5b43830	code : normalize enum names (llama/5697) * coda : normalize enum names ggml-ci * code : cont * code : cont	2024-02-25 19:58:46 +02:00
UEXTM.com	1cb64f7368	Introduce backend GUIDs (ggml/743) * Introduce backend GUIDs Initial proposed implementation of backend GUIDs (Discussed in https://github.com/ggerganov/ggml/pull/741) Hardcoded CPU backend GUID (for now) Change ggml_backend_is_cpu logic to use GUID * Remove redundant functions Remove redundant functions `ggml_backend_i::get_name` and `ggml_backend_guid` which are not desired for future expansion * Add spaces to match style Co-authored-by: slaren <slarengh@gmail.com> * Fix brace style to match Co-authored-by: slaren <slarengh@gmail.com> * Add void to () in function signature Co-authored-by: slaren <slarengh@gmail.com> * Add back ggml_backend_guid and make CPU_GUID a local static in ggml_backend_cpu_guid * add guids to all backends ggml-ci --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-02-25 19:58:45 +02:00
Meng, Hengyu	208de95ac7	conext add name (llama/5624) * [SYCL] conext add name * name should start with SYCL*	2024-02-22 15:12:36 +02:00
AidanBeltonS	c2ce39c795	Update ggml_sycl_op_mul_mat_vec_q (llama/5502) * Update ggml_sycl_op_mul_mat_vec_q * Apply suggestions from code review Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com> * revert suggestion on macro * fix bug * Add quant type GGML_TYPE_IQ1_S to unsupported * fix format --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>	2024-02-22 15:12:36 +02:00
Abhilash Majumder	462ffc58db	ggml-sycl: Replace 3d ops with macro (llama/5458) * use macro * use macro * fix format	2024-02-19 15:53:21 +02:00
Georgi Gerganov	8b17a2f776	src : relocate new backend sources	2024-02-10 09:55:47 +02:00

50 Commits