Commit Graph

1221 Commits

Author SHA1 Message Date
Justine Tunney
7a4f7d825e ggml : add llamafile sgemm (llama/6414)
This change upstreams llamafile's cpu matrix multiplication kernels
which improve image and prompt evaluation speed. For starters, Q4_0
and Q8_0 weights should go ~40% faster on CPU. The biggest benefits
are with data types like f16 / f32, which process prompts 2x faster
thus making them faster than quantized data types for prompt evals.

This change also introduces bona fide AVX512 support since tinyBLAS
is able to exploit the larger register file. For example, on my CPU
llama.cpp llava-cli processes an image prompt at 305 tokens/second,
using the Q4_K and Q4_0 types, which has always been faster than if
we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With
this change, f16 LLaVA performance leap frogs to 464 tokens/second.

On Intel Core i9-14900K this change improves F16 prompt perf by 5x.
For example, using llama.cpp at HEAD with Mistral 7b f16 to process
a 215 token prompt will go 13 tok/sec. This change has fixes making
it go 52 tok/sec. It's mostly thanks to my vectorized outer product
kernels but also because I added support for correctly counting the
number of cores on Alderlake, so the default thread count discounts
Intel's new efficiency cores. Only Linux right now can count cores.

This work was sponsored by Mozilla who's given permission to change
the license of this code from Apache 2.0 to MIT. To read more about
what's improved, and how it works, see: https://justine.lol/matmul/
2024-05-13 11:02:26 +03:00
Shijie
fdb2c87350 llama : add qwen2moe (llama/6074)
* support qwen2moe

* fix-review

* metal : support unary ops for nelements % 4 != 0

* metal : require contiguousness for float4 unary kernels

* metal : require contiguousness for float4 unary kernels (cont)

* fix-review

* names : for brevity "SHARED_EXP" -> "SHEXP"

* llama : reuse build_moe_ffn()

* llama : add model type name

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-13 11:02:26 +03:00
Neo Zhang Jianyu
98c0b77e0c fix mul_mat_id() for new input, make the ut pass (llama/6682) 2024-05-13 11:02:26 +03:00
Dave
9d6d50d933 Added support for GGML_OP_CLAMP in Metal (llama/6662)
* Added support for GGML_OP_CLAMP in Metal

* Corrected size

---------

Co-authored-by: dave-fl <dave@Davids-MacBook-Pro.local>
2024-05-13 11:02:26 +03:00
Neo Zhang Jianyu
c1320c1f0c fix memcpy() crash, add missed cmd in guide, fix softmax (llama/6622)
* disable mmap to fix memcpy crash, add missed cmd in guide, fix softmax

* refactor to disable mmap for SYCL backend

* fix compile error in other os

* refactor the solution, use host buf to fix it, instead of disable mmap

* keep to support mmap()

* use host buff to reduce malloc times

* revert to malloc/free solution, for threaad safe
2024-05-13 11:02:26 +03:00
Johannes Gäßler
66aaf03a7a CUDA: fix matrix multiplication logic for tests (llama/6667) 2024-05-13 11:02:26 +03:00
slaren
00a0947c65 metal : unify mul_mv_id kernels (llama/6556) 2024-05-13 11:02:26 +03:00
jiez
60f3713026 llama : add gguf_remove_key + remove split meta during quantize (llama/6591)
* Remove split metadata when quantize model shards

* Find metadata key by enum

* Correct loop range for gguf_remove_key and code format

* Free kv memory

---------

Co-authored-by: z5269887 <z5269887@unsw.edu.au>
2024-05-13 11:02:26 +03:00
Justina Cho
37e6757453 feat: implemented sigmoid function (ggml/806)
* added sigmoid function

* implemented metal kernel for sigmoid

* implemented cuda kernel for sigmoid

* added sigmoid unary op and incremented count
2024-05-13 11:02:26 +03:00
Borislav Stanimirov
8dcefdf4a9 build: fix and ignore msvc warnings (ggml/805) 2024-05-13 11:02:26 +03:00
Przemysław Pawełczyk
73d13ad19a
ggml : expose SSE3 and SSSE3 for MSVC when AVX is available (#2128) 2024-05-08 18:33:43 +03:00
Przemysław Pawełczyk
b6680fab50
build : improve disabling AVX-512 (#2129)
* cmake : make WHISPER_NO_AVX512=ON disable all subsets of AVX-512

Previously it happened only for MSVC, but it makes sense to have the
same behavior for other compilers too.

* make : reorder x86 ISA extensions in chronological order

And update compiler flags at the end to ease modifying conditions.

* make : support WHISPER_NO_AVX512=1 for disabling all AVX-512 subsets.

That way you do not have to override each AVX-512 subset setting
individually if it has been turned on during autodetection.
2024-05-08 18:32:43 +03:00
Borislav Stanimirov
f760756078
minor: add CMakeSettings.json to gitignore (#2094) 2024-05-08 11:03:21 +03:00
Pedro Probst
58210d6a76
examples : fix node compilation (#2115)
* node : fix compilation and update examples

* node : fix readme

* Update addon.node test
2024-05-02 22:52:55 +01:00
Przemysław Pawełczyk
8fac6455ff
make : change GNU make default CXX from g++ to c++ (#2100) 2024-04-28 22:54:21 +01:00
goldwaving
22b6598cc9
Remove unnecessary memory reallocation in fft (#2080)
fft_out needs to be twice the frame_size, not the frame_step.  It is resized in fft() anyway, but this change prevents an unnecessary reallocation.

n_fft must match the mel filter size, so it is best not to calculate it from the framesize.

We only need to get the magnitudes for half the spectrum since the other half is a mirror and not used in the mel filter loop later.
2024-04-28 18:36:12 +01:00
Georgi Gerganov
858452d58d
models : disable old script (#2079) 2024-04-24 14:56:30 +03:00
Georgi Gerganov
7f85e1d7fd
whisper : more prominent log message for sub-1s audio (#2065) 2024-04-24 14:46:06 +03:00
Georgi Gerganov
b0c3cbf2e8
main : pass nullptr when regex is empty (#2070) 2024-04-17 12:23:47 +03:00
AIWintermuteAI
a750868428
readme : add up-to-date repository for Python bindings (#2063)
README
2024-04-16 14:15:52 +03:00
Georgi Gerganov
7395c70a74
release : v1.5.5 2024-04-16 14:08:31 +03:00
Emmanuel Schmidbauer
9fab28135c
server : add dtw (#2044)
* server.cpp: add dtw

* Update examples/server/server.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-15 22:16:58 +03:00
Didzis Gosko
08d3eef97d
build : fix embedded Metal library generation (#2045) 2024-04-15 20:23:05 +03:00
Pedro Probst
1b5439a6c2
node : support no timestamps (#2048)
* fix: node: do not compute timestamps if you do not need them

* feat: add no_timestamps parameter to node addon
2024-04-15 20:03:34 +03:00
Didzis Gosko
c7f95b7ca2
build : detect AVX512 in Makefile, add AVX512 option in CMake (#2043)
* make : add AVX512 detection to Makefile and CMakeLists.txt

* make : autodetect more AVX512 instruction subsets

* cmake : do not default to AVX512, must be enabled explicitly

* cmake : enable a set of AVX512 subsets, when AVX512 is turned on

* make : consolidate AVX512 subsets, add AVX512 VBMI

* cmake : revert to NO AVX512 setting, add settings for AVX512 VNNI and VBMI

* make : re-introduce AVX512VNNI back

* cmake : remove superfluous comment line
2024-04-15 20:02:09 +03:00
Kendrick Taylor
5c554c04ff
whisper.nvim : fix missing reference to "model" variable (#2049) 2024-04-15 19:41:28 +03:00
Ikko Eltociear Ashimine
c383f091a1
whisper : update grammar-parser.cpp (#2058)
preceeding -> preceding
2024-04-15 19:40:27 +03:00
Georgi Gerganov
8f253ef3af
sync : ggml 2024-04-09 20:27:55 +03:00
Georgi Gerganov
c7dc37f97c
license : update copyright notice + add AUTHORS 2024-04-09 20:27:44 +03:00
Carolinabanana
526332873b
llama : add Command R Plus support (llama/6491)
* Add Command R Plus GGUF

* Add Command R Plus GGUF

* Loading works up to LayerNorm2D

* Export new tensors in 1D so they are not quantized.

* Fix embedding layer based on Noeda's example

* Whitespace

* Add line

* Fix unexpected tokens on MPS. Re-add F16 fix. ((Noeda)

* dranger003: Fix block index overflow in CUDA dequantizing.

* Reverted blocked multiplication code as it still has issues and could affect other Llama arches

* export norms as f32

* fix overflow issues during quant and other cleanup

* Type convention

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* dranger003: Fix more int overflow during quant.

---------

Co-authored-by: S <seast@Ss-Mac-Studio.local>
Co-authored-by: S <s@example.com>
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-09 20:26:18 +03:00
Abhilash Majumder
1d2721ca72
remove row=1 cond (llama/6532) 2024-04-09 20:26:18 +03:00
Neo Zhang Jianyu
219e601dab
support/fix OPs GGML_TYPE_IQ4_NL, GGML_TYPE_IQ4_XS, GGML_TYPE_IQ3_XXS, GGML_TYPE_IQ3_S, GGML_TYPE_IQ2_XXS, GGML_TYPE_IQ2_XS, GGML_TYPE_IQ2_S, GGML_TYPE_IQ1_S, GGML_TYPE_IQ1_M (llama/6521) 2024-04-09 20:26:18 +03:00
Georgi Gerganov
3b8aade3c2
scripts : update sync 2024-04-09 20:25:50 +03:00
Georgi Gerganov
52ccd4a3a8
files : rename ./extra to ./scripts 2024-04-09 20:13:41 +03:00
Brad Murray
5275074d37
whisper : fix DTW memory access (#2012)
* Fix DTW memory access

* Memory fix - Apply changes from denersc
2024-04-09 18:38:19 +03:00
ulatekh
c15b4cda7d
common : fix file-handle leak in read_wav() (#2026)
Now it cleans up in case of error.
2024-04-09 18:34:34 +03:00
Rotem Dan
d3cfb6ca2b
main : set stdin to binary mode on Windows (#2025) 2024-04-09 18:33:32 +03:00
slashlib
956ef860bc
cmake : support for CPU BLAS build via Intel MKL (#2024) 2024-04-09 18:32:46 +03:00
ulatekh
671b4bde6c
main : allow a response-file as the sole parameter (#2019)
* The "main" example now allows a response-file as the sole parameter.

A response-file is a text file with command-line parameters, one per line.
Prefix the name of the response-file with "@" to identify it as such.
It's used under MS Windows to work around command-line length limits.
It may be useful under other platforms to simplify character-escaping.

* minor : style

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-09 18:31:16 +03:00
ulatekh
c8eeb93a6a
whisper : suppress tokens with a regex (#1997)
* Allow a regular expression to describe tokens to suppress.

Example: --suppress-tokens-re "[,\.]|[ ]?[0-9]+" will suppress commas, periods, and numeric tokens.

Technique inspired by https://github.com/openai/whisper/discussions/1041

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Blind change to fix Java test.

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-09 18:27:28 +03:00
ulatekh
319fe5146e
cmake : create solution folders (#2004)
* Create solution folders in the CMake build.

* Fixed non-SDL2 build.

* Fixed emscripten build.
2024-04-09 18:23:33 +03:00
Georgi Gerganov
13c22321d1
sync : ggml 2024-04-07 17:04:56 +03:00
Georgi Gerganov
ccbe9d5676
extra : sync grammar-parser 2024-04-07 17:04:22 +03:00
Georgi Gerganov
81a3c41aa0
talk-llama : sync llama.cpp 2024-04-07 16:21:08 +03:00
Georgi Gerganov
a50207c65d
sync : ggml 2024-04-07 16:18:11 +03:00
Georgi Gerganov
97878e53fd
sync : llama.cpp (skip)
ggml-ci
2024-04-07 16:15:57 +03:00
Ouadie EL FAROUKI
61b05815e0
Fixed minor bug when enabling FP16 for non intel targets (llama/6464)
* moved INTEL_MKL guard from gemm_impl to gemm (wrapper)

* Update ggml-sycl.cpp

Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com>

---------

Co-authored-by: AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com>
2024-04-07 16:15:57 +03:00
slaren
1dce94cf26
ggml : mul_mat_id use the same tensor for all the experts (llama/6387)
* ggml : update mul_mat_id to use the same tensor for all the experts

* update cuda

* minor

* update metal

* update test-backend-ops

* fix cuda

* Update ggml-metal.m

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* update convert.py

* update convert-hf-to-gguf.py

* update convert.py for mixtral hf models

* Update convert-hf-to-gguf.py

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* cuda : support non-pow-2 number of experts

* allow quantize to work for split and merged experts models in the same way

* cleanup + disable mmap automatically with split tensors models

* update imatrix

* test-backend-ops : test qwen argsort

* update grok model loading

* llama : add merged experts tensors to the grok tensor map

* minor

* gguf : bump version

* fix quantizing of merged experts

* convert-hf-to-gguf.py : update grok (untested)

* make linter happy

* cuda/argsort : use shared memory instead of pool memory

* convert : fix grok tensor names

* metal : add support for non-pow-2 argsort

* llama : more loader cleanup, better error checking

* cuda : fix warning

* llama : still use mmap for loading old models, but copy the data to a host buffer

* add review note

* llama : remove ffn tensor counting + add sanity check

ggml-ci

* convert : fix handling of n_experts == None

ggml-ci

* imatrix : fix ncall counters

* llama : produce error if imatrix size does not match

* quantize : terminate on errors + trace logs

ggml-ci

* metal : pad shared memory to 16 bytes

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-07 16:15:57 +03:00
Meng, Hengyu
f12e982c0b
Disable iqx on windows as WA (llama/6435)
* disable iqx on windows as WA

* array instead of global_memory
2024-04-07 16:15:57 +03:00
0cc4m
fa966b9b40
Vulkan k-quant mmq and ggml-backend offload functionality (llama/6155)
* Fix Vulkan no kv offload incoherence

* Add k-quant mul mat mat shaders

* Rework working buffer allocation, reduces vram use noticeably

Clean up cpu assist code, replaced with ggml-backend offload function

* Default to all dedicated GPUs

* Add fallback for integrated GPUs if no dedicated GPUs are found

* Add debug info which device is allocating memory

* Fix Intel dequant issue

Fix validation issue

* Fix Vulkan GGML_OP_GET_ROWS implementation

* Clean up merge artifacts

* Remove Vulkan warning
2024-04-07 16:15:57 +03:00