whisper.cpp

mirror of https://github.com/ggerganov/whisper.cpp.git synced 2025-07-15 19:35:07 +02:00

Files

agray3 24f0aa460b Introduction of CUDA Graphs to LLama.cpp (llama/6766)

* DRAFT: Introduction of CUDA Graphs to LLama.cpp

* FIx issues raised in comments

* Tidied to now only use CUDA runtime (not mixed with driver calls)

* disable for multi-gpu and batch size > 1

* Disable CUDA graphs for old GPU arch and with env var

* added missing CUDA_CHECKs

* Addressed comments

* further addressed comments

* limit to GGML_ALLOW_CUDA_GRAPHS defined in llama.cpp cmake

* Added more comprehensive graph node checking

* With mechanism to fall back if graph capture fails

* Revert "With mechanism to fall back if graph capture fails"

This reverts commit eb9f15fb6fcb81384f732c4601a5b25c016a5143.

* Fall back if graph capture fails and address other comments

* - renamed GGML_ALLOW_CUDA_GRAPHS to GGML_CUDA_USE_GRAPHS

- rename env variable to disable CUDA graphs to GGML_CUDA_DISABLE_GRAPHS

- updated Makefile build to enable CUDA graphs

- removed graph capture failure checking in ggml_cuda_error
  using a global variable to track this is not thread safe, but I am also not safistied with checking an error by string
  if this is necessary to workaround some issues with graph capture with eg. cuBLAS, we can pass the ggml_backend_cuda_context to the error checking macro and store the result in the context

- fixed several resource leaks

- fixed issue with zero node graphs

- changed fixed size arrays to vectors

- removed the count of number of evaluations before start capturing, and instead changed the capture mode to relaxed

- removed the check for multiple devices so that it is still possible to use a single device, instead checks for split buffers to disable cuda graphs with -sm row

- changed the op for checking batch size to GGML_OP_ADD, should be more reliable than GGML_OP_SOFT_MAX

- code style fixes

- things to look into
  - VRAM usage of the cudaGraphExec_t, if it is significant we may need to make it optional
  - possibility of using cudaStreamBeginCaptureToGraph to keep track of which ggml graph nodes correspond to which cuda graph nodes

* fix build without cuda graphs

* remove outdated comment

* replace minimum cc value with a constant

---------

Co-authored-by: slaren <slarengh@gmail.com>

2024-05-13 11:02:26 +03:00

acc.cu

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

acc.cuh

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

alibi.cu

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

alibi.cuh

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

arange.cu

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

arange.cuh

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

argsort.cu

ggml : mul_mat_id use the same tensor for all the experts (llama/6387)

2024-04-07 16:15:57 +03:00

argsort.cuh

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

binbcast.cu

ggml : group all experts in a single ggml_mul_mat_id (llama/6505)

2024-05-13 11:02:26 +03:00

binbcast.cuh

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

clamp.cu

Introduction of CUDA Graphs to LLama.cpp (llama/6766)

2024-05-13 11:02:26 +03:00

clamp.cuh

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

common.cuh

Introduction of CUDA Graphs to LLama.cpp (llama/6766)

2024-05-13 11:02:26 +03:00

concat.cu

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

concat.cuh

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

convert.cu

Introduction of CUDA Graphs to LLama.cpp (llama/6766)

2024-05-13 11:02:26 +03:00

convert.cuh

llama : add Command R Plus support (llama/6491)

2024-04-09 20:26:18 +03:00

cpy.cu

Introduction of CUDA Graphs to LLama.cpp (llama/6766)

2024-05-13 11:02:26 +03:00

cpy.cuh

Introduction of CUDA Graphs to LLama.cpp (llama/6766)

2024-05-13 11:02:26 +03:00

dequantize.cuh

llama : add Command R Plus support (llama/6491)

2024-04-09 20:26:18 +03:00

diagmask.cu

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

diagmask.cuh

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

dmmv.cu

llama : add Command R Plus support (llama/6491)

2024-04-09 20:26:18 +03:00

dmmv.cuh

sync : llama.cpp (skip)

2024-04-07 16:15:57 +03:00

fattn.cu

CUDA: CUDART < 11.7 workaround for __hmax, __hmax2 (llama/7019)

2024-05-13 11:02:26 +03:00

fattn.cuh

ggml : add Flash Attention (llama/5021)

2024-05-13 11:02:26 +03:00

getrows.cu

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

getrows.cuh

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

im2col.cu

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

im2col.cuh

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

mmq.cu

Introduction of CUDA Graphs to LLama.cpp (llama/6766)

2024-05-13 11:02:26 +03:00

mmq.cuh

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

mmvq.cu

Introduction of CUDA Graphs to LLama.cpp (llama/6766)

2024-05-13 11:02:26 +03:00

mmvq.cuh

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

norm.cu

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

norm.cuh

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

pad.cu

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

pad.cuh

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

pool2d.cu

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

pool2d.cuh

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

quantize.cu

llama : add Command R Plus support (llama/6491)

2024-04-09 20:26:18 +03:00

quantize.cuh

llama : add Command R Plus support (llama/6491)

2024-04-09 20:26:18 +03:00

rope.cu

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

rope.cuh

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

scale.cu

Introduction of CUDA Graphs to LLama.cpp (llama/6766)

2024-05-13 11:02:26 +03:00

scale.cuh

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

softmax.cu

ggml : add Flash Attention (llama/5021)

2024-05-13 11:02:26 +03:00

softmax.cuh

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

sumrows.cu

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

sumrows.cuh

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

tsembd.cu

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

tsembd.cuh

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

unary.cu

feat: implemented sigmoid function (ggml/806)

2024-05-13 11:02:26 +03:00

unary.cuh

feat: implemented sigmoid function (ggml/806)

2024-05-13 11:02:26 +03:00

upscale.cu

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

upscale.cuh

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00

vecdotq.cuh

sync : ggml (#2001 )

2024-03-27 18:55:10 +02:00