mirror of
https://github.com/ggerganov/whisper.cpp.git
synced 2025-05-04 08:04:39 +02:00
* DRAFT: Introduction of CUDA Graphs to LLama.cpp * FIx issues raised in comments * Tidied to now only use CUDA runtime (not mixed with driver calls) * disable for multi-gpu and batch size > 1 * Disable CUDA graphs for old GPU arch and with env var * added missing CUDA_CHECKs * Addressed comments * further addressed comments * limit to GGML_ALLOW_CUDA_GRAPHS defined in llama.cpp cmake * Added more comprehensive graph node checking * With mechanism to fall back if graph capture fails * Revert "With mechanism to fall back if graph capture fails" This reverts commit eb9f15fb6fcb81384f732c4601a5b25c016a5143. * Fall back if graph capture fails and address other comments * - renamed GGML_ALLOW_CUDA_GRAPHS to GGML_CUDA_USE_GRAPHS - rename env variable to disable CUDA graphs to GGML_CUDA_DISABLE_GRAPHS - updated Makefile build to enable CUDA graphs - removed graph capture failure checking in ggml_cuda_error using a global variable to track this is not thread safe, but I am also not safistied with checking an error by string if this is necessary to workaround some issues with graph capture with eg. cuBLAS, we can pass the ggml_backend_cuda_context to the error checking macro and store the result in the context - fixed several resource leaks - fixed issue with zero node graphs - changed fixed size arrays to vectors - removed the count of number of evaluations before start capturing, and instead changed the capture mode to relaxed - removed the check for multiple devices so that it is still possible to use a single device, instead checks for split buffers to disable cuda graphs with -sm row - changed the op for checking batch size to GGML_OP_ADD, should be more reliable than GGML_OP_SOFT_MAX - code style fixes - things to look into - VRAM usage of the cudaGraphExec_t, if it is significant we may need to make it optional - possibility of using cudaStreamBeginCaptureToGraph to keep track of which ggml graph nodes correspond to which cuda graph nodes * fix build without cuda graphs * remove outdated comment * replace minimum cc value with a constant --------- Co-authored-by: slaren <slarengh@gmail.com> |
||
---|---|---|
.. | ||
acc.cu | ||
acc.cuh | ||
alibi.cu | ||
alibi.cuh | ||
arange.cu | ||
arange.cuh | ||
argsort.cu | ||
argsort.cuh | ||
binbcast.cu | ||
binbcast.cuh | ||
clamp.cu | ||
clamp.cuh | ||
common.cuh | ||
concat.cu | ||
concat.cuh | ||
convert.cu | ||
convert.cuh | ||
cpy.cu | ||
cpy.cuh | ||
dequantize.cuh | ||
diagmask.cu | ||
diagmask.cuh | ||
dmmv.cu | ||
dmmv.cuh | ||
fattn.cu | ||
fattn.cuh | ||
getrows.cu | ||
getrows.cuh | ||
im2col.cu | ||
im2col.cuh | ||
mmq.cu | ||
mmq.cuh | ||
mmvq.cu | ||
mmvq.cuh | ||
norm.cu | ||
norm.cuh | ||
pad.cu | ||
pad.cuh | ||
pool2d.cu | ||
pool2d.cuh | ||
quantize.cu | ||
quantize.cuh | ||
rope.cu | ||
rope.cuh | ||
scale.cu | ||
scale.cuh | ||
softmax.cu | ||
softmax.cuh | ||
sumrows.cu | ||
sumrows.cuh | ||
tsembd.cu | ||
tsembd.cuh | ||
unary.cu | ||
unary.cuh | ||
upscale.cu | ||
upscale.cuh | ||
vecdotq.cuh |