mirror of
https://github.com/ggerganov/whisper.cpp.git
synced 2025-01-24 23:09:05 +01:00
examples : vim plugin and LSP server (#1144)
* Initial proof of concept Vim plugin At present, this is likely only slightly better than feature parity with the existing whisper.nvim Known issues: Trailing whitespace Up to an existing length(5 seconds) of speech may be processed when listening is enabled CPU cycles are spent processing speech even when not listening. Fixing these issues is likely dependent upon future efforts to create a dedicated library instead of wrapping examples/stream * Support $WHISPER_CPP_HOME environment variable A minor misunderstanding of the whisper.nvim implementation resulted in a plugin that was functional, but not a drop in replacement as it should be now. * Initial progress on LSP implementation Libcall is nonviable because the library is immediately freed after a call is made. Further investigation has shown Language Server Protocol as a promising alternative that both simplifies the required logic on the vimscript side and increases the ease with which plugins for other editors could be made in the future. This is a very large undertaking and my progress has slowed substantially. Work is far from being in a usable state, but I wish to keep track of major refactors for organizational purposes. * Rewrite audio windowing of guided transcription One of the defining goals of this venture is allowing consecutive commands to be rattled off without the existing deadzones of the current implementation. * Add unguided_transcription. Cleanup. The unguided transcription implantation heavily borrows from existing example implementations and the guided_transcription logic. A high level pass was done to check that method arguments are accurate to what inputs are actually required. A first attempt at cancellation support was added for record keeping, but will be deleted in a future commit. * Fix compilation. Resolves a large number of compilation errors. No testing has been done yet for execution errors. Update Makefile and .gitignore * Functional unguided_transcription * Functional guided_transcription Fix commandset_list being passed by value Properly register the first token of a multitoken command * Minor changes before time fix I've apparently made an awfully major mistake in thinking that unix time was in milliseconds and will be changing all timekeeping code to use standardized methods. In preparation for this is a number of minor bugfixes. Output is manually flushed. An echo method has been added. registerCommandset now wraps the returned index * Swap timekeeping to use std::chrono * Add work in progress lsp backed whisper.vim plugin Current progress blockers are Adding modality awareness to the command processing (specifically, motion prompting) Improving the VAD to be a little more responsive (testing start of activity) * Reworked vim plugin command loop * Fix change inside Multiple bug fixes that, crucially, bring the plugin to the point where a demonstration video is possible Add better echo messaging so whisper_log isn't required Add loading complete message as indicator when listening has started Insert/append are actually included in command sets Some more heavy handed corrections to prevent a double exit when leaving insert mode As a somewhat hacky fix, the very first space is removed when inserting. This cleans up most use cases, but leaves me unsatisfied with the few cases it would be desired. * Forcibly set commandset_index to 0 after subinsert Also remove unnecessary ! to use builtin vim command * Fix upper A minor scope mistake was causing upper'd inputs to be eaten. This was fixed and echoing was slightly improved for clarity. * Fix formatting Corrects indentation to 4 spaces as project standard Slightly better error support for malformed json input * Remove obsolete vim plugin * Add json.hpp library The same library that is used for the llama.cpp server * Minor cleanups add lsp to the make clean directive. remove a redundant params definition. reorder whisper.vim logging for subtranscriptions Corrections to unlets (variables of argument scope appear immutable) * Fix indentation. Fallback for subTranscription Indentation has been changed to 4 spaces. Unit testing has been set up, I'm opting not to include it in the repository for now. It however, has revealed a bug in the state logic where a subtranscription can be initiated without having a saved command When this occurs, append is added as a fallback * Move audio polling logic to a subfunction While work on the improved vad will continue, It's grown to be a little out of scope. Instead, a future commit will perform multiple detection passes at substretches of audio when a backlog of audio exists. To facilitate this, and prevent code duplication, the vad code has been moved into a subfunction shared by both the unguided and guided transcription functions. * Test for voice over subchunks if backlog > 1s As the existing VAD implementation only checks for a falling edge at the end of an audio chunk. It fails to detect voice in cases where the recorded voice is only at the beginning of the audio. To ameliorate this, when the timestamp would cause analysis of audio over a second in length, it is split into 1 second length subchunks which are individually tested. Results are promising, but there seems to be a remaining bug with unguided transcription likely related to saving context * Limit the maximum length of audio input. This existing VAD implementation only detects falling edges, which means any gap in the users speaking is processed for transcription. This simply establishes a constant maximum length depending on the type of transcription. Uguided gets a generous 10 seconds and guided, 2. While quick testing showed that commands are generally around a half a second to a second, limiting commands to an even second resulted in extreme degradation of quality. (Seemingly always the same output for a given commandset) * Unguided timestamp tracking, cleanup Unguided transcriptions where not setup to allow for passing of timestamp data forward, but have been corrected. No_context is now always set to false. While conceptually desirable for the quality of guided transcription, It was seemingly responsible for prior command inputs ghosting in unguided transcription. Save and Run are now tracked by command number instead of command text. While command_text was provided for convenience, I wish to keep command index authoritative. This gives greater consistency and potentially allows for end users to rename or even translate the spoken versions of these commands * By default, maintain mode. Previously, mode was reset to 0 unless otherwise set. In addition to causing some edge cases, this was didn't mesh well with the existing approach to visual mode. With this change, initial tests indicate visual mode is functional. * Add undo breaks before subtranscriptions Subtranscriptions use undo as a hack to allow for partial responses to be displayed. However, scripts don't cause an undo break mid execution unless specifically instructed to. This meant that multiple unguided transcriptions from a single session would cause a latter to undo a former. This is now fixed and undo should be reasonably usable as a command. * Append instead of insert for new undo sequence When entering and leavening insert mode with `i`, the cursor shifts one column to the left. This is remedied by using append instead of insert for setting these breaks in the undo sequence `-` was also added to the pronunciation dictionary to be pronounced as minus as it was causing a particularly high failure rate. * Move undo sequence breaks to command execution Previously, undo sequence breaks were triggered when there was a command that caused a move to insert mode. This caused commands that changed state (like delete or paste) to be bundled together with into the last command that caused text to be entered. * Fix repeat. Add space, carrot, dollar commands Repeat (.) wasn't being tracked properly just like undo and is being manually tracked now. While efforts have been made to properly handle spaces, it was particularly finicky to add a single space when one is needed. A special 'space' command has been added to insert a single space and move the cursor after it. Carrot and Dollar commands have been added for start of line and end of line respectively. These are both simple to implement, and just a matter of defining a pronunciation. * Return error on duplicate in commandset Not every command in the commandset tokenizes to a single token. Because of this, it's possible for that two commands could resolve to the same single token after subsequent tokens are discarded. This commit adds a simple check for duplicates when a commandset is registered and returns an error if so. Additional code will be required later on the vim side to actually process this error. * Add support for user-defined commands This adds a user definable dictionary from spoken keys to strings or funcrefs. All keys are added to the commandlist and when spoken, trigger the corresponding function. Like "save" and "run", these user commands are only available when the command buffer is empty. * Add readme, update cmake * Add area commandset. Refactor spoken_dict Area commands (inside word, around sentence...) have been given a commandset as considered earlier. Verbose definitions for spoken_dict entries now use dicts instead of lists. This shortens the definition for most keys that require it and scales better with the addition of further commandsets * Add mark, jump. Fix change under visual. Mark (m) and jump (') have been added. When a visual selection was executed upon a command that initiated a subtranscription (change) the area of the visual selection is not properly tracked which causes the attempt to stream in partial response to fail. This is solved by disabling partial transcriptions from being streamed when a subtranscription is started while in visual mode. * Accommodate ignorecase. Fix change. From testing on older different versions of vim, the test for distinguishing an 'R' replace all from an 'r' replace could fail if ignorecase was set. The comparison has been changed to explicitly require case matching Change detection has been moved to the execution section as it was missing the change+motion case. * Support registers. Fix README typo There's no logic to prevent doubled register entry, but the functional result is equivalent to if the same key order was typed into vim. A minor typo in the readme. I've mismemorized the mnemonic for 't' as 'to' instead of till., but 'to' can't be used as it's a homophone with '2'. While there was no mistake in the actual logic, it was misleading to use 'to' in the readme.
This commit is contained in:
parent
cb5fb0a12d
commit
175ffa64ee
1
.gitignore
vendored
1
.gitignore
vendored
@ -24,6 +24,7 @@ build-sanitize-thread/
|
||||
/talk-llama
|
||||
/bench
|
||||
/quantize
|
||||
/lsp
|
||||
|
||||
arm_neon.h
|
||||
sync.sh
|
||||
|
5
Makefile
5
Makefile
@ -266,7 +266,7 @@ libwhisper.so: ggml.o $(WHISPER_OBJ)
|
||||
$(CXX) $(CXXFLAGS) -shared -o libwhisper.so ggml.o $(WHISPER_OBJ) $(LDFLAGS)
|
||||
|
||||
clean:
|
||||
rm -f *.o main stream command talk talk-llama bench quantize libwhisper.a libwhisper.so
|
||||
rm -f *.o main stream command talk talk-llama bench quantize lsp libwhisper.a libwhisper.so
|
||||
|
||||
#
|
||||
# Examples
|
||||
@ -293,6 +293,9 @@ stream: examples/stream/stream.cpp $(SRC_COMMON) $(SRC_COMMON_SDL) ggml.o $(WHIS
|
||||
command: examples/command/command.cpp $(SRC_COMMON) $(SRC_COMMON_SDL) ggml.o $(WHISPER_OBJ)
|
||||
$(CXX) $(CXXFLAGS) examples/command/command.cpp $(SRC_COMMON) $(SRC_COMMON_SDL) ggml.o $(WHISPER_OBJ) -o command $(CC_SDL) $(LDFLAGS)
|
||||
|
||||
lsp: examples/lsp/lsp.cpp $(SRC_COMMON) $(SRC_COMMON_SDL) ggml.o $(WHISPER_OBJ)
|
||||
$(CXX) $(CXXFLAGS) examples/lsp/lsp.cpp $(SRC_COMMON) $(SRC_COMMON_SDL) ggml.o $(WHISPER_OBJ) -o lsp $(CC_SDL) $(LDFLAGS)
|
||||
|
||||
talk: examples/talk/talk.cpp examples/talk/gpt-2.cpp $(SRC_COMMON) $(SRC_COMMON_SDL) ggml.o $(WHISPER_OBJ)
|
||||
$(CXX) $(CXXFLAGS) examples/talk/talk.cpp examples/talk/gpt-2.cpp $(SRC_COMMON) $(SRC_COMMON_SDL) ggml.o $(WHISPER_OBJ) -o talk $(CC_SDL) $(LDFLAGS)
|
||||
|
||||
|
@ -69,4 +69,5 @@ else()
|
||||
add_subdirectory(quantize)
|
||||
add_subdirectory(talk)
|
||||
add_subdirectory(talk-llama)
|
||||
add_subdirectory(lsp)
|
||||
endif()
|
||||
|
9
examples/lsp/CMakeLists.txt
Normal file
9
examples/lsp/CMakeLists.txt
Normal file
@ -0,0 +1,9 @@
|
||||
if (WHISPER_SDL2)
|
||||
# stream
|
||||
set(TARGET lsp)
|
||||
add_executable(${TARGET} lsp.cpp)
|
||||
|
||||
include(DefaultTargetOptions)
|
||||
|
||||
target_link_libraries(${TARGET} PRIVATE common common-sdl whisper ${CMAKE_THREAD_LIBS_INIT})
|
||||
endif ()
|
104
examples/lsp/README.md
Normal file
104
examples/lsp/README.md
Normal file
@ -0,0 +1,104 @@
|
||||
# Language Server
|
||||
|
||||
This example consists of a simple language server to expose both unguided
|
||||
and guided (command) transcriptions by sending json messages over stdout/stdin
|
||||
as well as a rather robust vim plugin that makes use of the language server.
|
||||
|
||||
## Vim plugin quick start
|
||||
|
||||
Compile the language server with
|
||||
|
||||
```bash
|
||||
make lsp
|
||||
```
|
||||
Install the plugin itself by copying or symlinking whisper.vim into ~/.vim/autoload/
|
||||
|
||||
In your vimrc, set the path of your whisper.cpp directory and optionally add some keybinds.
|
||||
|
||||
```vim
|
||||
let g:whisper_dir = "~/whisper.cpp"
|
||||
" Start listening for commands when Ctrl - g is pressed in normal mode
|
||||
nnoremap <C-G> call whisper#requestCommands()<CR>
|
||||
" Start unguided transcription when Ctrl - g is pressed in insert mode
|
||||
inoremap <C-G> <Cmd>call whisper#doTranscription()<CR>
|
||||
```
|
||||
|
||||
## Vim plugin usage
|
||||
|
||||
The vim plugin was designed to closely follow the mnemonics of vim
|
||||
|
||||
`s:spoken_dict` is used to translate keys to their spoken form.
|
||||
|
||||
|
||||
Keys corresponding to a string use that spoken value normally and when a motion is expected, but use the key itself when a character is expected.
|
||||
Keys corresponding to a dict, like `i`, can have manual difinitions given to each possible commandset.
|
||||
|
||||
0 is normal (insert), 1 is motion (inside), 2 is it's usage as a single key ([till] i), and 3 is it's usage in an area selection (s -> [around] sentence)
|
||||
|
||||
Some punctuation items, like `-` are explicitly given pronunciations to prevent them from being picked as punctuation instead of an actual command word.
|
||||
|
||||
Not all commands will tokenize to a single token and this can interfere with interpretation. "yank" as an example, takes multiple tokens and correspondingly, will give more accurate detection when only the first "ya" is used. While it could be changed to something else that is a single token (copy), value was placed on maintaining vim mnemonics.
|
||||
|
||||
Commands that would normally move the editor into insert mode (insert, append, open, change) will begin unguided transcription.
|
||||
Unguided transcription will end when a speech segment ends in exit.
|
||||
Presence of punctuation can be designated by whether or not you add a pause between the previous speech segment and exit.
|
||||
Exiting only occurs if exit is the last word, so "Take the first exit on your right" would not cause transcription to end.
|
||||
|
||||
After a command is evaluated, the plugin will continue listening for the next command.
|
||||
|
||||
While in command mode, "Exit" will end listening.
|
||||
|
||||
A best effort approach is taken to keep track of audio that is recorded while a previous chunk is still processing and immediately interpret it afterwards, but the current voice detection still needs a fairly sizable gap to determine when a command has been spoken.
|
||||
|
||||
Log information is sent to a special `whisper_log` buffer and can be accessed with
|
||||
```vim
|
||||
:e whisper_log
|
||||
```
|
||||
|
||||
## Vim plugin configuration
|
||||
|
||||
`g:whisper_dir`
|
||||
A full path to the whisper.cpp repo. It can be expanded in the definition like so:
|
||||
```vim
|
||||
let g:whisper_dir = expand("~/whisper.cpp/")
|
||||
```
|
||||
(The WHISPER_CPP_HOME environment variable is also checked for users of the existing whisper.nvim script)
|
||||
|
||||
`g:whisper_lsp_path`
|
||||
Can be used to manually set the path to the language server.
|
||||
If not defined, it will be inferred from the above whisper_dir
|
||||
|
||||
`g:whisper_model_path`
|
||||
A full path to the model to load. If not defined, it will default to ggml-base.en.bin
|
||||
|
||||
`g:whisper_user_commands`
|
||||
A dictionary of spoken commands that correspond to either strings or funcrefs.
|
||||
This can be used to create connections with other user plugins, for example
|
||||
```vim
|
||||
let g:whisper_user_commands = {"gen": "llama#doLlamaGen"}
|
||||
```
|
||||
will trigger the llama.cpp plugin to begin generation when "gen" is spoken
|
||||
|
||||
## Language server methods
|
||||
|
||||
`registerCommandset`
|
||||
`params` is a list of strings that should be checked for with this commandset. The server prepends a space to these strings before tokenizing.
|
||||
Responds with
|
||||
`result.index` an integer index for the commandset registered, which should be included when initiating a guided transcription to select this commandset.
|
||||
Will return an error if any of the commands in the commandset have duplicate tokenizations
|
||||
|
||||
`guided`
|
||||
`params.commandset_index` An index returned by a corresponding commandset registration. If not set, the most recently registered commandset is used.
|
||||
`params.timestamp` A positive unsigned integer which designates a point in time which audio should begin processing from. If left blank, the start point of audio processing will be the moment the message is recieved. This should be left blank unless you have a timestamp from a previous response.
|
||||
Responds with
|
||||
`result.command_index` The numerical index (starting from 0) of the detected command in the selected commandset
|
||||
`result.command_text` A string containing the command as provided in the commandset
|
||||
`result.timestamp` A positive unsigned integer that designates the point in time which audio stopped being processed at. Pass this timestamp back in a subsequent message to mask the latency of transcription.
|
||||
|
||||
`unguided`
|
||||
`params.no_context` Sets the corresponding whisper `no_context` param. Defaults to true. Might provide more accurate results for consecutive unguided transcriptions if those after the first are set to false.
|
||||
`params.prompt` If provided, sets the initial prompt used during transcription.
|
||||
`params.timestamp` A positive unsigned integer which designates a point in time which audio should begin processing from. If left blank, the start point of audio processing will be the moment the message is recieved. This should be left blank unless you have a timestamp from a previous response.
|
||||
Responds with
|
||||
`result.transcription` A string containing the transcribed text. N.B. This will almost always start with a space due to how text is tokenized.
|
||||
`result.timestamp` A positive unsigned integer that designates the point in time which audio stopped being processed at. Pass this timestamp back in a subsequent message to mask the latency of transcription.
|
24596
examples/lsp/json.hpp
Normal file
24596
examples/lsp/json.hpp
Normal file
File diff suppressed because it is too large
Load Diff
458
examples/lsp/lsp.cpp
Normal file
458
examples/lsp/lsp.cpp
Normal file
@ -0,0 +1,458 @@
|
||||
#include "common.h"
|
||||
#include "common-sdl.h"
|
||||
#include "whisper.h"
|
||||
#include "json.hpp"
|
||||
|
||||
#include <iostream>
|
||||
#include <cassert>
|
||||
#include <cstdio>
|
||||
#include <string>
|
||||
#include <thread>
|
||||
#include <vector>
|
||||
#include <deque>
|
||||
#include <set>
|
||||
|
||||
using json = nlohmann::json;
|
||||
|
||||
// command-line parameters
|
||||
struct whisper_params {
|
||||
int32_t n_threads = std::min(4, (int32_t) std::thread::hardware_concurrency());
|
||||
int32_t prompt_ms = 5000;
|
||||
int32_t command_ms = 8000;
|
||||
int32_t capture_id = -1;
|
||||
int32_t max_tokens = 32;
|
||||
int32_t audio_ctx = 0;
|
||||
|
||||
float vad_thold = 0.6f;
|
||||
float freq_thold = 100.0f;
|
||||
|
||||
bool speed_up = false;
|
||||
bool translate = false;
|
||||
bool print_special = false;
|
||||
bool print_energy = false;
|
||||
|
||||
std::string language = "en";
|
||||
std::string model = "models/ggml-base.en.bin";
|
||||
};
|
||||
struct command {
|
||||
std::vector<whisper_token> tokens;
|
||||
std::string plaintext;
|
||||
};
|
||||
struct commandset {
|
||||
std::vector<struct command> commands;
|
||||
std::vector<whisper_token> prompt_tokens;
|
||||
// TODO: Store longest command?
|
||||
// Multi-token commands should have probabilities of subsequent logits
|
||||
// given that the prior logit is correct.
|
||||
// In this case, all commands must be iterated.
|
||||
// This however, is likely highly involved as different tokens
|
||||
// almost certainly have different spoken lengths
|
||||
// It would also have performance implications equivalent to a beam search
|
||||
};
|
||||
|
||||
void whisper_print_usage(int argc, char ** argv, const whisper_params & params);
|
||||
|
||||
bool whisper_params_parse(int argc, char ** argv, whisper_params & params) {
|
||||
for (int i = 1; i < argc; i++) {
|
||||
std::string arg = argv[i];
|
||||
|
||||
if (arg == "-h" || arg == "--help") {
|
||||
whisper_print_usage(argc, argv, params);
|
||||
exit(0);
|
||||
}
|
||||
else if (arg == "-t" || arg == "--threads") { params.n_threads = std::stoi(argv[++i]); }
|
||||
else if (arg == "-pms" || arg == "--prompt-ms") { params.prompt_ms = std::stoi(argv[++i]); }
|
||||
else if (arg == "-cms" || arg == "--command-ms") { params.command_ms = std::stoi(argv[++i]); }
|
||||
else if (arg == "-c" || arg == "--capture") { params.capture_id = std::stoi(argv[++i]); }
|
||||
else if (arg == "-mt" || arg == "--max-tokens") { params.max_tokens = std::stoi(argv[++i]); }
|
||||
else if (arg == "-ac" || arg == "--audio-ctx") { params.audio_ctx = std::stoi(argv[++i]); }
|
||||
else if (arg == "-vth" || arg == "--vad-thold") { params.vad_thold = std::stof(argv[++i]); }
|
||||
else if (arg == "-fth" || arg == "--freq-thold") { params.freq_thold = std::stof(argv[++i]); }
|
||||
else if (arg == "-su" || arg == "--speed-up") { params.speed_up = true; }
|
||||
else if (arg == "-tr" || arg == "--translate") { params.translate = true; }
|
||||
else if (arg == "-ps" || arg == "--print-special") { params.print_special = true; }
|
||||
else if (arg == "-pe" || arg == "--print-energy") { params.print_energy = true; }
|
||||
else if (arg == "-l" || arg == "--language") { params.language = argv[++i]; }
|
||||
else if (arg == "-m" || arg == "--model") { params.model = argv[++i]; }
|
||||
else {
|
||||
fprintf(stderr, "error: unknown argument: %s\n", arg.c_str());
|
||||
whisper_print_usage(argc, argv, params);
|
||||
exit(0);
|
||||
}
|
||||
}
|
||||
|
||||
return true;
|
||||
}
|
||||
|
||||
void whisper_print_usage(int /*argc*/, char ** argv, const whisper_params & params) {
|
||||
fprintf(stderr, "\n");
|
||||
fprintf(stderr, "usage: %s [options]\n", argv[0]);
|
||||
fprintf(stderr, "\n");
|
||||
fprintf(stderr, "options:\n");
|
||||
fprintf(stderr, " -h, --help [default] show this help message and exit\n");
|
||||
fprintf(stderr, " -t N, --threads N [%-7d] number of threads to use during computation\n", params.n_threads);
|
||||
fprintf(stderr, " -pms N, --prompt-ms N [%-7d] prompt duration in milliseconds\n", params.prompt_ms);
|
||||
fprintf(stderr, " -cms N, --command-ms N [%-7d] command duration in milliseconds\n", params.command_ms);
|
||||
fprintf(stderr, " -c ID, --capture ID [%-7d] capture device ID\n", params.capture_id);
|
||||
fprintf(stderr, " -mt N, --max-tokens N [%-7d] maximum number of tokens per audio chunk\n", params.max_tokens);
|
||||
fprintf(stderr, " -ac N, --audio-ctx N [%-7d] audio context size (0 - all)\n", params.audio_ctx);
|
||||
fprintf(stderr, " -vth N, --vad-thold N [%-7.2f] voice activity detection threshold\n", params.vad_thold);
|
||||
fprintf(stderr, " -fth N, --freq-thold N [%-7.2f] high-pass frequency cutoff\n", params.freq_thold);
|
||||
fprintf(stderr, " -su, --speed-up [%-7s] speed up audio by x2 (reduced accuracy)\n", params.speed_up ? "true" : "false");
|
||||
fprintf(stderr, " -tr, --translate [%-7s] translate from source language to english\n", params.translate ? "true" : "false");
|
||||
fprintf(stderr, " -ps, --print-special [%-7s] print special tokens\n", params.print_special ? "true" : "false");
|
||||
fprintf(stderr, " -pe, --print-energy [%-7s] print sound energy (for debugging)\n", params.print_energy ? "true" : "false");
|
||||
fprintf(stderr, " -l LANG, --language LANG [%-7s] spoken language\n", params.language.c_str());
|
||||
fprintf(stderr, " -m FNAME, --model FNAME [%-7s] model path\n", params.model.c_str());
|
||||
fprintf(stderr, "\n");
|
||||
}
|
||||
uint64_t wait_for_vad(audio_async & audio, json jparams, const whisper_params & params, uint64_t maxlength_ms, std::vector<float> & pcmf32) {
|
||||
using namespace std::chrono;
|
||||
uint64_t time_now = time_point_cast<milliseconds>(system_clock::now()).time_since_epoch().count();
|
||||
uint64_t start_time = time_now;
|
||||
if (jparams.contains("timestamp")) {
|
||||
start_time = jparams.at("timestamp");
|
||||
}
|
||||
if(time_now - start_time < 500) {
|
||||
//wait for a backlog of audio
|
||||
std::this_thread::sleep_for(milliseconds(500 - (time_now - start_time)));
|
||||
time_now = time_point_cast<milliseconds>(system_clock::now()).time_since_epoch().count();
|
||||
} else if (time_now - start_time > 1000) {
|
||||
audio.get(time_now-start_time, pcmf32);
|
||||
size_t max_offset = pcmf32.size() - WHISPER_SAMPLE_RATE;
|
||||
for(size_t offset=0;offset < max_offset;offset+=WHISPER_SAMPLE_RATE/10) {
|
||||
std::vector<float> audio_chunk(&pcmf32[offset], &pcmf32[offset+WHISPER_SAMPLE_RATE]);
|
||||
if(::vad_simple(audio_chunk, WHISPER_SAMPLE_RATE, 1000, params.vad_thold, params.freq_thold, params.print_energy)) {
|
||||
pcmf32.resize(offset+WHISPER_SAMPLE_RATE);
|
||||
if (offset*1000/WHISPER_SAMPLE_RATE+1000 > maxlength_ms) {
|
||||
//remove samples from the beginning
|
||||
pcmf32.erase(pcmf32.begin(),pcmf32.end()-(maxlength_ms*WHISPER_SAMPLE_RATE/1000));
|
||||
fprintf(stderr, "Shortened samples");
|
||||
}
|
||||
return start_time + offset*1000/WHISPER_SAMPLE_RATE+1000;
|
||||
}
|
||||
}
|
||||
}
|
||||
size_t window_duration = std::max((uint64_t)1000, time_now-start_time);
|
||||
audio.get(window_duration, pcmf32);
|
||||
while (!::vad_simple(pcmf32, WHISPER_SAMPLE_RATE, 1000, params.vad_thold, params.freq_thold, params.print_energy)) {
|
||||
std::this_thread::sleep_for(milliseconds(100));
|
||||
time_now = time_point_cast<milliseconds>(system_clock::now()).time_since_epoch().count();
|
||||
window_duration = std::max((uint64_t)1000,time_now-start_time);
|
||||
audio.get(window_duration, pcmf32);
|
||||
}
|
||||
if (time_now - start_time > maxlength_ms) {
|
||||
audio.get(maxlength_ms, pcmf32);
|
||||
} else {
|
||||
audio.get(time_now - start_time, pcmf32);
|
||||
}
|
||||
|
||||
return time_now;
|
||||
}
|
||||
|
||||
json unguided_transcription(struct whisper_context * ctx, audio_async &audio, json jparams, const whisper_params ¶ms) {
|
||||
std::vector<whisper_token> prompt_tokens;
|
||||
std::vector<float> pcmf32;
|
||||
uint64_t unprocessed_audio_timestamp = wait_for_vad(audio, jparams, params, 10000U, pcmf32);
|
||||
|
||||
whisper_full_params wparams = whisper_full_default_params(WHISPER_SAMPLING_GREEDY);
|
||||
if (jparams.contains("prompt")) {
|
||||
// unlikely to see much use. Under normal circumstances, no_context would be set to false
|
||||
std::string prompt = jparams.at("prompt");
|
||||
prompt_tokens.resize(1024);
|
||||
int n = whisper_tokenize(ctx, prompt.c_str(), prompt_tokens.data(), 1024);
|
||||
prompt_tokens.resize(n);
|
||||
|
||||
wparams.prompt_tokens = prompt_tokens.data();
|
||||
wparams.prompt_n_tokens = prompt_tokens.size();
|
||||
}
|
||||
wparams.print_progress = false;
|
||||
wparams.print_special = params.print_special;
|
||||
wparams.print_realtime = false;
|
||||
wparams.print_timestamps = false;
|
||||
wparams.translate = params.translate;
|
||||
wparams.no_context = jparams.value("no_context", true);
|
||||
wparams.single_segment = true;
|
||||
wparams.max_tokens = params.max_tokens;
|
||||
wparams.language = params.language.c_str();
|
||||
wparams.n_threads = params.n_threads;
|
||||
|
||||
wparams.audio_ctx = params.audio_ctx;
|
||||
wparams.speed_up = params.speed_up;
|
||||
wparams.suppress_non_speech_tokens = true;
|
||||
// run the transformer and a single decoding pass
|
||||
if (whisper_full(ctx, wparams, pcmf32.data(), pcmf32.size()) != 0) {
|
||||
fprintf(stderr, "%s: ERROR: whisper_full() failed\n", __func__);
|
||||
throw json{
|
||||
{"code", -32803},
|
||||
{"message", "ERROR: whisper_full() failed"}
|
||||
};
|
||||
}
|
||||
std::string result = whisper_full_get_segment_text(ctx,0);
|
||||
return json {
|
||||
{"transcription", result},
|
||||
{"timestamp", unprocessed_audio_timestamp}
|
||||
};
|
||||
}
|
||||
|
||||
// command-list mode
|
||||
// guide the transcription to match the most likely command from a provided list
|
||||
json guided_transcription(struct whisper_context * ctx, audio_async &audio, const whisper_params ¶ms, json jparams, std::vector<struct commandset> commandset_list) {
|
||||
struct commandset cs = commandset_list[jparams.value("commandset_index", commandset_list.size()-1)];
|
||||
std::vector<float> pcmf32;
|
||||
uint64_t unprocessed_audio_timestamp = wait_for_vad(audio, jparams, params, 2000U, pcmf32);
|
||||
|
||||
fprintf(stderr, "%s: Speech detected! Processing ...\n", __func__);
|
||||
whisper_full_params wparams = whisper_full_default_params(WHISPER_SAMPLING_GREEDY);
|
||||
|
||||
wparams.print_progress = false;
|
||||
wparams.print_special = params.print_special;
|
||||
wparams.print_realtime = false;
|
||||
wparams.print_timestamps = false;
|
||||
wparams.translate = params.translate;
|
||||
wparams.no_context = true;
|
||||
wparams.single_segment = true;
|
||||
wparams.max_tokens = 1;
|
||||
wparams.language = params.language.c_str();
|
||||
wparams.n_threads = params.n_threads;
|
||||
|
||||
wparams.audio_ctx = params.audio_ctx;
|
||||
wparams.speed_up = params.speed_up;
|
||||
|
||||
// TODO: Do some time testing. Does an overly long prompt slow down processing?
|
||||
// Set up command sets/precompute prompts
|
||||
wparams.prompt_tokens = cs.prompt_tokens.data();
|
||||
wparams.prompt_n_tokens = cs.prompt_tokens.size();
|
||||
// TODO: properly expose as option
|
||||
wparams.suppress_non_speech_tokens = true;
|
||||
|
||||
// run the transformer and a single decoding pass
|
||||
if (whisper_full(ctx, wparams, pcmf32.data(), pcmf32.size()) != 0) {
|
||||
fprintf(stderr, "%s: ERROR: whisper_full() failed\n", __func__);
|
||||
throw json{
|
||||
{"code", -32803},
|
||||
{"message", "ERROR: whisper_full() failed"}//TODO: format string (sprintf?)
|
||||
};
|
||||
}
|
||||
|
||||
// estimate command probability
|
||||
// NOTE: not optimal
|
||||
{
|
||||
const auto * logits = whisper_get_logits(ctx);
|
||||
|
||||
std::vector<float> probs(whisper_n_vocab(ctx), 0.0f);
|
||||
|
||||
// compute probs from logits via softmax
|
||||
{
|
||||
float max = -1e9;
|
||||
for (int i = 0; i < (int) probs.size(); ++i) {
|
||||
max = std::max(max, logits[i]);
|
||||
}
|
||||
|
||||
float sum = 0.0f;
|
||||
for (int i = 0; i < (int) probs.size(); ++i) {
|
||||
probs[i] = expf(logits[i] - max);
|
||||
sum += probs[i];
|
||||
}
|
||||
|
||||
for (int i = 0; i < (int) probs.size(); ++i) {
|
||||
probs[i] /= sum;
|
||||
}
|
||||
}
|
||||
|
||||
std::vector<std::pair<float, int>> probs_id;
|
||||
|
||||
// In my testing, the most verbose token is always the desired.
|
||||
// TODO: Trim commandset struct once efficacy has been verified
|
||||
for (int i = 0; i < (int) cs.commands.size(); ++i) {
|
||||
probs_id.emplace_back(probs[cs.commands[i].tokens[0]], i);
|
||||
}
|
||||
|
||||
// sort descending
|
||||
{
|
||||
using pair_type = decltype(probs_id)::value_type;
|
||||
std::sort(probs_id.begin(), probs_id.end(), [](const pair_type & a, const pair_type & b) {
|
||||
return a.first > b.first;
|
||||
});
|
||||
}
|
||||
int id = probs_id[0].second;
|
||||
return json{
|
||||
{"command_index", id},
|
||||
{"command_text", cs.commands[id].plaintext},
|
||||
{"timestamp", unprocessed_audio_timestamp},
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
json register_commandset(struct whisper_context * ctx, json jparams, std::vector<struct commandset> &commandset_list) {
|
||||
// TODO: check for token collision
|
||||
struct commandset cs;
|
||||
|
||||
std::string k_prompt = " select one from the available words: ";
|
||||
std::set<whisper_token> token_set;
|
||||
whisper_token tokens[32];
|
||||
for (std::string s : jparams) {
|
||||
std::vector<whisper_token> token_vec;
|
||||
// The existing command implementation uses a nested for loop to tokenize single characters
|
||||
// I fail to see the purpose of this when ' a' has a wholly different pronunciation than the start of ' apple'
|
||||
const int n = whisper_tokenize(ctx, (" " + s).c_str(), tokens, 32);
|
||||
if (n < 0) {
|
||||
fprintf(stderr, "%s: error: failed to tokenize command '%s'\n", __func__, s.c_str());
|
||||
return 3;
|
||||
}
|
||||
token_vec.push_back(tokens[0]);
|
||||
if (!token_set.insert(tokens[0]).second) {
|
||||
fprintf(stderr, "%s: warning: %s is a duplicate of an existing token\n", __func__, s.c_str());
|
||||
throw json{
|
||||
{"code",-31000},
|
||||
{"message", "Duplicate token in token set: " + s}
|
||||
};
|
||||
}
|
||||
if (n > 1) {// empty string if n=0? Should never occur
|
||||
fprintf(stderr, "%s: error: command is more than a single token: %s\n", __func__, s.c_str());
|
||||
}
|
||||
struct command command = {token_vec, s};
|
||||
cs.commands.push_back(command);
|
||||
k_prompt += s;
|
||||
}
|
||||
k_prompt = k_prompt.substr(0,k_prompt.length()-2) + ". Selected word:";
|
||||
cs.prompt_tokens.resize(1024);
|
||||
int n = whisper_tokenize(ctx, k_prompt.c_str(), cs.prompt_tokens.data(), 1024);
|
||||
cs.prompt_tokens.resize(n);
|
||||
// prepare response
|
||||
int index = commandset_list.size();
|
||||
commandset_list.push_back(cs);
|
||||
return json{{"index",index}};
|
||||
}
|
||||
json seek(struct whisper_context * ctx, audio_async &audio, json params) {
|
||||
// whisper_state has the pertinent offsets, but there also seem to be a large
|
||||
// number of scratch buffers that would prevent rewinding context in a manner similar to llama
|
||||
// I'll give this a another pass once everything else is implemented,
|
||||
// but for now, it's unsupported
|
||||
throw json{
|
||||
{"code", -32601},
|
||||
{"message", "Seeking is not yet supported."}
|
||||
};
|
||||
}
|
||||
json parse_job(const json &body, struct whisper_context * ctx, audio_async &audio, const whisper_params ¶ms, std::vector<struct commandset> &commandset_list) {
|
||||
// See: https://www.jsonrpc.org/specification
|
||||
json id = body.at("id");
|
||||
try {
|
||||
std::string version = body.at("jsonrpc");
|
||||
if (version != "2.0") {
|
||||
// unsupported version
|
||||
throw json{
|
||||
{"code", -3260},
|
||||
{"message", "invalid jsonrpc version"}
|
||||
};
|
||||
}
|
||||
std::string method = body.at("method");
|
||||
json jparams = json{{"dummy", "dummy"}};
|
||||
if (body.contains("params"))
|
||||
jparams = body.at("params");
|
||||
json res;
|
||||
// TODO: be consistent about argument order
|
||||
fprintf(stderr, "Dispatching a job\n");
|
||||
if (method == "unguided") { res = unguided_transcription(ctx, audio, jparams, params); }
|
||||
else if (method == "guided") { res = guided_transcription(ctx, audio, params, jparams, commandset_list); }
|
||||
else if (method == "seek") { res = seek(ctx, audio, jparams); }
|
||||
else if (method == "registerCommandset") { res = register_commandset(ctx, jparams, commandset_list); }
|
||||
else if (method == "echo") { res = jparams; }
|
||||
|
||||
|
||||
return json{
|
||||
{"jsonrpc", "2.0"},
|
||||
{"result", res},
|
||||
{"id", id}
|
||||
};
|
||||
} catch(json ex) {
|
||||
return json {
|
||||
{"jsonrpc", "2.0"},
|
||||
{"error", ex},
|
||||
{"id", id}
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
void process_loop(struct whisper_context * ctx, audio_async &audio, const whisper_params ¶ms) {
|
||||
std::deque<json> jobqueue;
|
||||
std::vector<struct commandset> commandset_list;
|
||||
while (true) {
|
||||
// For eventual cancellation support, shouldn't block if job exists
|
||||
if (std::cin.rdbuf()->in_avail() > 22 || jobqueue.size() == 0) {
|
||||
int content_length;
|
||||
if (scanf("Content-Length: %d", &content_length) != 1) {
|
||||
fprintf(stderr, "Could not read input: %d", std::cin.peek());
|
||||
return;
|
||||
}
|
||||
// scanf leaves the new lines intact
|
||||
std::cin.ignore(2);
|
||||
if (std::cin.peek() != 13) {
|
||||
// Content-Type. jsonrpc necessitates utf8.
|
||||
std::cin.ignore(200,10);
|
||||
}
|
||||
std::cin.ignore(2);
|
||||
// A message is being sent and blocking is acceptable
|
||||
std::string content(content_length,'\0');
|
||||
std::cin.read(&content[0], content_length);
|
||||
json job = json::parse(content);
|
||||
// TODO: Some messages(cancellation) should skip queue here
|
||||
if (job.is_array()) {
|
||||
// response must also be batched. Will implement later
|
||||
// for (subjob : job.begin())
|
||||
// TODO: At the very least respond with an unsupported error.
|
||||
} else {
|
||||
jobqueue.push_back(job);
|
||||
}
|
||||
}
|
||||
assert(jobqueue.size() > 0);
|
||||
json job = jobqueue.front();
|
||||
json resp = parse_job(job, ctx, audio, params, commandset_list);
|
||||
if (resp != "unfinished") {
|
||||
jobqueue.pop_front();
|
||||
// send response
|
||||
std::string data = resp.dump(-1, ' ', false, json::error_handler_t::replace);
|
||||
fprintf(stdout, "Content-Length: %d\r\n\r\n%s\n", data.length()+1, data.c_str());
|
||||
std::cout.flush();
|
||||
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
int main(int argc, char ** argv) {
|
||||
whisper_params params;
|
||||
if (whisper_params_parse(argc, argv, params) == false) {
|
||||
return 1;
|
||||
}
|
||||
|
||||
if (whisper_lang_id(params.language.c_str()) == -1) {
|
||||
fprintf(stderr, "error: unknown language '%s'\n", params.language.c_str());
|
||||
whisper_print_usage(argc, argv, params);
|
||||
exit(0);
|
||||
}
|
||||
|
||||
// whisper init
|
||||
struct whisper_context * ctx = whisper_init_from_file(params.model.c_str());
|
||||
// init audio
|
||||
|
||||
audio_async audio(30*1000);
|
||||
if (!audio.init(params.capture_id, WHISPER_SAMPLE_RATE)) {
|
||||
fprintf(stderr, "%s: audio.init() failed!\n", __func__);
|
||||
return 1;
|
||||
}
|
||||
|
||||
audio.resume();
|
||||
// TODO: Investigate why this is required. An extra second of startup latency is not great
|
||||
// wait for 1 second to avoid any buffered noise
|
||||
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
|
||||
audio.clear();
|
||||
// TODO: consider some sort of indicator to designate loading has finished?
|
||||
// Potentially better for the client to just start with a non-blocking message (register commands)
|
||||
process_loop(ctx, audio, params);
|
||||
|
||||
audio.pause();
|
||||
whisper_print_timings(ctx);
|
||||
whisper_free(ctx);
|
||||
|
||||
return 0;
|
||||
}
|
362
examples/lsp/whisper.vim
Normal file
362
examples/lsp/whisper.vim
Normal file
@ -0,0 +1,362 @@
|
||||
if !exists("g:whisper_dir")
|
||||
let g:whisper_dir = expand($WHISPER_CPP_HOME)
|
||||
if g:whisper_dir == ""
|
||||
echoerr "Please provide a path to the whisper.cpp repo in either the $WHISPER_CPP_HOME environment variable, or g:whisper_dir"
|
||||
endif
|
||||
endif
|
||||
if !exists("g:whisper_lsp_path")
|
||||
let g:whisper_lsp_path = g:whisper_dir .. "lsp"
|
||||
if !filereadable(g:whisper_lsp_path)
|
||||
echoerr "Was not able to locate a lsp executable at: " .. g:whisper_lsp_path
|
||||
throw "Executable not found"
|
||||
endif
|
||||
endif
|
||||
if !exists("g:whisper_model_path")
|
||||
" TODO: allow custom paths relative to the repo dir
|
||||
let g:whisper_model_path = g:whisper_dir .. "models/ggml-base.en.bin"
|
||||
if !filereadable(g:whisper_model_path)
|
||||
echoerr "Could not find model at: " .. g:whisper_model_path
|
||||
throw "Model not found"
|
||||
endif
|
||||
endif
|
||||
let s:output_buffer = bufnr("whisper_log", v:true)
|
||||
call setbufvar(s:output_buffer,"&buftype","nofile")
|
||||
let s:lsp_command = [g:whisper_lsp_path,"-m",g:whisper_model_path]
|
||||
" For faster execution. TODO: server load multiple models/run multiple servers?
|
||||
" let s:lsp_command = [g:whisper_lsp_path, "-m", g:whisper_dir .. "models/ggml-tiny.en.bin", "-ac", "128"]
|
||||
|
||||
" requestCommands([params_dict])
|
||||
func whisper#requestCommands(...)
|
||||
let l:req = {"method": "guided", "params": {"commandset_index": 0}}
|
||||
if a:0 > 0
|
||||
call extend(l:req.params, a:1)
|
||||
endif
|
||||
let resp = ch_sendexpr(g:lsp_job, l:req, {"callback": function("s:commandCallback", [l:req.params, 0])})
|
||||
endfunction
|
||||
|
||||
" doTranscription([params_dict])
|
||||
func whisper#doTranscription(...)
|
||||
let l:req = {"method": "unguided", "params": {}}
|
||||
if a:0 > 0
|
||||
call extend(l:req.params, a:1)
|
||||
endif
|
||||
let resp = ch_sendexpr(g:lsp_job, l:req, {"callback": function("s:transcriptionCallback", [function("s:insertText"),function("s:endTranscription")])})
|
||||
endfunction
|
||||
|
||||
" For testing
|
||||
func whisper#uppertest(cha)
|
||||
echo tr(a:cha, s:c_lowerkeys, s:c_upperkeys)
|
||||
endfunction
|
||||
|
||||
|
||||
" (upper, exit, count, motion, command, insert/append, save run) "base"
|
||||
" (upper, exit, count, motion, command, inside/around) "motion/visual"
|
||||
" (upper, exit, count, motion, line, inside/around) "command already entered"
|
||||
" (upper, exit, key, ) "from/till"
|
||||
|
||||
" upper and lower keys is used to translate between cases with tr
|
||||
" Must be sunchronized
|
||||
let s:c_lowerkeys = "1234567890-=qwertyuiop[]\\asdfghjkl;'zxcvbnm,./\""
|
||||
let s:c_upperkeys = "!@#$%^&*()_+QWERTYUIOP{}|ASDFGHJKL:\"ZXCVBNM<>?'"
|
||||
let s:c_count = split("1234567890\"",'\zs')
|
||||
let s:c_command = split("ryuogpdxcv.iam", '\zs')
|
||||
let s:c_motion = split("wetf'hjklnb$^)",'\zs')
|
||||
" object words: Word, Sentence, Paragraph, [, (, <, Tag, {. ", '
|
||||
let s:c_area = split("wsp])>t}\"'",'\zs')
|
||||
"Special commands.
|
||||
let s:c_special_always = ["exit", "upper"]
|
||||
let s:c_special_normal = ["save", "run", "space"]
|
||||
|
||||
" If not in dict, key is spoken word,
|
||||
" If key resolves to string, value is used for normal/motion, but key for chars
|
||||
" If key resolves to dict, {0: "normal",1: "motion",2:"single char",3: "area"}
|
||||
" Missing entries fall back as follows {0: "required", 1: 0, 2: "key", 3: 0}
|
||||
let s:spoken_dict = {"w": "word", "e": "end", "r": "replace", "t": {0: "till", 3: "tag"}, "y": "yank", "u": "undo", "i": {0: "insert", 1: "inside"}, "o": "open", "p": {0: "paste", 3: "paragraph"}, "a": {0: "append", 1: "around"}, "s": {0: "substitute", 3: "sentence"}, "d": "delete", "f": "from", "g": "go", "h": "left", "j": "down", "k": "up", "l": "right", "c": "change", "v": "visual", "b": "back", "n": "next", "m": "mark", ".": {0: "repeat", 2: "period"}, "]": {0: "bracket", 2: "bracket"}, "'": {0: "jump", 2: "apostrophe", 3: "apostrophe"}, '"': {0: 'register', 2: "quotation", 3: "quotation"}, "-": {0: "minus", 2: "minus"}, "$": {0: "dollar", 2: "dollar"}, "^": {0: "carrot", 2: "carrot"}, ")": {0: "sentence", 2: "parenthesis", 3: "parenthesis"}, "}": {0: "paragraph", 2: "brace", 3: "brace"}, ">": {0: "indent", 2: "angle", 3: "angle"}}
|
||||
|
||||
" Give this another pass. This seems overly hacky even if it's functional
|
||||
let s:sub_tran_msg = ""
|
||||
func s:subTranProg(msg)
|
||||
if s:sub_tran_msg != ""
|
||||
let s:sub_tran_msg = s:sub_tran_msg .. a:msg
|
||||
if mode() !=? 'v'
|
||||
exe "normal" "u" .. s:sub_tran_msg
|
||||
endif
|
||||
else
|
||||
if s:command_backlog == ""
|
||||
" this should not occur
|
||||
call s:logCallback(0, "Warning: Encountered sub transcription without prior command")
|
||||
let s:command_backlog = "a"
|
||||
endif
|
||||
if a:msg[0] == ' '
|
||||
let s:sub_tran_msg = s:command_backlog .. a:msg[1:-1]
|
||||
else
|
||||
let s:sub_tran_msg = s:command_backlog .. a:msg
|
||||
endif
|
||||
if mode() !=? 'v'
|
||||
exe "normal" s:sub_tran_msg
|
||||
endif
|
||||
endif
|
||||
call appendbufline(s:output_buffer, "$", s:sub_tran_msg .. ":" .. string(a:msg ))
|
||||
endfunction
|
||||
|
||||
func s:subTranFinish(params, timestamp)
|
||||
let s:repeat_command = s:sub_tran_msg
|
||||
" Visual selection is lot if used with streaming, so streaming of partial
|
||||
" transcriptions is disabled in visual mode
|
||||
if mode() ==? 'v'
|
||||
exe "normal" s:sub_tran_msg
|
||||
endif
|
||||
let s:sub_tran_msg = ""
|
||||
let s:command_backlog = ""
|
||||
exe "normal a\<C-G>u"
|
||||
let l:params = a:params
|
||||
let l:params.timestamp = a:timestamp
|
||||
if exists("l:params.commandset_index")
|
||||
unlet l:params.commandset_index
|
||||
endif
|
||||
call whisper#requestCommands(a:params)
|
||||
endfunction
|
||||
|
||||
func s:logCallback(channel, msg)
|
||||
call appendbufline(s:output_buffer,"$",a:msg)
|
||||
endfunction
|
||||
|
||||
|
||||
func s:transcriptionCallback(progressCallback, finishedCallback, channel, msg)
|
||||
let l:tr = a:msg.result.transcription
|
||||
|
||||
let l:ex_ind = match(tolower(l:tr),"exit", len(l:tr)-6)
|
||||
" The worst case I've observed so far is " Exit.", which is 6 characters
|
||||
if l:ex_ind != -1
|
||||
call a:progressCallback(strpart(l:tr,0,l:ex_ind-1))
|
||||
call a:finishedCallback(a:msg.result.timestamp)
|
||||
else
|
||||
call a:progressCallback(l:tr)
|
||||
let req = {"method": "unguided", "params": {"timestamp": a:msg.result.timestamp, "no_context": v:true}}
|
||||
let resp = ch_sendexpr(g:lsp_job, req, {"callback": function("s:transcriptionCallback", [a:progressCallback, a:finishedCallback])})
|
||||
endif
|
||||
endfunc
|
||||
func s:insertText(msg)
|
||||
exe "normal a" .. a:msg
|
||||
endfunction
|
||||
func s:endTranscription(timestamp)
|
||||
call appendbufline(s:output_buffer, "$", "Ending unguided transcription")
|
||||
endfunction
|
||||
|
||||
|
||||
|
||||
" If a command does not include a whole actionable step, attempting to execute
|
||||
" it discards the remainder of things. There is likely a simpler solution,
|
||||
" but it can be made functional now by storing a backbuffer until actionable
|
||||
let s:command_backlog = ""
|
||||
let s:repeat_command = ""
|
||||
let s:preceeding_upper = v:false
|
||||
func s:commandCallback(params, commandset_index, channel, msg)
|
||||
let l:command_index = a:msg.result.command_index
|
||||
let l:do_execute = v:false
|
||||
let l:next_mode = a:commandset_index
|
||||
let l:command = s:commandset_list[a:commandset_index][l:command_index]
|
||||
call s:logCallback(0, string(a:msg) .. " " .. a:commandset_index .. " " .. l:command)
|
||||
if l:command_index == 0
|
||||
"exit
|
||||
"if s:command_backlog == ""
|
||||
call s:logCallback(0,"Stopping command mode")
|
||||
echo "No longer listening"
|
||||
let s:command_backlog = ""
|
||||
return
|
||||
"else
|
||||
" Legacy code to clear an existing buffer with exit.
|
||||
" Was found to be rarely desired and is better introduced as a
|
||||
" standalone command (clear?)
|
||||
" call s:logCallback(0,"Clearing command_backlog" .. s:command_backlog)
|
||||
" let s:command_backlog = ""
|
||||
" let s:preceeding_upper = v:false
|
||||
" endif
|
||||
elseif l:command_index == 1
|
||||
" upper
|
||||
let s:preceeding_upper = !s:preceeding_upper
|
||||
elseif l:command == "save"
|
||||
" save and run can only happen in commandset 0,
|
||||
exe "w"
|
||||
elseif l:command == "run"
|
||||
exe "make run"
|
||||
elseif l:command == "space"
|
||||
exe "normal i \<ESC>l"
|
||||
elseif has_key(s:c_user, l:command)
|
||||
let Userfunc = s:c_user[l:command]
|
||||
if type(Userfunc) == v:t_string
|
||||
let Userfunc = function(Userfunc)
|
||||
endif
|
||||
call Userfunc()
|
||||
else
|
||||
if s:preceeding_upper
|
||||
" Upper should keep commandset
|
||||
let s:preceeding_upper = v:false
|
||||
let l:visual_command = tr(l:command, s:c_lowerkeys, s:c_upperkeys)
|
||||
else
|
||||
let l:visual_command = l:command
|
||||
endif
|
||||
echo s:command_backlog .. " - " .. l:visual_command
|
||||
let s:command_backlog = s:command_backlog .. l:visual_command
|
||||
if a:commandset_index == 2 || a:commandset_index == 3
|
||||
" single key, either completes motion, replace, or register
|
||||
" Should move to execute unless part of a register
|
||||
" Change will be caught at execute
|
||||
if s:command_backlog[-2:-2] !=# '"'
|
||||
call s:logCallback(0,"not register")
|
||||
let l:do_execute = v:true
|
||||
end
|
||||
let l:next_mode = 0
|
||||
" commandset index only matters for a/i
|
||||
elseif (l:command == "a" || l:command == "i") && a:commandset_index == 1
|
||||
" inside/around. Is commandset 3
|
||||
let l:next_mode = 3
|
||||
elseif l:command ==# '"'
|
||||
let l:next_mode = 2
|
||||
elseif index(s:c_count, l:command) != -1
|
||||
let l:next_mode = a:commandset_index
|
||||
elseif index(s:c_motion, l:command) != -1
|
||||
if l:command == 't' || l:command == 'f' || l:command == "'"
|
||||
" prompt single key
|
||||
let l:next_mode = 2
|
||||
else
|
||||
let l:do_execute = v:true
|
||||
let l:next_mode = 0
|
||||
endif
|
||||
elseif index(s:c_command, l:command) != -1
|
||||
if index(["y","g","d","c"], s:command_backlog[-1:-1]) != -1 && s:command_backlog[-1:-1] != s:command_backlog[-2:-2] && mode() !=? 'v'
|
||||
" need motion or repeated command
|
||||
" Potential for bad state here if disparaging command keys are
|
||||
" entered (i.e. yd), but vim can handle checks for this at exe
|
||||
" And checking for cases like y123d would complicate things
|
||||
let l:next_mode = 1
|
||||
elseif index(["i","a","c", "o", "s"], l:command) != -1 || s:command_backlog[-1:-1] ==# 'R'
|
||||
"'Insert' mode, do general transcription
|
||||
let l:req = {"method": "unguided", "params": a:params}
|
||||
let l:req.params.timestamp = a:msg.result.timestamp
|
||||
let l:req.params.no_context = v:true
|
||||
let resp = ch_sendexpr(g:lsp_job, req, {"callback": function("s:transcriptionCallback", [function("s:subTranProg"), function("s:subTranFinish", [a:params])])})
|
||||
return
|
||||
elseif l:command == 'r' || l:command == 'm'
|
||||
let l:next_mode = 2
|
||||
elseif l:command == '.'
|
||||
let l:next_mode = 0
|
||||
let l:do_execute = v:true
|
||||
let s:command_backlog = s:command_backlog[0:-2] .. s:repeat_command
|
||||
else
|
||||
if l:command ==? 'v'
|
||||
let l:next_mode = 1
|
||||
else
|
||||
let l:next_mode = 0
|
||||
endif
|
||||
let l:do_execute = v:true
|
||||
endif
|
||||
else
|
||||
throw "Invalid command state: " .. l:command .. " " .. a:commandset_index .. " " .. s:command_backlog
|
||||
endif
|
||||
endif
|
||||
if l:do_execute
|
||||
if mode() ==?'v' && l:next_mode == 0
|
||||
let l:next_mode = 1
|
||||
elseif match(s:command_backlog, 'c') != -1
|
||||
let l:req = {"method": "unguided", "params": a:params}
|
||||
let l:req.params.timestamp = a:msg.result.timestamp
|
||||
let l:req.params.no_context = v:true
|
||||
let resp = ch_sendexpr(g:lsp_job, req, {"callback": function("s:transcriptionCallback", [function("s:subTranProg"), function("s:subTranFinish", [a:params])])})
|
||||
return
|
||||
endif
|
||||
exe "normal" s:command_backlog
|
||||
if index(s:c_motion + ["u"],l:command) == -1
|
||||
exe "normal a\<C-G>u"
|
||||
let s:repeat_command = s:command_backlog
|
||||
call s:logCallback(0, s:command_backlog)
|
||||
endif
|
||||
let s:command_backlog = ""
|
||||
endif
|
||||
let l:req = {"method": "guided", "params": a:params}
|
||||
let l:req.params.timestamp = a:msg.result.timestamp
|
||||
let l:req.params.commandset_index = l:next_mode
|
||||
let resp = ch_sendexpr(g:lsp_job, l:req, {"callback": function("s:commandCallback",[a:params, l:next_mode])})
|
||||
endfunction
|
||||
|
||||
func s:loadedCallback(channel, msg)
|
||||
echo "Loading complete"
|
||||
call s:logCallback(a:channel, a:msg)
|
||||
endfunction
|
||||
|
||||
func s:registerCommandset(commandlist, is_final)
|
||||
let req = {"method": "registerCommandset"}
|
||||
let req.params = a:commandlist
|
||||
call s:logCallback(0, join(a:commandlist))
|
||||
call add(g:whisper_commandlist_spoken, a:commandlist)
|
||||
if a:is_final
|
||||
let resp = ch_sendexpr(g:lsp_job, req, {"callback": "s:loadedCallback"})
|
||||
else
|
||||
let resp = ch_sendexpr(g:lsp_job, req, {"callback": "s:logCallback"})
|
||||
endif
|
||||
endfunction
|
||||
|
||||
func s:registerAllCommands()
|
||||
let l:normal = s:c_special_always + s:c_special_normal + s:c_count + s:c_command + s:c_motion + keys(s:c_user)
|
||||
let l:visual = s:c_special_always + s:c_count + s:c_command + s:c_motion
|
||||
" Currently the same as visual.
|
||||
" let l:post_command = s:c_special_always + s:c_count + s:c_command + s:c_motion
|
||||
let l:single_key = s:c_special_always + split(s:c_lowerkeys, '\zs')
|
||||
let l:area = s:c_special_always + s:c_area
|
||||
|
||||
" Used only for compatibility with the testing script
|
||||
let g:whisper_commandlist_spoken = []
|
||||
|
||||
let s:commandset_list = [l:normal, l:visual, l:single_key, l:area]
|
||||
call s:registerCommandset(s:commandsetToSpoken(l:normal, 0), v:false)
|
||||
call s:registerCommandset(s:commandsetToSpoken(l:visual, 1), v:false)
|
||||
call s:registerCommandset(s:commandsetToSpoken(l:single_key, 2), v:false)
|
||||
call s:registerCommandset(s:commandsetToSpoken(l:area, 3), v:true)
|
||||
endfunction
|
||||
|
||||
func s:commandsetToSpoken(commandset, spoken_index)
|
||||
let l:spoken_list = []
|
||||
for l:command in a:commandset
|
||||
if has_key(s:spoken_dict, l:command)
|
||||
let l:spoken_value = s:spoken_dict[l:command]
|
||||
if type(l:spoken_value) == v:t_dict
|
||||
if has_key(l:spoken_value, a:spoken_index)
|
||||
let l:spoken_value = l:spoken_value[a:spoken_index]
|
||||
else
|
||||
if a:spoken_index == 2
|
||||
let l:spoken_value = l:command
|
||||
else
|
||||
let l:spoken_value = l:spoken_value[0]
|
||||
endif
|
||||
endif
|
||||
else
|
||||
if a:spoken_index == 2
|
||||
let l:spoken_value = l:command
|
||||
endif
|
||||
endif
|
||||
else
|
||||
let l:spoken_value = l:command
|
||||
endif
|
||||
call add(l:spoken_list, l:spoken_value)
|
||||
endfor
|
||||
return l:spoken_list
|
||||
endfunction
|
||||
|
||||
" TODO: Check lifetime. If the script is resourced, is the existing
|
||||
" s:lsp_job dropped and therefore killed?
|
||||
" This seems to not be the case and I've had to deal with zombie processes
|
||||
" that survive exiting vim, even though said behavior conflicts with my
|
||||
" understanding of the provided documentation
|
||||
let s:lsp_opts = {"in_mode": "lsp", "out_mode": "lsp", "err_mode": "nl", "err_io": "buffer", "err_buf": s:output_buffer}
|
||||
if !exists("g:lsp_job")
|
||||
if exists("g:whisper_user_commands")
|
||||
let s:c_user = g:whisper_user_commands
|
||||
else
|
||||
let s:c_user = {}
|
||||
endif
|
||||
let g:lsp_job = job_start(s:lsp_command, s:lsp_opts)
|
||||
if job_status(g:lsp_job) == "fail"
|
||||
echoerr "Failed to start whisper job"
|
||||
endif
|
||||
call s:registerAllCommands()
|
||||
endif
|
@ -1,117 +0,0 @@
|
||||
" The current whisper ecosystem shows mighty powerful potential, but seems to
|
||||
" lack the required structure to make a speech to text plugin frictionless.
|
||||
" The most direct path forward will be to have a standalone library interfaced
|
||||
" with vim's libcall
|
||||
" Libcall only allows for a single argument(a string or number) and a single
|
||||
" output (always a string).
|
||||
" This... honestly fits well for common interactions as follows
|
||||
" init(modelname) -> (null string for success or error message)
|
||||
" unload(ignored) -> (null string for success or error message)
|
||||
" likely never needed
|
||||
" processCommand(newline separated commands string) -> commands
|
||||
" Should have support for consecutive commands
|
||||
" stream(maybe sentinel?) -> processed text.
|
||||
"
|
||||
" Support for streaming responses is desired, but care is needed to support
|
||||
" backtracking when more refined output is available.
|
||||
" Perhaps the greatest element of difficulty, speech input should be buffered
|
||||
" and it should be possible to 'rewind' input to mask latency and pivot off
|
||||
" modal changes. (If a command sends the editor to insert mode, stt should no
|
||||
" longer be limited by the command syntax)
|
||||
"
|
||||
" For now though, a simple proof of concept shall suffice.
|
||||
if !exists("g:whisper_dir")
|
||||
let g:whisper_dir = expand($WHISPER_CPP_HOME)
|
||||
if g:whisper_dir == ""
|
||||
echoerr "Please provide a path to the whisper.cpp repo in either the $WHISPER_CPP_HOME environment variable, or g:whisper_dir"
|
||||
endif
|
||||
endif
|
||||
if !exists("g:whisper_stream_path")
|
||||
if executable("stream")
|
||||
" A version of stream already exists in the path and should be used
|
||||
let g:whisper_stream_path = "stream"
|
||||
else
|
||||
let g:whisper_stream_path = g:whisper_dir .. "stream"
|
||||
if !filereadable(g:whisper_stream_path)
|
||||
echoerr "Was not able to locate a stream executable at: " .. g:whisper_stream_path
|
||||
throw "Executable not found"
|
||||
endif
|
||||
endif
|
||||
endif
|
||||
if !exists("g:whisper_model_path")
|
||||
" TODO: allow paths relative the repo dir
|
||||
let g:whisper_model_path = g:whisper_dir .. "models/ggml-base.en.bin"
|
||||
if !filereadable(g:whisper_model_path)
|
||||
echoerr "Could not find model at: " .. g:whisper_model_path
|
||||
throw "Model not found"
|
||||
endif
|
||||
endif
|
||||
let s:streaming_command = [g:whisper_stream_path,"-m",g:whisper_model_path,"-t","8","--step","0","--length","5000","-vth","0.6"]
|
||||
|
||||
let s:listening = v:false
|
||||
let s:cursor_pos = getpos(".")
|
||||
let s:cursor_pos[0] = bufnr("%")
|
||||
let s:loaded = v:false
|
||||
func s:callbackHandler(channel, msg)
|
||||
" Large risk of breaking if msg isn't line buffered
|
||||
" TODO: investigate sound_playfile as an indicator that listening has started?
|
||||
if a:msg == "[Start speaking]"
|
||||
let s:loaded = v:true
|
||||
if s:listening
|
||||
echo "Loading complete. Now listening"
|
||||
else
|
||||
echo "Loading complete. Listening has not been started"
|
||||
endif
|
||||
endif
|
||||
if s:listening
|
||||
let l:msg_lines = split(a:msg,"\n")
|
||||
let l:new_text = ""
|
||||
for l:line in l:msg_lines
|
||||
" This is sloppy, but will suffice until library is written
|
||||
if l:line[0] == '['
|
||||
let l:new_text = l:new_text .. l:line[28:-1] .. ' '
|
||||
endif
|
||||
endfor
|
||||
let l:buffer_line = getbufoneline(s:cursor_pos[0],s:cursor_pos[1])
|
||||
if len(l:buffer_line) == 0
|
||||
" As a special case, an empty line is instead set to the text
|
||||
let l:new_line = l:new_text
|
||||
let s:cursor_pos[2] = len(l:new_text)
|
||||
else
|
||||
" Append text after the cursor
|
||||
let l:new_line = strpart(l:buffer_line,0,s:cursor_pos[2]) .. l:new_text
|
||||
let l:new_line = l:new_line .. strpart(l:buffer_line,s:cursor_pos[2])
|
||||
let s:cursor_pos[2] = s:cursor_pos[2]+len(l:new_text)
|
||||
endif
|
||||
call setbufline(s:cursor_pos[0],s:cursor_pos[1],l:new_line)
|
||||
endif
|
||||
endfunction
|
||||
|
||||
function! whisper#startListening()
|
||||
let s:cursor_pos = getpos(".")
|
||||
let s:cursor_pos[0] = bufnr("%")
|
||||
let s:listening = v:true
|
||||
endfunction
|
||||
function! whisper#stopListening()
|
||||
let s:listening = v:false
|
||||
endfunction
|
||||
function! whisper#toggleListening()
|
||||
let s:cursor_pos = getpos(".")
|
||||
let s:cursor_pos[0] = bufnr("%")
|
||||
let s:listening = !s:listening
|
||||
if s:loaded
|
||||
if s:listening
|
||||
echo "Now listening"
|
||||
else
|
||||
echo "No longer listening"
|
||||
endif
|
||||
endif
|
||||
endfunction
|
||||
|
||||
" Note this includes stderr at present. It's still filtered and helps debugging
|
||||
let s:whisper_job = job_start(s:streaming_command, {"callback": "s:callbackHandler"})
|
||||
" TODO: Check lifetime. If the script is resourced, is the existing
|
||||
" s:whisper_job dropped and therefore killed?
|
||||
if job_status(s:whisper_job) == "fail"
|
||||
echoerr "Failed to start whisper job"
|
||||
endif
|
Loading…
Reference in New Issue
Block a user