tests : adding transcription tests

This commit is contained in:
Georgi Gerganov 2022-11-28 22:44:01 +02:00
parent 061fc81bd6
commit 9b7df68753
No known key found for this signature in database
GPG Key ID: 449E073F9DC10735
7 changed files with 140 additions and 0 deletions

View File

@ -206,3 +206,11 @@ tiny.en tiny base.en base small.en small medium.en medium large: main
./main -m models/ggml-$@.bin -f $$f ; \
echo "" ; \
done
#
# Tests
#
.PHONY: tests
tests:
bash ./tests/run-tests.sh

3
tests/.gitignore vendored Normal file
View File

@ -0,0 +1,3 @@
*.wav
*.ogg
*.wav.txt

1
tests/en-0-ref.txt Normal file
View File

@ -0,0 +1 @@
My fellow Americans, this day has brought terrible news and great sadness to our country. At 9 o'clock this morning, Mission Control in Houston lost contact with our space shuttle, Columbia. A short time later, debris was seen falling from the skies above Texas. The Colombians lost. There are no survivors. On board was a crew of seven. Colonel Rick Husband, Lieutenant Colonel Michael Anderson, Commander Laurel Clark, Captain David Brown, Commander William McCool, Dr. Kultna Shavla, and Ilan Ramon, a colonel in the Israeli Air Force. These men and women assumed great risk in the service to all humanity. In an age when spaceflight has come to seem almost routine, it is easy to overlook the dangers of travel by rocket and the difficulties of navigating the fierce outer atmosphere of the Earth. These astronauts knew the dangers, and they faced them willingly, knowing they had a high and noble purpose in life. Because of their courage and daring and idealism, we will miss them all the more. All Americans today are thinking as well of the families of these men and women who have been given this sudden shock and grief. You're not alone. Our entire nation grieves with you. And those you love will always have the respect and gratitude of this country. The cause in which they died will continue. Mankind is led into the darkness beyond our world by the inspiration of discovery and the longing to understand. Our journey into space will go on. In the skies today, we saw destruction and tragedy. Yet farther than we can see, there is comfort and hope. In the words of the prophet Isaiah, "Lift your eyes and look to the heavens. Who created all these? He who brings out the starry hosts one by one and calls them each by name." Because of His great power and mighty strength, not one of them is missing. The same Creator who names the stars also knows the names of the seven souls we mourn today. The crew of the shuttle Columbia did not return safely to Earth, yet we can pray that all are safely home. May God bless the grieving families. And may God continue to bless America. [Silence]

1
tests/en-1-ref.txt Normal file
View File

@ -0,0 +1 @@
Henry F. Phillips from Wikipedia, the free encyclopedia at en.wikipedia.org. Henry F. Phillips from Wikipedia, the free encyclopedia. Henry F. Phillips 1890-1958, a U.S. businessman from Portland, Oregon, has the honor of having the Phillips head screw and screwdriver named after him. The importance of the cross head screw design lies in its self-centering property, useful on automated production lines that use powered screwdrivers. Phillips' major contribution was in driving the cross head concept forward to the point where it was adopted by screw makers and automobile companies. Although he received patents for the design in 1936, U.S. Patent #2,046,343, U.S. Patents #2,046,837 to #2,046,840, it was so widely copied that by 1949 Phillips lost his patent. The American Screw Company was responsible for devising a means of manufacturing the screw, and successfully patented and licensed their method. Other screw makers of the 1930s dismissed the Phillips concept since it calls for a relatively complex recessed socket shape in the head of the screw, as distinct from the simple milled slot of a slotted type screw. The Phillips Screw Company and the American Screw Company went on to devise the Pawsadrive screw, which differs from the Phillips in that it is designed to accommodate greater torque than the Phillips. An image accompanied this article, captioned "Phillips Screw Head." The following is an info box which accompanies this article. Info box, part of the series on screw drive types. Slotted, commonly erroneously flat head. Phillips, cross head. Pawsadrive, super drive. Torques. Hex, Allen. Robertson. Tri-wing. Torx set. Spanner head. Triple square, XZN. Others, poly drive, spline drive, double hex. Many images accompanied this info box. This page was last modified on the 9th of April, 2008, at 1704. All text is available under the terms of the GNU Free Documentation License. See copyrights for details. Wikipedia is a registered trademark of the Wikimedia Foundation Incorporated, a U.S. registered 501(c)(3) tax-deductible nonprofit charity. This sound file and all text in the article are licensed under the GNU Free Documentation License, available at www.gnu.org/copyleft/fdl.html.

1
tests/en-2-ref.txt Normal file
View File

@ -0,0 +1 @@
This is the Micro Machine Man presenting the most midget miniature motorcade of Micro Machines. Each one has dramatic details, terrific trim, precision paint jobs, plus incredible Micro Machine Pocket Playsets. There's a police station, fire station, restaurant, service station, and more. Perfect pocket portables to take anyplace. And there are many miniature playsets to play with, and each one comes with its own special edition Micro Machine vehicle and fun, fantastic features that miraculously move. Raise the boat lift at the airport marina, man the gun turret at the army base, clean your car at the car wash, raise the toll bridge. And these playsets fit together to form a Micro Machine world. Micro Machine Pocket Playsets, so tremendously tiny, so perfectly precise, so dazzlingly detailed, you'll want to pocket them all. Micro Machines are Micro Machine Pocket Playsets sold separately from Galoob. The smaller they are, the better they are.

1
tests/es-0-ref.txt Normal file
View File

@ -0,0 +1 @@
Hola, como están todos? Mi nombre es Julián Virrueta Mendoza y en este podcast les vengo a hablar sobre la contaminación del agua. Bueno, empezaré por decir que el ser humano no está midiendo las consecuencias de sus actos. No hay duda que uno de los mayores problemas a los que se enfrentan muchas poblaciones actualmente es la contaminación del agua. Principalmente porque como bien sabemos el agua prácticamente es fundamental para la vida, por lo que la contaminación puede ser algo muy negativo para el desarrollo tanto económico como social de los pueblos o de las poblaciones próximas en ese lugar contaminado. Los comienzos de la contaminación, como lo definen muchos expertos en la materia, la contaminación del agua es causada por las actividades humanas. Es un fenómeno ambiental de importancia, el cual se comienza a producir desde los primeros intentos de industrialización para transformarse luego en un problema tan habitual como generalizado. Generalmente la contaminación del agua se produce a través de la introducción directa o indirecta en los acuíferos o caos de agua, ríos, mares, lagos, océanos, etc. o de diversas sustancias que pueden ser consideradas como contaminantes. Pero existen dos formas principales de contaminación del agua. Una de ellas tiene que ver con la contaminación natural del agua que se corresponde con el ciclo natural de esta durante el que puede entrar en contacto con ciertos constituyentes contaminantes como sustancias minerales y orgánicas disueltas o en suspensión que se vierten en la corteza terrestre, la atmósfera y en las aguas. Pero todo esto se puede contradecir si el ser humano comía sus consecuencias, si no tirara basura a los lagos, a los ríos, no tirara botes de aceite, no contaminara. Bueno amigos, yo los invito a que no contaminen el agua y que sepan cuidar la naturaleza. Los saluda su buen amigo y compañero Julián Virreta. Nos vemos. ¡Claro!

125
tests/run-tests.sh Executable file
View File

@ -0,0 +1,125 @@
#!/bin/bash
# This scripts run the selected model agains a collection of audio files from the web.
# It downloads, converts and transcribes each file and then compares the result with the expected reference
# transcription. The comparison is performed using git's diff command and shows the differences at the character level.
# It can be used to quickly verify that the model is working as expected across a wide range of audio files.
# I.e. like an integration test. The verification is done by visual inspection of the diff output.
#
# The reference data can be for example generated using the original OpenAI Whisper implementation, or entered manually.
#
# Feel free to suggest extra audio files to add to the list.
# Make sure they are between 1-3 minutes long since we don't want to make the test too slow.
#
# Usage:
#
# ./tests/run-tests.sh <model_name>
#
cd `dirname $0`
# Whisper models
models=( "tiny.en" "tiny" "base.en" "base" "small.en" "small" "medium.en" "medium" "large" )
# list available models
function list_models {
printf "\n"
printf " Available models:"
for model in "${models[@]}"; do
printf " $model"
done
printf "\n\n"
}
if [ $# -eq 0 ]; then
printf "Usage: $0 [model]\n\n"
printf "No model specified. Aborting\n"
list_models
exit 1
fi
model=$1
main="../main"
if [ ! -f ../models/ggml-$model.bin ]; then
printf "Model $model not found. Aborting\n"
list_models
exit 1
fi
if [ ! -f $main ]; then
printf "Executable $main not found. Aborting\n"
exit 1
fi
# add various audio files for testing purposes here
# the order of the files is important so don't change the existing order
# when adding new files, make sure to add the expected "ref.txt" file with the correct transcript
urls_en=(
"https://upload.wikimedia.org/wikipedia/commons/1/1f/George_W_Bush_Columbia_FINAL.ogg"
"https://upload.wikimedia.org/wikipedia/en/d/d4/En.henryfphillips.ogg"
"https://cdn.openai.com/whisper/draft-20220913a/micro-machines.wav"
)
urls_es=(
"https://upload.wikimedia.org/wikipedia/commons/c/c1/La_contaminacion_del_agua.ogg"
)
urls_it=(
)
urls_pt=(
)
urls_de=(
)
urls_jp=(
)
urls_ru=(
)
function run_lang() {
lang=$1
shift
urls=("$@")
i=0
for url in "${urls[@]}"; do
echo "- [$lang] Processing '$url' ..."
ext="${url##*.}"
fname_src="$lang-${i}.${ext}"
fname_dst="$lang-${i}-16khz.wav"
if [ ! -f $fname_src ]; then
wget --quiet --show-progress -O $fname_src $url
fi
if [ ! -f $fname_dst ]; then
ffmpeg -loglevel -0 -y -i $fname_src -ar 16000 -ac 1 -c:a pcm_s16le $fname_dst
if [ $? -ne 0 ]; then
echo "Error: ffmpeg failed to convert $fname_src to $fname_dst"
exit 1
fi
fi
$main -m ../models/ggml-$model.bin -f $fname_dst -l $lang -otxt 2> /dev/null
git diff --no-index --word-diff=color --word-diff-regex=. $fname_dst.txt $lang-$i-ref.txt
i=$(($i+1))
done
}
run_lang "en" "${urls_en[@]}"
if [[ $model != *.en ]]; then
run_lang "es" "${urls_es[@]}"
run_lang "it" "${urls_it[@]}"
run_lang "pt" "${urls_pt[@]}"
run_lang "de" "${urls_de[@]}"
run_lang "jp" "${urls_jp[@]}"
run_lang "ru" "${urls_ru[@]}"
fi