diff --git a/EvolutionOfThorstenDataset.pdf b/EvolutionOfThorstenDataset.pdf new file mode 100755 index 0000000..3eb85de Binary files /dev/null and b/EvolutionOfThorstenDataset.pdf differ diff --git a/README.md b/README.md index 82c9429..24aab07 100644 --- a/README.md +++ b/README.md @@ -1,13 +1,35 @@ # Introduction -Many smart assistants like Amazon Alexa, Google Home, Apple Siri and Microsoft Cortana need an internet connection to offer the functions STT (speech in text) and TTS (text in speech) in decent quality. But there are also open source projects that develop alternative wizards, some of which work offline. Personally i'm playing currently around with MyCroft AI which has a great community. +Many smart voice assistants like Amazon Alexa, Google Home, Apple Siri and Microsoft Cortana use cloud services to offer their (base) functionality. -For the area "STT / TTS", however, good training test data (eg for deep learning) are required. This is where the Mozilla Common Voice project comes into play. +As some people have privacy concerns using these services there are some (open source) projects trying to build offline and/or privacy aware alternatives. + +But speech recognition and text synthesis still requires cloud services for providing these in a decent quality. + +# MyCroft AI +> https://mycroft.ai/ + +MyCroft is a company developing an opensource voice assistant with a very nice and active community. But the stt/tts parts are still cloud based (eg. google services), even if requests are anonymized by a mycroft proxy in between. But integration with locally hosted services such as deepspeech (stt) or mimic/tacotron (tts) is possible. + +# Mozilla +Mozilla works on these really important aspects for free and open human machine voice interaction. + +## STT - speech to text +> https://commonvoice.mozilla.org/ + +"STT" needs lots of audio training data by many speakers (women/men/kids) of all ages, dialects and in various audio quality levels. So any voice contribution for common voice project is highly welcome. + +## TTS - text to speech +> https://github.com/mozilla/tts + +"TTS" needs lots of clean recordings by one speaker to train a model. Mozilla is developing a software stack for proper model training based on tacotron2 papers. # And?! -I want to make my small modest contribution and contribute the model (tacotron version 1 and 2) of my personal voice (**german**) to the community for free to use. +I want to make the most personal contribution i can give and contribute my personal voice (**german**) for TTS training to the community for free usage. +> Please don't use for evil! -# Samples of my original voice -To get an impression of what my voice sounds to decide if it fits to your project i published some sample recordings. As soon the final model is available i'll upload some generated TTS samples too. +# Dataset "thorsten" +## Samples of my voice +To get an impression what my voice sounds to decide if it fits to your project i published some sample recordings, so no need to download complete dataset first. * [Das Teilen eines Benutzerkontos ist strengstens untersagt.](./samples/original_recording/recorded_sample_01.wav ) * [Der Prophet spricht stets in Gleichnissen.](./samples/original_recording/recorded_sample_02.wav ) @@ -18,89 +40,78 @@ To get an impression of what my voice sounds to decide if it fits to your projec * [Jeden Gedanken kannst du hier loswerden.](./samples/original_recording/recorded_sample_07.wav ) -# Dataset information -I recorded my voice using mimic-recording-studio by mycroft and a german corpus provided by @gras64. After recording and removing bad recordings, @domcross optimized the dataset concerning random noise, echo and background beeps. -Finally the dataset used for training is: +## Information on dataset "thorsten" -* 20.711 recorded phrases (wav files) -* more than 20 hours of pure audio +* ljspeech-1.1 structure +* 22.668 recorded phrases (wav files) +* more than 23 hours of pure audio * samplerate 22.050Hz * mono -* avg sentence length: 47 chars +* phrase length (min/avg/max): 2 / 52 / 180 chars +* no silence at beginning/ending * avg spoken chars per second: 14 -* sentences with question mark: 2.759 -* sentences with exclamation mark: 1.869 -* ljspeech-1.1 structure - -> TODO: Add bokeh graph. -Changing recording location and equipment duing the recording session leads to two clusters in recording bokeh. - -# Training tacotron version 1 (for mimic2 usage) - -## Graphs -### analyze output -![char_len_vs_avg_secs](./img/char_len_vs_avg_secs.png?raw=true "char_len_vs_avg_secs") -![char_len_vs_med_secs](./img/char_len_vs_med_secs.png?raw=true "char_len_vs_med_secs") -![char_len_vs_mode_secs](./img/char_len_vs_mode_secs.png?raw=true "char_len_vs_mode_secs") -![char_len_vs_num_samples](./img/char_len_vs_num_samples.png?raw=true "char_len_vs_num_samples") -![char_len_vs_std](./img/char_len_vs_std.png?raw=true "char_len_vs_std") - -### alignment -> Training in progress - step 81k - -![](./img/tacotron1-step-81000-align.png?raw=true "tacotron1-step-81000-align") - -### Tensorboard -> Training in progress - step 81k - -![](./img/tacotron1-tensorboard-01.png?raw=true "tacotron1-tensorboard-01") -![](./img/tacotron1-tensorboard-02.png?raw=true "tacotron1-tensorboard-02") - -## TTS samples -> All samples are "training in progress" -* [Tacotron 1 training (Step 81.000)](./samples/tts_tacotron1/step-81000-audio.wav ) - -## Model release (tacotron 1) -> Will be released as soon training process is finished +* sentences with question mark: 2.780 +* sentences with exclamation mark: 1.840 -# Mozilla TTS training/model (tacotron 2) +![text length vs. mean audio duration](./img/thorsten-de---datasetAnalysis1.png) +![text length vs. median audio duration](./img/thorsten-de---datasetAnalysis2.png) +![text length vs. STD](./img/thorsten-de---datasetAnalysis3.png) +![text length vs. number instances](./img/thorsten-de---datasetAnalysis4.png) +![signal noise ratio](./img/thorsten-de---datasetAnalysis5.png) +![bokeh](./img/thorsten-de---datasetAnalysis6.png) -## Training info +> Interested in evolution of this dataset? See following pdf document ([evolution of thorsten dataset](./EvolutionOfThorstenDataset.pdf) ) -## Model release (tacotron2) -> Will be released as soon training process is finished + +## Please read some personal words before downloading the dataset +I contribute my voice as a person believing in a world where all people are equal. No matter of gender, sexual orientation, religion, skin color and geocoordinates of birth location. A global world where everybody is warmly welcome on any place on this planet and open and free knowledge and education is available to everyone. + +> https://drive.google.com/file/d/1yKJM1LAOQpRVojKunD9r8WN_p5KzBxjc/view?usp=sharing + +> So hopefully my voice is used in this manner to make this world a better place for all of us :-). + + +# Trained tacotron2 model "thorsten" +> Training is currently in progress. + +> If you trained a model on "thorsten" dataset please file an issue with some information on it. Sharing a trained model is highly appreciated. + + +# Feel free to file an issue if you ... +* have improvements on dataset +* use my TTS voice in your project(s) +* want to share your trained "thorsten" model +* get to know about any abuse usage of my voice # Special thanks -I want to thank the whole community for providing great open source products. +I want to thank all open source communities for providing great projects. -Special thanks go to -- domcross (https://github.com/domcross/) -- gras64 (https://github.com/gras64/) -- erogol (https://github.com/erogol/) -- krisgesling (https://github.com/krisgesling/) -- eltocino (https://github.com/el-tocino/) -- nmstoker (https://github.com/nmstoker) +Just to name some nice guys who joined me on this tts-roadtrip: -for their support or this. +* eltocino (https://github.com/el-tocino/) +* erogol (https://github.com/erogol/) +* gras64 (https://github.com/gras64/) +* krisgesling (https://github.com/krisgesling/) +* nmstoker (https://github.com/nmstoker) +* othiele (https://discourse.mozilla.org/u/othiele/summary) +* repodiac (https://github.com/repodiac) -A very special thank you for @domcross for his time, support, audio optimzing knowhow and finally his gpu computing power. Thank you :-) +And last but not least i want to say a huge thank you to a special guy who supported me on this journey right from the beginning. Not just with nice words, but with his time, audio optimization knowhow and finally his gpu computing power. -# Finally -If you use my (concrete) TTS voice I would be grateful for an info about the project and a demo. +Without his amazing support this dataset (in it's current way) would not exists. -> Please do not use my voice for evil. - -# TODO -- add training plots -- add bokeh graph -- google drive download urls for model download +Thank you Dominik (@domcross / https://github.com/domcross/) # Links +* https://discourse.mozilla.org/t/contributing-my-german-voice-for-tts/48150 * https://community.mycroft.ai/ * https://github.com/MycroftAI/mimic-recording-studio * https://voice.mozilla.org/ * https://github.com/mozilla/TTS * https://raw.githubusercontent.com/mozilla/voice-web/master/server/data/de/sentence-collector.txt -* https://github.com/gras64/corpus-file-gen \ No newline at end of file + +We'll hear us in future :-) + +Thorsten \ No newline at end of file diff --git a/img/char_len_vs_avg_secs.png b/img/char_len_vs_avg_secs.png deleted file mode 100644 index 11272e8..0000000 Binary files a/img/char_len_vs_avg_secs.png and /dev/null differ diff --git a/img/char_len_vs_med_secs.png b/img/char_len_vs_med_secs.png deleted file mode 100644 index b9925a4..0000000 Binary files a/img/char_len_vs_med_secs.png and /dev/null differ diff --git a/img/char_len_vs_mode_secs.png b/img/char_len_vs_mode_secs.png deleted file mode 100644 index 50e2c30..0000000 Binary files a/img/char_len_vs_mode_secs.png and /dev/null differ diff --git a/img/char_len_vs_num_samples.png b/img/char_len_vs_num_samples.png deleted file mode 100644 index 0f190f2..0000000 Binary files a/img/char_len_vs_num_samples.png and /dev/null differ diff --git a/img/char_len_vs_std.png b/img/char_len_vs_std.png deleted file mode 100644 index e328565..0000000 Binary files a/img/char_len_vs_std.png and /dev/null differ diff --git a/img/tacotron1-step-81000-align.png b/img/tacotron1-step-81000-align.png deleted file mode 100644 index 757f967..0000000 Binary files a/img/tacotron1-step-81000-align.png and /dev/null differ diff --git a/img/tacotron1-tensorboard-01.png b/img/tacotron1-tensorboard-01.png deleted file mode 100644 index 0ad8778..0000000 Binary files a/img/tacotron1-tensorboard-01.png and /dev/null differ diff --git a/img/tacotron1-tensorboard-02.png b/img/tacotron1-tensorboard-02.png deleted file mode 100644 index e2c0584..0000000 Binary files a/img/tacotron1-tensorboard-02.png and /dev/null differ diff --git a/img/thorsten-de---datasetAnalysis3.png b/img/thorsten-de---datasetAnalysis3.png new file mode 100644 index 0000000..bb8ce7b Binary files /dev/null and b/img/thorsten-de---datasetAnalysis3.png differ diff --git a/img/thorsten-de---datasetAnalysis4.png b/img/thorsten-de---datasetAnalysis4.png new file mode 100644 index 0000000..4932db5 Binary files /dev/null and b/img/thorsten-de---datasetAnalysis4.png differ diff --git a/img/thorsten-de---datasetAnalysis5.png b/img/thorsten-de---datasetAnalysis5.png new file mode 100644 index 0000000..542ef79 Binary files /dev/null and b/img/thorsten-de---datasetAnalysis5.png differ diff --git a/img/thorsten-de---datasetAnalysis6.png b/img/thorsten-de---datasetAnalysis6.png new file mode 100644 index 0000000..6cec5de Binary files /dev/null and b/img/thorsten-de---datasetAnalysis6.png differ diff --git a/samples/original_recording/resemblyzer/125b660d37dc16ec21d28eac29e80bcb.wav b/samples/original_recording/resemblyzer/125b660d37dc16ec21d28eac29e80bcb.wav deleted file mode 100644 index ed03f4b..0000000 Binary files a/samples/original_recording/resemblyzer/125b660d37dc16ec21d28eac29e80bcb.wav and /dev/null differ diff --git a/samples/original_recording/resemblyzer/2331dc4a3d9c7a0ed0b2bd7142a9c22c.wav b/samples/original_recording/resemblyzer/2331dc4a3d9c7a0ed0b2bd7142a9c22c.wav deleted file mode 100644 index 2ed222c..0000000 Binary files a/samples/original_recording/resemblyzer/2331dc4a3d9c7a0ed0b2bd7142a9c22c.wav and /dev/null differ diff --git a/samples/original_recording/resemblyzer/473c263192a14d5ab493b06461d634a8.wav b/samples/original_recording/resemblyzer/473c263192a14d5ab493b06461d634a8.wav deleted file mode 100644 index c44d9ed..0000000 Binary files a/samples/original_recording/resemblyzer/473c263192a14d5ab493b06461d634a8.wav and /dev/null differ diff --git a/samples/original_recording/resemblyzer/5062b8cb7557bf7c5f7d1e0b8c8bd7f7.wav b/samples/original_recording/resemblyzer/5062b8cb7557bf7c5f7d1e0b8c8bd7f7.wav deleted file mode 100644 index 403f70c..0000000 Binary files a/samples/original_recording/resemblyzer/5062b8cb7557bf7c5f7d1e0b8c8bd7f7.wav and /dev/null differ diff --git a/samples/original_recording/resemblyzer/51c5be2769a9808f0d549861ec65baf6.wav b/samples/original_recording/resemblyzer/51c5be2769a9808f0d549861ec65baf6.wav deleted file mode 100644 index e3f3ce0..0000000 Binary files a/samples/original_recording/resemblyzer/51c5be2769a9808f0d549861ec65baf6.wav and /dev/null differ diff --git a/samples/original_recording/resemblyzer/722c00bce2cf074493ec3d674cdd58e7.wav b/samples/original_recording/resemblyzer/722c00bce2cf074493ec3d674cdd58e7.wav deleted file mode 100644 index 4678fad..0000000 Binary files a/samples/original_recording/resemblyzer/722c00bce2cf074493ec3d674cdd58e7.wav and /dev/null differ diff --git a/samples/original_recording/resemblyzer/a45c050583f8e09a8bb143800ba80875.wav b/samples/original_recording/resemblyzer/a45c050583f8e09a8bb143800ba80875.wav deleted file mode 100644 index f133fd9..0000000 Binary files a/samples/original_recording/resemblyzer/a45c050583f8e09a8bb143800ba80875.wav and /dev/null differ diff --git a/samples/original_recording/resemblyzer/a6751996aa51045ea4734772a35ead62.wav b/samples/original_recording/resemblyzer/a6751996aa51045ea4734772a35ead62.wav deleted file mode 100644 index 35a6cce..0000000 Binary files a/samples/original_recording/resemblyzer/a6751996aa51045ea4734772a35ead62.wav and /dev/null differ diff --git a/samples/original_recording/resemblyzer/c884c866ed804ffc507697eeb033f3a8.wav b/samples/original_recording/resemblyzer/c884c866ed804ffc507697eeb033f3a8.wav deleted file mode 100644 index 98364b7..0000000 Binary files a/samples/original_recording/resemblyzer/c884c866ed804ffc507697eeb033f3a8.wav and /dev/null differ diff --git a/samples/tts_tacotron1/step-81000-align.png b/samples/tts_tacotron1/step-81000-align.png deleted file mode 100644 index 757f967..0000000 Binary files a/samples/tts_tacotron1/step-81000-align.png and /dev/null differ diff --git a/samples/tts_tacotron1/step-81000-audio.wav b/samples/tts_tacotron1/step-81000-audio.wav deleted file mode 100644 index ce3fd5d..0000000 Binary files a/samples/tts_tacotron1/step-81000-audio.wav and /dev/null differ diff --git a/samples/tts_tacotron2/185k_ThankYou.wav b/samples/tts_tacotron2/185k_ThankYou.wav deleted file mode 100644 index cfae190..0000000 Binary files a/samples/tts_tacotron2/185k_ThankYou.wav and /dev/null differ