Update tensorboard graphs, dataset details and added samples

2020-04-17 18:56:35 +02:00 · 2020-04-17 18:56:35 +02:00 · 9fee9ad529
commit 9fee9ad529
parent 8aede533fd
22 changed files with 81 additions and 42 deletions
--- a/README.md
+++ b/README.md
@ -1,21 +1,15 @@
-[english version below](#Introduction)
+[Deutsche Version weiter unten](#Einleitung)

-# Einleitung
-Viele (aktuell so angesagte) smarte Assistenten wie Amazon Alexa, Google Home, Apple Siri und Microsoft Cortana benötigen zwingend eine Internetverbindung um u.a. die Funktionen STT (Sprache in Text) und TTS (Text in Sprache) in ordentlicher Qualität anzubieten. Es gibt aber auch Open Source Projekte die alternative Assistenten entwickeln, die teils offline funktionieren.
+# Introduction
+Many smart assistants like Amazon Alexa, Google Home, Apple Siri and Microsoft Cortana need an internet connection to offer the functions STT (speech in text) and TTS (text in speech) in decent quality. But there are also open source projects that develop alternative wizards, some of which work offline. Personally i'm playing currently around with MyCroft AI which has a great community.

-Für den Bereich "STT/TTS" werden dazu jedoch gute Trainings-Testdaten (bspw. zum Deep-Learning) benötigt. Hier kommt das Projekt Mozilla Common Voice ins Spiel.
+For the area "STT / TTS", however, good training test data (eg for deep learning) are required. This is where the Mozilla Common Voice project comes into play.

-# Und?!
-Ich möchte meinen kleinen bescheidenen Beitrag leisten und stelle meine Stimme unter der CC0 Lizenz zur Verfügung. Die notwendigen Sätze entstammen dem Mozilla Common Voice Projekt und die Aufzeichnung der Stimme habe ich mit Mimic-Recording-Studio (von MyCroft) vorgenommen.
+# And?!
+I want to make my small modest contribution and contribute the model (tacotron version 1 and 2) of my personal voice (**german**) to the community for free to use.

-# Klingt gut. Was genau gibt es hier.
-* Der Corpus als CSV Format, so dass er vom Mimic-Recording-Studio verwendet werden kann (Datenquelle: Mozilla commion voice (anteilig))
-* Die LJSpeech-1.1 Struktur (metadata.csv und zugehörige WAV-Dateien) zur Verarbeitung mit mimic2 (basiert auf Tacotron)
-> Aufgrund von Github-Größenbeschränkung liegen die gezippten WAV-Dateien im Google Drive ([Download-Link](https://drive.google.com/open?id=1NTi-4r3EWl5dw0k2o4Xh92G0OHvhoxAJ)
-
-# Beispiele
-## Orinalaufnahme
-Um einen Eindruck von meiner Stimme zu bekommen habe ich einige Original Aufnahmedateien als Beispiel im Ordner /samples/original_recording bereitgestellt.
+# Samples of my original voice
+To get an impression of what my voice sounds to decide if it fits to your project i published some sample recordings. As soon the final model is available i'll upload some generated TTS samples too.

 * [Das Teilen eines Benutzerkontos ist strengstens untersagt.](./samples/original_recording/recorded_sample_01.wav )
 * [Der Prophet spricht stets in Gleichnissen.](./samples/original_recording/recorded_sample_02.wav )
@ -25,45 +19,90 @@ Um einen Eindruck von meiner Stimme zu bekommen habe ich einige Original Aufnahm
 * [Jede gute Küchenwaage hat eine Tara-Funktion.](./samples/original_recording/recorded_sample_06.wav )
 * [Jeden Gedanken kannst du hier loswerden.](./samples/original_recording/recorded_sample_07.wav )

-# Sonstiges
-Bitte verwende es nicht für Böses!
-Solltest Du meine (konkrete) TTS Stimme verwenden wäre ich für eine Info zum Projekt und eine Demo dankbar

-Außerdem gilt mein Dank an die Projekte/Communities von Mozilla Common Voice und MyCroft / Mimic.
-Besonds an Lindsay Saunders (Mozilla) für den netten Kontakt und eltocino, gras64, dominik von der MyCroft Community für die Gedult meine Anfängerfragen gedultig zu beantworten :-).
+# Dataset information
+I recorded my voice using mimic-recording-studio by mycroft and a german corpus provided by @gras64. After recording and removing bad recordings, @domcross optimized the dataset concerning random noise, echo and background beeps.
+Finally the dataset used for training is:

-# Introduction
-Many (currently so hip) smart assistants like Amazon Alexa, Google Home, Apple Siri and Microsoft Cortana need an internet connection to offer the functions STT (speech in text) and TTS (text in speech) in decent quality. But there are also open source projects that develop alternative wizards, some of which work offline.
+* 20.711 recorded phrases (wav files)
+* more than 20 hours of pure audio
+* samplerate 22.050Hz
+* mono
+* avg sentence length: 47 chars
+* avg spoken chars per second: 14
+* sentences with question mark: 2.759
+* sentences with exclamation mark: 1.869
+* ljspeech-1.1 structure

-For the area "STT / TTS", however, good training test data (eg for deep learning) are required. This is where the Mozilla Common Voice project comes into play.
+> TODO: Add bokeh graph.
+Changing recording location and equipment duing the recording session leads to two clusters in recording bokeh.

-# And?!
-I want to make my small modest contribution and make my voice available under the CC0 license. The necessary sentences came from the Mozilla Common Voice project and I recorded the voice with Mimic Recording Studio (by MyCroft).
+# Training tacotron version 1 (for mimic2 usage)

-# Sounds good. What exactly is here.
-* The Corpus as a CSV format that can be used by the Mimic recording studio (datasource is partial mozilla common voice project)
-* The LJSpeech-1.1 structure (metadata.csv and associated WAV files) for processing with mimic2 (based on Tacotron)
->> Due github size restrictions the compressed wav-files can be downloaded from google drive ([Download-Link](https://drive.google.com/open?id=1NTi-4r3EWl5dw0k2o4Xh92G0OHvhoxAJ)
+## Graphs
+### analyze output
+![char_len_vs_avg_secs](./img/char_len_vs_avg_secs.png?raw=true "char_len_vs_avg_secs")
+![char_len_vs_med_secs](./img/char_len_vs_med_secs.png?raw=true "char_len_vs_med_secs")
+![char_len_vs_mode_secs](./img/char_len_vs_mode_secs.png?raw=true "char_len_vs_mode_secs")
+![char_len_vs_num_samples](./img/char_len_vs_num_samples.png?raw=true "char_len_vs_num_samples")
+![char_len_vs_std](./img/char_len_vs_std.png?raw=true "char_len_vs_std")
+
+### alignment
+> Training in progress - step 81k
+
+![](./img/tacotron1-step-81000-align.png?raw=true "tacotron1-step-81000-align")
+
+### Tensorboard
+> Training in progress - step 81k
+
+![](./img/tacotron1-tensorboard-01.png?raw=true "tacotron1-tensorboard-01")
+![](./img/tacotron1-tensorboard-02.png?raw=true "tacotron1-tensorboard-02")
+
+## TTS samples
+> All samples are "training in progress"
+* [Tacotron 1 training (Step 81.000)](./samples/tts_tacotron1/step-81000-audio.wav )
+
+## Model release (tacotron 1)
+> Will be released as soon training process is finished


-# Miscellaneous
-Please do not use it for evil!
+# Mozilla TTS training/model (tacotron 2)
+
+## Training info
+
+## Model release (tacotron2)
+> Will be released as soon training process is finished
+
+
+# Special thanks
+I want to thank the whole community for providing great open source products.
+
+Special thanks go to
+- @domcross
+- @gras64
+- @erogol
+- @krisgesling
+- eltocino from MyCroft community
+- nmstroker from Mozilla forum
+
+for their support or this.
+
+A very special thank you for @domcross for his time, support, audio optimzing knowhow and finally his gpu computing power. Thank you :-)
+
+# Finally
 If you use my (concrete) TTS voice I would be grateful for an info about the project and a demo.

-Also, my thanks go to the projects / communities of Mozilla Common Voice and MyCroft / Mimic. Especially to Lindsay Saunders (Mozilla) for nice contact and eltocino, gras64, dominik from the MyCroft community for the patience to patiently answer my beginner questions :-).
+> Please do not use my voice for evil.

-# Mimic analyze(.py) results (after 12000 spoken phrases)
-![char_len_vs_avg_secs](./img/12000_phrases_char_len_vs_avg_secs.png?raw=true "char_len_vs_avg_secs")
-![char_len_vs_med_secs](./img/12000_phrases_char_len_vs_med_secs.png?raw=true "char_len_vs_med_secs")
-![char_len_vs_mode_secs](./img/12000_phrases_char_len_vs_mode_secs.png?raw=true "char_len_vs_mode_secs")
-![char_len_vs_num_samples](./img/12000_phrases_char_len_vs_num_samples.png?raw=true "char_len_vs_num_samples")
-![char_len_vs_std](./img/12000_phrases_char_len_vs_std.png?raw=true "char_len_vs_std")
+# TODO
+- add training plots
+- add bokeh graph
+- google drive download urls for model download

 # Links
-* https://voice.mozilla.org/
-* https://github.com/mozilla/CorporaCreator
-* https://raw.githubusercontent.com/mozilla/voice-web/master/server/data/de/sentence-collector.txt
 * https://community.mycroft.ai/
-* https://github.com/MycroftAI/mimic2
 * https://github.com/MycroftAI/mimic-recording-studio
+* https://voice.mozilla.org/
+* https://github.com/mozilla/TTS
+* https://raw.githubusercontent.com/mozilla/voice-web/master/server/data/de/sentence-collector.txt
 * https://github.com/gras64/corpus-file-gen
--- a/img/12000_phrases_char_len_vs_avg_secs.png
+++ b/img/12000_phrases_char_len_vs_avg_secs.png
--- a/img/12000_phrases_char_len_vs_med_secs.png
+++ b/img/12000_phrases_char_len_vs_med_secs.png
--- a/img/12000_phrases_char_len_vs_mode_secs.png
+++ b/img/12000_phrases_char_len_vs_mode_secs.png
--- a/img/12000_phrases_char_len_vs_num_samples.png
+++ b/img/12000_phrases_char_len_vs_num_samples.png
--- a/img/12000_phrases_char_len_vs_std.png
+++ b/img/12000_phrases_char_len_vs_std.png
--- a/img/5000_phrases_char_len_vs_avg_secs.png
+++ b/img/5000_phrases_char_len_vs_avg_secs.png
--- a/img/5000_phrases_char_len_vs_med_secs.png
+++ b/img/5000_phrases_char_len_vs_med_secs.png
--- a/img/5000_phrases_char_len_vs_mode_secs.png
+++ b/img/5000_phrases_char_len_vs_mode_secs.png
--- a/img/5000_phrases_char_len_vs_num_samples.png
+++ b/img/5000_phrases_char_len_vs_num_samples.png
--- a/img/5000_phrases_char_len_vs_std.png
+++ b/img/5000_phrases_char_len_vs_std.png
--- a/img/char_len_vs_avg_secs.png
+++ b/img/char_len_vs_avg_secs.png
--- a/img/char_len_vs_med_secs.png
+++ b/img/char_len_vs_med_secs.png
--- a/img/char_len_vs_mode_secs.png
+++ b/img/char_len_vs_mode_secs.png
--- a/img/char_len_vs_num_samples.png
+++ b/img/char_len_vs_num_samples.png
--- a/img/char_len_vs_std.png
+++ b/img/char_len_vs_std.png
--- a/img/tacotron1-step-81000-align.png
+++ b/img/tacotron1-step-81000-align.png
--- a/img/tacotron1-tensorboard-01.png
+++ b/img/tacotron1-tensorboard-01.png
--- a/img/tacotron1-tensorboard-02.png
+++ b/img/tacotron1-tensorboard-02.png
--- a/samples/tts_tacotron1/step-73000-audio.wav
+++ b/samples/tts_tacotron1/step-73000-audio.wav
--- a/samples/tts_tacotron1/step-81000-align.png
+++ b/samples/tts_tacotron1/step-81000-align.png
--- a/samples/tts_tacotron1/step-81000-audio.wav
+++ b/samples/tts_tacotron1/step-81000-audio.wav