README update due dataset release

This commit is contained in:
Thorsten Mueller 2020-08-05 17:25:01 +02:00
parent d8c7e0f303
commit 5f0386341c
26 changed files with 79 additions and 68 deletions

BIN
EvolutionOfThorstenDataset.pdf Executable file

Binary file not shown.

147
README.md
View File

@ -1,13 +1,35 @@
# Introduction
Many smart assistants like Amazon Alexa, Google Home, Apple Siri and Microsoft Cortana need an internet connection to offer the functions STT (speech in text) and TTS (text in speech) in decent quality. But there are also open source projects that develop alternative wizards, some of which work offline. Personally i'm playing currently around with MyCroft AI which has a great community.
Many smart voice assistants like Amazon Alexa, Google Home, Apple Siri and Microsoft Cortana use cloud services to offer their (base) functionality.
For the area "STT / TTS", however, good training test data (eg for deep learning) are required. This is where the Mozilla Common Voice project comes into play.
As some people have privacy concerns using these services there are some (open source) projects trying to build offline and/or privacy aware alternatives.
But speech recognition and text synthesis still requires cloud services for providing these in a decent quality.
# MyCroft AI
> https://mycroft.ai/
MyCroft is a company developing an opensource voice assistant with a very nice and active community. But the stt/tts parts are still cloud based (eg. google services), even if requests are anonymized by a mycroft proxy in between. But integration with locally hosted services such as deepspeech (stt) or mimic/tacotron (tts) is possible.
# Mozilla
Mozilla works on these really important aspects for free and open human machine voice interaction.
## STT - speech to text
> https://commonvoice.mozilla.org/
"STT" needs lots of audio training data by many speakers (women/men/kids) of all ages, dialects and in various audio quality levels. So any voice contribution for common voice project is highly welcome.
## TTS - text to speech
> https://github.com/mozilla/tts
"TTS" needs lots of clean recordings by one speaker to train a model. Mozilla is developing a software stack for proper model training based on tacotron2 papers.
# And?!
I want to make my small modest contribution and contribute the model (tacotron version 1 and 2) of my personal voice (**german**) to the community for free to use.
I want to make the most personal contribution i can give and contribute my personal voice (**german**) for TTS training to the community for free usage.
> Please don't use for evil!
# Samples of my original voice
To get an impression of what my voice sounds to decide if it fits to your project i published some sample recordings. As soon the final model is available i'll upload some generated TTS samples too.
# Dataset "thorsten"
## Samples of my voice
To get an impression what my voice sounds to decide if it fits to your project i published some sample recordings, so no need to download complete dataset first.
* [Das Teilen eines Benutzerkontos ist strengstens untersagt.](./samples/original_recording/recorded_sample_01.wav )
* [Der Prophet spricht stets in Gleichnissen.](./samples/original_recording/recorded_sample_02.wav )
@ -18,89 +40,78 @@ To get an impression of what my voice sounds to decide if it fits to your projec
* [Jeden Gedanken kannst du hier loswerden.](./samples/original_recording/recorded_sample_07.wav )
# Dataset information
I recorded my voice using mimic-recording-studio by mycroft and a german corpus provided by @gras64. After recording and removing bad recordings, @domcross optimized the dataset concerning random noise, echo and background beeps.
Finally the dataset used for training is:
## Information on dataset "thorsten"
* 20.711 recorded phrases (wav files)
* more than 20 hours of pure audio
* ljspeech-1.1 structure
* 22.668 recorded phrases (wav files)
* more than 23 hours of pure audio
* samplerate 22.050Hz
* mono
* avg sentence length: 47 chars
* phrase length (min/avg/max): 2 / 52 / 180 chars
* no silence at beginning/ending
* avg spoken chars per second: 14
* sentences with question mark: 2.759
* sentences with exclamation mark: 1.869
* ljspeech-1.1 structure
> TODO: Add bokeh graph.
Changing recording location and equipment duing the recording session leads to two clusters in recording bokeh.
# Training tacotron version 1 (for mimic2 usage)
## Graphs
### analyze output
![char_len_vs_avg_secs](./img/char_len_vs_avg_secs.png?raw=true "char_len_vs_avg_secs")
![char_len_vs_med_secs](./img/char_len_vs_med_secs.png?raw=true "char_len_vs_med_secs")
![char_len_vs_mode_secs](./img/char_len_vs_mode_secs.png?raw=true "char_len_vs_mode_secs")
![char_len_vs_num_samples](./img/char_len_vs_num_samples.png?raw=true "char_len_vs_num_samples")
![char_len_vs_std](./img/char_len_vs_std.png?raw=true "char_len_vs_std")
### alignment
> Training in progress - step 81k
![](./img/tacotron1-step-81000-align.png?raw=true "tacotron1-step-81000-align")
### Tensorboard
> Training in progress - step 81k
![](./img/tacotron1-tensorboard-01.png?raw=true "tacotron1-tensorboard-01")
![](./img/tacotron1-tensorboard-02.png?raw=true "tacotron1-tensorboard-02")
## TTS samples
> All samples are "training in progress"
* [Tacotron 1 training (Step 81.000)](./samples/tts_tacotron1/step-81000-audio.wav )
## Model release (tacotron 1)
> Will be released as soon training process is finished
* sentences with question mark: 2.780
* sentences with exclamation mark: 1.840
# Mozilla TTS training/model (tacotron 2)
![text length vs. mean audio duration](./img/thorsten-de---datasetAnalysis1.png)
![text length vs. median audio duration](./img/thorsten-de---datasetAnalysis2.png)
![text length vs. STD](./img/thorsten-de---datasetAnalysis3.png)
![text length vs. number instances](./img/thorsten-de---datasetAnalysis4.png)
![signal noise ratio](./img/thorsten-de---datasetAnalysis5.png)
![bokeh](./img/thorsten-de---datasetAnalysis6.png)
## Training info
> Interested in evolution of this dataset? See following pdf document ([evolution of thorsten dataset](./EvolutionOfThorstenDataset.pdf) )
## Model release (tacotron2)
> Will be released as soon training process is finished
## Please read some personal words before downloading the dataset
I contribute my voice as a person believing in a world where all people are equal. No matter of gender, sexual orientation, religion, skin color and geocoordinates of birth location. A global world where everybody is warmly welcome on any place on this planet and open and free knowledge and education is available to everyone.
> https://drive.google.com/file/d/1yKJM1LAOQpRVojKunD9r8WN_p5KzBxjc/view?usp=sharing
> So hopefully my voice is used in this manner to make this world a better place for all of us :-).
# Trained tacotron2 model "thorsten"
> Training is currently in progress.
> If you trained a model on "thorsten" dataset please file an issue with some information on it. Sharing a trained model is highly appreciated.
# Feel free to file an issue if you ...
* have improvements on dataset
* use my TTS voice in your project(s)
* want to share your trained "thorsten" model
* get to know about any abuse usage of my voice
# Special thanks
I want to thank the whole community for providing great open source products.
I want to thank all open source communities for providing great projects.
Special thanks go to
- domcross (https://github.com/domcross/)
- gras64 (https://github.com/gras64/)
- erogol (https://github.com/erogol/)
- krisgesling (https://github.com/krisgesling/)
- eltocino (https://github.com/el-tocino/)
- nmstoker (https://github.com/nmstoker)
Just to name some nice guys who joined me on this tts-roadtrip:
for their support or this.
* eltocino (https://github.com/el-tocino/)
* erogol (https://github.com/erogol/)
* gras64 (https://github.com/gras64/)
* krisgesling (https://github.com/krisgesling/)
* nmstoker (https://github.com/nmstoker)
* othiele (https://discourse.mozilla.org/u/othiele/summary)
* repodiac (https://github.com/repodiac)
A very special thank you for @domcross for his time, support, audio optimzing knowhow and finally his gpu computing power. Thank you :-)
And last but not least i want to say a huge thank you to a special guy who supported me on this journey right from the beginning. Not just with nice words, but with his time, audio optimization knowhow and finally his gpu computing power.
# Finally
If you use my (concrete) TTS voice I would be grateful for an info about the project and a demo.
Without his amazing support this dataset (in it's current way) would not exists.
> Please do not use my voice for evil.
# TODO
- add training plots
- add bokeh graph
- google drive download urls for model download
Thank you Dominik (@domcross / https://github.com/domcross/)
# Links
* https://discourse.mozilla.org/t/contributing-my-german-voice-for-tts/48150
* https://community.mycroft.ai/
* https://github.com/MycroftAI/mimic-recording-studio
* https://voice.mozilla.org/
* https://github.com/mozilla/TTS
* https://raw.githubusercontent.com/mozilla/voice-web/master/server/data/de/sentence-collector.txt
* https://github.com/gras64/corpus-file-gen
We'll hear us in future :-)
Thorsten

Binary file not shown.

Before

Width:  |  Height:  |  Size: 29 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 30 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 34 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 35 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 63 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 11 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 9.6 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.9 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 31 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 26 KiB