Compare commits

...

90 Commits

Author SHA1 Message Date
Thorsten Müller
f13bcaf63e
Added Windows TTS training recipe
Added modified vits recipe for Thorsten-Voice model training using Windows
2023-03-05 16:19:50 +01:00
Thorsten Müller
04c5683194
German Corpus for Mimic-Recording-Studio 2022-12-16 22:54:02 +01:00
Thorsten Müller
50e09d49bf
Added social media info 2022-11-13 17:08:26 +01:00
Thorsten Müller
b0afed75f4
Added new 2022.10 ThorstenVoice dataset. 2022-11-13 16:47:46 +01:00
Thorsten Müller
9b7b4c6836
Added new released Tacotron2 DDC model to README
tts-server --model_name tts_models/de/thorsten/tacotron2-DDC
2022-08-23 19:03:47 +02:00
Thorsten Müller
aba10bc64a
Added info on new VITS model. 2022-06-24 18:00:12 +02:00
Thorsten Müller
07e85b3905
Merge pull request #35 from thorstenMueller/thorstenMueller-patch-1
Add new project logo to header.
2022-05-09 20:56:03 +02:00
Thorsten Müller
e08d50d6bb
Added new logo to header 2022-05-09 20:46:17 +02:00
Thorsten Müller
e691aa4ee3
Delete Logo_Thorsten-Voice-kleiner.jpg 2022-05-09 20:45:31 +02:00
Thorsten Müller
625f73e986
Delete Logo_Thorsten-Voice.jpg 2022-05-09 20:45:18 +02:00
Thorsten Müller
de1802f8ce
Update README.md 2022-05-09 20:34:50 +02:00
Thorsten Müller
f0500309d6
Test with embedded logo 2022-05-09 18:19:02 +02:00
Thorsten Müller
41c91b9865
Add files via upload 2022-05-09 18:14:50 +02:00
Thorsten Müller
fcb1e705a9
Add files via upload 2022-05-09 18:13:06 +02:00
Thorsten Müller
b8802db4f8
Uploaded transparent Thorsten-Voice logo. 2022-05-09 18:12:03 +02:00
Thorsten Müller
b00c768343
Added badge links. 2022-04-28 18:13:49 +02:00
Thorsten Mueller
3b0b4f898f Fixed typo. 2022-04-24 09:13:13 +02:00
Thorsten Müller
2106fc6b00
Test 2022-04-23 23:31:15 +02:00
Thorsten Müller
e4ff3ce04a
Initial draft FUNDING.yml 2022-04-23 23:29:15 +02:00
Thorsten Müller
f408508cd7
Merge pull request #31 from thorstenMueller/prep-thorsten-22.05
Merge new README (preparation for new TTS model release)
2022-04-23 23:26:17 +02:00
Thorsten Mueller
6b4cfb41d4 Added Youtube link. 2022-04-23 23:22:27 +02:00
Thorsten Mueller
521dd33483 Updated TOC 2022-04-23 21:15:26 +02:00
Thorsten Mueller
6efb25310a preparations for new Thorsten models 2022-04-23 21:13:30 +02:00
Thorsten Müller
5654397f3e
Add citation file. 2022-04-20 23:48:54 +02:00
Thorsten Mueller
b5ec9ef991 Fixed minor issues 2022-02-15 17:52:03 +01:00
Thorsten Mueller
77ad01d4ff Making ffmpeg conversion optional. 2022-02-15 17:28:40 +01:00
Thorsten Mueller
c35507b1f7 Added link for VoiceLunch slides. 2022-01-03 20:09:43 +01:00
Thorsten Mueller
b536dfd958 Added check if audio file exists in getDatasetSpeechRate 2021-12-19 18:44:01 +01:00
Thorsten Mueller
29238f2a31 Updated Download links / Cites 2021-12-11 17:44:49 +01:00
Thorsten Müller
8c5f4503f3
Added two hyperlinks
To http://www.Thorsten-Voice.de and https://OpenVoice-Tech.net Wiki
2021-11-28 11:33:54 +01:00
Thorsten Mueller
2ff7e3961b Added Forward Tacotron samples. 2021-10-30 21:48:21 +02:00
Thorsten Müller
1221713314
Remove Wikipedia link to "Thorsten (Stimme)" 2021-10-23 16:52:59 +02:00
Thorsten Mueller
d3225b48f8 Added Citation to README. 2021-10-08 18:22:34 +02:00
Thorsten Mueller
33c030f844 Added two scripts for dataset analysis/cleaning. 2021-09-28 06:10:21 +02:00
Thorsten Müller
2daabae53e
Added DOIs in README 2021-09-24 16:32:16 +02:00
Thorsten Müller
1d445b09f8
Added DOI badge for emotional dataset 2021-09-23 21:58:54 +02:00
Thorsten Mueller
2853f111dc Merge branch 'master' of https://github.com/thorstenMueller/deep-learning-german-tts 2021-09-18 16:04:59 +02:00
Thorsten Mueller
7540606247 Added download link for new recording-in-progress neutral dataset. 2021-09-18 16:04:33 +02:00
Thorsten Mueller
0b9e929ce0 Added Fullband-MelGAN model download path. Thanks to (see #26) 2021-08-20 06:02:47 +02:00
Thorsten Mueller
bc06fa923f Added info on TensorVox by ZDisket - thanks :-) 2021-08-12 18:30:55 +02:00
Thorsten Mueller
f19144b085 Adjusted quick setup example to new vocoder model. 2021-08-06 09:50:44 +02:00
Thorsten Müller
251c093ad4
Added locale settings for german Umlaut handling. 2021-08-04 09:24:51 +02:00
Thorsten Mueller
f505fd38df Dockerfile draft for NVIDIA Jetson Xavier AGX and Coqui 2021-08-02 19:54:38 +02:00
Thorsten Mueller
3e09ae8615 Added link to my Youtube channel. 2021-07-21 22:49:47 +02:00
Thorsten Mueller
2ed2413dda Explain how i recorded emotional phrases. 2021-07-13 21:53:55 +02:00
Thorsten Mueller
51c5f55bbd Added check that recording exists before export. 2021-07-12 23:27:50 +02:00
Thorsten Mueller
4f875ac591 Added --mrs_dir param for more flexibility 2021-07-07 22:00:47 +02:00
Thorsten Mueller
2ea44ede87 Added REAME for helperScripts 2021-07-04 22:38:38 +02:00
Thorsten Mueller
ba60fc57d4 Added script to create LJSpeech dataset out of Mimic-Recording-Studio recordings. 2021-07-04 22:33:38 +02:00
Thorsten Müller
9e68d99ee7
Updated emotional dataset v02 download link 2021-06-20 08:57:39 +02:00
Thorsten Mueller
7172604eed Added v02 emotional dataset (drunk + whispering) 2021-06-13 10:59:04 +02:00
Thorsten Mueller
58dece7c55 Added chapter on public talks 2021-06-08 07:18:30 +02:00
Thorsten Mueller
c81f374aca Test Commit 2021-06-07 21:52:31 +02:00
Thorsten Mueller
2c6aca780b Added table with trained model checkpoint downloads 2021-05-11 22:34:10 +02:00
Thorsten Müller
68e60f2a92
Format Wikipedia link 2021-04-22 18:57:40 +02:00
Thorsten Mueller
a3b0dde296 Added info about Wikipedia article 2021-04-22 18:53:39 +02:00
Thorsten Mueller
28d81a0fb2 Update on emotional dataset info 2021-04-11 11:42:24 +02:00
Thorsten Mueller
12c6d26dbd Moved emotional samples to other table. 2021-04-11 11:39:29 +02:00
Thorsten Mueller
4c06db69dd Added silero models to audio comparison 2021-04-11 11:04:20 +02:00
Thorsten Müller
bae96a75a5
Added badge for link to TTS comparison page 2021-04-09 19:29:24 +02:00
Thorsten Müller
1313520064
Playing around with some cool badges :-) 2021-04-09 19:05:43 +02:00
Thorsten Mueller
e2ecf68c13 added details on coqui model usage. 2021-04-05 16:57:36 +02:00
Thorsten Mueller
c8a5e1082e Small TOC fix 2021-04-03 23:48:10 +02:00
Thorsten Mueller
40aae591d7 Small fixes in TOC 2021-04-03 23:45:46 +02:00
Thorsten Mueller
4f722e96a9 Adding info on emotional dataset. 2021-04-03 23:24:53 +02:00
Thorsten Müller
7e1530b742
Merge pull request #14 from snakers4/master
Add silero-models
2021-04-03 22:12:09 +02:00
snakers4
647786be6c Add silero-models 2021-04-03 05:17:14 +00:00
Thorsten Müller
00685a008d
Added cute sloth smiley. 2021-03-30 12:07:41 +02:00
Thorsten Mueller
e5481a82a6 Added smaller logo 2021-03-30 08:00:58 +02:00
Thorsten Mueller
2d1428cd13 Switch to non-transparent logo 2021-03-30 07:55:08 +02:00
Thorsten Mueller
df55a19ae2 Added ThorstenVoice logo 2021-03-30 07:53:48 +02:00
Thorsten Müller
9585b73cc3
Modify title 2021-03-16 20:23:29 +01:00
Thorsten Müller
70158ba7c8
Small README updates 2021-03-16 18:51:21 +01:00
Thorsten Mueller
e1e9f8666a Small text adjustments and formatting on README. 2021-03-16 18:41:39 +01:00
Thorsten Müller
cca10c215e
Added download link to v03 dataset. 2021-02-10 19:46:21 +01:00
Thorsten Mueller
09705597b8 Merge branch 'master' of https://github.com/thorstenMueller/deep-learning-german-tts 2021-01-23 18:50:15 +01:00
Thorsten Mueller
bdb3aa7d47 Added hifiGAN samples trained by SanjaESC 2021-01-23 18:15:56 +01:00
Thorsten Müller
f0c0f63ae1
Added nice guy SanjaESC to thanks section 2021-01-22 16:24:56 +01:00
Thorsten Müller
036c266ad7
Added Sebastian to thanks section - Thank you :-) 2021-01-16 08:24:10 +01:00
Thorsten Mueller
8e6137b3af Added wavegrad samples (training in progress) 2020-12-14 17:53:32 +01:00
Thorsten Mueller
9ee0353da4 Changed main and subheading for TensorFlowTTS 2020-12-02 12:23:20 +01:00
Thorsten Mueller
a99d4b6477 Added first samples for TensorFlowTTS 2020-12-02 12:14:16 +01:00
Thorsten Mueller
02020e54f7 added sample 05 for griffin lim. 2020-11-21 10:19:13 +01:00
Thorsten Mueller
5347394f3e Added Griffin Lim vocoder samples 2020-11-21 10:08:08 +01:00
Thorsten Mueller
c59d19e0a1 Added detail on glowtts training steps. 2020-11-17 22:04:09 +01:00
Thorsten Mueller
e45736f62d added sample05 with GlowTTS. 2020-11-17 21:53:08 +01:00
Thorsten Mueller
e96de3a095 fixed typo 2020-11-16 18:25:38 +01:00
Thorsten Mueller
eaead5cebe Rename to docs folder for Github pages 2020-11-16 17:28:20 +01:00
Thorsten Mueller
7b27bdac2d Added github page with index and sample wavs 2020-11-16 17:25:42 +01:00
Thorsten Müller
f55e16d0fc
fixed typo 2020-09-23 19:32:27 +02:00
83 changed files with 24467 additions and 107 deletions

2
.github/FUNDING.yml vendored Normal file
View File

@ -0,0 +1,2 @@
# These are supported funding model platforms

28
CITATION.cff Normal file
View File

@ -0,0 +1,28 @@
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: Thorsten-Voice
message: >-
Please cite Thorsten-Voice project if you use
datasets or trained TTS models.
type: dataset
authors:
- given-names: Thorsten
family-names: Müller
email: tm@thorsten-voice.de
- given-names: Dominik
family-names: Kreutz
repository-code: 'https://github.com/thorstenMueller/Thorsten-Voice'
url: 'https://www.Thorsten-Voice.de'
abstract: >-
A free to use, offline working, high quality german
TTS voice should be available for every project
without any license struggling.
keywords:
- Thorsten
- Voice
- Open
- German
- TTS
- Dataset

BIN
Logo_Thorsten-Voice.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 25 KiB

324
README.md
View File

@ -1,135 +1,245 @@
# Introduction ![Thorsten-Voice logo](Logo_Thorsten-Voice.png)
Many smart voice assistants like Amazon Alexa, Google Home, Apple Siri and Microsoft Cortana use cloud services to offer their (base) functionality.
As some people have privacy concerns using these services there are some (open source) projects trying to build offline and/or privacy aware alternatives. - [Project motivation](#motivation-for-thorsten-voice-project-speaking_head-speech_balloon)
- [Personal note](#some-personal-words-before-using-thorsten-voice)
But speech recognition and text synthesis still requires cloud services for providing these in a decent quality. - [**Thorsten** Voice Datasets](#voice-datasets)
- [Thorsten-21.02-neutral](#thorsten-2102-neutral)
- [Thorsten-21.06-emotional](#thorsten-2106-emotional)
- [Thorsten-22.10-neutral](#thorsten-2210-neutral)
# MyCroft AI - [**Thorsten** TTS-Models](#tts-models)
> https://mycroft.ai/ - [Thorsten-21.04-Tacotron2-DCA](#thorsten-2104-tacotron2-dca)
- [Thorsten-22.05-VITS](#thorsten-2205-vits)
- [Thorsten-22.08-Tacotron2-DDC](#thorsten-2208-tacotron2-ddc)
- [Other models](#other-models)
- [Public talks](#public-talks)
MyCroft is a company developing an opensource voice assistant with a very nice and active community. But the stt/tts parts are still cloud based (eg. google services), even if requests are anonymized by a mycroft proxy in between. But integration with locally hosted services such as deepspeech (stt) or mimic/tacotron (tts) is possible. - [My Youtube channel](#youtube-channel)
# Mozilla - [Special Thanks](#thanks-section)
Mozilla works on these really important aspects for free and open human machine voice interaction.
## STT - speech to text
> https://commonvoice.mozilla.org/
"STT" needs lots of audio training data by many speakers (women/men/kids) of all ages, dialects and in various audio quality levels. So any voice contribution for common voice project is highly welcome.
## TTS - text to speech
> https://github.com/mozilla/tts
"TTS" needs lots of clean recordings by one speaker to train a model. Mozilla is developing a software stack for proper model training based on tacotron2 papers.
# And?!
I want to make the most personal contribution i can give and contribute my personal voice (**german**) for TTS training to the community for free usage.
## Please read some personal words before downloading the dataset
I contribute my voice as a person believing in a world where all people are equal. No matter of gender, sexual orientation, religion, skin color and geocoordinates of birth location. A global world where everybody is warmly welcome on any place on this planet and open and free knowledge and education is available to everyone.
So hopefully my voice is used in this manner to make this world a better place for all of us :-).
**tl;dr** Please don't use for evil!
# Dataset "thorsten"
## Samples of my voice
To get an impression what my voice sounds to decide if it fits to your project i published some sample recordings, so no need to download complete dataset first.
* [Das Teilen eines Benutzerkontos ist strengstens untersagt.](./samples/original_recording/recorded_sample_01.wav )
* [Der Prophet spricht stets in Gleichnissen.](./samples/original_recording/recorded_sample_02.wav )
* [Bitte schmeißt euren Müll nicht einfach in die Walachei.](./samples/original_recording/recorded_sample_03.wav )
* [So etwas würde mir nie in den Sinn kommen.](./samples/original_recording/recorded_sample_04.wav )
* [Sie klettert auf einen Stein und nimmt eine Denkerpose ein.](./samples/original_recording/recorded_sample_05.wav )
* [Jede gute Küchenwaage hat eine Tara-Funktion.](./samples/original_recording/recorded_sample_06.wav )
* [Jeden Gedanken kannst du hier loswerden.](./samples/original_recording/recorded_sample_07.wav )
## Dataset information # Motivation for Thorsten-Voice project :speaking_head: :speech_balloon:
A **free** to use, **offline** working, **high quality** **german** **TTS** voice should be available for every project without any license struggling.
* ljspeech-1.1 structure <a href="https://twitter.com/intent/follow?screen_name=ThorstenVoice"><img src="https://img.shields.io/twitter/follow/ThorstenVoice?style=social&logo=twitter" alt="follow on Twitter"></a>
* 22.668 recorded phrases (wav files) [![YouTube Channel Subscribers](https://img.shields.io/youtube/channel/subscribers/UCjqqTVVBTsxpm0iOhQ1fp9g?style=social)](https://www.youtube.com/c/ThorstenMueller)
* more than 23 hours of pure audio [![Project website](https://img.shields.io/badge/Project_website-www.Thorsten--Voice.de-92a0c0)](https://www.Thorsten-Voice.de)
* samplerate 22.050Hz
* mono # Social media
* normalized to -24dB Please check and follow me on my social media profiles - Thank you.
* phrase length (min/avg/max): 2 / 52 / 180 chars
* no silence at beginning/ending | Platform | Link |
* avg spoken chars per second: 14 | --------------- | ------- |
* sentences with question mark: 2.780 | Youtube | [ThorstenVoice on Youtube](https://www.youtube.com/c/ThorstenMueller) |
* sentences with exclamation mark: 1.840 | Twitter | [ThorstenVoice on Twitter](https://twitter.com/ThorstenVoice) |
| Instagram | [ThorstenVoice on Instagram](https://www.instagram.com/thorsten_voice/) |
| LinkedIn | [Thorsten Müller on LinkedIn](https://www.linkedin.com/in/thorsten-m%C3%BCller-848a344/) |
# Some personal words before using **Thorsten-Voice**
> I contribute my voice as a person believing in a world where all people are equal. No matter of gender, sexual orientation, religion, skin color and geocoordinates of birth location. A global world where everybody is warmly welcome on any place on this planet and open and free knowledge and education is available to everyone. :earth_africa: (*Thorsten Müller*)
Please keep in mind, that **i am no professional voice talent**. I'm just a normal guy sharing his voice with the world.
# Voice-Datasets
Voice datasets are listed on Zenodo:
| Dataset | DOI Link |
| --------------- | ------- |
| Thorsten-21.02-neutral | [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.5525342.svg)](https://doi.org/10.5281/zenodo.5525342) |
| Thorsten-21.06-emotional | [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.5525023.svg)](https://doi.org/10.5281/zenodo.5525023) |
| Thorsten-22.10-neutral | [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7265581.svg)](https://doi.org/10.5281/zenodo.7265581) |
## Thorsten-21.02-neutral
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.5525342.svg)](https://doi.org/10.5281/zenodo.5525342)
```
@dataset{muller_thorsten_2021_5525342,
author = {Müller, Thorsten and
Kreutz, Dominik},
title = {Thorsten-Voice - "Thorsten-21.02-neutral" Dataset},
month = feb,
year = 2021,
note = {{Please use it to make the world a better place for
whole humankind.}},
publisher = {Zenodo},
version = {3.0},
doi = {10.5281/zenodo.5525342},
url = {https://doi.org/10.5281/zenodo.5525342}
}
```
> :speaking_head: **Listen to some audio recordings from this dataset [here](https://drive.google.com/drive/folders/1KVjGXG2ij002XRHb3fgFK4j0OEq1FsWm?usp=sharing).**
### Dataset summary
* Recorded by Thorsten Müller
* Optimized by Dominik Kreutz
* LJSpeech file and directory structure
* 22.668 recorded phrases (*wav files*)
* More than 23 hours of pure audio
* Samplerate 22.050Hz
* Mono
* Normalized to -24dB
* Phrase length (min/avg/max): 2 / 52 / 180 chars
* No silence at beginning/ending
* Avg spoken chars per second: 14
* Sentences with question mark: 2.780
* Sentences with exclamation mark: 1.840
### Dataset evolution
As described in the PDF document ([evolution of thorsten dataset](./EvolutionOfThorstenDataset.pdf)) this dataset consists of three recording phases.
* **Phase 1**: Recorded with a cheap usb microphone (*low quality*)
* **Phase 2**: Recorded with a good microphone (*good quality*)
* **Phase 3**: Recorded with same good microphone but longer phrases (> 100 chars) (*good quality*)
If you want to use a dataset subset you can see which files belong to which recording phase in [recording quality](./RecordingQuality.csv) csv file.
![text length vs. mean audio duration](./img/thorsten-de---datasetAnalysis1.png) ## Thorsten-21.06-emotional
![text length vs. median audio duration](./img/thorsten-de---datasetAnalysis2.png) [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.5525023.svg)](https://doi.org/10.5281/zenodo.5525023)
![text length vs. STD](./img/thorsten-de---datasetAnalysis3.png)
![text length vs. number instances](./img/thorsten-de---datasetAnalysis4.png)
![signal noise ratio](./img/thorsten-de---datasetAnalysis5.png)
![bokeh](./img/thorsten-de---datasetAnalysis6.png)
## Dataset evolution ```
As decribed in the pdf document ([evolution of thorsten dataset](./EvolutionOfThorstenDataset.pdf)) this dataset consists of three recording phases. @dataset{muller_thorsten_2021_5525023,
author = {Müller, Thorsten and
Kreutz, Dominik},
title = {{Thorsten-Voice - "Thorsten-21.06-emotional"
Dataset}},
month = jun,
year = 2021,
note = {{Please use it to make the world a better place for
whole humankind.}},
publisher = {Zenodo},
version = {2.0},
doi = {10.5281/zenodo.5525023},
url = {https://doi.org/10.5281/zenodo.5525023}
}
```
* phase1: Recorded with a cheap usb microphone All emotional recordings where recorded by myself and i tried to feel and pronounce that emotion even if the phrase context does not match that emotion. Example: I pronounced the sleepy recordings in the tone i have shortly before falling asleep.
* phase2: Recorded with a good microphone
* phase3: Recorded with same good microphone but longer phrases (> 100 chars)
If you wanna use just a dataset subset (phase1 and/or phase2 and/or phase3) you can see which files belong to which recording phase in [recording quality](./RecordingQuality.csv) csv file. ### Samples
Listen to the phrase "**Mist, wieder nichts geschafft.**" in following emotions.
* :slightly_smiling_face: [Neutral](./samples/thorsten-21.06-emotional/neutral.wav)
* :nauseated_face: [Disgusted](./samples/thorsten-21.06-emotional/disgusted.wav)
* :angry: [Angry](./samples/thorsten-21.06-emotional/angry.wav)
* :grinning: [Amused](./samples/thorsten-21.06-emotional/amused.wav)
* :astonished: [Surprised](./samples/thorsten-21.06-emotional/surprised.wav)
* :pensive: [Sleepy](./samples/thorsten-21.06-emotional/sleepy.wav)
* :dizzy_face: [Drunk](./samples/thorsten-21.06-emotional/drunk.wav)
* 🤫 [Whispering](./samples/thorsten-21.06-emotional/whisper.wav)
### Dataset summary
* Recorded by Thorsten Müller
* Optimized by Dominik Kreutz
* 300 sentences * 8 emotions = 2.400 recordings
* Mono
* Samplerate 22.050Hz
* Normalized to -24dB
* No silence at beginning/ending
* Sentence length: 59 - 148 chars
## Download information ## Thorsten-22.10-neutral
> Download size: 2,7GB [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7265581.svg)](https://doi.org/10.5281/zenodo.7265581)
> :speaking_head: **Listen to some audio recordings from this dataset [here](https://drive.google.com/drive/folders/1dxoSo8Ktmh-5E0rSVqkq_Jm1r4sFnwJM?usp=sharing).**
Version | Description | Date | Link ```
------------ | ------------- | ------------- | ------------- @dataset{muller_thorsten_2022_7265581,
thorsten-de-v01 | Initial version | 2020-06-28 | [Google Drive Download v01](https://drive.google.com/file/d/1yKJM1LAOQpRVojKunD9r8WN_p5KzBxjc/view?usp=sharing) author = {Müller, Thorsten and
thorsten-de-v02 | normalized to -24dB and split metadata.csv into shuffeled metadata_train.csv and metadata_val.csv | 2020-08-22 | [Google Drive Download v02](https://drive.google.com/file/d/1mGWfG0s2V2TEg-AI2m85tze1m4pyeM7b/view?usp=sharing) Kreutz, Dominik},
title = {ThorstenVoice Dataset 2022.10},
month = oct,
year = 2022,
publisher = {Zenodo},
version = {1.0},
doi = {10.5281/zenodo.7265581},
url = {https://doi.org/10.5281/zenodo.7265581
}
```
# TTS Models
## Thorsten-21.04-Tacotron2-DCA
This [TTS-model](https://drive.google.com/drive/folders/1m4RuffbvdOmQWnmy_Hmw0cZ_q0hj2o8B?usp=sharing) has been trained on [**Thorsten-21.02-neutral**](#thorsten-2102-neutral) dataset. The recommended trained Fullband-MelGAN Vocoder can be downloaded [here](https://drive.google.com/drive/folders/1hsfaconm4Yd9wPVyOtrXjWQs4ZAPoouY?usp=sharing).
Run the model:
* pip install TTS==0.5.0
* tts-server --model_name tts_models/de/thorsten/tacotron2-DCA
# Trained tacotron2 model "thorsten" ## Thorsten-22.05-VITS
If you trained a model on "thorsten" dataset please file an issue with some information on it. Sharing a trained model is highly appreciated. Trained on dataset **Thorsten-22.05-neutral**.
Audio samples are available on [Thorsten-Voice website](https://www.thorsten-voice.de/en/just-get-started/).
## Trained models (TODO) To run TTS server just follow these steps:
* pip install tts==0.7.1
* tts-server --model_name tts_models/de/thorsten/vits
* Open browser on http://localhost:5002 and enjoy playing
## Thorsten-22.08-Tacotron2-DDC
Trained on dataset [**Thorsten-22.05-neutral**](#thorsten-2205-neutral).
Audio samples are available on [Thorsten-Voice website]([https://www.thorsten-voice.de/en/just-get-started/](https://www.thorsten-voice.de/2022/08/14/welches-tts-modell-klingt-besser/)).
To run TTS server just follow these steps:
* pip install tts==0.8.0
* tts-server --model_name tts_models/de/thorsten/tacotron2-DDC
* Open browser on http://localhost:5002 and enjoy playing
## Other models
### Silero
You can use a free A-GPL licensed models trained on **Thorsten-21.02-neutral** dataset via the [silero-models](https://github.com/snakers4/silero-models/blob/master/models.yml) project.
* [Thorsten 16kHz](https://drive.google.com/drive/folders/1tR6w4kgRS2JJ1TWZhwoFuU04Xkgo6YAs?usp=sharing)
* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-models/blob/master/examples_tts.ipynb)
### ZDisket
[ZDisket](https://github.com/ZDisket/TensorVox) made a tool called TensorVox for setting up an TTS environment on Windows and included a german TTS model trained by [monatis](https://github.com/monatis/german-tts). Thanks for sharing that :thumbsup:. See it in action on [Youtube](https://youtu.be/tY6_xZnkv-A).
# Public talks
I really want to bring the topic "**Open Voice For An Open Future**" to a bigger public attention.
* I've been part of a Linux User Group podcast about Mycroft AI and talked on my TTS efforts on that in (*May 2021*).
* I was invited by [Yusuf](https://github.com/monatis/) from Turkish tensorflow community to talk on "How to make machines speak with your own voice". This talk has been streamed live on Youtube and is available [here](https://www.youtube.com/watch?v=m-Uwb-Bg144&t=2303s). If you're interested on the showed slides, feel free to download my presentation [here](https://docs.google.com/presentation/d/1ynnw0ilKV3WwMSJHytrN3GXRiFr8x3r0DUimBm1y0LI/edit?usp=sharing) (*June 2021*)
)
* I've been invited as speaker on VoiceLunch language & linguistics on 03.01.2022. [Here are my slides](https://docs.google.com/presentation/d/1Gi6BmYHs7g4ZgdAiIKGBnBwZDCvJOD9DJxQOGlgds1o/edit?usp=sharing) (*January 2022*).
# Youtube channel
In summer 2021 i've started to share my lessons learned and experiences on open voice tech, in special **TTS** on my little [Youtube channel](https://www.youtube.com/c/ThorstenMueller). If you check out and like my videos i'd happy to welcome you as subscriber and member of my little Youtube community.
Folder | Date | Link | Description
------------ | ------------- | ------------- | -------------
thorsten-taco2-ddc-v0.1 | to do | to do | to do
# Feel free to file an issue if you ... # Feel free to file an issue if you ...
* have improvements on dataset * Use my TTS voice in your project(s)
* use my TTS voice in your project(s) * Want to share your trained "Thorsten" model
* want to share your trained "thorsten" model * Get to know about any abuse usage of my voice
* get to know about any abuse usage of my voice
# Special thanks # Thanks section
I want to thank all open source communities for providing great projects. ## Cool projects
* https://commonvoice.mozilla.org/
* https://coqui.ai/
* https://mycroft.ai/
* https://github.com/rhasspy/
Just to name some nice guys who joined me on this tts-roadtrip: ## Cool people
* [El-Tocino](https://github.com/el-tocino/)
* [Eren Gölge](https://github.com/erogol/)
* [Gras64](https://github.com/gras64/)
* [Kris Gesling](https://github.com/krisgesling/)
* [Nmstoker](https://github.com/nmstoker)
* [Othiele](https://discourse.mozilla.org/u/othiele/summary)
* [Repodiac](https://github.com/repodiac)
* [SanjaESC](https://github.com/SanjaESC)
* [Synesthesiam](https://github.com/synesthesiam/)
* eltocino (https://github.com/el-tocino/) ## Even more special people
* erogol (https://github.com/erogol/) Additionally, a really nice thanks for my dear colleague, Sebastian Kraus, for supporting me with audio recording equipment and for being the creative mastermind behind the logo design.
* gras64 (https://github.com/gras64/)
* krisgesling (https://github.com/krisgesling/)
* nmstoker (https://github.com/nmstoker)
* othiele (https://discourse.mozilla.org/u/othiele/summary)
* repodiac (https://github.com/repodiac)
And last but not least i want to say a huge thank you to a special guy who supported me on this journey right from the beginning. Not just with nice words, but with his time, audio optimization knowhow and finally his gpu computing power. And last but not least i want to say a **huge, huge thank you** to a special guy who supported me on this journey as a partner right from the beginning. Not just with nice words, but with his time, audio optimization knowhow and finally GPU power.
Without his amazing support this dataset (in it's current way) would not exists. **Thank you so much, dear **Dominik** ([@domcross](https://github.com/domcross/)) for being my partner on this journey.**
Thank you Dominik (@domcross / https://github.com/domcross/) Thorsten (*Twitter: @ThorstenVoice*)
# Links
* https://discourse.mozilla.org/t/contributing-my-german-voice-for-tts/48150
* https://community.mycroft.ai/
* https://github.com/MycroftAI/mimic-recording-studio
* https://voice.mozilla.org/
* https://github.com/mozilla/TTS
(https://github.com/repodiac/tit-for-tat/tree/master/thorsten-TTS)
* https://raw.githubusercontent.com/mozilla/voice-web/master/server/data/de/sentence-collector.txt
We'll hear us in future :-)
Thorsten

94
Youtube/train_vits_win.py Normal file
View File

@ -0,0 +1,94 @@
import os
from trainer import Trainer, TrainerArgs
from TTS.tts.configs.shared_configs import BaseDatasetConfig
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.vits import Vits, VitsAudioConfig
from TTS.tts.utils.text.tokenizer import TTSTokenizer
from TTS.utils.audio import AudioProcessor
def main():
output_path = os.path.dirname(os.path.abspath(__file__))
#output_path = "c:\\temp\tts"
dataset_config = BaseDatasetConfig(
formatter="ljspeech", meta_file_train="metadata_small.csv", path="C:\\Users\\ThorstenVoice\\TTS-Training\\ThorstenVoice-Dataset_2022.10"
)
audio_config = VitsAudioConfig(
sample_rate=22050, win_length=1024, hop_length=256, num_mels=80, mel_fmin=0, mel_fmax=None
)
config = VitsConfig(
audio=audio_config,
run_name="vits_thorsten-voice",
batch_size=4,
eval_batch_size=4,
batch_group_size=5,
num_loader_workers=1,
num_eval_loader_workers=1,
run_eval=True,
test_delay_epochs=-1,
epochs=1000,
text_cleaner="phoneme_cleaners",
use_phonemes=True,
phoneme_language="de",
phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
compute_input_seq_cache=True,
print_step=25,
print_eval=True,
mixed_precision=False,
output_path=output_path,
datasets=[dataset_config],
cudnn_benchmark=False,
test_sentences=[
"Es hat mich viel Zeit gekostet ein Stimme zu entwickeln, jetzt wo ich sie habe werde ich nicht mehr schweigen.",
"Sei eine Stimme, kein Echo.",
"Es tut mir Leid David. Das kann ich leider nicht machen.",
"Dieser Kuchen ist großartig. Er ist so lecker und feucht.",
"Vor dem 22. November 1963.",
],
)
# INITIALIZE THE AUDIO PROCESSOR
# Audio processor is used for feature extraction and audio I/O.
# It mainly serves to the dataloader and the training loggers.
ap = AudioProcessor.init_from_config(config)
# INITIALIZE THE TOKENIZER
# Tokenizer is used to convert text to sequences of token IDs.
# config is updated with the default characters if not defined in the config.
tokenizer, config = TTSTokenizer.init_from_config(config)
# LOAD DATA SAMPLES
# Each sample is a list of ```[text, audio_file_path, speaker_name]```
# You can define your custom sample loader returning the list of samples.
# Or define your custom formatter and pass it to the `load_tts_samples`.
# Check `TTS.tts.datasets.load_tts_samples` for more details.
train_samples, eval_samples = load_tts_samples(
dataset_config,
eval_split=True,
eval_split_max_size=config.eval_split_max_size,
eval_split_size=config.eval_split_size,
)
# init model
model = Vits(config, ap, tokenizer, speaker_manager=None)
# init the trainer and 🚀
trainer = Trainer(
TrainerArgs(),
config,
output_path,
model=model,
train_samples=train_samples,
eval_samples=eval_samples,
)
trainer.fit()
print("Fertig!")
from multiprocessing import Process, freeze_support
if __name__ == '__main__':
freeze_support() # needed for Windows
main()

1
docs/_config.yml Normal file
View File

@ -0,0 +1 @@
theme: jekyll-theme-cayman

449
docs/audio_compare.md Normal file
View File

@ -0,0 +1,449 @@
# Vocoder Vergleich auf Basis des "thorsten" Tacotron 2 Modells
Hier sind Hörproben mit unterschiedlichen Vocodern. Alle gesprochenen Texte (*Sample 1 - 4*) basieren auf Aufnahmen im Dataset, jedoch nicht auf dem Spektogramm von "ground truth", sondern auf Basis des trainierten Tacotron 2 Modells. Sample 5 ist der Beginn des Märchens "Der Froschkönig" und wurde nicht für das Dataset aufgezeichnet.
## Sätze
* **Sample #01**: Eure Schoko-Bonbons sind sagenhaft lecker!
* **Sample #02**: Eure Tröte nervt.
* **Sample #03**: Europa und Asien zusammengenommen wird auch als Eurasien bezeichnet.
* **Sample #04**: Euer Plan hat ja toll geklappt.
* *Sample #05: "In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön." (Anfang vom "Froschkönig")*
# Ground truth
Originalaufnahmen aus dem "thorsten" Dataset.
<dl>
<table>
<thead>
<tr>
<th>Sample</th>
<th>Text</th>
<th>Audio</th>
</tr>
</thead>
<tbody>
<tr>
<td>01</td>
<td>Eure Schoko-Bonbons sind sagenhaft lecker</td>
<td><audio controls="" preload="none"><source src="samples/sample01-gt.wav"></audio></td>
</tr>
<tr>
<td>02</td>
<td>Eure Tröte nervt</td>
<td><audio controls="" preload="none"><source src="samples/sample02-gt.wav"></audio></td>
</tr>
<tr>
<td>03</td>
<td>Europa und Asien zusammengenommen wird auch als Eurasien bezeichnet</td>
<td><audio controls="" preload="none"><source src="samples/sample03-gt.wav"></audio></td>
</tr>
<tr>
<td>04</td>
<td>Euer Plan hat ja toll geklappt.</td>
<td><audio controls="" preload="none"><source src="samples/sample04-gt.wav"></audio></td>
</tr>
</tbody>
</table>
</dl>
# Griffin Lim
> Details zum Model: (todo: link)
> Tacotron2 + DDC: 460k Schritte trainiert
<dl>
<table>
<thead>
<tr>
<th>Sample</th>
<th>Text</th>
<th>Audio</th>
</tr>
</thead>
<tbody>
<tr>
<td>01</td>
<td>Eure Schoko-Bonbons sind sagenhaft lecker</td>
<td><audio controls="" preload="none"><source src="samples/sample01-griffin-lim.wav"></audio></td>
</tr>
<tr>
<td>02</td>
<td>Eure Tröte nervt</td>
<td><audio controls="" preload="none"><source src="samples/sample02-griffin-lim.wav"></audio></td>
</tr>
<tr>
<td>03</td>
<td>Europa und Asien zusammengenommen wird auch als Eurasien bezeichnet</td>
<td><audio controls="" preload="none"><source src="samples/sample03-griffin-lim.wav"></audio></td>
</tr>
<tr>
<td>04</td>
<td>Euer Plan hat ja toll geklappt.</td>
<td><audio controls="" preload="none"><source src="samples/sample04-griffin-lim.wav"></audio></td>
</tr>
<tr>
<td>05</td>
<td>In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön.</td>
<td><audio controls="" preload="none"><source src="samples/sample05-griffin-lim.wav"></audio></td>
</tr>
</tbody>
</table>
</dl>
# ParallelWaveGAN
> Details: [Notebook von Olaf](https://colab.research.google.com/drive/15kJHTDTVxyIjxiZgqD1G_s5gUeVNLkfy?usp=sharing)
> Tacotron2 + DDC: 360k Schritte trainiert, PWGAN Vocoder: 925k Schritte trainiert
<dl>
<table>
<thead>
<tr>
<th>Sample</th>
<th>Text</th>
<th>Audio</th>
</tr>
</thead>
<tbody>
<tr>
<td>01</td>
<td>Eure Schoko-Bonbons sind sagenhaft lecker</td>
<td><audio controls="" preload="none"><source src="samples/sample01-pwgan.wav"></audio></td>
</tr>
<tr>
<td>02</td>
<td>Eure Tröte nervt</td>
<td><audio controls="" preload="none"><source src="samples/sample02-pwgan.wav"></audio></td>
</tr>
<tr>
<td>03</td>
<td>Europa und Asien zusammengenommen wird auch als Eurasien bezeichnet</td>
<td><audio controls="" preload="none"><source src="samples/sample03-pwgan.wav"></audio></td>
</tr>
<tr>
<td>04</td>
<td>Euer Plan hat ja toll geklappt.</td>
<td><audio controls="" preload="none"><source src="samples/sample04-pwgan.wav"></audio></td>
</tr>
<tr>
<td>05</td>
<td>In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön.</td>
<td><audio controls="" preload="none"><source src="samples/sample05-pwgan.wav"></audio></td>
</tr>
</tbody>
</table>
</dl>
# WaveGrad
> Tacotron2 + DDC: 460k Schritte trainiert, WaveGrad Vocoder: 510k Schritte trainiert (inkl. Noise-Schedule)
<dl>
<table>
<thead>
<tr>
<th>Sample</th>
<th>Text</th>
<th>Audio</th>
</tr>
</thead>
<tbody>
<tr>
<td>01</td>
<td>Eure Schoko-Bonbons sind sagenhaft lecker</td>
<td><audio controls="" preload="none"><source src="samples/sample01-wavegrad.wav"></audio></td>
</tr>
<tr>
<td>02</td>
<td>Eure Tröte nervt</td>
<td><audio controls="" preload="none"><source src="samples/sample02-wavegrad.wav"></audio></td>
</tr>
<tr>
<td>03</td>
<td>Europa und Asien zusammengenommen wird auch als Eurasien bezeichnet</td>
<td><audio controls="" preload="none"><source src="samples/sample03-wavegrad.wav"></audio></td>
</tr>
<tr>
<td>04</td>
<td>Euer Plan hat ja toll geklappt.</td>
<td><audio controls="" preload="none"><source src="samples/sample04-wavegrad.wav"></audio></td>
</tr>
<tr>
<td>05</td>
<td>In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön.</td>
<td><audio controls="" preload="none"><source src="samples/sample05-wavegrad.wav"></audio></td>
</tr>
</tbody>
</table>
</dl>
# HifiGAN
> Thanks to SanjaESC (https://github.com/SanjaESC) for training this model.
<dl>
<table>
<thead>
<tr>
<th>Sample</th>
<th>Text</th>
<th>Audio</th>
</tr>
</thead>
<tbody>
<tr>
<td>01</td>
<td>Eure Schoko-Bonbons sind sagenhaft lecker</td>
<td><audio controls="" preload="none"><source src="samples/sample01-hifigan.wav"></audio></td>
</tr>
<tr>
<td>02</td>
<td>Eure Tröte nervt</td>
<td><audio controls="" preload="none"><source src="samples/sample02-hifigan.wav"></audio></td>
</tr>
<tr>
<td>03</td>
<td>Europa und Asien zusammengenommen wird auch als Eurasien bezeichnet</td>
<td><audio controls="" preload="none"><source src="samples/sample03-hifigan.wav"></audio></td>
</tr>
<tr>
<td>04</td>
<td>Euer Plan hat ja toll geklappt.</td>
<td><audio controls="" preload="none"><source src="samples/sample04-hifigan.wav"></audio></td>
</tr>
<tr>
<td>05</td>
<td>In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön.</td>
<td><audio controls="" preload="none"><source src="samples/sample05-hifigan.wav"></audio></td>
</tr>
</tbody>
</table>
</dl>
# VocGAN
> **Diese Beispiele basieren auf "ground truth" und nicht auf dem Tacotron 2 Modell**
> 200 Epochen / 284k Trainingsschritte
<dl>
<table>
<thead>
<tr>
<th>Sample</th>
<th>Text</th>
<th>Audio</th>
</tr>
</thead>
<tbody>
<tr>
<td>01</td>
<td>Eure Schoko-Bonbons sind sagenhaft lecker</td>
<td><audio controls="" preload="none"><source src="samples/sample01-vocgan.wav"></audio></td>
</tr>
<tr>
<td>02</td>
<td>Eure Tröte nervt</td>
<td><audio controls="" preload="none"><source src="samples/sample02-vocgan.wav"></audio></td>
</tr>
<tr>
<td>03</td>
<td>Europa und Asien zusammengenommen wird auch als Eurasien bezeichnet</td>
<td><audio controls="" preload="none"><source src="samples/sample03-vocgan.wav"></audio></td>
</tr>
<tr>
<td>04</td>
<td>Euer Plan hat ja toll geklappt.</td>
<td><audio controls="" preload="none"><source src="samples/sample04-vocgan.wav"></audio></td>
</tr>
</tbody>
</table>
</dl>
# GlowTTS / Waveglow
> Details: [Github von Synesthesiam](https://github.com/rhasspy/de_larynx-thorsten)
> GlowTTS trainiert für 380k und Vocoder für 500k Schritte.
<dl>
<table>
<thead>
<tr>
<th>Sample</th>
<th>Text</th>
<th>Audio</th>
</tr>
</thead>
<tbody>
<tr>
<td>01</td>
<td>Eure Schoko-Bonbons sind sagenhaft lecker</td>
<td><audio controls="" preload="none"><source src="samples/sample01-waveglow.wav"></audio></td>
</tr>
<tr>
<td>02</td>
<td>Eure Tröte nervt</td>
<td><audio controls="" preload="none"><source src="samples/sample02-waveglow.wav"></audio></td>
</tr>
<tr>
<td>03</td>
<td>Europa und Asien zusammengenommen wird auch als Eurasien bezeichnet</td>
<td><audio controls="" preload="none"><source src="samples/sample03-waveglow.wav"></audio></td>
</tr>
<tr>
<td>04</td>
<td>Euer Plan hat ja toll geklappt.</td>
<td><audio controls="" preload="none"><source src="samples/sample04-waveglow.wav"></audio></td>
</tr>
<tr>
<td>05</td>
<td>In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön.</td>
<td><audio controls="" preload="none"><source src="samples/sample05-waveglow.wav"></audio></td>
</tr>
</tbody>
</table>
</dl>
# TensorFlowTTS
## Multiband MelGAN
> Thanks [Monatis](https://github.com/monatis)
> Details: [Notebook von Monatis](https://colab.research.google.com/drive/1W0nSFpsz32M0OcIkY9uMOiGrLTPKVhTy?usp=sharing#scrollTo=SCbWCChVkfnn)
> Taco2 Modell für 80k Schritte trainiert, Multiband MelGAN für 800k Schritte.
<dl>
<table>
<thead>
<tr>
<th>Sample</th>
<th>Text</th>
<th>Audio</th>
</tr>
</thead>
<tbody>
<tr>
<td>01</td>
<td>Eure Schoko-Bonbons sind sagenhaft lecker</td>
<td><audio controls="" preload="none"><source src="samples/sample01-TensorFlowTTS.wav"></audio></td>
</tr>
<tr>
<td>02</td>
<td>Eure Tröte nervt</td>
<td><audio controls="" preload="none"><source src="samples/sample02-TensorFlowTTS.wav"></audio></td>
</tr>
<tr>
<td>03</td>
<td>Europa und Asien zusammengenommen wird auch als Eurasien bezeichnet</td>
<td><audio controls="" preload="none"><source src="samples/sample03-TensorFlowTTS.wav"></audio></td>
</tr>
<tr>
<td>04</td>
<td>Euer Plan hat ja toll geklappt.</td>
<td><audio controls="" preload="none"><source src="samples/sample04-TensorFlowTTS.wav"></audio></td>
</tr>
<tr>
<td>05</td>
<td>In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön.</td>
<td><audio controls="" preload="none"><source src="samples/sample05-TensorFlowTTS.wav"></audio></td>
</tr>
</tbody>
</table>
</dl>
# Silero models
> Thanks [snakers4](https://github.com/snakers4/silero-models)
> Details: [Notebook von Silero](https://colab.research.google.com/github/snakers4/silero-models/blob/master/examples_tts.ipynb#scrollTo=indirect-berry)
<dl>
<table>
<thead>
<tr>
<th>Sample</th>
<th>Text</th>
<th>Audio</th>
</tr>
</thead>
<tbody>
<tr>
<td>01</td>
<td>Eure Schoko-Bonbons sind sagenhaft lecker</td>
<td><audio controls="" preload="none"><source src="samples/sample01-silero.wav"></audio></td>
</tr>
<tr>
<td>02</td>
<td>Eure Tröte nervt</td>
<td><audio controls="" preload="none"><source src="samples/sample02-silero.wav"></audio></td>
</tr>
<tr>
<td>03</td>
<td>Europa und Asien zusammengenommen wird auch als Eurasien bezeichnet</td>
<td><audio controls="" preload="none"><source src="samples/sample03-silero.wav"></audio></td>
</tr>
<tr>
<td>04</td>
<td>Euer Plan hat ja toll geklappt.</td>
<td><audio controls="" preload="none"><source src="samples/sample04-silero.wav"></audio></td>
</tr>
<tr>
<td>05</td>
<td>In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön.</td>
<td><audio controls="" preload="none"><source src="samples/sample05-silero.wav"></audio></td>
</tr>
</tbody>
</table>
</dl>
# Forward Tacotron
> Thanks [cschaefer26](https://github.com/as-ideas/ForwardTacotron)
> Config: Forward-Tacotron, trained to 300k, alpha set to 0.8, pretrained HifiGAN vocoder
<dl>
<table>
<thead>
<tr>
<th>Sample</th>
<th>Text</th>
<th>Audio</th>
</tr>
</thead>
<tbody>
<tr>
<td>01</td>
<td>Eure Schoko-Bonbons sind sagenhaft lecker</td>
<td><audio controls="" preload="none"><source src="samples/sample01-ForwardTacotron-HifiGAN.wav"></audio></td>
</tr>
<tr>
<td>02</td>
<td>Eure Tröte nervt</td>
<td><audio controls="" preload="none"><source src="samples/sample02-ForwardTacotron-HifiGAN.wav"></audio></td>
</tr>
<tr>
<td>03</td>
<td>Europa und Asien zusammengenommen wird auch als Eurasien bezeichnet</td>
<td><audio controls="" preload="none"><source src="samples/sample03-ForwardTacotron-HifiGAN.wav"></audio></td>
</tr>
<tr>
<td>04</td>
<td>Euer Plan hat ja toll geklappt.</td>
<td><audio controls="" preload="none"><source src="samples/sample04-ForwardTacotron-HifiGAN.wav"></audio></td>
</tr>
<tr>
<td>05</td>
<td>In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön.</td>
<td><audio controls="" preload="none"><source src="samples/sample05-ForwardTacotron-HifiGAN.wav"></audio></td>
</tr>
</tbody>
</table>
</dl>

48
docs/index.md Normal file
View File

@ -0,0 +1,48 @@
# Motivation
<span style="font-size:1.5em;font-weight:bold">
Eine kostenfreie, qualitativ hochwertige, deutsche TTS Stimme, die offline erzeugt werden kann sollte jedem Projekt ohne Lizenzrechtliche Probleme zur Verfügung stehen.
</span>
# Egal aus welchem Bereich du kommst:
* Privates Bastelprojekt
* OpenSource/Community Projekt
* Bildung/Forschung/Wissenschaft
* Kommerzielles Unternehmen
* ...
# Egal welcher Bereich dich interessiert:
* Smarte Sprachassistenten
* Navigationssysteme
* Smart Homes
* Sprechende Kühlschränke
* Vorlesen von Bildschirmtexten (Barrierefreiheit)
* Interaktive Robotik
* ...
# Wer wir sind
Wir sind eine kleine motivierte Gruppe hobbymäßiger TTS-Enthusiasten die sich nach einem abgewandelten "Herr der Ringe Zitat" benannt hat - "**Fellowership of free german tts**"
# Wo wir aktuell stehen
Wir arbeiten weiterhin daran qualitativ noch bessere Modell zu trainieren, aber den aktuellen "stable" Stand kannst Du hier anhören:
* [Es ist im Moment klarer Himmel bei 18 Grad.](https://drive.google.com/file/d/1cDIq4QG6i60WjUYNT6fr2cpEjFQIi8w5/view?usp=sharing)
* [Ich verstehe das nicht, aber ich lerne jeden Tag neue Dinge.](https://drive.google.com/file/d/1kja_2RsFt6EmC33HTB4ozJyFlvh_DTFQ/view?usp=sharing)
* [Ich bin jetzt bereit.](https://drive.google.com/file/d/1GkplGH7LMJcPDpgFJocXHCjRln_ccVFs/view?usp=sharing)
* [Bitte warte einen Moment, bis ich fertig mit dem Booten bin.](https://drive.google.com/file/d/19Td-F14n_05F-squ3bNlt2BDE-NMFaq1/view?usp=sharing)
* [Mein Name ist Mycroft und ich bin funky.](https://drive.google.com/file/d/1dbyOyE7Oy8YdAsYqQ4vz4VJjiWIyc8oV/view?usp=sharing)
## Vergleich einiger Vocoder
Wir experimentieren aktuell mit unterschiedlichen Konfigurationen um das beste Modell zu ermitteln. Ein Vergleich der bisherigen Ergebnisse findest Du auf dieser Seite.
> [Vergleich der unterschiedlichen Modelle](./audio_compare)
# Interessiert?
[Weitere Details, Downloads und Danksagungen findet ihr hier.](https://github.com/thorstenMueller/deep-learning-german-tts "Dataset Details und Thorsten-Modell Download")
---
<span style="font-size:1.5em;font-weight:bold">
Wir wünschen euch viel Spaß und Erfolg bei der Umsetzung eurer Projekte :-)
</span>

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,51 @@
# Dockerfile for running Coqui TTS trainings in a docker container on NVIDIA Jetson platofrm.
# Based on NVIDIA Jetson ML Image, provided without any warranty as is by Thorsten Müller (https://twitter.com/ThorstenVoice) in august 2021
FROM nvcr.io/nvidia/l4t-ml:r32.5.0-py3
RUN echo "deb https://repo.download.nvidia.com/jetson/common r32.4 main" >> /etc/apt/sources.list.d/nvidia-l4t-apt-source.list
RUN echo "deb https://repo.download.nvidia.com/jetson/t194 r32.4 main" >> /etc/apt/sources.list.d/nvidia-l4t-apt-source.list
RUN apt-get update -y
RUN apt-get install vim python-mecab libmecab-dev cuda-toolkit-10-2 libcudnn8 libcudnn8-dev libsndfile1-dev locales -y
# Setting some environment vars
ENV LLVM_CONFIG=/usr/bin/llvm-config-9
ENV PYTHONPATH=/coqui/TTS/
ENV LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH
# Skipping OPENBLAS_CORETYPE might show "Illegal instruction (core dumped) error
ENV OPENBLAS_CORETYPE=ARMV8
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
LABEL com.nvidia.volumes.needed="nvidia_driver"
# Adjust locale setting to your personal needs
RUN sed -i '/de_DE.UTF-8/s/^# //g' /etc/locale.gen && \
locale-gen
ENV LANG de_DE.UTF-8
ENV LANGUAGE de_DE:de
ENV LC_ALL de_DE.UTF-8
RUN mkdir /coqui
WORKDIR /coqui
ARG COQUI_BRANCH
RUN git clone -b ${COQUI_BRANCH} https://github.com/coqui-ai/TTS.git
WORKDIR /coqui/TTS
RUN pip3 install pip setuptools wheel --upgrade
RUN pip uninstall -y tensorboard tensorflow tensorflow-estimator nbconvert matplotlib
RUN pip install -r requirements.txt
RUN python3 ./setup.py develop
# Jupyter Notebook
RUN python3 -c "from notebook.auth.security import set_password; set_password('nvidia', '/root/.jupyter/jupyter_notebook_config.json')"
CMD /bin/bash -c "jupyter lab --ip 0.0.0.0 --port 8888 --allow-root"
# Build example:
# nvidia-docker build . -f Dockerfile.Jetson-Coqui --build-arg COQUI_BRANCH=v0.1.3 -t jetson-coqui
# Run example:
# nvidia-docker run -p 8888:8888 -d --shm-size 32g --gpus all -v /ssd/___prj/tts/dataset-july21:/coqui/TTS/data jetson-coqui
# Bash example:
# nvidia-docker exec -it <containerId> /bin/bash

View File

@ -0,0 +1,157 @@
# This script generates the folder structure for ljspeech-1.1 processing from mimic-recording-studio database
# Changelog
# v1.0 - Initial release by Thorsten Müller (https://github.com/thorstenMueller/deep-learning-german-tts)
# v1.1 - Great improvements by Peter Schmalfeldt (https://github.com/manifestinteractive)
# - Audio processing with ffmpeg (mono and samplerate of 22.050 Hz)
# - Much better Python coding than my original version
# - Greater logging output to command line
# - See more details here: https://gist.github.com/manifestinteractive/6fd9be62d0ede934d4e1171e5e751aba
# - Thanks Peter, it's a great contribution :-)
# v1.2 - Added choice for choosing which recording session should be exported as LJSpeech
# v1.3 - Added parameter mrs_dir to pass directory of Mimic-Recording-Studio
# v1.4 - Script won't crash when audio recorded has been deleted on disk
# v1.5 - Added parameter "ffmpeg" to make converting with ffmpeg optional
from genericpath import exists
import glob
import sqlite3
import os
import argparse
import sys
from shutil import copyfile
from shutil import rmtree
# Setup Directory Data
cwd = os.path.dirname(os.path.abspath(__file__))
output_dir = os.path.join(cwd, "dataset")
output_dir_audio = ""
output_dir_audio_temp=""
output_dir_speech = ""
# Create folders needed for ljspeech
def create_folders():
global output_dir
global output_dir_audio
global output_dir_audio_temp
global output_dir_speech
print('→ Creating Dataset Folders')
output_dir_speech = os.path.join(output_dir, "LJSpeech-1.1")
# Delete existing folder if exists for clean run
if os.path.exists(output_dir_speech):
rmtree(output_dir_speech)
output_dir_audio = os.path.join(output_dir_speech, "wavs")
output_dir_audio_temp = os.path.join(output_dir_speech, "temp")
# Create Clean Folders
os.makedirs(output_dir_speech)
os.makedirs(output_dir_audio)
os.makedirs(output_dir_audio_temp)
def convert_audio():
global output_dir_audio
global output_dir_audio_temp
recordings = len([name for name in os.listdir(output_dir_audio_temp) if os.path.isfile(os.path.join(output_dir_audio_temp,name))])
print('→ Converting %s Audio Files to 22050 Hz, 16 Bit, Mono\n' % "{:,}".format(recordings))
# Please use `pip install ffmpeg-python`
import ffmpeg
for idx, wav in enumerate(glob.glob(os.path.join(output_dir_audio_temp, "*.wav"))):
percent = (idx + 1) / recordings
print(' \033[96m%s\033[0m \033[2m%s / %s (%s)\033[0m ' % (os.path.basename(wav), "{:,}".format((idx + 1)), "{:,}".format(recordings), "{:.0%}".format(percent)))
# Convert WAV file to required format
(ffmpeg
.input(wav)
.output(os.path.join(output_dir_audio, os.path.basename(wav)), acodec='pcm_s16le', ac=1, ar=22050, loglevel='error')
.overwrite_output()
.run(capture_stdout=True)
)
def copy_audio():
global output_dir_audio
print('→ Using ffmpeg to convert recordings')
recordings = len([name for name in os.listdir(output_dir_audio_temp) if os.path.isfile(os.path.join(output_dir_audio_temp,name))])
print('→ Copy %s Audio Files to LJSpeech Dataset\n' % "{:,}".format(recordings))
for idx, wav in enumerate(glob.glob(os.path.join(output_dir_audio_temp, "*.wav"))):
copyfile(wav,os.path.join(output_dir_audio, os.path.basename(wav)))
def create_meta_data(mrs_dir):
print('→ Creating META Data')
conn = sqlite3.connect(os.path.join(mrs_dir, "backend", "db", "mimicstudio.db"))
c = conn.cursor()
# Create metadata.csv for ljspeech
metadata = open(os.path.join(output_dir_speech, "metadata.csv"), mode="w", encoding="utf8")
# List available recording sessions
user_models = c.execute('SELECT uuid, user_name from usermodel ORDER BY created_date DESC').fetchall()
user_id = user_models[0][0]
for row in user_models:
print(row[0] + ' -> ' + row[1])
user_answer = input('Please choose ID of recording session to export (default is newest session) [' + user_id + ']: ')
if user_answer:
user_id = user_answer
for row in c.execute('SELECT audio_id, prompt, lower(prompt) FROM audiomodel WHERE user_id = "' + user_id + '" ORDER BY length(prompt)'):
source_file = os.path.join(mrs_dir, "backend", "audio_files", user_id, row[0] + ".wav")
if exists(source_file):
metadata.write(row[0] + "|" + row[1] + "|" + row[2] + "\n")
copyfile(source_file, os.path.join(output_dir_audio_temp, row[0] + ".wav"))
else:
print("Wave file {} not found.".format(source_file))
metadata.close()
conn.close()
def cleanup():
global output_dir_audio_temp
# Remove Temp Folder
rmtree(output_dir_audio_temp)
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--mrs_dir', required=True)
parser.add_argument('--ffmpeg', required=False, default=False)
args = parser.parse_args()
if not os.path.isdir(os.path.join(args.mrs_dir,"backend")):
sys.exit("Passed directory is no valid Mimic-Recording-Studio main directory!")
print('\n\033[48;5;22m MRS to LJ Speech Processor \033[0m\n')
create_folders()
create_meta_data(args.mrs_dir)
if(args.ffmpeg):
convert_audio()
else:
copy_audio()
cleanup()
print('\n\033[38;5;86;1m✔\033[0m COMPLETE【ツ】\n')
if __name__ == '__main__':
main()

27
helperScripts/README.md Normal file
View File

@ -0,0 +1,27 @@
# Short collection of helpful scripts for dataset creation and/or TTS training stuff
## MRS2LJSpeech
Python script which takes recordings (filesystem and sqlite db) done with Mycroft Mimic-Recording-Studio (https://github.com/MycroftAI/mimic-recording-studio) and creates an audio optimized dataset in widely supported LJSpeech directory structure.
Peter Schmalfeldt (https://github.com/manifestinteractive) did an amazing job as he optimized my originally (quick'n dirty) version of that script, so thank you Peter :-)
See more details here: https://gist.github.com/manifestinteractive/6fd9be62d0ede934d4e1171e5e751aba#file-mrs2ljspeech-py
## Dockerfile.Jetson-Coqui
> Add your user to `docker` group to not require sudo on all operations.
Thanks to NVIDIA for providing docker images for Jetson platform. I use the "machine learning (ML)" image as baseimage for setting up a Coqui environment.
> You can use any branch or tag as COQUI_BRANCH argument. v0.1.3 is just the current stable version.
Switch to directory where Dockerfile is in and run `nvidia-docker build . -f Dockerfile.Jetson-Coqui --build-arg COQUI_BRANCH=v0.1.3 -t jetson-coqui` to build your container image. When build process is finished you can start a container on that image.
### Mapped volumes
We need to bring your dataset and configuration file into our container so we should map a volume on running container
`nvidia-docker run -p 8888:8888 -d --shm-size 32g --gpus all -v [host path with dataset and config.json]:/coqui/TTS/data jetson-coqui`. Now we have a running container ready for Coqui TTS magic.
### Jupyter notebook
Coqui provides lots of useful Jupyter notebooks for dataset analysis. Once your container is up and running you should be able to call
### Running bash into container
`nvidia-docker exec -it jetson-coqui /bin/bash` now you're inside the container and an `ls /coqui/TTS/data` should show your dataset files.

View File

@ -0,0 +1,41 @@
# This script gets speech rate per audio recording from a voice dataset (ljspeech file and directory structure)
# Writte by Thorsten Müller (deep-learning-german@gmx.net) and provided without any warranty.
# https://github.com/thorstenMueller/deep-learning-german-tts/
# https://twitter.com/ThorstenVoice
# Changelog:
# v0.1 - 26.09.2021 - Initial version
from genericpath import exists
import os
import librosa
import csv
dataset_dir = "/home/thorsten/___dev/tts/dataset/Thorsten-neutral-Dec2021-44k/" # Directory where metadata.csv is in
out_csv_file = os.path.join(dataset_dir,"speech_rate_report.csv")
decimal_use_comma = True # False: Splitting decimal value with a dot (.); True: Comma (,)
out_csv = open(out_csv_file,"w")
out_csv.write("filename;audiolength_sec;number_chars;chars_per_sec;remove_from_dataset\n")
# Open metadata.csv file
with open(os.path.join(dataset_dir,"metadata.csv")) as csvfile:
reader = csv.reader(csvfile, delimiter='|')
for row in reader:
wav_file = os.path.join(dataset_dir,"wavs",row[0] + ".wav")
if exists(wav_file):
# Gather values for report.csv output
phrase_len = len(row[1]) - 1 # Do not count punctuation marks.
duration = round(librosa.get_duration(filename=wav_file),2)
char_per_sec = round(phrase_len / duration,2)
if decimal_use_comma:
duration = str(duration).replace(".",",")
char_per_sec = str(char_per_sec).replace(".",",")
out_csv.write(row[0] + ".wav;" + str(duration) + ";" + str(phrase_len) + ";" + str(char_per_sec) + ";no\n")
else:
print("File " + wav_file + " does not exist.")
out_csv.close()

View File

@ -0,0 +1,48 @@
# This script removes recordings from an ljspeech file/directory structured dataset based on CSV file from "getDatasetSpeechRate"
# Writte by Thorsten Müller (deep-learning-german@gmx.net) and provided without any warranty.
# https://github.com/thorstenMueller/deep-learning-german-tts/
# https://twitter.com/ThorstenVoice
# Changelog:
# v0.1 - 26.09.2021 - Initial version
import os
import csv
import shutil
dataset_dir = "/Users/thorsten/Downloads/thorsten-export-20210909/" # Directory where metadata.csv is in
subfolder_removed = "___removed"
in_csv_file = os.path.join(dataset_dir,"speech_rate_report.csv")
to_remove = []
# Open metadata.csv file
with open(os.path.join(dataset_dir,in_csv_file)) as csvfile:
reader = csv.reader(csvfile, delimiter=';')
for row in reader:
if row[4] == "yes":
# Recording in that row should be removed from dataset
to_remove.append(row[0])
print("Recording " + row[0] + " will be removed from dataset.")
print("\n" + str(len(to_remove)) + " recordings has been marked for deletion.")
if len(to_remove) > 0:
metadata_cleaned = open(os.path.join(dataset_dir,"metadata_cleaned.csv"),"w")
# Create new subdirectory for removed wav files
removed_dir = os.path.join(dataset_dir,subfolder_removed)
if not os.path.exists(removed_dir):
os.makedirs(removed_dir)
# Remove lines from metadata.csv and move wav files to new subdirectory
with open(os.path.join(dataset_dir,"metadata.csv")) as csvfile:
reader = csv.reader(csvfile, delimiter='|')
for row in reader:
if (row[0] + ".wav") not in to_remove:
metadata_cleaned.write(row[0] + "|" + row[1] + "|" + row[2] + "\n")
else:
# Move recording to new subfolder
shutil.move(os.path.join(dataset_dir,"wavs",row[0] + ".wav"),removed_dir)
metadata_cleaned.close()

Binary file not shown.

Before

Width:  |  Height:  |  Size: 8.2 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 8.5 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 11 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 9.6 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 3.9 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 31 KiB

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.