forked from extern/Thorsten-Voice
Compare commits
90 Commits
githubPage
...
master
Author | SHA1 | Date | |
---|---|---|---|
|
f13bcaf63e | ||
|
04c5683194 | ||
|
50e09d49bf | ||
|
b0afed75f4 | ||
|
9b7b4c6836 | ||
|
aba10bc64a | ||
|
07e85b3905 | ||
|
e08d50d6bb | ||
|
e691aa4ee3 | ||
|
625f73e986 | ||
|
de1802f8ce | ||
|
f0500309d6 | ||
|
41c91b9865 | ||
|
fcb1e705a9 | ||
|
b8802db4f8 | ||
|
b00c768343 | ||
|
3b0b4f898f | ||
|
2106fc6b00 | ||
|
e4ff3ce04a | ||
|
f408508cd7 | ||
|
6b4cfb41d4 | ||
|
521dd33483 | ||
|
6efb25310a | ||
|
5654397f3e | ||
|
b5ec9ef991 | ||
|
77ad01d4ff | ||
|
c35507b1f7 | ||
|
b536dfd958 | ||
|
29238f2a31 | ||
|
8c5f4503f3 | ||
|
2ff7e3961b | ||
|
1221713314 | ||
|
d3225b48f8 | ||
|
33c030f844 | ||
|
2daabae53e | ||
|
1d445b09f8 | ||
|
2853f111dc | ||
|
7540606247 | ||
|
0b9e929ce0 | ||
|
bc06fa923f | ||
|
f19144b085 | ||
|
251c093ad4 | ||
|
f505fd38df | ||
|
3e09ae8615 | ||
|
2ed2413dda | ||
|
51c5f55bbd | ||
|
4f875ac591 | ||
|
2ea44ede87 | ||
|
ba60fc57d4 | ||
|
9e68d99ee7 | ||
|
7172604eed | ||
|
58dece7c55 | ||
|
c81f374aca | ||
|
2c6aca780b | ||
|
68e60f2a92 | ||
|
a3b0dde296 | ||
|
28d81a0fb2 | ||
|
12c6d26dbd | ||
|
4c06db69dd | ||
|
bae96a75a5 | ||
|
1313520064 | ||
|
e2ecf68c13 | ||
|
c8a5e1082e | ||
|
40aae591d7 | ||
|
4f722e96a9 | ||
|
7e1530b742 | ||
|
647786be6c | ||
|
00685a008d | ||
|
e5481a82a6 | ||
|
2d1428cd13 | ||
|
df55a19ae2 | ||
|
9585b73cc3 | ||
|
70158ba7c8 | ||
|
e1e9f8666a | ||
|
cca10c215e | ||
|
09705597b8 | ||
|
bdb3aa7d47 | ||
|
f0c0f63ae1 | ||
|
036c266ad7 | ||
|
8e6137b3af | ||
|
9ee0353da4 | ||
|
a99d4b6477 | ||
|
02020e54f7 | ||
|
5347394f3e | ||
|
c59d19e0a1 | ||
|
e45736f62d | ||
|
e96de3a095 | ||
|
eaead5cebe | ||
|
7b27bdac2d | ||
|
f55e16d0fc |
2
.github/FUNDING.yml
vendored
Normal file
2
.github/FUNDING.yml
vendored
Normal file
@ -0,0 +1,2 @@
|
||||
# These are supported funding model platforms
|
||||
|
28
CITATION.cff
Normal file
28
CITATION.cff
Normal file
@ -0,0 +1,28 @@
|
||||
# This CITATION.cff file was generated with cffinit.
|
||||
# Visit https://bit.ly/cffinit to generate yours today!
|
||||
|
||||
cff-version: 1.2.0
|
||||
title: Thorsten-Voice
|
||||
message: >-
|
||||
Please cite Thorsten-Voice project if you use
|
||||
datasets or trained TTS models.
|
||||
type: dataset
|
||||
authors:
|
||||
- given-names: Thorsten
|
||||
family-names: Müller
|
||||
email: tm@thorsten-voice.de
|
||||
- given-names: Dominik
|
||||
family-names: Kreutz
|
||||
repository-code: 'https://github.com/thorstenMueller/Thorsten-Voice'
|
||||
url: 'https://www.Thorsten-Voice.de'
|
||||
abstract: >-
|
||||
A free to use, offline working, high quality german
|
||||
TTS voice should be available for every project
|
||||
without any license struggling.
|
||||
keywords:
|
||||
- Thorsten
|
||||
- Voice
|
||||
- Open
|
||||
- German
|
||||
- TTS
|
||||
- Dataset
|
BIN
EvolutionOfThorstenDataset.pdf
Executable file
BIN
EvolutionOfThorstenDataset.pdf
Executable file
Binary file not shown.
BIN
Logo_Thorsten-Voice.png
Normal file
BIN
Logo_Thorsten-Voice.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 25 KiB |
245
README.md
Normal file
245
README.md
Normal file
@ -0,0 +1,245 @@
|
||||
![Thorsten-Voice logo](Logo_Thorsten-Voice.png)
|
||||
|
||||
- [Project motivation](#motivation-for-thorsten-voice-project-speaking_head-speech_balloon)
|
||||
|
||||
- [Personal note](#some-personal-words-before-using-thorsten-voice)
|
||||
|
||||
- [**Thorsten** Voice Datasets](#voice-datasets)
|
||||
- [Thorsten-21.02-neutral](#thorsten-2102-neutral)
|
||||
- [Thorsten-21.06-emotional](#thorsten-2106-emotional)
|
||||
- [Thorsten-22.10-neutral](#thorsten-2210-neutral)
|
||||
|
||||
- [**Thorsten** TTS-Models](#tts-models)
|
||||
- [Thorsten-21.04-Tacotron2-DCA](#thorsten-2104-tacotron2-dca)
|
||||
- [Thorsten-22.05-VITS](#thorsten-2205-vits)
|
||||
- [Thorsten-22.08-Tacotron2-DDC](#thorsten-2208-tacotron2-ddc)
|
||||
- [Other models](#other-models)
|
||||
|
||||
- [Public talks](#public-talks)
|
||||
|
||||
- [My Youtube channel](#youtube-channel)
|
||||
|
||||
- [Special Thanks](#thanks-section)
|
||||
|
||||
|
||||
# Motivation for Thorsten-Voice project :speaking_head: :speech_balloon:
|
||||
A **free** to use, **offline** working, **high quality** **german** **TTS** voice should be available for every project without any license struggling.
|
||||
|
||||
<a href="https://twitter.com/intent/follow?screen_name=ThorstenVoice"><img src="https://img.shields.io/twitter/follow/ThorstenVoice?style=social&logo=twitter" alt="follow on Twitter"></a>
|
||||
[![YouTube Channel Subscribers](https://img.shields.io/youtube/channel/subscribers/UCjqqTVVBTsxpm0iOhQ1fp9g?style=social)](https://www.youtube.com/c/ThorstenMueller)
|
||||
[![Project website](https://img.shields.io/badge/Project_website-www.Thorsten--Voice.de-92a0c0)](https://www.Thorsten-Voice.de)
|
||||
|
||||
# Social media
|
||||
Please check and follow me on my social media profiles - Thank you.
|
||||
|
||||
| Platform | Link |
|
||||
| --------------- | ------- |
|
||||
| Youtube | [ThorstenVoice on Youtube](https://www.youtube.com/c/ThorstenMueller) |
|
||||
| Twitter | [ThorstenVoice on Twitter](https://twitter.com/ThorstenVoice) |
|
||||
| Instagram | [ThorstenVoice on Instagram](https://www.instagram.com/thorsten_voice/) |
|
||||
| LinkedIn | [Thorsten Müller on LinkedIn](https://www.linkedin.com/in/thorsten-m%C3%BCller-848a344/) |
|
||||
|
||||
# Some personal words before using **Thorsten-Voice**
|
||||
> I contribute my voice as a person believing in a world where all people are equal. No matter of gender, sexual orientation, religion, skin color and geocoordinates of birth location. A global world where everybody is warmly welcome on any place on this planet and open and free knowledge and education is available to everyone. :earth_africa: (*Thorsten Müller*)
|
||||
|
||||
Please keep in mind, that **i am no professional voice talent**. I'm just a normal guy sharing his voice with the world.
|
||||
|
||||
# Voice-Datasets
|
||||
Voice datasets are listed on Zenodo:
|
||||
| Dataset | DOI Link |
|
||||
| --------------- | ------- |
|
||||
| Thorsten-21.02-neutral | [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.5525342.svg)](https://doi.org/10.5281/zenodo.5525342) |
|
||||
| Thorsten-21.06-emotional | [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.5525023.svg)](https://doi.org/10.5281/zenodo.5525023) |
|
||||
| Thorsten-22.10-neutral | [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7265581.svg)](https://doi.org/10.5281/zenodo.7265581) |
|
||||
|
||||
## Thorsten-21.02-neutral
|
||||
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.5525342.svg)](https://doi.org/10.5281/zenodo.5525342)
|
||||
|
||||
```
|
||||
@dataset{muller_thorsten_2021_5525342,
|
||||
author = {Müller, Thorsten and
|
||||
Kreutz, Dominik},
|
||||
title = {Thorsten-Voice - "Thorsten-21.02-neutral" Dataset},
|
||||
month = feb,
|
||||
year = 2021,
|
||||
note = {{Please use it to make the world a better place for
|
||||
whole humankind.}},
|
||||
publisher = {Zenodo},
|
||||
version = {3.0},
|
||||
doi = {10.5281/zenodo.5525342},
|
||||
url = {https://doi.org/10.5281/zenodo.5525342}
|
||||
}
|
||||
```
|
||||
|
||||
> :speaking_head: **Listen to some audio recordings from this dataset [here](https://drive.google.com/drive/folders/1KVjGXG2ij002XRHb3fgFK4j0OEq1FsWm?usp=sharing).**
|
||||
|
||||
### Dataset summary
|
||||
* Recorded by Thorsten Müller
|
||||
* Optimized by Dominik Kreutz
|
||||
* LJSpeech file and directory structure
|
||||
* 22.668 recorded phrases (*wav files*)
|
||||
* More than 23 hours of pure audio
|
||||
* Samplerate 22.050Hz
|
||||
* Mono
|
||||
* Normalized to -24dB
|
||||
* Phrase length (min/avg/max): 2 / 52 / 180 chars
|
||||
* No silence at beginning/ending
|
||||
* Avg spoken chars per second: 14
|
||||
* Sentences with question mark: 2.780
|
||||
* Sentences with exclamation mark: 1.840
|
||||
|
||||
### Dataset evolution
|
||||
As described in the PDF document ([evolution of thorsten dataset](./EvolutionOfThorstenDataset.pdf)) this dataset consists of three recording phases.
|
||||
|
||||
* **Phase 1**: Recorded with a cheap usb microphone (*low quality*)
|
||||
* **Phase 2**: Recorded with a good microphone (*good quality*)
|
||||
* **Phase 3**: Recorded with same good microphone but longer phrases (> 100 chars) (*good quality*)
|
||||
|
||||
If you want to use a dataset subset you can see which files belong to which recording phase in [recording quality](./RecordingQuality.csv) csv file.
|
||||
|
||||
|
||||
## Thorsten-21.06-emotional
|
||||
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.5525023.svg)](https://doi.org/10.5281/zenodo.5525023)
|
||||
|
||||
```
|
||||
@dataset{muller_thorsten_2021_5525023,
|
||||
author = {Müller, Thorsten and
|
||||
Kreutz, Dominik},
|
||||
title = {{Thorsten-Voice - "Thorsten-21.06-emotional"
|
||||
Dataset}},
|
||||
month = jun,
|
||||
year = 2021,
|
||||
note = {{Please use it to make the world a better place for
|
||||
whole humankind.}},
|
||||
publisher = {Zenodo},
|
||||
version = {2.0},
|
||||
doi = {10.5281/zenodo.5525023},
|
||||
url = {https://doi.org/10.5281/zenodo.5525023}
|
||||
}
|
||||
```
|
||||
|
||||
All emotional recordings where recorded by myself and i tried to feel and pronounce that emotion even if the phrase context does not match that emotion. Example: I pronounced the sleepy recordings in the tone i have shortly before falling asleep.
|
||||
|
||||
### Samples
|
||||
Listen to the phrase "**Mist, wieder nichts geschafft.**" in following emotions.
|
||||
|
||||
* :slightly_smiling_face: [Neutral](./samples/thorsten-21.06-emotional/neutral.wav)
|
||||
* :nauseated_face: [Disgusted](./samples/thorsten-21.06-emotional/disgusted.wav)
|
||||
* :angry: [Angry](./samples/thorsten-21.06-emotional/angry.wav)
|
||||
* :grinning: [Amused](./samples/thorsten-21.06-emotional/amused.wav)
|
||||
* :astonished: [Surprised](./samples/thorsten-21.06-emotional/surprised.wav)
|
||||
* :pensive: [Sleepy](./samples/thorsten-21.06-emotional/sleepy.wav)
|
||||
* :dizzy_face: [Drunk](./samples/thorsten-21.06-emotional/drunk.wav)
|
||||
* 🤫 [Whispering](./samples/thorsten-21.06-emotional/whisper.wav)
|
||||
### Dataset summary
|
||||
* Recorded by Thorsten Müller
|
||||
* Optimized by Dominik Kreutz
|
||||
* 300 sentences * 8 emotions = 2.400 recordings
|
||||
* Mono
|
||||
* Samplerate 22.050Hz
|
||||
* Normalized to -24dB
|
||||
* No silence at beginning/ending
|
||||
* Sentence length: 59 - 148 chars
|
||||
|
||||
|
||||
## Thorsten-22.10-neutral
|
||||
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7265581.svg)](https://doi.org/10.5281/zenodo.7265581)
|
||||
> :speaking_head: **Listen to some audio recordings from this dataset [here](https://drive.google.com/drive/folders/1dxoSo8Ktmh-5E0rSVqkq_Jm1r4sFnwJM?usp=sharing).**
|
||||
|
||||
```
|
||||
@dataset{muller_thorsten_2022_7265581,
|
||||
author = {Müller, Thorsten and
|
||||
Kreutz, Dominik},
|
||||
title = {ThorstenVoice Dataset 2022.10},
|
||||
month = oct,
|
||||
year = 2022,
|
||||
publisher = {Zenodo},
|
||||
version = {1.0},
|
||||
doi = {10.5281/zenodo.7265581},
|
||||
url = {https://doi.org/10.5281/zenodo.7265581
|
||||
}
|
||||
```
|
||||
|
||||
# TTS Models
|
||||
|
||||
## Thorsten-21.04-Tacotron2-DCA
|
||||
This [TTS-model](https://drive.google.com/drive/folders/1m4RuffbvdOmQWnmy_Hmw0cZ_q0hj2o8B?usp=sharing) has been trained on [**Thorsten-21.02-neutral**](#thorsten-2102-neutral) dataset. The recommended trained Fullband-MelGAN Vocoder can be downloaded [here](https://drive.google.com/drive/folders/1hsfaconm4Yd9wPVyOtrXjWQs4ZAPoouY?usp=sharing).
|
||||
|
||||
Run the model:
|
||||
* pip install TTS==0.5.0
|
||||
* tts-server --model_name tts_models/de/thorsten/tacotron2-DCA
|
||||
|
||||
|
||||
## Thorsten-22.05-VITS
|
||||
Trained on dataset **Thorsten-22.05-neutral**.
|
||||
Audio samples are available on [Thorsten-Voice website](https://www.thorsten-voice.de/en/just-get-started/).
|
||||
|
||||
To run TTS server just follow these steps:
|
||||
* pip install tts==0.7.1
|
||||
* tts-server --model_name tts_models/de/thorsten/vits
|
||||
* Open browser on http://localhost:5002 and enjoy playing
|
||||
|
||||
## Thorsten-22.08-Tacotron2-DDC
|
||||
Trained on dataset [**Thorsten-22.05-neutral**](#thorsten-2205-neutral).
|
||||
Audio samples are available on [Thorsten-Voice website]([https://www.thorsten-voice.de/en/just-get-started/](https://www.thorsten-voice.de/2022/08/14/welches-tts-modell-klingt-besser/)).
|
||||
|
||||
To run TTS server just follow these steps:
|
||||
* pip install tts==0.8.0
|
||||
* tts-server --model_name tts_models/de/thorsten/tacotron2-DDC
|
||||
* Open browser on http://localhost:5002 and enjoy playing
|
||||
|
||||
|
||||
## Other models
|
||||
### Silero
|
||||
|
||||
You can use a free A-GPL licensed models trained on **Thorsten-21.02-neutral** dataset via the [silero-models](https://github.com/snakers4/silero-models/blob/master/models.yml) project.
|
||||
|
||||
* [Thorsten 16kHz](https://drive.google.com/drive/folders/1tR6w4kgRS2JJ1TWZhwoFuU04Xkgo6YAs?usp=sharing)
|
||||
* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-models/blob/master/examples_tts.ipynb)
|
||||
|
||||
### ZDisket
|
||||
[ZDisket](https://github.com/ZDisket/TensorVox) made a tool called TensorVox for setting up an TTS environment on Windows and included a german TTS model trained by [monatis](https://github.com/monatis/german-tts). Thanks for sharing that :thumbsup:. See it in action on [Youtube](https://youtu.be/tY6_xZnkv-A).
|
||||
|
||||
# Public talks
|
||||
I really want to bring the topic "**Open Voice For An Open Future**" to a bigger public attention.
|
||||
|
||||
* I've been part of a Linux User Group podcast about Mycroft AI and talked on my TTS efforts on that in (*May 2021*).
|
||||
* I was invited by [Yusuf](https://github.com/monatis/) from Turkish tensorflow community to talk on "How to make machines speak with your own voice". This talk has been streamed live on Youtube and is available [here](https://www.youtube.com/watch?v=m-Uwb-Bg144&t=2303s). If you're interested on the showed slides, feel free to download my presentation [here](https://docs.google.com/presentation/d/1ynnw0ilKV3WwMSJHytrN3GXRiFr8x3r0DUimBm1y0LI/edit?usp=sharing) (*June 2021*)
|
||||
)
|
||||
* I've been invited as speaker on VoiceLunch language & linguistics on 03.01.2022. [Here are my slides](https://docs.google.com/presentation/d/1Gi6BmYHs7g4ZgdAiIKGBnBwZDCvJOD9DJxQOGlgds1o/edit?usp=sharing) (*January 2022*).
|
||||
|
||||
# Youtube channel
|
||||
In summer 2021 i've started to share my lessons learned and experiences on open voice tech, in special **TTS** on my little [Youtube channel](https://www.youtube.com/c/ThorstenMueller). If you check out and like my videos i'd happy to welcome you as subscriber and member of my little Youtube community.
|
||||
|
||||
|
||||
# Feel free to file an issue if you ...
|
||||
* Use my TTS voice in your project(s)
|
||||
* Want to share your trained "Thorsten" model
|
||||
* Get to know about any abuse usage of my voice
|
||||
|
||||
# Thanks section
|
||||
## Cool projects
|
||||
* https://commonvoice.mozilla.org/
|
||||
* https://coqui.ai/
|
||||
* https://mycroft.ai/
|
||||
* https://github.com/rhasspy/
|
||||
|
||||
## Cool people
|
||||
* [El-Tocino](https://github.com/el-tocino/)
|
||||
* [Eren Gölge](https://github.com/erogol/)
|
||||
* [Gras64](https://github.com/gras64/)
|
||||
* [Kris Gesling](https://github.com/krisgesling/)
|
||||
* [Nmstoker](https://github.com/nmstoker)
|
||||
* [Othiele](https://discourse.mozilla.org/u/othiele/summary)
|
||||
* [Repodiac](https://github.com/repodiac)
|
||||
* [SanjaESC](https://github.com/SanjaESC)
|
||||
* [Synesthesiam](https://github.com/synesthesiam/)
|
||||
|
||||
## Even more special people
|
||||
Additionally, a really nice thanks for my dear colleague, Sebastian Kraus, for supporting me with audio recording equipment and for being the creative mastermind behind the logo design.
|
||||
|
||||
And last but not least i want to say a **huge, huge thank you** to a special guy who supported me on this journey as a partner right from the beginning. Not just with nice words, but with his time, audio optimization knowhow and finally GPU power.
|
||||
|
||||
**Thank you so much, dear **Dominik** ([@domcross](https://github.com/domcross/)) for being my partner on this journey.**
|
||||
|
||||
Thorsten (*Twitter: @ThorstenVoice*)
|
22944
RecordingQuality.csv
Normal file
22944
RecordingQuality.csv
Normal file
File diff suppressed because it is too large
Load Diff
94
Youtube/train_vits_win.py
Normal file
94
Youtube/train_vits_win.py
Normal file
@ -0,0 +1,94 @@
|
||||
import os
|
||||
|
||||
from trainer import Trainer, TrainerArgs
|
||||
|
||||
from TTS.tts.configs.shared_configs import BaseDatasetConfig
|
||||
from TTS.tts.configs.vits_config import VitsConfig
|
||||
from TTS.tts.datasets import load_tts_samples
|
||||
from TTS.tts.models.vits import Vits, VitsAudioConfig
|
||||
from TTS.tts.utils.text.tokenizer import TTSTokenizer
|
||||
from TTS.utils.audio import AudioProcessor
|
||||
|
||||
def main():
|
||||
|
||||
output_path = os.path.dirname(os.path.abspath(__file__))
|
||||
#output_path = "c:\\temp\tts"
|
||||
dataset_config = BaseDatasetConfig(
|
||||
formatter="ljspeech", meta_file_train="metadata_small.csv", path="C:\\Users\\ThorstenVoice\\TTS-Training\\ThorstenVoice-Dataset_2022.10"
|
||||
)
|
||||
audio_config = VitsAudioConfig(
|
||||
sample_rate=22050, win_length=1024, hop_length=256, num_mels=80, mel_fmin=0, mel_fmax=None
|
||||
)
|
||||
|
||||
config = VitsConfig(
|
||||
audio=audio_config,
|
||||
run_name="vits_thorsten-voice",
|
||||
batch_size=4,
|
||||
eval_batch_size=4,
|
||||
batch_group_size=5,
|
||||
num_loader_workers=1,
|
||||
num_eval_loader_workers=1,
|
||||
run_eval=True,
|
||||
test_delay_epochs=-1,
|
||||
epochs=1000,
|
||||
text_cleaner="phoneme_cleaners",
|
||||
use_phonemes=True,
|
||||
phoneme_language="de",
|
||||
phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
|
||||
compute_input_seq_cache=True,
|
||||
print_step=25,
|
||||
print_eval=True,
|
||||
mixed_precision=False,
|
||||
output_path=output_path,
|
||||
datasets=[dataset_config],
|
||||
cudnn_benchmark=False,
|
||||
test_sentences=[
|
||||
"Es hat mich viel Zeit gekostet ein Stimme zu entwickeln, jetzt wo ich sie habe werde ich nicht mehr schweigen.",
|
||||
"Sei eine Stimme, kein Echo.",
|
||||
"Es tut mir Leid David. Das kann ich leider nicht machen.",
|
||||
"Dieser Kuchen ist großartig. Er ist so lecker und feucht.",
|
||||
"Vor dem 22. November 1963.",
|
||||
],
|
||||
)
|
||||
|
||||
# INITIALIZE THE AUDIO PROCESSOR
|
||||
# Audio processor is used for feature extraction and audio I/O.
|
||||
# It mainly serves to the dataloader and the training loggers.
|
||||
ap = AudioProcessor.init_from_config(config)
|
||||
|
||||
# INITIALIZE THE TOKENIZER
|
||||
# Tokenizer is used to convert text to sequences of token IDs.
|
||||
# config is updated with the default characters if not defined in the config.
|
||||
tokenizer, config = TTSTokenizer.init_from_config(config)
|
||||
|
||||
# LOAD DATA SAMPLES
|
||||
# Each sample is a list of ```[text, audio_file_path, speaker_name]```
|
||||
# You can define your custom sample loader returning the list of samples.
|
||||
# Or define your custom formatter and pass it to the `load_tts_samples`.
|
||||
# Check `TTS.tts.datasets.load_tts_samples` for more details.
|
||||
train_samples, eval_samples = load_tts_samples(
|
||||
dataset_config,
|
||||
eval_split=True,
|
||||
eval_split_max_size=config.eval_split_max_size,
|
||||
eval_split_size=config.eval_split_size,
|
||||
)
|
||||
|
||||
# init model
|
||||
model = Vits(config, ap, tokenizer, speaker_manager=None)
|
||||
|
||||
# init the trainer and 🚀
|
||||
trainer = Trainer(
|
||||
TrainerArgs(),
|
||||
config,
|
||||
output_path,
|
||||
model=model,
|
||||
train_samples=train_samples,
|
||||
eval_samples=eval_samples,
|
||||
)
|
||||
trainer.fit()
|
||||
print("Fertig!")
|
||||
|
||||
from multiprocessing import Process, freeze_support
|
||||
if __name__ == '__main__':
|
||||
freeze_support() # needed for Windows
|
||||
main()
|
@ -6,7 +6,7 @@ Hier sind Hörproben mit unterschiedlichen Vocodern. Alle gesprochenen Texte (*S
|
||||
* **Sample #02**: Eure Tröte nervt.
|
||||
* **Sample #03**: Europa und Asien zusammengenommen wird auch als Eurasien bezeichnet.
|
||||
* **Sample #04**: Euer Plan hat ja toll geklappt.
|
||||
* *Sample #05: "In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön ..." (Anfang vom "Froschkönig")*
|
||||
* *Sample #05: "In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön." (Anfang vom "Froschkönig")*
|
||||
|
||||
# Ground truth
|
||||
Originalaufnahmen aus dem "thorsten" Dataset.
|
||||
@ -52,10 +52,50 @@ Originalaufnahmen aus dem "thorsten" Dataset.
|
||||
> Details zum Model: (todo: link)
|
||||
> Tacotron2 + DDC: 460k Schritte trainiert
|
||||
|
||||
# ParallelWaveGAN
|
||||
> Tacotron2 + DDC: 360k Schritte trainiert, PWGAN Vocoder: 925k Schritte trainiert
|
||||
<dl>
|
||||
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Sample</th>
|
||||
<th>Text</th>
|
||||
<th>Audio</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>01</td>
|
||||
<td>Eure Schoko-Bonbons sind sagenhaft lecker</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample01-griffin-lim.wav"></audio></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>02</td>
|
||||
<td>Eure Tröte nervt</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample02-griffin-lim.wav"></audio></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>03</td>
|
||||
<td>Europa und Asien zusammengenommen wird auch als Eurasien bezeichnet</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample03-griffin-lim.wav"></audio></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>04</td>
|
||||
<td>Euer Plan hat ja toll geklappt.</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample04-griffin-lim.wav"></audio></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>05</td>
|
||||
<td>In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön.</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample05-griffin-lim.wav"></audio></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
</dl>
|
||||
|
||||
# ParallelWaveGAN
|
||||
> Details: [Notebook von Olaf](https://colab.research.google.com/drive/15kJHTDTVxyIjxiZgqD1G_s5gUeVNLkfy?usp=sharing)
|
||||
> Tacotron2 + DDC: 360k Schritte trainiert, PWGAN Vocoder: 925k Schritte trainiert
|
||||
<dl>
|
||||
|
||||
<table>
|
||||
@ -89,7 +129,7 @@ Originalaufnahmen aus dem "thorsten" Dataset.
|
||||
</tr>
|
||||
<tr>
|
||||
<td>05</td>
|
||||
<td>Anfang vom Froschkönig</td>
|
||||
<td>In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön.</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample05-pwgan.wav"></audio></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
@ -99,10 +139,89 @@ Originalaufnahmen aus dem "thorsten" Dataset.
|
||||
|
||||
|
||||
# WaveGrad
|
||||
> todo
|
||||
> Tacotron2 + DDC: 460k Schritte trainiert, WaveGrad Vocoder: 510k Schritte trainiert (inkl. Noise-Schedule)
|
||||
<dl>
|
||||
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Sample</th>
|
||||
<th>Text</th>
|
||||
<th>Audio</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>01</td>
|
||||
<td>Eure Schoko-Bonbons sind sagenhaft lecker</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample01-wavegrad.wav"></audio></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>02</td>
|
||||
<td>Eure Tröte nervt</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample02-wavegrad.wav"></audio></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>03</td>
|
||||
<td>Europa und Asien zusammengenommen wird auch als Eurasien bezeichnet</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample03-wavegrad.wav"></audio></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>04</td>
|
||||
<td>Euer Plan hat ja toll geklappt.</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample04-wavegrad.wav"></audio></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>05</td>
|
||||
<td>In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön.</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample05-wavegrad.wav"></audio></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
</dl>
|
||||
|
||||
# HifiGAN
|
||||
> todo
|
||||
> Thanks to SanjaESC (https://github.com/SanjaESC) for training this model.
|
||||
<dl>
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Sample</th>
|
||||
<th>Text</th>
|
||||
<th>Audio</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>01</td>
|
||||
<td>Eure Schoko-Bonbons sind sagenhaft lecker</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample01-hifigan.wav"></audio></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>02</td>
|
||||
<td>Eure Tröte nervt</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample02-hifigan.wav"></audio></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>03</td>
|
||||
<td>Europa und Asien zusammengenommen wird auch als Eurasien bezeichnet</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample03-hifigan.wav"></audio></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>04</td>
|
||||
<td>Euer Plan hat ja toll geklappt.</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample04-hifigan.wav"></audio></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>05</td>
|
||||
<td>In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön.</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample05-hifigan.wav"></audio></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
</dl>
|
||||
|
||||
# VocGAN
|
||||
> **Diese Beispiele basieren auf "ground truth" und nicht auf dem Tacotron 2 Modell**
|
||||
@ -146,6 +265,7 @@ Originalaufnahmen aus dem "thorsten" Dataset.
|
||||
|
||||
# GlowTTS / Waveglow
|
||||
> Details: [Github von Synesthesiam](https://github.com/rhasspy/de_larynx-thorsten)
|
||||
> GlowTTS trainiert für 380k und Vocoder für 500k Schritte.
|
||||
|
||||
<dl>
|
||||
|
||||
@ -178,6 +298,151 @@ Originalaufnahmen aus dem "thorsten" Dataset.
|
||||
<td>Euer Plan hat ja toll geklappt.</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample04-waveglow.wav"></audio></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>05</td>
|
||||
<td>In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön.</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample05-waveglow.wav"></audio></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
</dl>
|
||||
|
||||
|
||||
|
||||
# TensorFlowTTS
|
||||
## Multiband MelGAN
|
||||
> Thanks [Monatis](https://github.com/monatis)
|
||||
> Details: [Notebook von Monatis](https://colab.research.google.com/drive/1W0nSFpsz32M0OcIkY9uMOiGrLTPKVhTy?usp=sharing#scrollTo=SCbWCChVkfnn)
|
||||
> Taco2 Modell für 80k Schritte trainiert, Multiband MelGAN für 800k Schritte.
|
||||
|
||||
<dl>
|
||||
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Sample</th>
|
||||
<th>Text</th>
|
||||
<th>Audio</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>01</td>
|
||||
<td>Eure Schoko-Bonbons sind sagenhaft lecker</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample01-TensorFlowTTS.wav"></audio></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>02</td>
|
||||
<td>Eure Tröte nervt</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample02-TensorFlowTTS.wav"></audio></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>03</td>
|
||||
<td>Europa und Asien zusammengenommen wird auch als Eurasien bezeichnet</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample03-TensorFlowTTS.wav"></audio></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>04</td>
|
||||
<td>Euer Plan hat ja toll geklappt.</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample04-TensorFlowTTS.wav"></audio></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>05</td>
|
||||
<td>In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön.</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample05-TensorFlowTTS.wav"></audio></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
</dl>
|
||||
|
||||
|
||||
# Silero models
|
||||
> Thanks [snakers4](https://github.com/snakers4/silero-models)
|
||||
> Details: [Notebook von Silero](https://colab.research.google.com/github/snakers4/silero-models/blob/master/examples_tts.ipynb#scrollTo=indirect-berry)
|
||||
|
||||
<dl>
|
||||
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Sample</th>
|
||||
<th>Text</th>
|
||||
<th>Audio</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>01</td>
|
||||
<td>Eure Schoko-Bonbons sind sagenhaft lecker</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample01-silero.wav"></audio></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>02</td>
|
||||
<td>Eure Tröte nervt</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample02-silero.wav"></audio></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>03</td>
|
||||
<td>Europa und Asien zusammengenommen wird auch als Eurasien bezeichnet</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample03-silero.wav"></audio></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>04</td>
|
||||
<td>Euer Plan hat ja toll geklappt.</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample04-silero.wav"></audio></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>05</td>
|
||||
<td>In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön.</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample05-silero.wav"></audio></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
</dl>
|
||||
|
||||
# Forward Tacotron
|
||||
> Thanks [cschaefer26](https://github.com/as-ideas/ForwardTacotron)
|
||||
> Config: Forward-Tacotron, trained to 300k, alpha set to 0.8, pretrained HifiGAN vocoder
|
||||
|
||||
<dl>
|
||||
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Sample</th>
|
||||
<th>Text</th>
|
||||
<th>Audio</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>01</td>
|
||||
<td>Eure Schoko-Bonbons sind sagenhaft lecker</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample01-ForwardTacotron-HifiGAN.wav"></audio></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>02</td>
|
||||
<td>Eure Tröte nervt</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample02-ForwardTacotron-HifiGAN.wav"></audio></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>03</td>
|
||||
<td>Europa und Asien zusammengenommen wird auch als Eurasien bezeichnet</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample03-ForwardTacotron-HifiGAN.wav"></audio></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>04</td>
|
||||
<td>Euer Plan hat ja toll geklappt.</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample04-ForwardTacotron-HifiGAN.wav"></audio></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>05</td>
|
||||
<td>In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön.</td>
|
||||
<td><audio controls="" preload="none"><source src="samples/sample05-ForwardTacotron-HifiGAN.wav"></audio></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
|
@ -32,9 +32,10 @@ Wir arbeiten weiterhin daran qualitativ noch bessere Modell zu trainieren, aber
|
||||
* [Bitte warte einen Moment, bis ich fertig mit dem Booten bin.](https://drive.google.com/file/d/19Td-F14n_05F-squ3bNlt2BDE-NMFaq1/view?usp=sharing)
|
||||
* [Mein Name ist Mycroft und ich bin funky.](https://drive.google.com/file/d/1dbyOyE7Oy8YdAsYqQ4vz4VJjiWIyc8oV/view?usp=sharing)
|
||||
|
||||
|
||||
## Vergleich einiger Vocoder
|
||||
Wir experimentieren aktuell mit unterschiedlichen Konfigurationen um das beste Modell zu ermitteln. Ein Vergleich der bisherigen Ergebnisse findest Du auf dieser Seite.
|
||||
> [Vergleich der unterschiedlichen Modell](./audio_compare)
|
||||
> [Vergleich der unterschiedlichen Modelle](./audio_compare)
|
||||
|
||||
# Interessiert?
|
||||
[Weitere Details, Downloads und Danksagungen findet ihr hier.](https://github.com/thorstenMueller/deep-learning-german-tts "Dataset Details und Thorsten-Modell Download")
|
||||
|
@ -1,60 +0,0 @@
|
||||
# tl;dr
|
||||
---
|
||||
|
||||
<span style="font-family:Papyrus; font-size:3em;color:green"> Eine kostenfreie qualitativ hochwertige deutsche Stimme, die lokal erzeugt werden kann!</span>
|
||||
|
||||
---
|
||||
|
||||
|
||||
# Eine freie Deutsche Stimme
|
||||
Auch wenn die Überschrift stark nach einem politischen Statement klingt, geht es hier um ein völlig anderes Thema.
|
||||
|
||||
Derzeit gewinnt die sprachbasierte Bedienung von Maschinen rasant an Bedeutung. Viele kennen diese Kommunikation bereits aus ihrem Alltag mit Smartphones oder so genannten smarten Assistenten wie Apple Siri, Google Home oder Amazon Alexa.
|
||||
|
||||
Die Systeme der großen Hersteller bringen, neben sehr vielen Vorteilen, auch einige durchaus schwerwiegende Nachteile im Datenschutzbereich mit sich (Cloudzwang, mangelnde Hoheit über die eigenen Daten, Bedenken über "Mithörer", ...). Daher gibt es durchaus Menschen, die zwar die Vorteile solcher Systeme gerne nutzen möchten, aber aufgrund von den genannten Datenschutzbedenken von deren Nutzung absehen.
|
||||
|
||||
# Alternativen zu (_online Spracherzeugung_) von Amazon, Google, Apple, ...
|
||||
Glücklicherweise bilden sich auch Alternativen (u.a. OpenSource) heraus um der Marktmarkt der "Big Player" etwas entgegenzutreten. Einige davon sind:
|
||||
|
||||
* Mozilla Voice Projekte
|
||||
* MyCroft AI
|
||||
|
||||
Diese (und andere) Communities arbeiten daran entsprechende Alternativen anzubieten. Jedoch steht hier oft die englische Sprache im Vordergrund. Dies ist gerade bei der Interaktion mit deutschsprachigen Anwendern natürlich nicht hilfreich.
|
||||
|
||||
# Freies deutsches TTS - was ist das?
|
||||
Die meisten haben sicherlich schon einmal einen persönlichen smarten Assistenten (oder Smartphone) nach dem Wetter, Terminen, oder ähnlichem gefragt.
|
||||
Falls dem so ist und das Gerät eine gut verständliche deutsche Antwort geliefert hat wurden in diesem Fall "Cloud Resourcen" genutzt.
|
||||
|
||||
Natürlich wissen Amazon, Google und Apple um die gute Qualität ihrer künstlichen Stimmen und sind u.a. daher nicht bereit, diese für eine private- und kostenfreie Offlinenutzung zur Verfügung zu stellen.
|
||||
Und genau da liegt eines der großen Probleme in (quelloffenen) Alternativen. Selbst wenn große Anteile kostenfrei und offline zu betreiben sind, spätestens bei der Sprachausgabe sind sie auf die "Big Player" angewiesen, sofern sie einen gewissen Qualitätsanspruch haben.
|
||||
|
||||
# Wie und wem hilft dieses Projekt
|
||||
Das freie deutsche Dataset beinhaltet über 23 aufgezeichneter Stunden auf Basis freier Texte. Darauf basieren die mit machine learning trainierten TTS Modelle.
|
||||
Die Nutzung ist **ohne Lizenzrechtliche Bedenken** möglich und steht somit allen Interessierten offen. Zum Beispiel:
|
||||
|
||||
* OpenSource Projekte/Communities
|
||||
* Bildung/Forschung/Wissenschaft
|
||||
* kommerzielle Einsatzzwecke
|
||||
|
||||
Gerade den kleinen Communities soll hiermit die Möglichkeit geboten zu werden offline TTS Funktion in ihren Projekten mit auszuliefern.
|
||||
|
||||
# Beispiele
|
||||
* [Es ist im Moment klarer Himmel bei 18 Grad.](https://drive.google.com/file/d/1cDIq4QG6i60WjUYNT6fr2cpEjFQIi8w5/view?usp=sharing)
|
||||
* [Ich verstehe das nicht, aber ich lerne jeden Tag neue Dinge.](https://drive.google.com/file/d/1kja_2RsFt6EmC33HTB4ozJyFlvh_DTFQ/view?usp=sharing)
|
||||
* [Ich bin jetzt bereit.](https://drive.google.com/file/d/1GkplGH7LMJcPDpgFJocXHCjRln_ccVFs/view?usp=sharing)
|
||||
* [Bitte warte einen Moment, bis ich fertig mit dem Booten bin.](https://drive.google.com/file/d/19Td-F14n_05F-squ3bNlt2BDE-NMFaq1/view?usp=sharing)
|
||||
* [Mein Name ist MyCroft und ich bin funky.](https://drive.google.com/file/d/1dbyOyE7Oy8YdAsYqQ4vz4VJjiWIyc8oV/view?usp=sharing)
|
||||
|
||||
# Aktueller Stand
|
||||
Wir (eine Gruppe von netten TTS Enthusiasten) wissen, dass das aktuelle Modell qualitativ noch viel Luft nach oben hat. Aber wir sind weiterhin motiviert in Zukunft hoffentlich noch bessere Modelle zur Verfügung stellen zu können.
|
||||
|
||||
# Zu guter Letzt
|
||||
Da ich keinen großen Einfluss habe, welche Aussagen mit meiner Stimme zukünftig gemacht werden, möchte ich doch einige Punkte sagen, die mir persönlich wichtig sind:
|
||||
|
||||
Ich teile meine Stimme als Person, die daran glaubt, dass alle Menschen gleichberechtigt sind, unabhängig von Geschlecht, sexueller Orientierung, Religion, Hautfarbe oder den Geokoordinaten der Geburt. An eine Welt wo jeder Mensch zu jeder Zeit herzlich Willkommen ist und wo Bildung und Wissen für jeden frei verfügbar ist.
|
||||
|
||||
# Links
|
||||
* https://github.com/thorstenMueller/deep-learning-german-tts/
|
||||
* https://medium.com/@thorsten_Mueller/why-ive-chosen-to-donate-my-german-voice-for-mankind-177beeb91675
|
||||
* TODO Github Links der Mitstreiter
|
||||
* TODO Modell (TTS Server Package) veröffentlichen
|
BIN
docs/samples/sample01-ForwardTacotron-HifiGAN.wav
Normal file
BIN
docs/samples/sample01-ForwardTacotron-HifiGAN.wav
Normal file
Binary file not shown.
BIN
docs/samples/sample01-TensorFlowTTS.wav
Normal file
BIN
docs/samples/sample01-TensorFlowTTS.wav
Normal file
Binary file not shown.
BIN
docs/samples/sample01-griffin-lim.wav
Normal file
BIN
docs/samples/sample01-griffin-lim.wav
Normal file
Binary file not shown.
BIN
docs/samples/sample01-hifigan.wav
Normal file
BIN
docs/samples/sample01-hifigan.wav
Normal file
Binary file not shown.
BIN
docs/samples/sample01-silero.wav
Normal file
BIN
docs/samples/sample01-silero.wav
Normal file
Binary file not shown.
BIN
docs/samples/sample01-wavegrad.wav
Normal file
BIN
docs/samples/sample01-wavegrad.wav
Normal file
Binary file not shown.
BIN
docs/samples/sample02-ForwardTacotron-HifiGAN.wav.wav
Normal file
BIN
docs/samples/sample02-ForwardTacotron-HifiGAN.wav.wav
Normal file
Binary file not shown.
BIN
docs/samples/sample02-TensorFlowTTS.wav
Normal file
BIN
docs/samples/sample02-TensorFlowTTS.wav
Normal file
Binary file not shown.
BIN
docs/samples/sample02-griffin-lim.wav
Normal file
BIN
docs/samples/sample02-griffin-lim.wav
Normal file
Binary file not shown.
BIN
docs/samples/sample02-hifigan.wav
Normal file
BIN
docs/samples/sample02-hifigan.wav
Normal file
Binary file not shown.
BIN
docs/samples/sample02-silero.wav
Normal file
BIN
docs/samples/sample02-silero.wav
Normal file
Binary file not shown.
BIN
docs/samples/sample02-wavegrad.wav
Normal file
BIN
docs/samples/sample02-wavegrad.wav
Normal file
Binary file not shown.
BIN
docs/samples/sample03-ForwardTacotron-HifiGAN.wav
Normal file
BIN
docs/samples/sample03-ForwardTacotron-HifiGAN.wav
Normal file
Binary file not shown.
BIN
docs/samples/sample03-TensorFlowTTS.wav
Normal file
BIN
docs/samples/sample03-TensorFlowTTS.wav
Normal file
Binary file not shown.
BIN
docs/samples/sample03-griffin-lim.wav
Normal file
BIN
docs/samples/sample03-griffin-lim.wav
Normal file
Binary file not shown.
BIN
docs/samples/sample03-hifigan.wav
Normal file
BIN
docs/samples/sample03-hifigan.wav
Normal file
Binary file not shown.
BIN
docs/samples/sample03-silero.wav
Normal file
BIN
docs/samples/sample03-silero.wav
Normal file
Binary file not shown.
BIN
docs/samples/sample03-wavegrad.wav
Normal file
BIN
docs/samples/sample03-wavegrad.wav
Normal file
Binary file not shown.
BIN
docs/samples/sample04-ForwardTacotron-HifiGAN.wav.wav
Normal file
BIN
docs/samples/sample04-ForwardTacotron-HifiGAN.wav.wav
Normal file
Binary file not shown.
BIN
docs/samples/sample04-TensorFlowTTS.wav
Normal file
BIN
docs/samples/sample04-TensorFlowTTS.wav
Normal file
Binary file not shown.
BIN
docs/samples/sample04-griffin-lim.wav
Normal file
BIN
docs/samples/sample04-griffin-lim.wav
Normal file
Binary file not shown.
BIN
docs/samples/sample04-hifigan.wav
Normal file
BIN
docs/samples/sample04-hifigan.wav
Normal file
Binary file not shown.
BIN
docs/samples/sample04-silero.wav
Normal file
BIN
docs/samples/sample04-silero.wav
Normal file
Binary file not shown.
BIN
docs/samples/sample04-wavegrad.wav
Normal file
BIN
docs/samples/sample04-wavegrad.wav
Normal file
Binary file not shown.
BIN
docs/samples/sample05-ForwardTacotron-HifiGAN.wav
Normal file
BIN
docs/samples/sample05-ForwardTacotron-HifiGAN.wav
Normal file
Binary file not shown.
BIN
docs/samples/sample05-TensorFlowTTS.wav
Normal file
BIN
docs/samples/sample05-TensorFlowTTS.wav
Normal file
Binary file not shown.
BIN
docs/samples/sample05-griffin-lim.wav
Normal file
BIN
docs/samples/sample05-griffin-lim.wav
Normal file
Binary file not shown.
BIN
docs/samples/sample05-hifigan.wav
Normal file
BIN
docs/samples/sample05-hifigan.wav
Normal file
Binary file not shown.
BIN
docs/samples/sample05-silero.wav
Normal file
BIN
docs/samples/sample05-silero.wav
Normal file
Binary file not shown.
BIN
docs/samples/sample05-wavegrad.wav
Normal file
BIN
docs/samples/sample05-wavegrad.wav
Normal file
Binary file not shown.
23304
german_corpus-mimic_recording_studio.csv
Normal file
23304
german_corpus-mimic_recording_studio.csv
Normal file
File diff suppressed because it is too large
Load Diff
51
helperScripts/Dockerfile.Jetson-Coqui
Normal file
51
helperScripts/Dockerfile.Jetson-Coqui
Normal file
@ -0,0 +1,51 @@
|
||||
# Dockerfile for running Coqui TTS trainings in a docker container on NVIDIA Jetson platofrm.
|
||||
# Based on NVIDIA Jetson ML Image, provided without any warranty as is by Thorsten Müller (https://twitter.com/ThorstenVoice) in august 2021
|
||||
|
||||
FROM nvcr.io/nvidia/l4t-ml:r32.5.0-py3
|
||||
|
||||
RUN echo "deb https://repo.download.nvidia.com/jetson/common r32.4 main" >> /etc/apt/sources.list.d/nvidia-l4t-apt-source.list
|
||||
RUN echo "deb https://repo.download.nvidia.com/jetson/t194 r32.4 main" >> /etc/apt/sources.list.d/nvidia-l4t-apt-source.list
|
||||
|
||||
RUN apt-get update -y
|
||||
RUN apt-get install vim python-mecab libmecab-dev cuda-toolkit-10-2 libcudnn8 libcudnn8-dev libsndfile1-dev locales -y
|
||||
|
||||
# Setting some environment vars
|
||||
ENV LLVM_CONFIG=/usr/bin/llvm-config-9
|
||||
ENV PYTHONPATH=/coqui/TTS/
|
||||
ENV LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH
|
||||
# Skipping OPENBLAS_CORETYPE might show "Illegal instruction (core dumped) error
|
||||
ENV OPENBLAS_CORETYPE=ARMV8
|
||||
|
||||
ENV NVIDIA_VISIBLE_DEVICES all
|
||||
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
|
||||
LABEL com.nvidia.volumes.needed="nvidia_driver"
|
||||
|
||||
# Adjust locale setting to your personal needs
|
||||
RUN sed -i '/de_DE.UTF-8/s/^# //g' /etc/locale.gen && \
|
||||
locale-gen
|
||||
ENV LANG de_DE.UTF-8
|
||||
ENV LANGUAGE de_DE:de
|
||||
ENV LC_ALL de_DE.UTF-8
|
||||
|
||||
RUN mkdir /coqui
|
||||
WORKDIR /coqui
|
||||
|
||||
ARG COQUI_BRANCH
|
||||
RUN git clone -b ${COQUI_BRANCH} https://github.com/coqui-ai/TTS.git
|
||||
WORKDIR /coqui/TTS
|
||||
RUN pip3 install pip setuptools wheel --upgrade
|
||||
RUN pip uninstall -y tensorboard tensorflow tensorflow-estimator nbconvert matplotlib
|
||||
RUN pip install -r requirements.txt
|
||||
RUN python3 ./setup.py develop
|
||||
|
||||
# Jupyter Notebook
|
||||
RUN python3 -c "from notebook.auth.security import set_password; set_password('nvidia', '/root/.jupyter/jupyter_notebook_config.json')"
|
||||
CMD /bin/bash -c "jupyter lab --ip 0.0.0.0 --port 8888 --allow-root"
|
||||
|
||||
|
||||
# Build example:
|
||||
# nvidia-docker build . -f Dockerfile.Jetson-Coqui --build-arg COQUI_BRANCH=v0.1.3 -t jetson-coqui
|
||||
# Run example:
|
||||
# nvidia-docker run -p 8888:8888 -d --shm-size 32g --gpus all -v /ssd/___prj/tts/dataset-july21:/coqui/TTS/data jetson-coqui
|
||||
# Bash example:
|
||||
# nvidia-docker exec -it <containerId> /bin/bash
|
157
helperScripts/MRS2LJSpeech.py
Normal file
157
helperScripts/MRS2LJSpeech.py
Normal file
@ -0,0 +1,157 @@
|
||||
# This script generates the folder structure for ljspeech-1.1 processing from mimic-recording-studio database
|
||||
|
||||
# Changelog
|
||||
# v1.0 - Initial release by Thorsten Müller (https://github.com/thorstenMueller/deep-learning-german-tts)
|
||||
# v1.1 - Great improvements by Peter Schmalfeldt (https://github.com/manifestinteractive)
|
||||
# - Audio processing with ffmpeg (mono and samplerate of 22.050 Hz)
|
||||
# - Much better Python coding than my original version
|
||||
# - Greater logging output to command line
|
||||
# - See more details here: https://gist.github.com/manifestinteractive/6fd9be62d0ede934d4e1171e5e751aba
|
||||
# - Thanks Peter, it's a great contribution :-)
|
||||
# v1.2 - Added choice for choosing which recording session should be exported as LJSpeech
|
||||
# v1.3 - Added parameter mrs_dir to pass directory of Mimic-Recording-Studio
|
||||
# v1.4 - Script won't crash when audio recorded has been deleted on disk
|
||||
# v1.5 - Added parameter "ffmpeg" to make converting with ffmpeg optional
|
||||
|
||||
from genericpath import exists
|
||||
import glob
|
||||
import sqlite3
|
||||
import os
|
||||
import argparse
|
||||
import sys
|
||||
|
||||
from shutil import copyfile
|
||||
from shutil import rmtree
|
||||
|
||||
# Setup Directory Data
|
||||
cwd = os.path.dirname(os.path.abspath(__file__))
|
||||
output_dir = os.path.join(cwd, "dataset")
|
||||
output_dir_audio = ""
|
||||
output_dir_audio_temp=""
|
||||
output_dir_speech = ""
|
||||
|
||||
# Create folders needed for ljspeech
|
||||
def create_folders():
|
||||
global output_dir
|
||||
global output_dir_audio
|
||||
global output_dir_audio_temp
|
||||
global output_dir_speech
|
||||
|
||||
print('→ Creating Dataset Folders')
|
||||
|
||||
output_dir_speech = os.path.join(output_dir, "LJSpeech-1.1")
|
||||
|
||||
# Delete existing folder if exists for clean run
|
||||
if os.path.exists(output_dir_speech):
|
||||
rmtree(output_dir_speech)
|
||||
|
||||
output_dir_audio = os.path.join(output_dir_speech, "wavs")
|
||||
output_dir_audio_temp = os.path.join(output_dir_speech, "temp")
|
||||
|
||||
# Create Clean Folders
|
||||
os.makedirs(output_dir_speech)
|
||||
os.makedirs(output_dir_audio)
|
||||
os.makedirs(output_dir_audio_temp)
|
||||
|
||||
def convert_audio():
|
||||
global output_dir_audio
|
||||
global output_dir_audio_temp
|
||||
|
||||
recordings = len([name for name in os.listdir(output_dir_audio_temp) if os.path.isfile(os.path.join(output_dir_audio_temp,name))])
|
||||
|
||||
print('→ Converting %s Audio Files to 22050 Hz, 16 Bit, Mono\n' % "{:,}".format(recordings))
|
||||
|
||||
# Please use `pip install ffmpeg-python`
|
||||
import ffmpeg
|
||||
|
||||
for idx, wav in enumerate(glob.glob(os.path.join(output_dir_audio_temp, "*.wav"))):
|
||||
|
||||
percent = (idx + 1) / recordings
|
||||
|
||||
print('› \033[96m%s\033[0m \033[2m%s / %s (%s)\033[0m ' % (os.path.basename(wav), "{:,}".format((idx + 1)), "{:,}".format(recordings), "{:.0%}".format(percent)))
|
||||
|
||||
# Convert WAV file to required format
|
||||
(ffmpeg
|
||||
.input(wav)
|
||||
.output(os.path.join(output_dir_audio, os.path.basename(wav)), acodec='pcm_s16le', ac=1, ar=22050, loglevel='error')
|
||||
.overwrite_output()
|
||||
.run(capture_stdout=True)
|
||||
)
|
||||
|
||||
|
||||
def copy_audio():
|
||||
global output_dir_audio
|
||||
|
||||
print('→ Using ffmpeg to convert recordings')
|
||||
recordings = len([name for name in os.listdir(output_dir_audio_temp) if os.path.isfile(os.path.join(output_dir_audio_temp,name))])
|
||||
|
||||
print('→ Copy %s Audio Files to LJSpeech Dataset\n' % "{:,}".format(recordings))
|
||||
|
||||
for idx, wav in enumerate(glob.glob(os.path.join(output_dir_audio_temp, "*.wav"))):
|
||||
copyfile(wav,os.path.join(output_dir_audio, os.path.basename(wav)))
|
||||
|
||||
def create_meta_data(mrs_dir):
|
||||
print('→ Creating META Data')
|
||||
|
||||
conn = sqlite3.connect(os.path.join(mrs_dir, "backend", "db", "mimicstudio.db"))
|
||||
c = conn.cursor()
|
||||
|
||||
# Create metadata.csv for ljspeech
|
||||
metadata = open(os.path.join(output_dir_speech, "metadata.csv"), mode="w", encoding="utf8")
|
||||
|
||||
# List available recording sessions
|
||||
user_models = c.execute('SELECT uuid, user_name from usermodel ORDER BY created_date DESC').fetchall()
|
||||
user_id = user_models[0][0]
|
||||
|
||||
for row in user_models:
|
||||
print(row[0] + ' -> ' + row[1])
|
||||
|
||||
user_answer = input('Please choose ID of recording session to export (default is newest session) [' + user_id + ']: ')
|
||||
|
||||
if user_answer:
|
||||
user_id = user_answer
|
||||
|
||||
|
||||
for row in c.execute('SELECT audio_id, prompt, lower(prompt) FROM audiomodel WHERE user_id = "' + user_id + '" ORDER BY length(prompt)'):
|
||||
source_file = os.path.join(mrs_dir, "backend", "audio_files", user_id, row[0] + ".wav")
|
||||
if exists(source_file):
|
||||
metadata.write(row[0] + "|" + row[1] + "|" + row[2] + "\n")
|
||||
copyfile(source_file, os.path.join(output_dir_audio_temp, row[0] + ".wav"))
|
||||
else:
|
||||
print("Wave file {} not found.".format(source_file))
|
||||
|
||||
metadata.close()
|
||||
conn.close()
|
||||
|
||||
def cleanup():
|
||||
global output_dir_audio_temp
|
||||
|
||||
# Remove Temp Folder
|
||||
rmtree(output_dir_audio_temp)
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('--mrs_dir', required=True)
|
||||
parser.add_argument('--ffmpeg', required=False, default=False)
|
||||
args = parser.parse_args()
|
||||
|
||||
if not os.path.isdir(os.path.join(args.mrs_dir,"backend")):
|
||||
sys.exit("Passed directory is no valid Mimic-Recording-Studio main directory!")
|
||||
|
||||
print('\n\033[48;5;22m MRS to LJ Speech Processor \033[0m\n')
|
||||
|
||||
create_folders()
|
||||
create_meta_data(args.mrs_dir)
|
||||
|
||||
if(args.ffmpeg):
|
||||
convert_audio()
|
||||
|
||||
else:
|
||||
copy_audio()
|
||||
|
||||
cleanup()
|
||||
|
||||
print('\n\033[38;5;86;1m✔\033[0m COMPLETE【ツ】\n')
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
27
helperScripts/README.md
Normal file
27
helperScripts/README.md
Normal file
@ -0,0 +1,27 @@
|
||||
# Short collection of helpful scripts for dataset creation and/or TTS training stuff
|
||||
|
||||
## MRS2LJSpeech
|
||||
Python script which takes recordings (filesystem and sqlite db) done with Mycroft Mimic-Recording-Studio (https://github.com/MycroftAI/mimic-recording-studio) and creates an audio optimized dataset in widely supported LJSpeech directory structure.
|
||||
|
||||
Peter Schmalfeldt (https://github.com/manifestinteractive) did an amazing job as he optimized my originally (quick'n dirty) version of that script, so thank you Peter :-)
|
||||
See more details here: https://gist.github.com/manifestinteractive/6fd9be62d0ede934d4e1171e5e751aba#file-mrs2ljspeech-py
|
||||
|
||||
## Dockerfile.Jetson-Coqui
|
||||
> Add your user to `docker` group to not require sudo on all operations.
|
||||
|
||||
Thanks to NVIDIA for providing docker images for Jetson platform. I use the "machine learning (ML)" image as baseimage for setting up a Coqui environment.
|
||||
|
||||
> You can use any branch or tag as COQUI_BRANCH argument. v0.1.3 is just the current stable version.
|
||||
|
||||
Switch to directory where Dockerfile is in and run `nvidia-docker build . -f Dockerfile.Jetson-Coqui --build-arg COQUI_BRANCH=v0.1.3 -t jetson-coqui` to build your container image. When build process is finished you can start a container on that image.
|
||||
|
||||
|
||||
### Mapped volumes
|
||||
We need to bring your dataset and configuration file into our container so we should map a volume on running container
|
||||
`nvidia-docker run -p 8888:8888 -d --shm-size 32g --gpus all -v [host path with dataset and config.json]:/coqui/TTS/data jetson-coqui`. Now we have a running container ready for Coqui TTS magic.
|
||||
|
||||
### Jupyter notebook
|
||||
Coqui provides lots of useful Jupyter notebooks for dataset analysis. Once your container is up and running you should be able to call
|
||||
|
||||
### Running bash into container
|
||||
`nvidia-docker exec -it jetson-coqui /bin/bash` now you're inside the container and an `ls /coqui/TTS/data` should show your dataset files.
|
41
helperScripts/getDatasetSpeechRate.py
Normal file
41
helperScripts/getDatasetSpeechRate.py
Normal file
@ -0,0 +1,41 @@
|
||||
# This script gets speech rate per audio recording from a voice dataset (ljspeech file and directory structure)
|
||||
# Writte by Thorsten Müller (deep-learning-german@gmx.net) and provided without any warranty.
|
||||
# https://github.com/thorstenMueller/deep-learning-german-tts/
|
||||
# https://twitter.com/ThorstenVoice
|
||||
|
||||
# Changelog:
|
||||
# v0.1 - 26.09.2021 - Initial version
|
||||
|
||||
from genericpath import exists
|
||||
import os
|
||||
import librosa
|
||||
import csv
|
||||
|
||||
dataset_dir = "/home/thorsten/___dev/tts/dataset/Thorsten-neutral-Dec2021-44k/" # Directory where metadata.csv is in
|
||||
out_csv_file = os.path.join(dataset_dir,"speech_rate_report.csv")
|
||||
decimal_use_comma = True # False: Splitting decimal value with a dot (.); True: Comma (,)
|
||||
|
||||
out_csv = open(out_csv_file,"w")
|
||||
out_csv.write("filename;audiolength_sec;number_chars;chars_per_sec;remove_from_dataset\n")
|
||||
|
||||
# Open metadata.csv file
|
||||
with open(os.path.join(dataset_dir,"metadata.csv")) as csvfile:
|
||||
reader = csv.reader(csvfile, delimiter='|')
|
||||
for row in reader:
|
||||
wav_file = os.path.join(dataset_dir,"wavs",row[0] + ".wav")
|
||||
|
||||
if exists(wav_file):
|
||||
# Gather values for report.csv output
|
||||
phrase_len = len(row[1]) - 1 # Do not count punctuation marks.
|
||||
duration = round(librosa.get_duration(filename=wav_file),2)
|
||||
char_per_sec = round(phrase_len / duration,2)
|
||||
|
||||
if decimal_use_comma:
|
||||
duration = str(duration).replace(".",",")
|
||||
char_per_sec = str(char_per_sec).replace(".",",")
|
||||
|
||||
out_csv.write(row[0] + ".wav;" + str(duration) + ";" + str(phrase_len) + ";" + str(char_per_sec) + ";no\n")
|
||||
else:
|
||||
print("File " + wav_file + " does not exist.")
|
||||
|
||||
out_csv.close()
|
48
helperScripts/removeFilesFromDataset.py
Normal file
48
helperScripts/removeFilesFromDataset.py
Normal file
@ -0,0 +1,48 @@
|
||||
# This script removes recordings from an ljspeech file/directory structured dataset based on CSV file from "getDatasetSpeechRate"
|
||||
# Writte by Thorsten Müller (deep-learning-german@gmx.net) and provided without any warranty.
|
||||
# https://github.com/thorstenMueller/deep-learning-german-tts/
|
||||
# https://twitter.com/ThorstenVoice
|
||||
|
||||
# Changelog:
|
||||
# v0.1 - 26.09.2021 - Initial version
|
||||
|
||||
import os
|
||||
import csv
|
||||
import shutil
|
||||
|
||||
dataset_dir = "/Users/thorsten/Downloads/thorsten-export-20210909/" # Directory where metadata.csv is in
|
||||
subfolder_removed = "___removed"
|
||||
in_csv_file = os.path.join(dataset_dir,"speech_rate_report.csv")
|
||||
to_remove = []
|
||||
|
||||
# Open metadata.csv file
|
||||
with open(os.path.join(dataset_dir,in_csv_file)) as csvfile:
|
||||
reader = csv.reader(csvfile, delimiter=';')
|
||||
for row in reader:
|
||||
if row[4] == "yes":
|
||||
# Recording in that row should be removed from dataset
|
||||
to_remove.append(row[0])
|
||||
print("Recording " + row[0] + " will be removed from dataset.")
|
||||
|
||||
print("\n" + str(len(to_remove)) + " recordings has been marked for deletion.")
|
||||
|
||||
if len(to_remove) > 0:
|
||||
|
||||
metadata_cleaned = open(os.path.join(dataset_dir,"metadata_cleaned.csv"),"w")
|
||||
|
||||
# Create new subdirectory for removed wav files
|
||||
removed_dir = os.path.join(dataset_dir,subfolder_removed)
|
||||
if not os.path.exists(removed_dir):
|
||||
os.makedirs(removed_dir)
|
||||
|
||||
# Remove lines from metadata.csv and move wav files to new subdirectory
|
||||
with open(os.path.join(dataset_dir,"metadata.csv")) as csvfile:
|
||||
reader = csv.reader(csvfile, delimiter='|')
|
||||
for row in reader:
|
||||
if (row[0] + ".wav") not in to_remove:
|
||||
metadata_cleaned.write(row[0] + "|" + row[1] + "|" + row[2] + "\n")
|
||||
else:
|
||||
# Move recording to new subfolder
|
||||
shutil.move(os.path.join(dataset_dir,"wavs",row[0] + ".wav"),removed_dir)
|
||||
|
||||
metadata_cleaned.close()
|
BIN
samples/thorsten-21.06-emotional/amused.wav
Normal file
BIN
samples/thorsten-21.06-emotional/amused.wav
Normal file
Binary file not shown.
BIN
samples/thorsten-21.06-emotional/angry.wav
Normal file
BIN
samples/thorsten-21.06-emotional/angry.wav
Normal file
Binary file not shown.
BIN
samples/thorsten-21.06-emotional/disgusted.wav
Normal file
BIN
samples/thorsten-21.06-emotional/disgusted.wav
Normal file
Binary file not shown.
BIN
samples/thorsten-21.06-emotional/drunk.wav
Normal file
BIN
samples/thorsten-21.06-emotional/drunk.wav
Normal file
Binary file not shown.
BIN
samples/thorsten-21.06-emotional/neutral.wav
Normal file
BIN
samples/thorsten-21.06-emotional/neutral.wav
Normal file
Binary file not shown.
BIN
samples/thorsten-21.06-emotional/sleepy.wav
Normal file
BIN
samples/thorsten-21.06-emotional/sleepy.wav
Normal file
Binary file not shown.
BIN
samples/thorsten-21.06-emotional/surprised.wav
Normal file
BIN
samples/thorsten-21.06-emotional/surprised.wav
Normal file
Binary file not shown.
BIN
samples/thorsten-21.06-emotional/whisper.wav
Normal file
BIN
samples/thorsten-21.06-emotional/whisper.wav
Normal file
Binary file not shown.
Loading…
Reference in New Issue
Block a user