Compare commits
90 Commits
githubPage
...
master
Author | SHA1 | Date | |
---|---|---|---|
|
f13bcaf63e | ||
|
04c5683194 | ||
|
50e09d49bf | ||
|
b0afed75f4 | ||
|
9b7b4c6836 | ||
|
aba10bc64a | ||
|
07e85b3905 | ||
|
e08d50d6bb | ||
|
e691aa4ee3 | ||
|
625f73e986 | ||
|
de1802f8ce | ||
|
f0500309d6 | ||
|
41c91b9865 | ||
|
fcb1e705a9 | ||
|
b8802db4f8 | ||
|
b00c768343 | ||
|
3b0b4f898f | ||
|
2106fc6b00 | ||
|
e4ff3ce04a | ||
|
f408508cd7 | ||
|
6b4cfb41d4 | ||
|
521dd33483 | ||
|
6efb25310a | ||
|
5654397f3e | ||
|
b5ec9ef991 | ||
|
77ad01d4ff | ||
|
c35507b1f7 | ||
|
b536dfd958 | ||
|
29238f2a31 | ||
|
8c5f4503f3 | ||
|
2ff7e3961b | ||
|
1221713314 | ||
|
d3225b48f8 | ||
|
33c030f844 | ||
|
2daabae53e | ||
|
1d445b09f8 | ||
|
2853f111dc | ||
|
7540606247 | ||
|
0b9e929ce0 | ||
|
bc06fa923f | ||
|
f19144b085 | ||
|
251c093ad4 | ||
|
f505fd38df | ||
|
3e09ae8615 | ||
|
2ed2413dda | ||
|
51c5f55bbd | ||
|
4f875ac591 | ||
|
2ea44ede87 | ||
|
ba60fc57d4 | ||
|
9e68d99ee7 | ||
|
7172604eed | ||
|
58dece7c55 | ||
|
c81f374aca | ||
|
2c6aca780b | ||
|
68e60f2a92 | ||
|
a3b0dde296 | ||
|
28d81a0fb2 | ||
|
12c6d26dbd | ||
|
4c06db69dd | ||
|
bae96a75a5 | ||
|
1313520064 | ||
|
e2ecf68c13 | ||
|
c8a5e1082e | ||
|
40aae591d7 | ||
|
4f722e96a9 | ||
|
7e1530b742 | ||
|
647786be6c | ||
|
00685a008d | ||
|
e5481a82a6 | ||
|
2d1428cd13 | ||
|
df55a19ae2 | ||
|
9585b73cc3 | ||
|
70158ba7c8 | ||
|
e1e9f8666a | ||
|
cca10c215e | ||
|
09705597b8 | ||
|
bdb3aa7d47 | ||
|
f0c0f63ae1 | ||
|
036c266ad7 | ||
|
8e6137b3af | ||
|
9ee0353da4 | ||
|
a99d4b6477 | ||
|
02020e54f7 | ||
|
5347394f3e | ||
|
c59d19e0a1 | ||
|
e45736f62d | ||
|
e96de3a095 | ||
|
eaead5cebe | ||
|
7b27bdac2d | ||
|
f55e16d0fc |
2
.github/FUNDING.yml
vendored
Normal file
@ -0,0 +1,2 @@
|
|||||||
|
# These are supported funding model platforms
|
||||||
|
|
28
CITATION.cff
Normal file
@ -0,0 +1,28 @@
|
|||||||
|
# This CITATION.cff file was generated with cffinit.
|
||||||
|
# Visit https://bit.ly/cffinit to generate yours today!
|
||||||
|
|
||||||
|
cff-version: 1.2.0
|
||||||
|
title: Thorsten-Voice
|
||||||
|
message: >-
|
||||||
|
Please cite Thorsten-Voice project if you use
|
||||||
|
datasets or trained TTS models.
|
||||||
|
type: dataset
|
||||||
|
authors:
|
||||||
|
- given-names: Thorsten
|
||||||
|
family-names: Müller
|
||||||
|
email: tm@thorsten-voice.de
|
||||||
|
- given-names: Dominik
|
||||||
|
family-names: Kreutz
|
||||||
|
repository-code: 'https://github.com/thorstenMueller/Thorsten-Voice'
|
||||||
|
url: 'https://www.Thorsten-Voice.de'
|
||||||
|
abstract: >-
|
||||||
|
A free to use, offline working, high quality german
|
||||||
|
TTS voice should be available for every project
|
||||||
|
without any license struggling.
|
||||||
|
keywords:
|
||||||
|
- Thorsten
|
||||||
|
- Voice
|
||||||
|
- Open
|
||||||
|
- German
|
||||||
|
- TTS
|
||||||
|
- Dataset
|
BIN
Logo_Thorsten-Voice.png
Normal file
After Width: | Height: | Size: 25 KiB |
320
README.md
@ -1,135 +1,245 @@
|
|||||||
# Introduction
|
![Thorsten-Voice logo](Logo_Thorsten-Voice.png)
|
||||||
Many smart voice assistants like Amazon Alexa, Google Home, Apple Siri and Microsoft Cortana use cloud services to offer their (base) functionality.
|
|
||||||
|
|
||||||
As some people have privacy concerns using these services there are some (open source) projects trying to build offline and/or privacy aware alternatives.
|
- [Project motivation](#motivation-for-thorsten-voice-project-speaking_head-speech_balloon)
|
||||||
|
|
||||||
But speech recognition and text synthesis still requires cloud services for providing these in a decent quality.
|
- [Personal note](#some-personal-words-before-using-thorsten-voice)
|
||||||
|
|
||||||
# MyCroft AI
|
- [**Thorsten** Voice Datasets](#voice-datasets)
|
||||||
> https://mycroft.ai/
|
- [Thorsten-21.02-neutral](#thorsten-2102-neutral)
|
||||||
|
- [Thorsten-21.06-emotional](#thorsten-2106-emotional)
|
||||||
|
- [Thorsten-22.10-neutral](#thorsten-2210-neutral)
|
||||||
|
|
||||||
MyCroft is a company developing an opensource voice assistant with a very nice and active community. But the stt/tts parts are still cloud based (eg. google services), even if requests are anonymized by a mycroft proxy in between. But integration with locally hosted services such as deepspeech (stt) or mimic/tacotron (tts) is possible.
|
- [**Thorsten** TTS-Models](#tts-models)
|
||||||
|
- [Thorsten-21.04-Tacotron2-DCA](#thorsten-2104-tacotron2-dca)
|
||||||
|
- [Thorsten-22.05-VITS](#thorsten-2205-vits)
|
||||||
|
- [Thorsten-22.08-Tacotron2-DDC](#thorsten-2208-tacotron2-ddc)
|
||||||
|
- [Other models](#other-models)
|
||||||
|
|
||||||
# Mozilla
|
- [Public talks](#public-talks)
|
||||||
Mozilla works on these really important aspects for free and open human machine voice interaction.
|
|
||||||
|
|
||||||
## STT - speech to text
|
- [My Youtube channel](#youtube-channel)
|
||||||
> https://commonvoice.mozilla.org/
|
|
||||||
|
|
||||||
"STT" needs lots of audio training data by many speakers (women/men/kids) of all ages, dialects and in various audio quality levels. So any voice contribution for common voice project is highly welcome.
|
- [Special Thanks](#thanks-section)
|
||||||
|
|
||||||
## TTS - text to speech
|
|
||||||
> https://github.com/mozilla/tts
|
|
||||||
|
|
||||||
"TTS" needs lots of clean recordings by one speaker to train a model. Mozilla is developing a software stack for proper model training based on tacotron2 papers.
|
|
||||||
|
|
||||||
# And?!
|
|
||||||
I want to make the most personal contribution i can give and contribute my personal voice (**german**) for TTS training to the community for free usage.
|
|
||||||
|
|
||||||
## Please read some personal words before downloading the dataset
|
|
||||||
I contribute my voice as a person believing in a world where all people are equal. No matter of gender, sexual orientation, religion, skin color and geocoordinates of birth location. A global world where everybody is warmly welcome on any place on this planet and open and free knowledge and education is available to everyone.
|
|
||||||
|
|
||||||
So hopefully my voice is used in this manner to make this world a better place for all of us :-).
|
|
||||||
|
|
||||||
**tl;dr** Please don't use for evil!
|
|
||||||
|
|
||||||
# Dataset "thorsten"
|
|
||||||
## Samples of my voice
|
|
||||||
To get an impression what my voice sounds to decide if it fits to your project i published some sample recordings, so no need to download complete dataset first.
|
|
||||||
|
|
||||||
* [Das Teilen eines Benutzerkontos ist strengstens untersagt.](./samples/original_recording/recorded_sample_01.wav )
|
|
||||||
* [Der Prophet spricht stets in Gleichnissen.](./samples/original_recording/recorded_sample_02.wav )
|
|
||||||
* [Bitte schmeißt euren Müll nicht einfach in die Walachei.](./samples/original_recording/recorded_sample_03.wav )
|
|
||||||
* [So etwas würde mir nie in den Sinn kommen.](./samples/original_recording/recorded_sample_04.wav )
|
|
||||||
* [Sie klettert auf einen Stein und nimmt eine Denkerpose ein.](./samples/original_recording/recorded_sample_05.wav )
|
|
||||||
* [Jede gute Küchenwaage hat eine Tara-Funktion.](./samples/original_recording/recorded_sample_06.wav )
|
|
||||||
* [Jeden Gedanken kannst du hier loswerden.](./samples/original_recording/recorded_sample_07.wav )
|
|
||||||
|
|
||||||
|
|
||||||
## Dataset information
|
# Motivation for Thorsten-Voice project :speaking_head: :speech_balloon:
|
||||||
|
A **free** to use, **offline** working, **high quality** **german** **TTS** voice should be available for every project without any license struggling.
|
||||||
|
|
||||||
* ljspeech-1.1 structure
|
<a href="https://twitter.com/intent/follow?screen_name=ThorstenVoice"><img src="https://img.shields.io/twitter/follow/ThorstenVoice?style=social&logo=twitter" alt="follow on Twitter"></a>
|
||||||
* 22.668 recorded phrases (wav files)
|
[![YouTube Channel Subscribers](https://img.shields.io/youtube/channel/subscribers/UCjqqTVVBTsxpm0iOhQ1fp9g?style=social)](https://www.youtube.com/c/ThorstenMueller)
|
||||||
* more than 23 hours of pure audio
|
[![Project website](https://img.shields.io/badge/Project_website-www.Thorsten--Voice.de-92a0c0)](https://www.Thorsten-Voice.de)
|
||||||
* samplerate 22.050Hz
|
|
||||||
* mono
|
# Social media
|
||||||
* normalized to -24dB
|
Please check and follow me on my social media profiles - Thank you.
|
||||||
* phrase length (min/avg/max): 2 / 52 / 180 chars
|
|
||||||
* no silence at beginning/ending
|
| Platform | Link |
|
||||||
* avg spoken chars per second: 14
|
| --------------- | ------- |
|
||||||
* sentences with question mark: 2.780
|
| Youtube | [ThorstenVoice on Youtube](https://www.youtube.com/c/ThorstenMueller) |
|
||||||
* sentences with exclamation mark: 1.840
|
| Twitter | [ThorstenVoice on Twitter](https://twitter.com/ThorstenVoice) |
|
||||||
|
| Instagram | [ThorstenVoice on Instagram](https://www.instagram.com/thorsten_voice/) |
|
||||||
|
| LinkedIn | [Thorsten Müller on LinkedIn](https://www.linkedin.com/in/thorsten-m%C3%BCller-848a344/) |
|
||||||
|
|
||||||
|
# Some personal words before using **Thorsten-Voice**
|
||||||
|
> I contribute my voice as a person believing in a world where all people are equal. No matter of gender, sexual orientation, religion, skin color and geocoordinates of birth location. A global world where everybody is warmly welcome on any place on this planet and open and free knowledge and education is available to everyone. :earth_africa: (*Thorsten Müller*)
|
||||||
|
|
||||||
|
Please keep in mind, that **i am no professional voice talent**. I'm just a normal guy sharing his voice with the world.
|
||||||
|
|
||||||
|
# Voice-Datasets
|
||||||
|
Voice datasets are listed on Zenodo:
|
||||||
|
| Dataset | DOI Link |
|
||||||
|
| --------------- | ------- |
|
||||||
|
| Thorsten-21.02-neutral | [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.5525342.svg)](https://doi.org/10.5281/zenodo.5525342) |
|
||||||
|
| Thorsten-21.06-emotional | [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.5525023.svg)](https://doi.org/10.5281/zenodo.5525023) |
|
||||||
|
| Thorsten-22.10-neutral | [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7265581.svg)](https://doi.org/10.5281/zenodo.7265581) |
|
||||||
|
|
||||||
|
## Thorsten-21.02-neutral
|
||||||
|
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.5525342.svg)](https://doi.org/10.5281/zenodo.5525342)
|
||||||
|
|
||||||
|
```
|
||||||
|
@dataset{muller_thorsten_2021_5525342,
|
||||||
|
author = {Müller, Thorsten and
|
||||||
|
Kreutz, Dominik},
|
||||||
|
title = {Thorsten-Voice - "Thorsten-21.02-neutral" Dataset},
|
||||||
|
month = feb,
|
||||||
|
year = 2021,
|
||||||
|
note = {{Please use it to make the world a better place for
|
||||||
|
whole humankind.}},
|
||||||
|
publisher = {Zenodo},
|
||||||
|
version = {3.0},
|
||||||
|
doi = {10.5281/zenodo.5525342},
|
||||||
|
url = {https://doi.org/10.5281/zenodo.5525342}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
> :speaking_head: **Listen to some audio recordings from this dataset [here](https://drive.google.com/drive/folders/1KVjGXG2ij002XRHb3fgFK4j0OEq1FsWm?usp=sharing).**
|
||||||
|
|
||||||
|
### Dataset summary
|
||||||
|
* Recorded by Thorsten Müller
|
||||||
|
* Optimized by Dominik Kreutz
|
||||||
|
* LJSpeech file and directory structure
|
||||||
|
* 22.668 recorded phrases (*wav files*)
|
||||||
|
* More than 23 hours of pure audio
|
||||||
|
* Samplerate 22.050Hz
|
||||||
|
* Mono
|
||||||
|
* Normalized to -24dB
|
||||||
|
* Phrase length (min/avg/max): 2 / 52 / 180 chars
|
||||||
|
* No silence at beginning/ending
|
||||||
|
* Avg spoken chars per second: 14
|
||||||
|
* Sentences with question mark: 2.780
|
||||||
|
* Sentences with exclamation mark: 1.840
|
||||||
|
|
||||||
|
### Dataset evolution
|
||||||
|
As described in the PDF document ([evolution of thorsten dataset](./EvolutionOfThorstenDataset.pdf)) this dataset consists of three recording phases.
|
||||||
|
|
||||||
|
* **Phase 1**: Recorded with a cheap usb microphone (*low quality*)
|
||||||
|
* **Phase 2**: Recorded with a good microphone (*good quality*)
|
||||||
|
* **Phase 3**: Recorded with same good microphone but longer phrases (> 100 chars) (*good quality*)
|
||||||
|
|
||||||
|
If you want to use a dataset subset you can see which files belong to which recording phase in [recording quality](./RecordingQuality.csv) csv file.
|
||||||
|
|
||||||
|
|
||||||
![text length vs. mean audio duration](./img/thorsten-de---datasetAnalysis1.png)
|
## Thorsten-21.06-emotional
|
||||||
![text length vs. median audio duration](./img/thorsten-de---datasetAnalysis2.png)
|
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.5525023.svg)](https://doi.org/10.5281/zenodo.5525023)
|
||||||
![text length vs. STD](./img/thorsten-de---datasetAnalysis3.png)
|
|
||||||
![text length vs. number instances](./img/thorsten-de---datasetAnalysis4.png)
|
|
||||||
![signal noise ratio](./img/thorsten-de---datasetAnalysis5.png)
|
|
||||||
![bokeh](./img/thorsten-de---datasetAnalysis6.png)
|
|
||||||
|
|
||||||
## Dataset evolution
|
```
|
||||||
As decribed in the pdf document ([evolution of thorsten dataset](./EvolutionOfThorstenDataset.pdf)) this dataset consists of three recording phases.
|
@dataset{muller_thorsten_2021_5525023,
|
||||||
|
author = {Müller, Thorsten and
|
||||||
|
Kreutz, Dominik},
|
||||||
|
title = {{Thorsten-Voice - "Thorsten-21.06-emotional"
|
||||||
|
Dataset}},
|
||||||
|
month = jun,
|
||||||
|
year = 2021,
|
||||||
|
note = {{Please use it to make the world a better place for
|
||||||
|
whole humankind.}},
|
||||||
|
publisher = {Zenodo},
|
||||||
|
version = {2.0},
|
||||||
|
doi = {10.5281/zenodo.5525023},
|
||||||
|
url = {https://doi.org/10.5281/zenodo.5525023}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
* phase1: Recorded with a cheap usb microphone
|
All emotional recordings where recorded by myself and i tried to feel and pronounce that emotion even if the phrase context does not match that emotion. Example: I pronounced the sleepy recordings in the tone i have shortly before falling asleep.
|
||||||
* phase2: Recorded with a good microphone
|
|
||||||
* phase3: Recorded with same good microphone but longer phrases (> 100 chars)
|
|
||||||
|
|
||||||
If you wanna use just a dataset subset (phase1 and/or phase2 and/or phase3) you can see which files belong to which recording phase in [recording quality](./RecordingQuality.csv) csv file.
|
### Samples
|
||||||
|
Listen to the phrase "**Mist, wieder nichts geschafft.**" in following emotions.
|
||||||
|
|
||||||
|
* :slightly_smiling_face: [Neutral](./samples/thorsten-21.06-emotional/neutral.wav)
|
||||||
|
* :nauseated_face: [Disgusted](./samples/thorsten-21.06-emotional/disgusted.wav)
|
||||||
|
* :angry: [Angry](./samples/thorsten-21.06-emotional/angry.wav)
|
||||||
|
* :grinning: [Amused](./samples/thorsten-21.06-emotional/amused.wav)
|
||||||
|
* :astonished: [Surprised](./samples/thorsten-21.06-emotional/surprised.wav)
|
||||||
|
* :pensive: [Sleepy](./samples/thorsten-21.06-emotional/sleepy.wav)
|
||||||
|
* :dizzy_face: [Drunk](./samples/thorsten-21.06-emotional/drunk.wav)
|
||||||
|
* 🤫 [Whispering](./samples/thorsten-21.06-emotional/whisper.wav)
|
||||||
|
### Dataset summary
|
||||||
|
* Recorded by Thorsten Müller
|
||||||
|
* Optimized by Dominik Kreutz
|
||||||
|
* 300 sentences * 8 emotions = 2.400 recordings
|
||||||
|
* Mono
|
||||||
|
* Samplerate 22.050Hz
|
||||||
|
* Normalized to -24dB
|
||||||
|
* No silence at beginning/ending
|
||||||
|
* Sentence length: 59 - 148 chars
|
||||||
|
|
||||||
|
|
||||||
## Download information
|
## Thorsten-22.10-neutral
|
||||||
> Download size: 2,7GB
|
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7265581.svg)](https://doi.org/10.5281/zenodo.7265581)
|
||||||
|
> :speaking_head: **Listen to some audio recordings from this dataset [here](https://drive.google.com/drive/folders/1dxoSo8Ktmh-5E0rSVqkq_Jm1r4sFnwJM?usp=sharing).**
|
||||||
|
|
||||||
Version | Description | Date | Link
|
```
|
||||||
------------ | ------------- | ------------- | -------------
|
@dataset{muller_thorsten_2022_7265581,
|
||||||
thorsten-de-v01 | Initial version | 2020-06-28 | [Google Drive Download v01](https://drive.google.com/file/d/1yKJM1LAOQpRVojKunD9r8WN_p5KzBxjc/view?usp=sharing)
|
author = {Müller, Thorsten and
|
||||||
thorsten-de-v02 | normalized to -24dB and split metadata.csv into shuffeled metadata_train.csv and metadata_val.csv | 2020-08-22 | [Google Drive Download v02](https://drive.google.com/file/d/1mGWfG0s2V2TEg-AI2m85tze1m4pyeM7b/view?usp=sharing)
|
Kreutz, Dominik},
|
||||||
|
title = {ThorstenVoice Dataset 2022.10},
|
||||||
|
month = oct,
|
||||||
|
year = 2022,
|
||||||
|
publisher = {Zenodo},
|
||||||
|
version = {1.0},
|
||||||
|
doi = {10.5281/zenodo.7265581},
|
||||||
|
url = {https://doi.org/10.5281/zenodo.7265581
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
# TTS Models
|
||||||
|
|
||||||
|
## Thorsten-21.04-Tacotron2-DCA
|
||||||
|
This [TTS-model](https://drive.google.com/drive/folders/1m4RuffbvdOmQWnmy_Hmw0cZ_q0hj2o8B?usp=sharing) has been trained on [**Thorsten-21.02-neutral**](#thorsten-2102-neutral) dataset. The recommended trained Fullband-MelGAN Vocoder can be downloaded [here](https://drive.google.com/drive/folders/1hsfaconm4Yd9wPVyOtrXjWQs4ZAPoouY?usp=sharing).
|
||||||
|
|
||||||
|
Run the model:
|
||||||
|
* pip install TTS==0.5.0
|
||||||
|
* tts-server --model_name tts_models/de/thorsten/tacotron2-DCA
|
||||||
|
|
||||||
|
|
||||||
# Trained tacotron2 model "thorsten"
|
## Thorsten-22.05-VITS
|
||||||
If you trained a model on "thorsten" dataset please file an issue with some information on it. Sharing a trained model is highly appreciated.
|
Trained on dataset **Thorsten-22.05-neutral**.
|
||||||
|
Audio samples are available on [Thorsten-Voice website](https://www.thorsten-voice.de/en/just-get-started/).
|
||||||
|
|
||||||
## Trained models (TODO)
|
To run TTS server just follow these steps:
|
||||||
|
* pip install tts==0.7.1
|
||||||
|
* tts-server --model_name tts_models/de/thorsten/vits
|
||||||
|
* Open browser on http://localhost:5002 and enjoy playing
|
||||||
|
|
||||||
|
## Thorsten-22.08-Tacotron2-DDC
|
||||||
|
Trained on dataset [**Thorsten-22.05-neutral**](#thorsten-2205-neutral).
|
||||||
|
Audio samples are available on [Thorsten-Voice website]([https://www.thorsten-voice.de/en/just-get-started/](https://www.thorsten-voice.de/2022/08/14/welches-tts-modell-klingt-besser/)).
|
||||||
|
|
||||||
|
To run TTS server just follow these steps:
|
||||||
|
* pip install tts==0.8.0
|
||||||
|
* tts-server --model_name tts_models/de/thorsten/tacotron2-DDC
|
||||||
|
* Open browser on http://localhost:5002 and enjoy playing
|
||||||
|
|
||||||
|
|
||||||
|
## Other models
|
||||||
|
### Silero
|
||||||
|
|
||||||
|
You can use a free A-GPL licensed models trained on **Thorsten-21.02-neutral** dataset via the [silero-models](https://github.com/snakers4/silero-models/blob/master/models.yml) project.
|
||||||
|
|
||||||
|
* [Thorsten 16kHz](https://drive.google.com/drive/folders/1tR6w4kgRS2JJ1TWZhwoFuU04Xkgo6YAs?usp=sharing)
|
||||||
|
* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-models/blob/master/examples_tts.ipynb)
|
||||||
|
|
||||||
|
### ZDisket
|
||||||
|
[ZDisket](https://github.com/ZDisket/TensorVox) made a tool called TensorVox for setting up an TTS environment on Windows and included a german TTS model trained by [monatis](https://github.com/monatis/german-tts). Thanks for sharing that :thumbsup:. See it in action on [Youtube](https://youtu.be/tY6_xZnkv-A).
|
||||||
|
|
||||||
|
# Public talks
|
||||||
|
I really want to bring the topic "**Open Voice For An Open Future**" to a bigger public attention.
|
||||||
|
|
||||||
|
* I've been part of a Linux User Group podcast about Mycroft AI and talked on my TTS efforts on that in (*May 2021*).
|
||||||
|
* I was invited by [Yusuf](https://github.com/monatis/) from Turkish tensorflow community to talk on "How to make machines speak with your own voice". This talk has been streamed live on Youtube and is available [here](https://www.youtube.com/watch?v=m-Uwb-Bg144&t=2303s). If you're interested on the showed slides, feel free to download my presentation [here](https://docs.google.com/presentation/d/1ynnw0ilKV3WwMSJHytrN3GXRiFr8x3r0DUimBm1y0LI/edit?usp=sharing) (*June 2021*)
|
||||||
|
)
|
||||||
|
* I've been invited as speaker on VoiceLunch language & linguistics on 03.01.2022. [Here are my slides](https://docs.google.com/presentation/d/1Gi6BmYHs7g4ZgdAiIKGBnBwZDCvJOD9DJxQOGlgds1o/edit?usp=sharing) (*January 2022*).
|
||||||
|
|
||||||
|
# Youtube channel
|
||||||
|
In summer 2021 i've started to share my lessons learned and experiences on open voice tech, in special **TTS** on my little [Youtube channel](https://www.youtube.com/c/ThorstenMueller). If you check out and like my videos i'd happy to welcome you as subscriber and member of my little Youtube community.
|
||||||
|
|
||||||
Folder | Date | Link | Description
|
|
||||||
------------ | ------------- | ------------- | -------------
|
|
||||||
thorsten-taco2-ddc-v0.1 | to do | to do | to do
|
|
||||||
|
|
||||||
# Feel free to file an issue if you ...
|
# Feel free to file an issue if you ...
|
||||||
* have improvements on dataset
|
* Use my TTS voice in your project(s)
|
||||||
* use my TTS voice in your project(s)
|
* Want to share your trained "Thorsten" model
|
||||||
* want to share your trained "thorsten" model
|
* Get to know about any abuse usage of my voice
|
||||||
* get to know about any abuse usage of my voice
|
|
||||||
|
|
||||||
# Special thanks
|
# Thanks section
|
||||||
I want to thank all open source communities for providing great projects.
|
## Cool projects
|
||||||
|
* https://commonvoice.mozilla.org/
|
||||||
|
* https://coqui.ai/
|
||||||
|
* https://mycroft.ai/
|
||||||
|
* https://github.com/rhasspy/
|
||||||
|
|
||||||
Just to name some nice guys who joined me on this tts-roadtrip:
|
## Cool people
|
||||||
|
* [El-Tocino](https://github.com/el-tocino/)
|
||||||
|
* [Eren Gölge](https://github.com/erogol/)
|
||||||
|
* [Gras64](https://github.com/gras64/)
|
||||||
|
* [Kris Gesling](https://github.com/krisgesling/)
|
||||||
|
* [Nmstoker](https://github.com/nmstoker)
|
||||||
|
* [Othiele](https://discourse.mozilla.org/u/othiele/summary)
|
||||||
|
* [Repodiac](https://github.com/repodiac)
|
||||||
|
* [SanjaESC](https://github.com/SanjaESC)
|
||||||
|
* [Synesthesiam](https://github.com/synesthesiam/)
|
||||||
|
|
||||||
* eltocino (https://github.com/el-tocino/)
|
## Even more special people
|
||||||
* erogol (https://github.com/erogol/)
|
Additionally, a really nice thanks for my dear colleague, Sebastian Kraus, for supporting me with audio recording equipment and for being the creative mastermind behind the logo design.
|
||||||
* gras64 (https://github.com/gras64/)
|
|
||||||
* krisgesling (https://github.com/krisgesling/)
|
|
||||||
* nmstoker (https://github.com/nmstoker)
|
|
||||||
* othiele (https://discourse.mozilla.org/u/othiele/summary)
|
|
||||||
* repodiac (https://github.com/repodiac)
|
|
||||||
|
|
||||||
And last but not least i want to say a huge thank you to a special guy who supported me on this journey right from the beginning. Not just with nice words, but with his time, audio optimization knowhow and finally his gpu computing power.
|
And last but not least i want to say a **huge, huge thank you** to a special guy who supported me on this journey as a partner right from the beginning. Not just with nice words, but with his time, audio optimization knowhow and finally GPU power.
|
||||||
|
|
||||||
Without his amazing support this dataset (in it's current way) would not exists.
|
**Thank you so much, dear **Dominik** ([@domcross](https://github.com/domcross/)) for being my partner on this journey.**
|
||||||
|
|
||||||
Thank you Dominik (@domcross / https://github.com/domcross/)
|
Thorsten (*Twitter: @ThorstenVoice*)
|
||||||
|
|
||||||
# Links
|
|
||||||
* https://discourse.mozilla.org/t/contributing-my-german-voice-for-tts/48150
|
|
||||||
* https://community.mycroft.ai/
|
|
||||||
* https://github.com/MycroftAI/mimic-recording-studio
|
|
||||||
* https://voice.mozilla.org/
|
|
||||||
* https://github.com/mozilla/TTS
|
|
||||||
(https://github.com/repodiac/tit-for-tat/tree/master/thorsten-TTS)
|
|
||||||
* https://raw.githubusercontent.com/mozilla/voice-web/master/server/data/de/sentence-collector.txt
|
|
||||||
|
|
||||||
We'll hear us in future :-)
|
|
||||||
|
|
||||||
Thorsten
|
|
||||||
|
94
Youtube/train_vits_win.py
Normal file
@ -0,0 +1,94 @@
|
|||||||
|
import os
|
||||||
|
|
||||||
|
from trainer import Trainer, TrainerArgs
|
||||||
|
|
||||||
|
from TTS.tts.configs.shared_configs import BaseDatasetConfig
|
||||||
|
from TTS.tts.configs.vits_config import VitsConfig
|
||||||
|
from TTS.tts.datasets import load_tts_samples
|
||||||
|
from TTS.tts.models.vits import Vits, VitsAudioConfig
|
||||||
|
from TTS.tts.utils.text.tokenizer import TTSTokenizer
|
||||||
|
from TTS.utils.audio import AudioProcessor
|
||||||
|
|
||||||
|
def main():
|
||||||
|
|
||||||
|
output_path = os.path.dirname(os.path.abspath(__file__))
|
||||||
|
#output_path = "c:\\temp\tts"
|
||||||
|
dataset_config = BaseDatasetConfig(
|
||||||
|
formatter="ljspeech", meta_file_train="metadata_small.csv", path="C:\\Users\\ThorstenVoice\\TTS-Training\\ThorstenVoice-Dataset_2022.10"
|
||||||
|
)
|
||||||
|
audio_config = VitsAudioConfig(
|
||||||
|
sample_rate=22050, win_length=1024, hop_length=256, num_mels=80, mel_fmin=0, mel_fmax=None
|
||||||
|
)
|
||||||
|
|
||||||
|
config = VitsConfig(
|
||||||
|
audio=audio_config,
|
||||||
|
run_name="vits_thorsten-voice",
|
||||||
|
batch_size=4,
|
||||||
|
eval_batch_size=4,
|
||||||
|
batch_group_size=5,
|
||||||
|
num_loader_workers=1,
|
||||||
|
num_eval_loader_workers=1,
|
||||||
|
run_eval=True,
|
||||||
|
test_delay_epochs=-1,
|
||||||
|
epochs=1000,
|
||||||
|
text_cleaner="phoneme_cleaners",
|
||||||
|
use_phonemes=True,
|
||||||
|
phoneme_language="de",
|
||||||
|
phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
|
||||||
|
compute_input_seq_cache=True,
|
||||||
|
print_step=25,
|
||||||
|
print_eval=True,
|
||||||
|
mixed_precision=False,
|
||||||
|
output_path=output_path,
|
||||||
|
datasets=[dataset_config],
|
||||||
|
cudnn_benchmark=False,
|
||||||
|
test_sentences=[
|
||||||
|
"Es hat mich viel Zeit gekostet ein Stimme zu entwickeln, jetzt wo ich sie habe werde ich nicht mehr schweigen.",
|
||||||
|
"Sei eine Stimme, kein Echo.",
|
||||||
|
"Es tut mir Leid David. Das kann ich leider nicht machen.",
|
||||||
|
"Dieser Kuchen ist großartig. Er ist so lecker und feucht.",
|
||||||
|
"Vor dem 22. November 1963.",
|
||||||
|
],
|
||||||
|
)
|
||||||
|
|
||||||
|
# INITIALIZE THE AUDIO PROCESSOR
|
||||||
|
# Audio processor is used for feature extraction and audio I/O.
|
||||||
|
# It mainly serves to the dataloader and the training loggers.
|
||||||
|
ap = AudioProcessor.init_from_config(config)
|
||||||
|
|
||||||
|
# INITIALIZE THE TOKENIZER
|
||||||
|
# Tokenizer is used to convert text to sequences of token IDs.
|
||||||
|
# config is updated with the default characters if not defined in the config.
|
||||||
|
tokenizer, config = TTSTokenizer.init_from_config(config)
|
||||||
|
|
||||||
|
# LOAD DATA SAMPLES
|
||||||
|
# Each sample is a list of ```[text, audio_file_path, speaker_name]```
|
||||||
|
# You can define your custom sample loader returning the list of samples.
|
||||||
|
# Or define your custom formatter and pass it to the `load_tts_samples`.
|
||||||
|
# Check `TTS.tts.datasets.load_tts_samples` for more details.
|
||||||
|
train_samples, eval_samples = load_tts_samples(
|
||||||
|
dataset_config,
|
||||||
|
eval_split=True,
|
||||||
|
eval_split_max_size=config.eval_split_max_size,
|
||||||
|
eval_split_size=config.eval_split_size,
|
||||||
|
)
|
||||||
|
|
||||||
|
# init model
|
||||||
|
model = Vits(config, ap, tokenizer, speaker_manager=None)
|
||||||
|
|
||||||
|
# init the trainer and 🚀
|
||||||
|
trainer = Trainer(
|
||||||
|
TrainerArgs(),
|
||||||
|
config,
|
||||||
|
output_path,
|
||||||
|
model=model,
|
||||||
|
train_samples=train_samples,
|
||||||
|
eval_samples=eval_samples,
|
||||||
|
)
|
||||||
|
trainer.fit()
|
||||||
|
print("Fertig!")
|
||||||
|
|
||||||
|
from multiprocessing import Process, freeze_support
|
||||||
|
if __name__ == '__main__':
|
||||||
|
freeze_support() # needed for Windows
|
||||||
|
main()
|
1
docs/_config.yml
Normal file
@ -0,0 +1 @@
|
|||||||
|
theme: jekyll-theme-cayman
|
449
docs/audio_compare.md
Normal file
@ -0,0 +1,449 @@
|
|||||||
|
# Vocoder Vergleich auf Basis des "thorsten" Tacotron 2 Modells
|
||||||
|
Hier sind Hörproben mit unterschiedlichen Vocodern. Alle gesprochenen Texte (*Sample 1 - 4*) basieren auf Aufnahmen im Dataset, jedoch nicht auf dem Spektogramm von "ground truth", sondern auf Basis des trainierten Tacotron 2 Modells. Sample 5 ist der Beginn des Märchens "Der Froschkönig" und wurde nicht für das Dataset aufgezeichnet.
|
||||||
|
|
||||||
|
## Sätze
|
||||||
|
* **Sample #01**: Eure Schoko-Bonbons sind sagenhaft lecker!
|
||||||
|
* **Sample #02**: Eure Tröte nervt.
|
||||||
|
* **Sample #03**: Europa und Asien zusammengenommen wird auch als Eurasien bezeichnet.
|
||||||
|
* **Sample #04**: Euer Plan hat ja toll geklappt.
|
||||||
|
* *Sample #05: "In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön." (Anfang vom "Froschkönig")*
|
||||||
|
|
||||||
|
# Ground truth
|
||||||
|
Originalaufnahmen aus dem "thorsten" Dataset.
|
||||||
|
|
||||||
|
<dl>
|
||||||
|
|
||||||
|
<table>
|
||||||
|
<thead>
|
||||||
|
<tr>
|
||||||
|
<th>Sample</th>
|
||||||
|
<th>Text</th>
|
||||||
|
<th>Audio</th>
|
||||||
|
</tr>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
<tr>
|
||||||
|
<td>01</td>
|
||||||
|
<td>Eure Schoko-Bonbons sind sagenhaft lecker</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample01-gt.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>02</td>
|
||||||
|
<td>Eure Tröte nervt</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample02-gt.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>03</td>
|
||||||
|
<td>Europa und Asien zusammengenommen wird auch als Eurasien bezeichnet</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample03-gt.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>04</td>
|
||||||
|
<td>Euer Plan hat ja toll geklappt.</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample04-gt.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
|
||||||
|
</dl>
|
||||||
|
|
||||||
|
|
||||||
|
# Griffin Lim
|
||||||
|
> Details zum Model: (todo: link)
|
||||||
|
> Tacotron2 + DDC: 460k Schritte trainiert
|
||||||
|
|
||||||
|
<dl>
|
||||||
|
|
||||||
|
<table>
|
||||||
|
<thead>
|
||||||
|
<tr>
|
||||||
|
<th>Sample</th>
|
||||||
|
<th>Text</th>
|
||||||
|
<th>Audio</th>
|
||||||
|
</tr>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
<tr>
|
||||||
|
<td>01</td>
|
||||||
|
<td>Eure Schoko-Bonbons sind sagenhaft lecker</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample01-griffin-lim.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>02</td>
|
||||||
|
<td>Eure Tröte nervt</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample02-griffin-lim.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>03</td>
|
||||||
|
<td>Europa und Asien zusammengenommen wird auch als Eurasien bezeichnet</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample03-griffin-lim.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>04</td>
|
||||||
|
<td>Euer Plan hat ja toll geklappt.</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample04-griffin-lim.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>05</td>
|
||||||
|
<td>In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön.</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample05-griffin-lim.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
|
||||||
|
</dl>
|
||||||
|
|
||||||
|
# ParallelWaveGAN
|
||||||
|
> Details: [Notebook von Olaf](https://colab.research.google.com/drive/15kJHTDTVxyIjxiZgqD1G_s5gUeVNLkfy?usp=sharing)
|
||||||
|
> Tacotron2 + DDC: 360k Schritte trainiert, PWGAN Vocoder: 925k Schritte trainiert
|
||||||
|
<dl>
|
||||||
|
|
||||||
|
<table>
|
||||||
|
<thead>
|
||||||
|
<tr>
|
||||||
|
<th>Sample</th>
|
||||||
|
<th>Text</th>
|
||||||
|
<th>Audio</th>
|
||||||
|
</tr>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
<tr>
|
||||||
|
<td>01</td>
|
||||||
|
<td>Eure Schoko-Bonbons sind sagenhaft lecker</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample01-pwgan.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>02</td>
|
||||||
|
<td>Eure Tröte nervt</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample02-pwgan.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>03</td>
|
||||||
|
<td>Europa und Asien zusammengenommen wird auch als Eurasien bezeichnet</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample03-pwgan.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>04</td>
|
||||||
|
<td>Euer Plan hat ja toll geklappt.</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample04-pwgan.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>05</td>
|
||||||
|
<td>In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön.</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample05-pwgan.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
|
||||||
|
</dl>
|
||||||
|
|
||||||
|
|
||||||
|
# WaveGrad
|
||||||
|
> Tacotron2 + DDC: 460k Schritte trainiert, WaveGrad Vocoder: 510k Schritte trainiert (inkl. Noise-Schedule)
|
||||||
|
<dl>
|
||||||
|
|
||||||
|
<table>
|
||||||
|
<thead>
|
||||||
|
<tr>
|
||||||
|
<th>Sample</th>
|
||||||
|
<th>Text</th>
|
||||||
|
<th>Audio</th>
|
||||||
|
</tr>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
<tr>
|
||||||
|
<td>01</td>
|
||||||
|
<td>Eure Schoko-Bonbons sind sagenhaft lecker</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample01-wavegrad.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>02</td>
|
||||||
|
<td>Eure Tröte nervt</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample02-wavegrad.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>03</td>
|
||||||
|
<td>Europa und Asien zusammengenommen wird auch als Eurasien bezeichnet</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample03-wavegrad.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>04</td>
|
||||||
|
<td>Euer Plan hat ja toll geklappt.</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample04-wavegrad.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>05</td>
|
||||||
|
<td>In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön.</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample05-wavegrad.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
|
||||||
|
</dl>
|
||||||
|
|
||||||
|
# HifiGAN
|
||||||
|
> Thanks to SanjaESC (https://github.com/SanjaESC) for training this model.
|
||||||
|
<dl>
|
||||||
|
<table>
|
||||||
|
<thead>
|
||||||
|
<tr>
|
||||||
|
<th>Sample</th>
|
||||||
|
<th>Text</th>
|
||||||
|
<th>Audio</th>
|
||||||
|
</tr>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
<tr>
|
||||||
|
<td>01</td>
|
||||||
|
<td>Eure Schoko-Bonbons sind sagenhaft lecker</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample01-hifigan.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>02</td>
|
||||||
|
<td>Eure Tröte nervt</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample02-hifigan.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>03</td>
|
||||||
|
<td>Europa und Asien zusammengenommen wird auch als Eurasien bezeichnet</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample03-hifigan.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>04</td>
|
||||||
|
<td>Euer Plan hat ja toll geklappt.</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample04-hifigan.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>05</td>
|
||||||
|
<td>In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön.</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample05-hifigan.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
|
||||||
|
</dl>
|
||||||
|
|
||||||
|
# VocGAN
|
||||||
|
> **Diese Beispiele basieren auf "ground truth" und nicht auf dem Tacotron 2 Modell**
|
||||||
|
> 200 Epochen / 284k Trainingsschritte
|
||||||
|
|
||||||
|
<dl>
|
||||||
|
|
||||||
|
<table>
|
||||||
|
<thead>
|
||||||
|
<tr>
|
||||||
|
<th>Sample</th>
|
||||||
|
<th>Text</th>
|
||||||
|
<th>Audio</th>
|
||||||
|
</tr>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
<tr>
|
||||||
|
<td>01</td>
|
||||||
|
<td>Eure Schoko-Bonbons sind sagenhaft lecker</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample01-vocgan.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>02</td>
|
||||||
|
<td>Eure Tröte nervt</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample02-vocgan.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>03</td>
|
||||||
|
<td>Europa und Asien zusammengenommen wird auch als Eurasien bezeichnet</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample03-vocgan.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>04</td>
|
||||||
|
<td>Euer Plan hat ja toll geklappt.</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample04-vocgan.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
|
||||||
|
</dl>
|
||||||
|
|
||||||
|
# GlowTTS / Waveglow
|
||||||
|
> Details: [Github von Synesthesiam](https://github.com/rhasspy/de_larynx-thorsten)
|
||||||
|
> GlowTTS trainiert für 380k und Vocoder für 500k Schritte.
|
||||||
|
|
||||||
|
<dl>
|
||||||
|
|
||||||
|
<table>
|
||||||
|
<thead>
|
||||||
|
<tr>
|
||||||
|
<th>Sample</th>
|
||||||
|
<th>Text</th>
|
||||||
|
<th>Audio</th>
|
||||||
|
</tr>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
<tr>
|
||||||
|
<td>01</td>
|
||||||
|
<td>Eure Schoko-Bonbons sind sagenhaft lecker</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample01-waveglow.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>02</td>
|
||||||
|
<td>Eure Tröte nervt</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample02-waveglow.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>03</td>
|
||||||
|
<td>Europa und Asien zusammengenommen wird auch als Eurasien bezeichnet</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample03-waveglow.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>04</td>
|
||||||
|
<td>Euer Plan hat ja toll geklappt.</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample04-waveglow.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>05</td>
|
||||||
|
<td>In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön.</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample05-waveglow.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
|
||||||
|
</dl>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
# TensorFlowTTS
|
||||||
|
## Multiband MelGAN
|
||||||
|
> Thanks [Monatis](https://github.com/monatis)
|
||||||
|
> Details: [Notebook von Monatis](https://colab.research.google.com/drive/1W0nSFpsz32M0OcIkY9uMOiGrLTPKVhTy?usp=sharing#scrollTo=SCbWCChVkfnn)
|
||||||
|
> Taco2 Modell für 80k Schritte trainiert, Multiband MelGAN für 800k Schritte.
|
||||||
|
|
||||||
|
<dl>
|
||||||
|
|
||||||
|
<table>
|
||||||
|
<thead>
|
||||||
|
<tr>
|
||||||
|
<th>Sample</th>
|
||||||
|
<th>Text</th>
|
||||||
|
<th>Audio</th>
|
||||||
|
</tr>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
<tr>
|
||||||
|
<td>01</td>
|
||||||
|
<td>Eure Schoko-Bonbons sind sagenhaft lecker</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample01-TensorFlowTTS.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>02</td>
|
||||||
|
<td>Eure Tröte nervt</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample02-TensorFlowTTS.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>03</td>
|
||||||
|
<td>Europa und Asien zusammengenommen wird auch als Eurasien bezeichnet</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample03-TensorFlowTTS.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>04</td>
|
||||||
|
<td>Euer Plan hat ja toll geklappt.</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample04-TensorFlowTTS.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>05</td>
|
||||||
|
<td>In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön.</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample05-TensorFlowTTS.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
|
||||||
|
</dl>
|
||||||
|
|
||||||
|
|
||||||
|
# Silero models
|
||||||
|
> Thanks [snakers4](https://github.com/snakers4/silero-models)
|
||||||
|
> Details: [Notebook von Silero](https://colab.research.google.com/github/snakers4/silero-models/blob/master/examples_tts.ipynb#scrollTo=indirect-berry)
|
||||||
|
|
||||||
|
<dl>
|
||||||
|
|
||||||
|
<table>
|
||||||
|
<thead>
|
||||||
|
<tr>
|
||||||
|
<th>Sample</th>
|
||||||
|
<th>Text</th>
|
||||||
|
<th>Audio</th>
|
||||||
|
</tr>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
<tr>
|
||||||
|
<td>01</td>
|
||||||
|
<td>Eure Schoko-Bonbons sind sagenhaft lecker</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample01-silero.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>02</td>
|
||||||
|
<td>Eure Tröte nervt</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample02-silero.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>03</td>
|
||||||
|
<td>Europa und Asien zusammengenommen wird auch als Eurasien bezeichnet</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample03-silero.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>04</td>
|
||||||
|
<td>Euer Plan hat ja toll geklappt.</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample04-silero.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>05</td>
|
||||||
|
<td>In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön.</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample05-silero.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
|
||||||
|
</dl>
|
||||||
|
|
||||||
|
# Forward Tacotron
|
||||||
|
> Thanks [cschaefer26](https://github.com/as-ideas/ForwardTacotron)
|
||||||
|
> Config: Forward-Tacotron, trained to 300k, alpha set to 0.8, pretrained HifiGAN vocoder
|
||||||
|
|
||||||
|
<dl>
|
||||||
|
|
||||||
|
<table>
|
||||||
|
<thead>
|
||||||
|
<tr>
|
||||||
|
<th>Sample</th>
|
||||||
|
<th>Text</th>
|
||||||
|
<th>Audio</th>
|
||||||
|
</tr>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
<tr>
|
||||||
|
<td>01</td>
|
||||||
|
<td>Eure Schoko-Bonbons sind sagenhaft lecker</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample01-ForwardTacotron-HifiGAN.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>02</td>
|
||||||
|
<td>Eure Tröte nervt</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample02-ForwardTacotron-HifiGAN.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>03</td>
|
||||||
|
<td>Europa und Asien zusammengenommen wird auch als Eurasien bezeichnet</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample03-ForwardTacotron-HifiGAN.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>04</td>
|
||||||
|
<td>Euer Plan hat ja toll geklappt.</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample04-ForwardTacotron-HifiGAN.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>05</td>
|
||||||
|
<td>In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön.</td>
|
||||||
|
<td><audio controls="" preload="none"><source src="samples/sample05-ForwardTacotron-HifiGAN.wav"></audio></td>
|
||||||
|
</tr>
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
|
||||||
|
</dl>
|
48
docs/index.md
Normal file
@ -0,0 +1,48 @@
|
|||||||
|
# Motivation
|
||||||
|
|
||||||
|
<span style="font-size:1.5em;font-weight:bold">
|
||||||
|
Eine kostenfreie, qualitativ hochwertige, deutsche TTS Stimme, die offline erzeugt werden kann sollte jedem Projekt ohne Lizenzrechtliche Probleme zur Verfügung stehen.
|
||||||
|
</span>
|
||||||
|
|
||||||
|
|
||||||
|
# Egal aus welchem Bereich du kommst:
|
||||||
|
* Privates Bastelprojekt
|
||||||
|
* OpenSource/Community Projekt
|
||||||
|
* Bildung/Forschung/Wissenschaft
|
||||||
|
* Kommerzielles Unternehmen
|
||||||
|
* ...
|
||||||
|
|
||||||
|
# Egal welcher Bereich dich interessiert:
|
||||||
|
* Smarte Sprachassistenten
|
||||||
|
* Navigationssysteme
|
||||||
|
* Smart Homes
|
||||||
|
* Sprechende Kühlschränke
|
||||||
|
* Vorlesen von Bildschirmtexten (Barrierefreiheit)
|
||||||
|
* Interaktive Robotik
|
||||||
|
* ...
|
||||||
|
|
||||||
|
# Wer wir sind
|
||||||
|
Wir sind eine kleine motivierte Gruppe hobbymäßiger TTS-Enthusiasten die sich nach einem abgewandelten "Herr der Ringe Zitat" benannt hat - "**Fellowership of free german tts**"
|
||||||
|
|
||||||
|
# Wo wir aktuell stehen
|
||||||
|
Wir arbeiten weiterhin daran qualitativ noch bessere Modell zu trainieren, aber den aktuellen "stable" Stand kannst Du hier anhören:
|
||||||
|
* [Es ist im Moment klarer Himmel bei 18 Grad.](https://drive.google.com/file/d/1cDIq4QG6i60WjUYNT6fr2cpEjFQIi8w5/view?usp=sharing)
|
||||||
|
* [Ich verstehe das nicht, aber ich lerne jeden Tag neue Dinge.](https://drive.google.com/file/d/1kja_2RsFt6EmC33HTB4ozJyFlvh_DTFQ/view?usp=sharing)
|
||||||
|
* [Ich bin jetzt bereit.](https://drive.google.com/file/d/1GkplGH7LMJcPDpgFJocXHCjRln_ccVFs/view?usp=sharing)
|
||||||
|
* [Bitte warte einen Moment, bis ich fertig mit dem Booten bin.](https://drive.google.com/file/d/19Td-F14n_05F-squ3bNlt2BDE-NMFaq1/view?usp=sharing)
|
||||||
|
* [Mein Name ist Mycroft und ich bin funky.](https://drive.google.com/file/d/1dbyOyE7Oy8YdAsYqQ4vz4VJjiWIyc8oV/view?usp=sharing)
|
||||||
|
|
||||||
|
|
||||||
|
## Vergleich einiger Vocoder
|
||||||
|
Wir experimentieren aktuell mit unterschiedlichen Konfigurationen um das beste Modell zu ermitteln. Ein Vergleich der bisherigen Ergebnisse findest Du auf dieser Seite.
|
||||||
|
> [Vergleich der unterschiedlichen Modelle](./audio_compare)
|
||||||
|
|
||||||
|
# Interessiert?
|
||||||
|
[Weitere Details, Downloads und Danksagungen findet ihr hier.](https://github.com/thorstenMueller/deep-learning-german-tts "Dataset Details und Thorsten-Modell Download")
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
<span style="font-size:1.5em;font-weight:bold">
|
||||||
|
Wir wünschen euch viel Spaß und Erfolg bei der Umsetzung eurer Projekte :-)
|
||||||
|
</span>
|
BIN
docs/samples/sample01-ForwardTacotron-HifiGAN.wav
Normal file
BIN
docs/samples/sample01-TensorFlowTTS.wav
Normal file
BIN
docs/samples/sample01-griffin-lim.wav
Normal file
BIN
docs/samples/sample01-gt.wav
Normal file
BIN
docs/samples/sample01-hifigan.wav
Normal file
BIN
docs/samples/sample01-pwgan.wav
Normal file
BIN
docs/samples/sample01-silero.wav
Normal file
BIN
docs/samples/sample01-vocgan.wav
Normal file
BIN
docs/samples/sample01-waveglow.wav
Normal file
BIN
docs/samples/sample01-wavegrad.wav
Normal file
BIN
docs/samples/sample02-ForwardTacotron-HifiGAN.wav.wav
Normal file
BIN
docs/samples/sample02-TensorFlowTTS.wav
Normal file
BIN
docs/samples/sample02-griffin-lim.wav
Normal file
BIN
docs/samples/sample02-gt.wav
Normal file
BIN
docs/samples/sample02-hifigan.wav
Normal file
BIN
docs/samples/sample02-pwgan.wav
Normal file
BIN
docs/samples/sample02-silero.wav
Normal file
BIN
docs/samples/sample02-vocgan.wav
Normal file
BIN
docs/samples/sample02-waveglow.wav
Normal file
BIN
docs/samples/sample02-wavegrad.wav
Normal file
BIN
docs/samples/sample03-ForwardTacotron-HifiGAN.wav
Normal file
BIN
docs/samples/sample03-TensorFlowTTS.wav
Normal file
BIN
docs/samples/sample03-griffin-lim.wav
Normal file
BIN
docs/samples/sample03-gt.wav
Normal file
BIN
docs/samples/sample03-hifigan.wav
Normal file
BIN
docs/samples/sample03-pwgan.wav
Normal file
BIN
docs/samples/sample03-silero.wav
Normal file
BIN
docs/samples/sample03-vocgan.wav
Normal file
BIN
docs/samples/sample03-waveglow.wav
Normal file
BIN
docs/samples/sample03-wavegrad.wav
Normal file
BIN
docs/samples/sample04-ForwardTacotron-HifiGAN.wav.wav
Normal file
BIN
docs/samples/sample04-TensorFlowTTS.wav
Normal file
BIN
docs/samples/sample04-griffin-lim.wav
Normal file
BIN
docs/samples/sample04-gt.wav
Normal file
BIN
docs/samples/sample04-hifigan.wav
Normal file
BIN
docs/samples/sample04-pwgan.wav
Normal file
BIN
docs/samples/sample04-silero.wav
Normal file
BIN
docs/samples/sample04-vocgan.wav
Normal file
BIN
docs/samples/sample04-waveglow.wav
Normal file
BIN
docs/samples/sample04-wavegrad.wav
Normal file
BIN
docs/samples/sample05-ForwardTacotron-HifiGAN.wav
Normal file
BIN
docs/samples/sample05-TensorFlowTTS.wav
Normal file
BIN
docs/samples/sample05-griffin-lim.wav
Normal file
BIN
docs/samples/sample05-hifigan.wav
Normal file
BIN
docs/samples/sample05-pwgan.wav
Normal file
BIN
docs/samples/sample05-silero.wav
Normal file
BIN
docs/samples/sample05-waveglow.wav
Normal file
BIN
docs/samples/sample05-wavegrad.wav
Normal file
23304
german_corpus-mimic_recording_studio.csv
Normal file
51
helperScripts/Dockerfile.Jetson-Coqui
Normal file
@ -0,0 +1,51 @@
|
|||||||
|
# Dockerfile for running Coqui TTS trainings in a docker container on NVIDIA Jetson platofrm.
|
||||||
|
# Based on NVIDIA Jetson ML Image, provided without any warranty as is by Thorsten Müller (https://twitter.com/ThorstenVoice) in august 2021
|
||||||
|
|
||||||
|
FROM nvcr.io/nvidia/l4t-ml:r32.5.0-py3
|
||||||
|
|
||||||
|
RUN echo "deb https://repo.download.nvidia.com/jetson/common r32.4 main" >> /etc/apt/sources.list.d/nvidia-l4t-apt-source.list
|
||||||
|
RUN echo "deb https://repo.download.nvidia.com/jetson/t194 r32.4 main" >> /etc/apt/sources.list.d/nvidia-l4t-apt-source.list
|
||||||
|
|
||||||
|
RUN apt-get update -y
|
||||||
|
RUN apt-get install vim python-mecab libmecab-dev cuda-toolkit-10-2 libcudnn8 libcudnn8-dev libsndfile1-dev locales -y
|
||||||
|
|
||||||
|
# Setting some environment vars
|
||||||
|
ENV LLVM_CONFIG=/usr/bin/llvm-config-9
|
||||||
|
ENV PYTHONPATH=/coqui/TTS/
|
||||||
|
ENV LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH
|
||||||
|
# Skipping OPENBLAS_CORETYPE might show "Illegal instruction (core dumped) error
|
||||||
|
ENV OPENBLAS_CORETYPE=ARMV8
|
||||||
|
|
||||||
|
ENV NVIDIA_VISIBLE_DEVICES all
|
||||||
|
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
|
||||||
|
LABEL com.nvidia.volumes.needed="nvidia_driver"
|
||||||
|
|
||||||
|
# Adjust locale setting to your personal needs
|
||||||
|
RUN sed -i '/de_DE.UTF-8/s/^# //g' /etc/locale.gen && \
|
||||||
|
locale-gen
|
||||||
|
ENV LANG de_DE.UTF-8
|
||||||
|
ENV LANGUAGE de_DE:de
|
||||||
|
ENV LC_ALL de_DE.UTF-8
|
||||||
|
|
||||||
|
RUN mkdir /coqui
|
||||||
|
WORKDIR /coqui
|
||||||
|
|
||||||
|
ARG COQUI_BRANCH
|
||||||
|
RUN git clone -b ${COQUI_BRANCH} https://github.com/coqui-ai/TTS.git
|
||||||
|
WORKDIR /coqui/TTS
|
||||||
|
RUN pip3 install pip setuptools wheel --upgrade
|
||||||
|
RUN pip uninstall -y tensorboard tensorflow tensorflow-estimator nbconvert matplotlib
|
||||||
|
RUN pip install -r requirements.txt
|
||||||
|
RUN python3 ./setup.py develop
|
||||||
|
|
||||||
|
# Jupyter Notebook
|
||||||
|
RUN python3 -c "from notebook.auth.security import set_password; set_password('nvidia', '/root/.jupyter/jupyter_notebook_config.json')"
|
||||||
|
CMD /bin/bash -c "jupyter lab --ip 0.0.0.0 --port 8888 --allow-root"
|
||||||
|
|
||||||
|
|
||||||
|
# Build example:
|
||||||
|
# nvidia-docker build . -f Dockerfile.Jetson-Coqui --build-arg COQUI_BRANCH=v0.1.3 -t jetson-coqui
|
||||||
|
# Run example:
|
||||||
|
# nvidia-docker run -p 8888:8888 -d --shm-size 32g --gpus all -v /ssd/___prj/tts/dataset-july21:/coqui/TTS/data jetson-coqui
|
||||||
|
# Bash example:
|
||||||
|
# nvidia-docker exec -it <containerId> /bin/bash
|
157
helperScripts/MRS2LJSpeech.py
Normal file
@ -0,0 +1,157 @@
|
|||||||
|
# This script generates the folder structure for ljspeech-1.1 processing from mimic-recording-studio database
|
||||||
|
|
||||||
|
# Changelog
|
||||||
|
# v1.0 - Initial release by Thorsten Müller (https://github.com/thorstenMueller/deep-learning-german-tts)
|
||||||
|
# v1.1 - Great improvements by Peter Schmalfeldt (https://github.com/manifestinteractive)
|
||||||
|
# - Audio processing with ffmpeg (mono and samplerate of 22.050 Hz)
|
||||||
|
# - Much better Python coding than my original version
|
||||||
|
# - Greater logging output to command line
|
||||||
|
# - See more details here: https://gist.github.com/manifestinteractive/6fd9be62d0ede934d4e1171e5e751aba
|
||||||
|
# - Thanks Peter, it's a great contribution :-)
|
||||||
|
# v1.2 - Added choice for choosing which recording session should be exported as LJSpeech
|
||||||
|
# v1.3 - Added parameter mrs_dir to pass directory of Mimic-Recording-Studio
|
||||||
|
# v1.4 - Script won't crash when audio recorded has been deleted on disk
|
||||||
|
# v1.5 - Added parameter "ffmpeg" to make converting with ffmpeg optional
|
||||||
|
|
||||||
|
from genericpath import exists
|
||||||
|
import glob
|
||||||
|
import sqlite3
|
||||||
|
import os
|
||||||
|
import argparse
|
||||||
|
import sys
|
||||||
|
|
||||||
|
from shutil import copyfile
|
||||||
|
from shutil import rmtree
|
||||||
|
|
||||||
|
# Setup Directory Data
|
||||||
|
cwd = os.path.dirname(os.path.abspath(__file__))
|
||||||
|
output_dir = os.path.join(cwd, "dataset")
|
||||||
|
output_dir_audio = ""
|
||||||
|
output_dir_audio_temp=""
|
||||||
|
output_dir_speech = ""
|
||||||
|
|
||||||
|
# Create folders needed for ljspeech
|
||||||
|
def create_folders():
|
||||||
|
global output_dir
|
||||||
|
global output_dir_audio
|
||||||
|
global output_dir_audio_temp
|
||||||
|
global output_dir_speech
|
||||||
|
|
||||||
|
print('→ Creating Dataset Folders')
|
||||||
|
|
||||||
|
output_dir_speech = os.path.join(output_dir, "LJSpeech-1.1")
|
||||||
|
|
||||||
|
# Delete existing folder if exists for clean run
|
||||||
|
if os.path.exists(output_dir_speech):
|
||||||
|
rmtree(output_dir_speech)
|
||||||
|
|
||||||
|
output_dir_audio = os.path.join(output_dir_speech, "wavs")
|
||||||
|
output_dir_audio_temp = os.path.join(output_dir_speech, "temp")
|
||||||
|
|
||||||
|
# Create Clean Folders
|
||||||
|
os.makedirs(output_dir_speech)
|
||||||
|
os.makedirs(output_dir_audio)
|
||||||
|
os.makedirs(output_dir_audio_temp)
|
||||||
|
|
||||||
|
def convert_audio():
|
||||||
|
global output_dir_audio
|
||||||
|
global output_dir_audio_temp
|
||||||
|
|
||||||
|
recordings = len([name for name in os.listdir(output_dir_audio_temp) if os.path.isfile(os.path.join(output_dir_audio_temp,name))])
|
||||||
|
|
||||||
|
print('→ Converting %s Audio Files to 22050 Hz, 16 Bit, Mono\n' % "{:,}".format(recordings))
|
||||||
|
|
||||||
|
# Please use `pip install ffmpeg-python`
|
||||||
|
import ffmpeg
|
||||||
|
|
||||||
|
for idx, wav in enumerate(glob.glob(os.path.join(output_dir_audio_temp, "*.wav"))):
|
||||||
|
|
||||||
|
percent = (idx + 1) / recordings
|
||||||
|
|
||||||
|
print('› \033[96m%s\033[0m \033[2m%s / %s (%s)\033[0m ' % (os.path.basename(wav), "{:,}".format((idx + 1)), "{:,}".format(recordings), "{:.0%}".format(percent)))
|
||||||
|
|
||||||
|
# Convert WAV file to required format
|
||||||
|
(ffmpeg
|
||||||
|
.input(wav)
|
||||||
|
.output(os.path.join(output_dir_audio, os.path.basename(wav)), acodec='pcm_s16le', ac=1, ar=22050, loglevel='error')
|
||||||
|
.overwrite_output()
|
||||||
|
.run(capture_stdout=True)
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def copy_audio():
|
||||||
|
global output_dir_audio
|
||||||
|
|
||||||
|
print('→ Using ffmpeg to convert recordings')
|
||||||
|
recordings = len([name for name in os.listdir(output_dir_audio_temp) if os.path.isfile(os.path.join(output_dir_audio_temp,name))])
|
||||||
|
|
||||||
|
print('→ Copy %s Audio Files to LJSpeech Dataset\n' % "{:,}".format(recordings))
|
||||||
|
|
||||||
|
for idx, wav in enumerate(glob.glob(os.path.join(output_dir_audio_temp, "*.wav"))):
|
||||||
|
copyfile(wav,os.path.join(output_dir_audio, os.path.basename(wav)))
|
||||||
|
|
||||||
|
def create_meta_data(mrs_dir):
|
||||||
|
print('→ Creating META Data')
|
||||||
|
|
||||||
|
conn = sqlite3.connect(os.path.join(mrs_dir, "backend", "db", "mimicstudio.db"))
|
||||||
|
c = conn.cursor()
|
||||||
|
|
||||||
|
# Create metadata.csv for ljspeech
|
||||||
|
metadata = open(os.path.join(output_dir_speech, "metadata.csv"), mode="w", encoding="utf8")
|
||||||
|
|
||||||
|
# List available recording sessions
|
||||||
|
user_models = c.execute('SELECT uuid, user_name from usermodel ORDER BY created_date DESC').fetchall()
|
||||||
|
user_id = user_models[0][0]
|
||||||
|
|
||||||
|
for row in user_models:
|
||||||
|
print(row[0] + ' -> ' + row[1])
|
||||||
|
|
||||||
|
user_answer = input('Please choose ID of recording session to export (default is newest session) [' + user_id + ']: ')
|
||||||
|
|
||||||
|
if user_answer:
|
||||||
|
user_id = user_answer
|
||||||
|
|
||||||
|
|
||||||
|
for row in c.execute('SELECT audio_id, prompt, lower(prompt) FROM audiomodel WHERE user_id = "' + user_id + '" ORDER BY length(prompt)'):
|
||||||
|
source_file = os.path.join(mrs_dir, "backend", "audio_files", user_id, row[0] + ".wav")
|
||||||
|
if exists(source_file):
|
||||||
|
metadata.write(row[0] + "|" + row[1] + "|" + row[2] + "\n")
|
||||||
|
copyfile(source_file, os.path.join(output_dir_audio_temp, row[0] + ".wav"))
|
||||||
|
else:
|
||||||
|
print("Wave file {} not found.".format(source_file))
|
||||||
|
|
||||||
|
metadata.close()
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
def cleanup():
|
||||||
|
global output_dir_audio_temp
|
||||||
|
|
||||||
|
# Remove Temp Folder
|
||||||
|
rmtree(output_dir_audio_temp)
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument('--mrs_dir', required=True)
|
||||||
|
parser.add_argument('--ffmpeg', required=False, default=False)
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
if not os.path.isdir(os.path.join(args.mrs_dir,"backend")):
|
||||||
|
sys.exit("Passed directory is no valid Mimic-Recording-Studio main directory!")
|
||||||
|
|
||||||
|
print('\n\033[48;5;22m MRS to LJ Speech Processor \033[0m\n')
|
||||||
|
|
||||||
|
create_folders()
|
||||||
|
create_meta_data(args.mrs_dir)
|
||||||
|
|
||||||
|
if(args.ffmpeg):
|
||||||
|
convert_audio()
|
||||||
|
|
||||||
|
else:
|
||||||
|
copy_audio()
|
||||||
|
|
||||||
|
cleanup()
|
||||||
|
|
||||||
|
print('\n\033[38;5;86;1m✔\033[0m COMPLETE【ツ】\n')
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
27
helperScripts/README.md
Normal file
@ -0,0 +1,27 @@
|
|||||||
|
# Short collection of helpful scripts for dataset creation and/or TTS training stuff
|
||||||
|
|
||||||
|
## MRS2LJSpeech
|
||||||
|
Python script which takes recordings (filesystem and sqlite db) done with Mycroft Mimic-Recording-Studio (https://github.com/MycroftAI/mimic-recording-studio) and creates an audio optimized dataset in widely supported LJSpeech directory structure.
|
||||||
|
|
||||||
|
Peter Schmalfeldt (https://github.com/manifestinteractive) did an amazing job as he optimized my originally (quick'n dirty) version of that script, so thank you Peter :-)
|
||||||
|
See more details here: https://gist.github.com/manifestinteractive/6fd9be62d0ede934d4e1171e5e751aba#file-mrs2ljspeech-py
|
||||||
|
|
||||||
|
## Dockerfile.Jetson-Coqui
|
||||||
|
> Add your user to `docker` group to not require sudo on all operations.
|
||||||
|
|
||||||
|
Thanks to NVIDIA for providing docker images for Jetson platform. I use the "machine learning (ML)" image as baseimage for setting up a Coqui environment.
|
||||||
|
|
||||||
|
> You can use any branch or tag as COQUI_BRANCH argument. v0.1.3 is just the current stable version.
|
||||||
|
|
||||||
|
Switch to directory where Dockerfile is in and run `nvidia-docker build . -f Dockerfile.Jetson-Coqui --build-arg COQUI_BRANCH=v0.1.3 -t jetson-coqui` to build your container image. When build process is finished you can start a container on that image.
|
||||||
|
|
||||||
|
|
||||||
|
### Mapped volumes
|
||||||
|
We need to bring your dataset and configuration file into our container so we should map a volume on running container
|
||||||
|
`nvidia-docker run -p 8888:8888 -d --shm-size 32g --gpus all -v [host path with dataset and config.json]:/coqui/TTS/data jetson-coqui`. Now we have a running container ready for Coqui TTS magic.
|
||||||
|
|
||||||
|
### Jupyter notebook
|
||||||
|
Coqui provides lots of useful Jupyter notebooks for dataset analysis. Once your container is up and running you should be able to call
|
||||||
|
|
||||||
|
### Running bash into container
|
||||||
|
`nvidia-docker exec -it jetson-coqui /bin/bash` now you're inside the container and an `ls /coqui/TTS/data` should show your dataset files.
|
41
helperScripts/getDatasetSpeechRate.py
Normal file
@ -0,0 +1,41 @@
|
|||||||
|
# This script gets speech rate per audio recording from a voice dataset (ljspeech file and directory structure)
|
||||||
|
# Writte by Thorsten Müller (deep-learning-german@gmx.net) and provided without any warranty.
|
||||||
|
# https://github.com/thorstenMueller/deep-learning-german-tts/
|
||||||
|
# https://twitter.com/ThorstenVoice
|
||||||
|
|
||||||
|
# Changelog:
|
||||||
|
# v0.1 - 26.09.2021 - Initial version
|
||||||
|
|
||||||
|
from genericpath import exists
|
||||||
|
import os
|
||||||
|
import librosa
|
||||||
|
import csv
|
||||||
|
|
||||||
|
dataset_dir = "/home/thorsten/___dev/tts/dataset/Thorsten-neutral-Dec2021-44k/" # Directory where metadata.csv is in
|
||||||
|
out_csv_file = os.path.join(dataset_dir,"speech_rate_report.csv")
|
||||||
|
decimal_use_comma = True # False: Splitting decimal value with a dot (.); True: Comma (,)
|
||||||
|
|
||||||
|
out_csv = open(out_csv_file,"w")
|
||||||
|
out_csv.write("filename;audiolength_sec;number_chars;chars_per_sec;remove_from_dataset\n")
|
||||||
|
|
||||||
|
# Open metadata.csv file
|
||||||
|
with open(os.path.join(dataset_dir,"metadata.csv")) as csvfile:
|
||||||
|
reader = csv.reader(csvfile, delimiter='|')
|
||||||
|
for row in reader:
|
||||||
|
wav_file = os.path.join(dataset_dir,"wavs",row[0] + ".wav")
|
||||||
|
|
||||||
|
if exists(wav_file):
|
||||||
|
# Gather values for report.csv output
|
||||||
|
phrase_len = len(row[1]) - 1 # Do not count punctuation marks.
|
||||||
|
duration = round(librosa.get_duration(filename=wav_file),2)
|
||||||
|
char_per_sec = round(phrase_len / duration,2)
|
||||||
|
|
||||||
|
if decimal_use_comma:
|
||||||
|
duration = str(duration).replace(".",",")
|
||||||
|
char_per_sec = str(char_per_sec).replace(".",",")
|
||||||
|
|
||||||
|
out_csv.write(row[0] + ".wav;" + str(duration) + ";" + str(phrase_len) + ";" + str(char_per_sec) + ";no\n")
|
||||||
|
else:
|
||||||
|
print("File " + wav_file + " does not exist.")
|
||||||
|
|
||||||
|
out_csv.close()
|
48
helperScripts/removeFilesFromDataset.py
Normal file
@ -0,0 +1,48 @@
|
|||||||
|
# This script removes recordings from an ljspeech file/directory structured dataset based on CSV file from "getDatasetSpeechRate"
|
||||||
|
# Writte by Thorsten Müller (deep-learning-german@gmx.net) and provided without any warranty.
|
||||||
|
# https://github.com/thorstenMueller/deep-learning-german-tts/
|
||||||
|
# https://twitter.com/ThorstenVoice
|
||||||
|
|
||||||
|
# Changelog:
|
||||||
|
# v0.1 - 26.09.2021 - Initial version
|
||||||
|
|
||||||
|
import os
|
||||||
|
import csv
|
||||||
|
import shutil
|
||||||
|
|
||||||
|
dataset_dir = "/Users/thorsten/Downloads/thorsten-export-20210909/" # Directory where metadata.csv is in
|
||||||
|
subfolder_removed = "___removed"
|
||||||
|
in_csv_file = os.path.join(dataset_dir,"speech_rate_report.csv")
|
||||||
|
to_remove = []
|
||||||
|
|
||||||
|
# Open metadata.csv file
|
||||||
|
with open(os.path.join(dataset_dir,in_csv_file)) as csvfile:
|
||||||
|
reader = csv.reader(csvfile, delimiter=';')
|
||||||
|
for row in reader:
|
||||||
|
if row[4] == "yes":
|
||||||
|
# Recording in that row should be removed from dataset
|
||||||
|
to_remove.append(row[0])
|
||||||
|
print("Recording " + row[0] + " will be removed from dataset.")
|
||||||
|
|
||||||
|
print("\n" + str(len(to_remove)) + " recordings has been marked for deletion.")
|
||||||
|
|
||||||
|
if len(to_remove) > 0:
|
||||||
|
|
||||||
|
metadata_cleaned = open(os.path.join(dataset_dir,"metadata_cleaned.csv"),"w")
|
||||||
|
|
||||||
|
# Create new subdirectory for removed wav files
|
||||||
|
removed_dir = os.path.join(dataset_dir,subfolder_removed)
|
||||||
|
if not os.path.exists(removed_dir):
|
||||||
|
os.makedirs(removed_dir)
|
||||||
|
|
||||||
|
# Remove lines from metadata.csv and move wav files to new subdirectory
|
||||||
|
with open(os.path.join(dataset_dir,"metadata.csv")) as csvfile:
|
||||||
|
reader = csv.reader(csvfile, delimiter='|')
|
||||||
|
for row in reader:
|
||||||
|
if (row[0] + ".wav") not in to_remove:
|
||||||
|
metadata_cleaned.write(row[0] + "|" + row[1] + "|" + row[2] + "\n")
|
||||||
|
else:
|
||||||
|
# Move recording to new subfolder
|
||||||
|
shutil.move(os.path.join(dataset_dir,"wavs",row[0] + ".wav"),removed_dir)
|
||||||
|
|
||||||
|
metadata_cleaned.close()
|
Before Width: | Height: | Size: 8.2 KiB |
Before Width: | Height: | Size: 8.5 KiB |
Before Width: | Height: | Size: 11 KiB |
Before Width: | Height: | Size: 9.6 KiB |
Before Width: | Height: | Size: 3.9 KiB |
Before Width: | Height: | Size: 31 KiB |