Thorsten-Voice/README.md

215 lines
10 KiB
Markdown
Raw Normal View History

2022-04-23 21:13:30 +02:00
- [Project motivation](#motivation-for-thorsten-voice-project-speaking_head-speech_balloon)
- [Personal note](#some-personal-words-before-using-thorsten-voice)
- [**Thorsten** Voice Datasets](#voice-datasets)
- [Thorsten-21.02-neutral](#thorsten-2102-neutral)
- [Thorsten-21.06-emotional](#thorsten-2106-emotional)
- [Thorsten-22.05-neutral](#thorsten-2205-neutral)
- [**Thorsten** TTS-Models](#tts-models)
- [Thorsten-21.04-Tacotron2-DCA](#thorsten-2104-tacotron2-dca)
- [Thorsten-22.05-VITS](#thorsten-2205-vits)
- [Thorsten-22.05-Tacotron2-DDC](#thorsten-2205-tacotron2-ddc)
- [Other models](#other-models)
2021-04-03 23:24:53 +02:00
2022-04-23 21:13:30 +02:00
- [Public talks](#public-talks)
2021-04-03 23:24:53 +02:00
2022-04-23 23:22:27 +02:00
- [My Youtube channel](#youtube-channel)
2022-04-23 21:13:30 +02:00
- [Special Thanks](#thanks-section)
2021-04-03 23:24:53 +02:00
2022-04-23 21:13:30 +02:00
# Motivation for Thorsten-Voice project :speaking_head: :speech_balloon:
A **free** to use, **offline** working, **high quality** **german** **TTS** voice should be available for every project without any license struggling.
2021-04-03 23:48:10 +02:00
2021-04-03 07:17:14 +02:00
2022-04-23 21:13:30 +02:00
[![Open Source](https://badges.frapsoft.com/os/v1/open-source.svg?v=103)](https://opensource.org/)
<a href="https://twitter.com/intent/follow?screen_name=ThorstenVoice"><img src="https://img.shields.io/twitter/follow/ThorstenVoice?style=social&logo=twitter" alt="follow on Twitter"></a>
2022-04-28 18:13:49 +02:00
[![YouTube Channel Subscribers](https://img.shields.io/youtube/channel/subscribers/UCjqqTVVBTsxpm0iOhQ1fp9g?style=social)](https://www.youtube.com/c/ThorstenMueller)
[![Project website](https://img.shields.io/badge/Project_website-www.Thorsten--Voice.de-92a0c0)](https://www.Thorsten-Voice.de)
2021-04-03 07:17:14 +02:00
2022-04-23 21:13:30 +02:00
# Some personal words before using **Thorsten-Voice**
> I contribute my voice as a person believing in a world where all people are equal. No matter of gender, sexual orientation, religion, skin color and geocoordinates of birth location. A global world where everybody is warmly welcome on any place on this planet and open and free knowledge and education is available to everyone. :earth_africa: (*Thorsten Müller*)
2019-10-31 22:07:59 +01:00
2022-04-23 21:13:30 +02:00
Please keep in mind, that **i am no professional voice talent**. I'm just a normal guy sharing his voice with the world.
2022-04-23 21:13:30 +02:00
# Voice-Datasets
Voice datasets are listed on Zenodo:
2021-09-24 16:32:16 +02:00
| Dataset | DOI Link |
| --------------- | ------- |
2022-04-23 21:13:30 +02:00
| Thorsten-21.02-neutral | [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.5525342.svg)](https://doi.org/10.5281/zenodo.5525342) |
| Thorsten-21.06-emotional | [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.5525023.svg)](https://doi.org/10.5281/zenodo.5525023) |
2022-04-24 09:13:13 +02:00
| Thorsten-22.05-neutral | soon to come |
2020-08-05 17:25:01 +02:00
2022-04-23 21:13:30 +02:00
## Thorsten-21.02-neutral
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.5525342.svg)](https://doi.org/10.5281/zenodo.5525342)
2020-08-05 20:13:59 +02:00
2022-04-23 21:13:30 +02:00
```
@dataset{muller_thorsten_2021_5525342,
author = {Müller, Thorsten and
Kreutz, Dominik},
title = {Thorsten-Voice - "Thorsten-21.02-neutral" Dataset},
month = feb,
year = 2021,
note = {{Please use it to make the world a better place for
whole humankind.}},
publisher = {Zenodo},
version = {3.0},
doi = {10.5281/zenodo.5525342},
url = {https://doi.org/10.5281/zenodo.5525342}
}
```
2019-10-31 22:23:03 +01:00
2022-04-23 21:13:30 +02:00
> :speaking_head: **Listen to some audio recordings from this dataset [here](https://drive.google.com/drive/folders/1KVjGXG2ij002XRHb3fgFK4j0OEq1FsWm?usp=sharing).**
### Dataset summary
* Recorded by Thorsten Müller
* Optimized by Dominik Kreutz
* LJSpeech file and directory structure
* 22.668 recorded phrases (*wav files*)
* More than 23 hours of pure audio
* Samplerate 22.050Hz
* Mono
* Normalized to -24dB
* Phrase length (min/avg/max): 2 / 52 / 180 chars
* No silence at beginning/ending
* Avg spoken chars per second: 14
* Sentences with question mark: 2.780
* Sentences with exclamation mark: 1.840
### Dataset evolution
2022-04-23 21:13:30 +02:00
As described in the PDF document ([evolution of thorsten dataset](./EvolutionOfThorstenDataset.pdf)) this dataset consists of three recording phases.
2020-09-23 18:05:38 +02:00
2022-04-23 21:13:30 +02:00
* **Phase 1**: Recorded with a cheap usb microphone (*low quality*)
* **Phase 2**: Recorded with a good microphone (*good quality*)
* **Phase 3**: Recorded with same good microphone but longer phrases (> 100 chars) (*good quality*)
2020-09-23 18:05:38 +02:00
2022-04-23 21:13:30 +02:00
If you want to use a dataset subset you can see which files belong to which recording phase in [recording quality](./RecordingQuality.csv) csv file.
2021-12-11 17:44:49 +01:00
2022-04-23 21:13:30 +02:00
## Thorsten-21.06-emotional
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.5525023.svg)](https://doi.org/10.5281/zenodo.5525023)
2021-12-11 17:44:49 +01:00
```
@dataset{muller_thorsten_2021_5525023,
author = {Müller, Thorsten and
Kreutz, Dominik},
2022-04-23 21:13:30 +02:00
title = {{Thorsten-Voice - "Thorsten-21.06-emotional"
Dataset}},
2021-12-11 17:44:49 +01:00
month = jun,
year = 2021,
note = {{Please use it to make the world a better place for
whole humankind.}},
publisher = {Zenodo},
version = {2.0},
doi = {10.5281/zenodo.5525023},
url = {https://doi.org/10.5281/zenodo.5525023}
}
```
2021-04-03 23:24:53 +02:00
2022-04-23 21:13:30 +02:00
All emotional recordings where recorded by myself and i tried to feel and pronounce that emotion even if the phrase context does not match that emotion. Example: I pronounced the sleepy recordings in the tone i have shortly before falling asleep.
### Samples
Listen to the phrase "**Mist, wieder nichts geschafft.**" in following emotions.
* :slightly_smiling_face: [Neutral](./samples/thorsten-21.06-emotional/neutral.wav)
* :nauseated_face: [Disgusted](./samples/thorsten-21.06-emotional/disgusted.wav)
* :angry: [Angry](./samples/thorsten-21.06-emotional/angry.wav)
* :grinning: [Amused](./samples/thorsten-21.06-emotional/amused.wav)
* :astonished: [Surprised](./samples/thorsten-21.06-emotional/surprised.wav)
* :pensive: [Sleepy](./samples/thorsten-21.06-emotional/sleepy.wav)
* :dizzy_face: [Drunk](./samples/thorsten-21.06-emotional/drunk.wav)
* 🤫 [Whispering](./samples/thorsten-21.06-emotional/whisper.wav)
### Dataset summary
* Recorded by Thorsten Müller
* Optimized by Dominik Kreutz
* 300 sentences * 8 emotions = 2.400 recordings
* Mono
* Samplerate 22.050Hz
* Normalized to -24dB
* No silence at beginning/ending
* Sentence length: 59 - 148 chars
2021-04-03 23:24:53 +02:00
2022-04-23 21:13:30 +02:00
## Thorsten-22.05-neutral
> :speaking_head: **Listen to some audio recordings from this dataset [here](https://drive.google.com/drive/folders/1dxoSo8Ktmh-5E0rSVqkq_Jm1r4sFnwJM?usp=sharing).**
2021-04-05 16:57:36 +02:00
2022-04-23 21:13:30 +02:00
Soon to come
2022-04-23 21:13:30 +02:00
# TTS Models
2021-04-05 16:57:36 +02:00
2022-04-23 21:13:30 +02:00
## Thorsten-21.04-Tacotron2-DCA
This [TTS-model](https://drive.google.com/drive/folders/1m4RuffbvdOmQWnmy_Hmw0cZ_q0hj2o8B?usp=sharing) has been trained on [**Thorsten-21.02-neutral**](#thorsten-2102-neutral) dataset. The recommended trained Fullband-MelGAN Vocoder can be downloaded [here](https://drive.google.com/drive/folders/1hsfaconm4Yd9wPVyOtrXjWQs4ZAPoouY?usp=sharing).
2021-04-05 16:57:36 +02:00
2022-04-23 21:13:30 +02:00
Run the model:
* pip install TTS==0.5.0
* tts-server --model_name tts_models/de/thorsten/tacotron2-DCA
2021-04-03 07:17:14 +02:00
2022-04-23 21:13:30 +02:00
## Thorsten-22.05-VITS
Trained on dataset **Thorsten-22.05-neutral**.
> TODO
2021-04-03 07:17:14 +02:00
2022-04-23 21:13:30 +02:00
## Thorsten-22.05-Tacotron2-DDC
Trained on dataset [**Thorsten-22.05-neutral**](#thorsten-2205-neutral).
> :speaking_head: **Listen to synthesized samples [here](https://drive.google.com/drive/folders/1cZlLYkLWKtF0cZQ74Pef8fJ8fiG1G7du?usp=sharing).**
2021-06-08 07:18:30 +02:00
2022-04-23 21:13:30 +02:00
Soon to come.
2021-07-21 22:49:47 +02:00
2022-04-23 21:13:30 +02:00
## Other models
### Silero
2022-04-23 21:13:30 +02:00
You can use a free A-GPL licensed models trained on **Thorsten-21.02-neutral** dataset via the [silero-models](https://github.com/snakers4/silero-models/blob/master/models.yml) project.
2022-04-23 21:13:30 +02:00
* [Thorsten 16kHz](https://drive.google.com/drive/folders/1tR6w4kgRS2JJ1TWZhwoFuU04Xkgo6YAs?usp=sharing)
* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-models/blob/master/examples_tts.ipynb)
2019-10-31 22:23:03 +01:00
2022-04-23 21:13:30 +02:00
### ZDisket
[ZDisket](https://github.com/ZDisket/TensorVox) made a tool called TensorVox for setting up an TTS environment on Windows and included a german TTS model trained by [monatis](https://github.com/monatis/german-tts). Thanks for sharing that :thumbsup:. See it in action on [Youtube](https://youtu.be/tY6_xZnkv-A).
# Public talks
I really want to bring the topic "**Open Voice For An Open Future**" to a bigger public attention.
* I've been part of a Linux User Group podcast about Mycroft AI and talked on my TTS efforts on that in (*May 2021*).
* I was invited by [Yusuf](https://github.com/monatis/) from Turkish tensorflow community to talk on "How to make machines speak with your own voice". This talk has been streamed live on Youtube and is available [here](https://www.youtube.com/watch?v=m-Uwb-Bg144&t=2303s). If you're interested on the showed slides, feel free to download my presentation [here](https://docs.google.com/presentation/d/1ynnw0ilKV3WwMSJHytrN3GXRiFr8x3r0DUimBm1y0LI/edit?usp=sharing) (*June 2021*)
)
* I've been invited as speaker on VoiceLunch language & linguistics on 03.01.2022. [Here are my slides](https://docs.google.com/presentation/d/1Gi6BmYHs7g4ZgdAiIKGBnBwZDCvJOD9DJxQOGlgds1o/edit?usp=sharing) (*January 2022*).
2022-04-23 23:22:27 +02:00
# Youtube channel
In summer 2021 i've started to share my lessons learned and experiences on open voice tech, in special **TTS** on my little [Youtube channel](https://www.youtube.com/c/ThorstenMueller). If you check out and like my videos i'd happy to welcome you as subscriber and member of my little Youtube community.
2022-04-23 21:13:30 +02:00
# Feel free to file an issue if you ...
* Use my TTS voice in your project(s)
* Want to share your trained "Thorsten" model
* Get to know about any abuse usage of my voice
# Thanks section
## Cool projects
* https://commonvoice.mozilla.org/
* https://coqui.ai/
* https://mycroft.ai/
* https://github.com/rhasspy/
## Cool people
* [El-Tocino](https://github.com/el-tocino/)
* [Eren Gölge](https://github.com/erogol/)
* [Gras64](https://github.com/gras64/)
* [Kris Gesling](https://github.com/krisgesling/)
* [Nmstoker](https://github.com/nmstoker)
* [Othiele](https://discourse.mozilla.org/u/othiele/summary)
* [Repodiac](https://github.com/repodiac)
* [SanjaESC](https://github.com/SanjaESC)
* [Synesthesiam](https://github.com/synesthesiam/)
## Even more special people
Additionally, a really nice thanks for my dear colleague, Sebastian Kraus, for supporting me with audio recording equipment and for being the creative mastermind behind the logo design.
2022-04-23 21:13:30 +02:00
And last but not least i want to say a **huge, huge thank you** to a special guy who supported me on this journey as a partner right from the beginning. Not just with nice words, but with his time, audio optimization knowhow and finally GPU power.
2020-08-05 17:25:01 +02:00
2022-04-23 21:13:30 +02:00
**Thank you so much, dear **Dominik** ([@domcross](https://github.com/domcross/)) for being my partner on this journey.**
2020-08-05 17:25:01 +02:00
2022-04-23 21:13:30 +02:00
Thorsten (*Twitter: @ThorstenVoice*)