mirror of https://github.com/thorstenMueller/Thorsten-Voice.git synced 2024-11-21 15:33:10 +01:00

Thorsten-Voice: A free to use, offline working, high quality german TTS voice should be available for every project without any license struggling.

dataset deutsch german speech-synthesis sprachsynthese thorsten-voice tts

Go to file

Thorsten Mueller 29238f2a31 Updated Download links / Cites		2021-12-11 17:44:49 +01:00
docs	Added Forward Tacotron samples.	2021-10-30 21:48:21 +02:00
helperScripts	Added two scripts for dataset analysis/cleaning.	2021-09-28 06:10:21 +02:00
img	Adjusted quick setup example to new vocoder model.	2021-08-06 09:50:44 +02:00
samples	Added v02 emotional dataset (drunk + whispering)	2021-06-13 10:59:04 +02:00
EvolutionOfThorstenDataset.pdf	README update due dataset release	2020-08-05 17:25:01 +02:00
LICENSE	Create LICENSE	2019-10-29 19:04:37 +01:00
README.md	Updated Download links / Cites	2021-12-11 17:44:49 +01:00
RecordingQuality.csv	Added recording quality csv	2020-09-23 18:05:38 +02:00

README.md

Introduction to "Thorsten-Voice" 🗣️ 💬 🦥
A personal note
Voice "Thorsten" (neutral)
Voice "Thorsten" (emotional)
- Emotional dataset information and samples 🎤
- Emotional dataset download information
Pretrained TTS models
Public talks
Feel free to file an issue if you ...
Recommended projects / communities
Special thanks
Additional links

Introduction to "Thorsten-Voice" 🗣️ 💬 🦥

A free to use, offline working, high quality german TTS voice should be available for every project without any license struggling.

My datasets are listed on Zenodo with following DOIs:

Dataset	DOI Link
Thorsten (neutral)
Thorsten (emotional)

Speaking tech devices and voice based smart assistants are very popular ourdays. But for providing nice sounding TTS lot of projects depend on big tech cloud services for synthezing voice. While quality is quite good, there remain critical aspects like privacy concerns and missing offline availablitiy.

➡️ http://www.Thorsten-Voice.de

➡️ https://OpenVoice-Tech.net

True, but what is this all about

I want to (hopefully) fill that german TTS gap and make the most personal contribution i can give.
I contribute my personal voice! 💚

This contribution is split into three parts:

"Thorsten" neutral dataset
"Thorsten" emotional dataset
Pretrained TTS models based on "Thorsten" dataset

Please read some personal words before using dataset / TTS models

I contribute my voice as a person believing in a world where all people are equal. No matter of gender, sexual orientation, religion, skin color and geocoordinates of birth location. A global world where everybody is warmly welcome on any place on this planet and open and free knowledge and education is available to everyone. 🌍

So hopefully my voice is used in this manner to make this world a better place for all of us 😃.

tl;dr Please don't use for evil!

Datasets

For both datasets please keep in mind, that i am no professional voice talent. I'm just a normal guy sharing his voice with you.

Dataset "Thorsten" neutral

NEW RECORDING-IN-PROGRESS SNEAK PREVIEW OOPS PREHEAR 🗣️ 🚧 🎤

I am currently recording a new neutral dataset on a new corpus. This time with BETTER MICROPHONE, BETTER ROOM SITUATION, MORE NATURAL SPEECH FLOW right from the beginning. I've just recorded 8.000 recordings (much recording work remaining) but i share this dataset with you. Any feedback on quality, understandability or naturalness is highly appreciated and i can adjust my recording voice on furher recordings.

https://drive.google.com/file/d/1Pqdwrv63OnPnp5TVJt1PmrcBTIEx6Zko/view?usp=sharing

Samples of my neutral voice

To get an impression what my voice sounds to decide if it fits to your project i published some sample recordings, so no need to download complete dataset first.

Dataset information 🎤

ljspeech-1.1 structure
22.668 recorded phrases (wav files)
more than 23 hours of pure audio
samplerate 22.050Hz
mono
normalized to -24dB
phrase length (min/avg/max): 2 / 52 / 180 chars
no silence at beginning/ending
avg spoken chars per second: 14
sentences with question mark: 2.780
sentences with exclamation mark: 1.840

Dataset evolution

As described in the pdf document (evolution of thorsten dataset) this dataset consists of three recording phases.

phase1: Recorded with a cheap usb microphone
phase2: Recorded with a good microphone
phase3: Recorded with same good microphone but longer phrases (> 100 chars)

If you wanna use just a dataset subset (phase1 and/or phase2 and/or phase3) you can see which files belong to which recording phase in recording quality csv file.

Neutral dataset download information

Download: https://zenodo.org/record/5525342 (2,7GB)

 @dataset{muller_thorsten_2021_5525342,
   author       = {Müller, Thorsten and
                    Kreutz, Dominik},
    title        = {Thorsten - Open German Voice (Neutral) Dataset},
    month        = feb,
    year         = 2021,
    note         = {{Please use it to make the world a better place for 
                    whole humankind.}},
    publisher    = {Zenodo},
    version      = {3.0},
    doi          = {10.5281/zenodo.5525342},
    url          = {https://doi.org/10.5281/zenodo.5525342}
  }

Dataset "Thorsten" emotional

Emotional dataset information and samples 🎤

All emotional recordings where recorded by myself and i tried to feel and pronounce that emotion even if the phrase context does not match that emotion. Example: I pronounced the sleepy recordings in the tone i have shortly before falling asleep.

300 sentences * 8 emotions = 2.400 recordings
recorded by Thorsten Müller (optimized by Dominik Kreutz)
mono
samplerate 22.050Hz
normalized to -24dB
no silence at beginning/ending
sentence length: 59 - 148 chars

Btw. i mentioned, that i'm no professional voice talent, did i?

"Mist, wieder nichts geschafft."

Emotion	Minutes	Sample
Neutral 🙂	19 min.	neutral sample
Disgusted 🤢	23 min.	disgusted sample
Angry 😠	20 min.	angry sample
Amused 😀	18 min.	amused sample
Surprised 😲	18 min.	surprised sample
Sleepy 😔	30 min.	sleepy sample
Drunk (i was "not" drunk while recording!) 😵	25 min.	drunk sample
Whispering 🤫	22 min.	whispering sample

Emotional dataset download information

Download: https://zenodo.org/record/5525023 (size 350MB)

@dataset{muller_thorsten_2021_5525023,
  author       = {Müller, Thorsten and
                  Kreutz, Dominik},
  title        = {Thorsten - Open German Voice (Emotional) Dataset},
  month        = jun,
  year         = 2021,
  note         = {{Please use it to make the world a better place for 
                   whole humankind.}},
  publisher    = {Zenodo},
  version      = {2.0},
  doi          = {10.5281/zenodo.5525023},
  url          = {https://doi.org/10.5281/zenodo.5525023}
}

Pretrained TTS models

If you trained a model on "Thorsten" dataset please file an issue with some information on it. Sharing a trained model is highly appreciated.

My personal training sessions are based on TTS repo code (originally initiated by Mozilla) and now maintained through https://www.coqui.ai (🐸)

Coqui models

Quick steps for synthesizing voice

For all "Thorsten" coqui models i recommend setting up a virtual environment (venv).

Python 3.6 - 3.9 required

mkdir ThorstenVoice
cd ThorstenVoice
python3 -m venv .
source ./bin/activate
pip install -U pip TTS
tts-server --model_name tts_models/de/thorsten/tacotron2-DCA
Open web-browser on http://localhost:5002

Details: https://github.com/coqui-ai/TTS/releases/tag/v0.0.11 or https://github.com/coqui-ai/TTS/releases/tag/v0.1.3

Instead of web frontend you can call it by cli.

curl http://localhost:5002/api/tts?text=TEXT --output test.wav

Download Coqui trained checkpoints / config

Model name	Coqui Repo branch / commit	Release date	Google Drive Download Link
Thorsten Tacotron2 DCA	master / 0ee3eeefb553678d56c49534f3972a426a254649	2021-04-02	Google Drive Thorsten Taco2 DCA
Thorsten Vocoder WaveGrad	master / 0ee3eeefb553678d56c49534f3972a426a254649	2021-04-02	Google Drive Thorsten Vocoder WaveGrad
Thorsten Vocoder Fullband-MelGAN	master / 0ee3eeefb553678d56c49534f3972a426a254649	2021-07-26	Google Drive Thorsten Vocoder Fullband-MelGAN or Coqui v0.1.3 model download
Thorsten Vocoder HifiGAN		planned	planned
Thorsten Vocoder WaveRNN		planned	planned

Silero

You can use a free A-GPL licensed models trained on this dataset via the silero-models project. The full list of models including their older version is available via this yaml file.

Speaker	Gender	Language	Examples	Colab
thorsten_8khz	m	de	8000 / 16000
thorsten_16khz	m	de	8000 / 16000

ZDisket

ZDisket made a tool called TensorVox for setting up an TTS environment on Windows easily and included the german TTS model trained by monatis. Thanks for sharing that 👍. You can find more details on how to set up here or see it live in action on Youtube.

Public talks

I really want to bring the topic "OpenVoice" to a bigger public attention, so i am happy to be invited as a speaker on that.

I have been part of a Linux User Group podcast about Mycroft AI and talked on my TTS efforts on that in May 2021. I'll publish a link to that talk when it's released to the public.

In addition to that i was invited by Yusuf from Turkish tensorflow community to talk on "How to make machines speak with your own voice" on june 2nd, 2021. This talk has been streamed live on Youtube and is available here. If you're interested on the showed slides, feel free to download my presentation here

Whenever i've something about open voice in mind what i like to share my thoughts on i post a video on Youtube.

Feel free to file an issue if you ...

have improvements on dataset
use my TTS voice in your project(s)
want to share your trained "Thorsten" model
get to know about any abuse usage of my voice

Recommended projects

https://mycroft.ai/ (for building an opensource privacy friendly voice assistant)
https://www.mozilla.org (for initiating voice projects for STT and TTS)
https://coqui.ai/ (for keeping voice projects running)
https://github.com/coqui-ai/TTS
https://github.com/TensorSpeech/TensorFlowTTS
https://github.com/rhasspy/de_larynx-thorsten

Special thanks

I want to thank all open source communities for providing great projects.

Just to name some nice guys who joined me on this TTS roadtrip:

eltocino (https://github.com/el-tocino/)
erogol (https://github.com/erogol/)
gras64 (https://github.com/gras64/)
krisgesling (https://github.com/krisgesling/)
nmstoker (https://github.com/nmstoker)
othiele (https://discourse.mozilla.org/u/othiele/summary)
repodiac (https://github.com/repodiac)
SanjaESC (https://github.com/SanjaESC)

Additionally, a really nice thanks for my dear colleague, Sebastian Kraus, for supporting me with audio recording equipment and for being the creative mastermind behind the logo design.

And last but not least i want to say a huge thank you to a special guy who supported me on this journey right from the beginning. Not just with nice words, but with his time, audio optimization knowhow and finally his gpu computing power.

Without his amazing support this dataset (in it's current way) would not exists.

Thank you Dominik (@domcross / https://github.com/domcross/)

Additional links

We'll hear us in future 🗣️

Thorsten (https://twitter.com/ThorstenVoice)