Thorsten-Voice: A free to use, offline working, high quality german TTS voice should be available for every project without any license struggling.
Go to file
2021-04-22 18:57:40 +02:00
docs Added silero models to audio comparison 2021-04-11 11:04:20 +02:00
img Added smaller logo 2021-03-30 08:00:58 +02:00
samples Adding info on emotional dataset. 2021-04-03 23:24:53 +02:00
EvolutionOfThorstenDataset.pdf README update due dataset release 2020-08-05 17:25:01 +02:00
LICENSE Create LICENSE 2019-10-29 19:04:37 +01:00
README.md Format Wikipedia link 2021-04-22 18:57:40 +02:00
RecordingQuality.csv Added recording quality csv 2020-09-23 18:05:38 +02:00

Thorsten - Open German Voice Dataset

Introduction to "Thorsten-Voice" 🗣️ 💬 🦥

A free to use, offline working, high quality german TTS voice should be available for every project without any license struggling.

CC-0 license Maintaner follow on Twitter Open Source Audio comparison page

A summary of my open german voice dataset is available on Wikipedia

https://de.wikipedia.org/wiki/Thorsten_(Stimme) 😃

Speaking tech devices and voice based smart assistants are very popular ourdays. But for providing nice sounding TTS lot of projects depend on big tech cloud services for synthezing voice. While quality is quite good, there remain critical aspects like privacy concerns and missing offline availablitiy.

True, but what is this all about

I want to (hopefully) fill that german TTS gap and make the most personal contribution i can give.
I contribute my personal voice! 💚

This contribution is split into three parts:

  • "Thorsten" neutral dataset
  • "Thorsten" emotional dataset
  • Pretrained TTS models based on "Thorsten" dataset

Please read some personal words before using dataset / TTS models

I contribute my voice as a person believing in a world where all people are equal. No matter of gender, sexual orientation, religion, skin color and geocoordinates of birth location. A global world where everybody is warmly welcome on any place on this planet and open and free knowledge and education is available to everyone. 🌍

So hopefully my voice is used in this manner to make this world a better place for all of us 😃.

tl;dr Please don't use for evil!

Datasets

For both datasets please keep in mind, that i am no professional voice talent. I'm just a normal guy sharing his voice with you.

Dataset "Thorsten" neutral

Samples of my neutral voice

To get an impression what my voice sounds to decide if it fits to your project i published some sample recordings, so no need to download complete dataset first.

Dataset information 🎤

  • ljspeech-1.1 structure
  • 22.668 recorded phrases (wav files)
  • more than 23 hours of pure audio
  • samplerate 22.050Hz
  • mono
  • normalized to -24dB
  • phrase length (min/avg/max): 2 / 52 / 180 chars
  • no silence at beginning/ending
  • avg spoken chars per second: 14
  • sentences with question mark: 2.780
  • sentences with exclamation mark: 1.840

text length vs. mean audio duration text length vs. median audio duration text length vs. STD text length vs. number instances signal noise ratio bokeh

Dataset evolution

As described in the pdf document (evolution of thorsten dataset) this dataset consists of three recording phases.

  • phase1: Recorded with a cheap usb microphone
  • phase2: Recorded with a good microphone
  • phase3: Recorded with same good microphone but longer phrases (> 100 chars)

If you wanna use just a dataset subset (phase1 and/or phase2 and/or phase3) you can see which files belong to which recording phase in recording quality csv file.

Neutral dataset download information

Download size: 2,7GB

Version Description Date Link
thorsten-de-v01 Initial version 2020-06-28 Google Drive Download v01
thorsten-de-v02 Normalized to -24dB and split metadata.csv into shuffeled metadata_train.csv and metadata_val.csv 2020-08-22 Google Drive Download v02
thorsten-de-v03 Based on v02 dataset, but with increased speed by 10% (using ffmpeg atempo=1.1). 2021-02-10 Google Drive Download v03

Dataset "Thorsten" emotional

Emotional dataset information and samples 🎤

  • 300 sentences * 6 emotions = 1.800 recordings
  • recorded by Thorsten Müller (optimized by Dominik Kreutz)
  • mono
  • samplerate 22.050Hz
  • normalized to -24dB
  • no silence at beginning/ending
  • sentence length: 59 - 148 chars

Btw. i mentioned, that i'm no professional voice talent, did i?

"Mist, wieder nichts geschafft."

Emotion Minutes Sample
Neutral 🙂 19 min. neutral sample
Disgusted 🤢 23 min. disgusted sample
Angry 😠 20 min. angry sample
Amused 😀 18 min. amused sample
Surprised 😲 18 min. surprised sample
Sleepy 😔 30 min. sleepy sample

Emotional dataset download information

Download size: 300MB

Version Description Date Link
thorsten-de-emotional-v01 Initial version 2021-04-03 Google Drive Download v01

Pretrained TTS models

If you trained a model on "Thorsten" dataset please file an issue with some information on it. Sharing a trained model is highly appreciated.

My personal training sessions are based on TTS repo code (originally initiated by Mozilla) and now maintained through https://www.coqui.ai (🐸)

Coqui models

For all "Thorsten" coqui models i recommend setting up a virtual environment (venv).

  • mkdir ThorstenVoice
  • cd ThorstenVoice
  • python3 -m venv .
  • source ./bin/activate
  • pip install -U pip
  • pip install -U tts
  • start coqui server model with one of the following model combinations
  • Open web-browser on http://localhost:5002

Tacotron2 + DCA (DynamicConvolution Attention) & WaveGrad vocoder

Using option "use_cuda=true" is recommended for better real time factor. RTF (CPU) around 25x realtime RTF (GPU) around 4x realtime

  • tts-server --model_name tts_models/de/thorsten/tacotron2-DCA

See: https://github.com/coqui-ai/TTS/releases/tag/v0.0.11

Tacotron2 + DCA (DynamicConvolution Attention) & Fullband-MelGAN (universal) vocoder

RTF is less then 0.5 realtime

  • tts-server --model_name tts_models/de/thorsten/tacotron2-DCA --vocoder_name vocoder_models/universal/libri-tts/fullband-melgan

Silero-models

You can use a free A-GPL licensed models trained on this dataset via the silero-models project. The full list of models including their older version is available via this yaml file.

Speaker Gender Language Examples Colab
thorsten_8khz m de 8000 / 16000 Open In Colab
thorsten_16khz m de 8000 / 16000 Open In Colab

Feel free to file an issue if you ...

  • have improvements on dataset
  • use my TTS voice in your project(s)
  • want to share your trained "Thorsten" model
  • get to know about any abuse usage of my voice

Recommended projects

Special thanks

I want to thank all open source communities for providing great projects.

Just to name some nice guys who joined me on this TTS roadtrip:

Additionally, a really nice thanks for my dear colleague, Sebastian Kraus, for supporting me with audio recording equipment and for being the creative mastermind behind the logo design.

And last but not least i want to say a huge thank you to a special guy who supported me on this journey right from the beginning. Not just with nice words, but with his time, audio optimization knowhow and finally his gpu computing power.

Without his amazing support this dataset (in it's current way) would not exists.

Thank you Dominik (@domcross / https://github.com/domcross/)

Additional links

We'll hear us in future 🗣️

Thorsten (https://twitter.com/ThorstenVoice)