REX Minnesang

Description

Voice conversion is a technique that transforms a person's utterance into the voice of another. In our Minnesang Exhibit we want to use this technique to allow visitors to cite a minnesong poem in medieval German.

Scenario: Jane reads aloud the verse of a minnesong poem in English, her native language. The system analyzes the characteristics of her voice and modifies a prerecorded sample spoken by an expert speaker in medieval German to sound as if it was spoken by her. When playing back the converted sample Jane will hear the original poem in medieval German but spoken with her own voice.

Visit the official Minnesang interaktiv, the public i10 Minnesang, or contact Daniel Spelmezan.

Voice Conversion Core Technology

Learn a conversion function that maps the acoustic space of a source speaker to the acoustic space of a target speaker.

1. Monolingual voice conversion (on parallel utterances)

well researched, obtained conversion effect is good
corresponding utterances (time frames) of source and target speaker needed
learn the spectral conversion function from the time-aligned spectral data corresponding to both speakers
cluster (model) the acoustic space with a Gaussian Mixture Model (GMM) into different classes (phonemes), then learn specific transformations for each class
the conversion function rests upon the probabilistic classification achieved by the GMM as well as the characteristic of each class (mean vector and covariance matrix), the mixture weights represent the statistical frequency of each class in the observation (speech)

2. Genuine cross-language voice conversion (on non-parallel utterances)

new research field, preliminary results not yet convincing
no corresponding time frames of source and target speaker exist
the source and target languages usually use different phoneme sets
subdivide the source and target speech samples into artificial phonetic classes
determine for each source class the most similar target class
use this mapping for a voice conversion parameter training

Existing Systems

This collection is not complete, but represents some approaches that try to solve or circumvent the non-parallel voice conversion.

VTLN (vocal tract length normalization) based voice conversion (David Sünderman, Ney)

works with both parallel and non-parallel utterances

successfully transforms the source voice, but is not sufficient for convincing voice conversion

warping function with one parameter warps the whole frequency axis in the same direction (man becomes woman and v.v.)

have licensed code, but will not use it

Matlab code

VTLN-based cross-language voice conversion (David Sünderman, Ney, Höge)

non-parallel utterances

currently being developed (expecting results by October 2005)

use warping function with several parameters to better describe the characteristics of the speaker's vocal tract

move certain parts of the axis to higher frequencies and other parts to lower frequencies

Matlab code

Voice conversion for unknown speakers (Hui Ye, Steve Young, Cambridge University)

parallel utterances

speech recognition (HTK toolkit) tags source and target utterances

use the source indices to reassemble the source utterance from the target's training database

currently licensed exclusively to Anthropics until March 2006, but we set up a licence agreement with Anthropics

Hui Ye has finished the project & has no time for research - not much support to be expected

Matlab code

parallel voice conversion demo samples freely available

female source converted target
male source converted target

Using phone and diphone based acoustic models for voice conversion: a step towards creating voice fonts (Arun Kumar, Ashish Verma)

female	source	converted	target
male	source	converted	target

non-parallel utterances

characterizing the target voice requires 100-150 sentences (15 min. of speech)

parallel and non-parallel demo samples freely available, but the results didn't convince me

Progress report

Anthropic's algorithm converts an unknown source voice to a known target voice. "In order to ensure that the continuous spectral evolution of the source is carried over into the transformed speech, it is important to select continuous target sentences wherever possible ... the selected target vectors are then concatenated to form the counterpart of the source vectors. ... each target speaker recorded about 650 sentences." - We pursue the inverse direction: to convert a well known source voice to an unknown target voice. Since the visitor (target voice) speaks only 3-4 sentences, we cannot analyze this voice in advance (i.e., create a large target database). According to the developers, just reversing the conversion direction may not produce convincing results because we don't have enough target data to create parallel utterances.
Another approach?

we have parallel utterances of both the source and the target speaker in the visitor's native language (at least in German and English)

train the conversion function on the parallel utterance but apply the conversion to the medieval utterance

Problem: the GMM model estimates the conversion parameters for the parallel training utterance, if we change the utterance (the language) the correspondence will change and the probabilities in the conversion function will change, thus the conversion function may not be appropriate to correctly convert the medieval utterance

Describing the characteristics of a speaker's voice seems to be a fundamental problem that cannot be resolved by using only 2-3 training utterances. According to recent research literature, the amount of parallel training data required to train a robust transformation matrix usually consists of about 64 sentences. The previously mentioned non-parallel approach for creating voice-fonts used 100-150 training sentences to characterize the voice.
We may not recompile the continuously spoken source utterance from the target speaker's utterance, but a first phoneme analysis on the given source / target text revealed that only few phonemes are missing.

can we interpolate the missing phonemes from the existing target phonemes?

however, phoneme frequency and context (allophones) per language vary heavily

Language Words Phonemes Phoneme match Wasted phonemes
medieval German 138 33
German 135 32 93.9% 3.1%
English US 133 35 87.9% 14.7%
English UK 133 34 78.8% 25.7%

We need speaker independent HMM-based speech recognizers (Hidden Markov Models) of all target languages in order to index the spoken target phonemes

Language	Words	Phonemes	Phoneme match	Wasted phonemes
medieval German	138	33
German	135	32	93.9%	3.1%
English US	133	35	87.9%	14.7%
English UK	133	34	78.8%	25.7%

Problem: HMM models differ for each target language

different HMM states due to different phonemes, different acoustic models, and different language models

how to find an equivalent path in the source and target HMM?

Possible solution:? Model the source and target language with the same phoneme set

since the phonemes are the same, rename the recognized target HMM states to the corresponding recognized source HMM states

use the renamed target indices to approximate the source utterance from the target utterance

the amount of target data, however, will not cover all HMM states of the source - to replace the missing frames use the original source frames or frames from a different but complete target database?

due to the different acoustic models of the source and target language (different mono-/di-/triphones, i.e., phoneme joints), the indexed target frames that may match the source frames will be only single, out of context phonemes, and these small units may not suffice to generate a continuous and natural utterance in good quality (as usually spoken by a human) that closely matches the source utterance

the recognition accuracy of the target utterance with speaker independent HMM-based speech recognizers is not perfect (90-95%, even less)

Hauke Schramm's PhD, include accents in speech recognition and improve recognition rate

First conversion tests with the received package yielded noisy samples (and sometimes even errors in the Matlab code).

Recognizing an utterance using the complete word network, which is constructed from the language grammar, does not yield good recognition results. Only few phonemes are recognized correctly.

Specify the uttered sentence in advance to improve the phoneme recognition. This is called forced alignment (usually used to re/train a HMM model, see HTK book on page 195 / 186). Make sure to include 'sp' (short pause) after each pronunciation in the dictionary to model silence between words, otherwise the force-aligned phonemes will not coincide with the actually heard phoneme times, and the converted sample will still be noisy. My tests still yield a shift of about 0,2 sec to the right.

This project mentions that it is important to train the users on the same microphone that is going to be used for testing. From another research paper on speech recognition: "Performance degrades when noise is introduced or when conditions differ from the training session used to build the reference templates. To compensate... must always use ... noise-limiting microphone with the same response characteristics as the microphone used during training."
(22nd Semtember 2005), Hui Ye told me that our licensed version from Anthropics is the old version that uses speech recognition and single transformation. Their new version uses force alignment, multiple transforms, and signal enhancement techniques, such as spectral refinement and phase prediction, which significantly improve the quality. No wonder that my results with the original samples sound noisy.

"The use of a single transform results in a significant averaging effect on the formant structure, and the use of multiple transforms generally delivers better quality... However, in practice, the shortage of source training data makes it difficult to robustly estimate multiple transforms."

The new version may have a better signal representation, but I doubt that it will improve the conversion quality with our limited target data.

(28th September 2005) I finished scripts that index any given sample and create a target database from it, just like the original target database provided.

I found a bug in findtgtvectors.m :) My recorded samples yielded a mlf-file with float entries, but only integers are allowed as array indices.

Samples recorded with our top-notch microphone and mixer yield an error in Matlab's built-in poly2lsf-function. iSight works pretty well, but samples are too undertoned.

Cross-gender conversion sounds bad - within gender conversion is a little better. Using the original samples produce a better conversion quality than the samples that I recorded (maybe it's my accent :)

Conversion quality with the limited target samples are in general too noisy and not expressive enough.

Forced alignment does not work very well because phonemes are not time-aligned correctly (shift forward/backward in time). Forced-alignment / phoneme recognition clearly needs improvement, and this is usually done by hand. But I doubt that the new version is better since forced alignment is done with the HTK toolkit, and not with the VC software.

The created target database has too few samples. The matched frames are in general to short to be continuous (mostly 1 or 2 state-match), some frames are repeated, and some frames simply do not exist in the target database (the silence in the samples seems to match really well ;)

The converted source vectors still sound like the source speaker, and additional noise appears.

Idea:? Let target speaker utter, in his native language, some dedicated words that sound similar to the source words, and use these to fill in the gaps that cannot be easily resampled from the limited target database.

(4th October 2005) The new version is not licenced to Anthropics, and it's not for public use (I never assumed that :)

The same holds for the pitch marking source code, which we need for Macintosh.

Hui Ye said that we need a good acoustic model to improve phoneme recognition by forced alignment and to reduce the noise. I guess this means that the provided English acoustic model, which I received from Anthropics and was built by the developers themselves, is not good !?

(7th October 2005) I contacted Steve Young to set up a license agreement for his new version of the voice morphing code.

(31st October 2005) Steve Young replied that their enhanced version is not publicly available.

Maybe we can agree on a research project?

(22nd December 2005) I enquired David Sundermann on his text-independent voice conversion research.
(31st December 2005) D. Sündermann replied that his improved algorithm may suit our needs. We will stay in contact and, hopefully, receive the new toolbox this year after he presents his improved algorithm at the ICASSP conference in May, 14-19.
(22nd March 2006) According to Sündermann's paper Text-Independent Voice Conversion Based on Unit Selection, to appear in Proc. of the ICASSP 2006, his improved text-independent algorithm is - for male voices - as good as the traditional text-dependent conversion. But the algorithm has problems with female voices.

The voice conversion toolbox is written in Matlab. The new Matlab version R14 SR3 now also supports the Matlab compiler on Mac OS X.

Sundermann converted first test samples for us with his new algorithm (only medieval German samples, female2male and male2female). The converted speech is free of distortions and of good quality. The resemblance to the target speaker is not yet ideal, but one can nevertheless recognize some of the target speaker's voice characteristics.

Biggest problems

Converted samples sound noisy.

speech recognition is bad

forced alignment is bad too if the phonemes are not time aligned correctly

manual correction needed to compensate for the time shift

Not enough target data to resample the source utterance in good quality.

some frames skipped because they do not exist in the target sample

matched states are not long enough (1-state-match), this increases noise and corrupts the spectral evolution of the sample

utter dedicated words to fill in the gaps?

Feedback from Rex-Preview (July 2005)

Technical note: do not expose the fire-i camera to strong light for long periods! It will damage the color filters!
Add lute music to converted speech to make the experience more poetical.
Add scenery to video recording so the visitor has the impression to be in medieval times (e.g., the visitor stays in front of a nice castle picture).
Make the converted utterance available on CD / website.