by Adriana Stan, Junichi Yamagishi, Simon King, Matthew Aylett
Abstract:
This paper first introduces a newly-recorded high quality Romanian speech corpus designed for speech synthesis, called ''RSS'', along with Romanian front-end text processing modules and HMM-based synthetic voices built from the corpus. All of these are now freely available for academic use in order to promote Romanian speech technology research. The RSS corpus comprises 3500 training sentences and 500 test sentences uttered by a female speaker and was recorded using multiple microphones at 96 kHz sampling frequency in a hemianechoic chamber. The details of the new Romanian text processor we have developed are also given. Using the database, we then revisit some basic configuration choices of speech synthesis, such as waveform sampling frequency and auditory frequency warping scale, with the aim of improving speaker similarity, which is an acknowledged weakness of current HMM-based speech synthesisers. As we demonstrate using perceptual tests, these configuration choices can make substantial differences to the quality of the synthetic speech. Contrary to common practice in automatic speech recognition, higher waveform sampling frequencies can offer enhanced feature extraction and improved speaker similarity for HMM-based speech synthesis.
Reference:
Adriana Stan, Junichi Yamagishi, Simon King, Matthew Aylett, "The Romanian speech synthesis (RSS) corpus: Building a high quality HMM-based speech synthesis system using a high sampling rate", In Speech Communication, vol. 53, no. 3, pp. 442-450, 2011.
Bibtex Entry:
@article{Stan2011442,
author = {Adriana Stan and Junichi Yamagishi and Simon King and
Matthew Aylett},
title = {The {R}omanian speech synthesis ({RSS}) corpus:
Building a high quality {HMM}-based speech synthesis
system using a high sampling rate},
journal = {Speech Communication},
volume = {53},
number = {3},
pages = {442--450},
note = {},
abstract = {This paper first introduces a newly-recorded high
quality Romanian speech corpus designed for speech
synthesis, called ''RSS'', along with Romanian
front-end text processing modules and HMM-based
synthetic voices built from the corpus. All of these
are now freely available for academic use in order to
promote Romanian speech technology research. The RSS
corpus comprises 3500 training sentences and 500 test
sentences uttered by a female speaker and was recorded
using multiple microphones at 96 kHz sampling
frequency in a hemianechoic chamber. The details of the
new Romanian text processor we have developed are also
given. Using the database, we then revisit some basic
configuration choices of speech synthesis, such as
waveform sampling frequency and auditory frequency
warping scale, with the aim of improving speaker
similarity, which is an acknowledged weakness of
current HMM-based speech synthesisers. As we
demonstrate using perceptual tests, these configuration
choices can make substantial differences to the quality
of the synthetic speech. Contrary to common practice in
automatic speech recognition, higher waveform sampling
frequencies can offer enhanced feature extraction and
improved speaker similarity for HMM-based speech
synthesis.},
doi = {10.1016/j.specom.2010.12.002},
issn = {0167-6393},
keywords = {Speech synthesis, HTS, Romanian, HMMs, Sampling
frequency, Auditory scale},
url = {http://www.sciencedirect.com/science/article/pii/S0167639310002074},
year = 2011
}