by Adriana Stan, Mircea Giurgiu
Abstract:
This paper addresses the idea of the superpositional model based on the DCT (Discrete Cosine Transform) parameterization of the F0 contours. We examine the capacity of the DCT coefficients to estimate the fast variations in the F0 contour at syllable level and also the overall trend of the phrase level. The method determines the coefficients at syllable level, based on the subtraction of the estimated phrase level contour from the original one; thus considering that the syllable has an additive prosodic effect over the phrase level. We also compare the use of 3 different decision and regression tree algorithms for DCT coefficients clustering and prediction. Additional features are selected based on a greedy stepwise without backtracking feature selection method. The results support the proposed method through low average square errors and little or no perceivable errors in the synthesized speech.
Reference:
Adriana Stan, Mircea Giurgiu, "A Superpositional Model Applied to F0 Parametrisation using DCT for Text-to-Speech Synthesis", In Proceedings of the 6th Conference on Speech Technology and Human-Computer Dialogue, Brasov, Romania, 2011.
Bibtex Entry:
@inproceedings{SPED11,
author = {Adriana Stan and Mircea Giurgiu},
title = {{A Superpositional Model Applied to F0 Parametrisation using
DCT for Text-to-Speech Synthesis}},
year = 2011,
abstract = {This paper addresses the idea of the superpositional model based
on the DCT (Discrete Cosine Transform) parameterization of the
F0 contours. We examine the capacity of the DCT coefficients to
estimate the fast variations in the F0 contour at syllable level
and also the overall trend of the phrase level. The method determines
the coefficients at syllable level, based on the subtraction of the
estimated phrase level contour from the original one; thus considering
that the syllable has an additive prosodic effect over the phrase level.
We also compare the use of 3 different decision and regression tree
algorithms for DCT coefficients clustering and prediction. Additional
features are selected based on a greedy stepwise without backtracking
feature selection method. The results support the proposed method through
low average square errors and little or no perceivable errors in the
synthesized speech.},
booktitle = {Proceedings of the 6th Conference on Speech Technology and
Human-Computer Dialogue},
address = {Brasov, Romania}
}