A Superpositional Model Applied to F0 Parametrisation using DCT for Text-to-Speech Synthesis

Stan, Adriana; Giurgiu, Mircea

by Adriana Stan, Mircea Giurgiu

Abstract:

This paper addresses the idea of the superpositional model based on the DCT (Discrete Cosine Transform) parameterization of the F0 contours. We examine the capacity of the DCT coefficients to estimate the fast variations in the F0 contour at syllable level and also the overall trend of the phrase level. The method determines the coefficients at syllable level, based on the subtraction of the estimated phrase level contour from the original one; thus considering that the syllable has an additive prosodic effect over the phrase level. We also compare the use of 3 different decision and regression tree algorithms for DCT coefficients clustering and prediction. Additional features are selected based on a greedy stepwise without backtracking feature selection method. The results support the proposed method through low average square errors and little or no perceivable errors in the synthesized speech.

Reference:

Adriana Stan, Mircea Giurgiu, "A Superpositional Model Applied to F0 Parametrisation using DCT for Text-to-Speech Synthesis", In Proceedings of the 6th Conference on Speech Technology and Human-Computer Dialogue, Brasov, Romania, 2011.

Bibtex Entry:

@inproceedings{SPED11,
  author = {Adriana Stan and Mircea Giurgiu},
  title =  {{A Superpositional Model Applied to F0 Parametrisation using 
                    DCT for Text-to-Speech Synthesis}},
  year = 2011,
  abstract = {This paper addresses the idea of the superpositional model based 
                on the DCT (Discrete Cosine Transform) parameterization of the 
                F0 contours. We examine the capacity of the DCT coefficients to 
                estimate the fast variations in the F0 contour at syllable level 
                and also the overall trend of the phrase level. The method determines 
                the coefficients at syllable level, based on the subtraction of the 
                estimated phrase level contour from the original one; thus considering 
                that the syllable has an additive prosodic effect over the phrase level. 
                We also compare the use of 3 different decision and regression tree 
                algorithms for DCT coefficients clustering and prediction. Additional 
                features are selected based on a greedy stepwise without backtracking 
                feature selection method. The results support the proposed method through 
                low average square errors and little or no perceivable errors in the 
                synthesized speech.},
  booktitle = {Proceedings of the 6th Conference on Speech Technology and 
                    Human-Computer Dialogue},
  address = {Brasov, Romania}
}