Romanian language statistics and resources for text-to-speech systems

Stan, Adriana; Giurgiu, Mircea

by Adriana Stan, Mircea Giurgiu

Abstract:

This paper introduces a series of results and experiments used in the development of a Romanian text-to-speech system, focusing on text statistics. We investigate the presence of several linguistic units used in text-to-speech systems, from phonemes to words. The text corpus we used, News-Romanian (News-RO) comprises 4500 newspaper articles. A subset of it, around 2500 sentences represents the Romanian Speech Synthesis (RSS) recorded speech database. The results offer an important insight to how should a speech database be designed. We also describe the methods used in the development of a 50,000 words Romanian lexicon with phonetic transcription and accent positioning. Such a lexicon is useful in machine learning algorithms of the front-end part of a text-to-speech system. As an addition we study the use of Maximal Onset Principle for Romanian syllabification.

Reference:

Adriana Stan, Mircea Giurgiu, "Romanian language statistics and resources for text-to-speech systems", In Proceedings of the 9th Edition of the International Symposium on Electronics and Telecommunications, Timisoara, Romania, pp. 381-384, 2010.

Bibtex Entry:

@inproceedings{ISETC10,
  author = {Adriana Stan and Mircea Giurgiu},
  title = {Romanian language statistics and resources for text-to-speech 
                    systems},
  booktitle = {Proceedings of the 9th Edition of the International 
                    Symposium on Electronics and Telecommunications},
  abstract = {This paper introduces a series of results and experiments 
                   used in the development of a Romanian text-to-speech 
                   system, focusing on text statistics. We investigate the 
                   presence of several linguistic units used in text-to-speech 
                   systems, from phonemes to words. The text corpus we used, 
                   News-Romanian (News-RO) comprises 4500 newspaper articles. 
                   A subset of it, around 2500 sentences represents the Romanian 
                   Speech Synthesis (RSS) recorded speech database. The results 
                   offer an important insight to how should a speech database be 
                   designed. We also describe the methods used in the development 
                   of a 50,000 words Romanian lexicon with phonetic transcription 
                   and accent positioning. Such a lexicon is useful in machine 
                   learning algorithms of the front-end part of a text-to-speech 
                   system. As an addition we study the use of Maximal Onset 
                   Principle for Romanian syllabification.},
  year = {2010},
  pages={381-384},
  address = {Timisoara, Romania}
}