A Grapheme-based Method for Automatic Alignment of Speech and Text Data

Stan, Adriana; Bell, Peter; King, Simon

by Adriana Stan, Peter Bell, Simon King

Abstract:

This paper introduces a method for automatic alignment of speech data with unsynchronised, imperfect transcripts, for a domain where no initial acoustic models are available. Using grapheme-based acoustic models, word skip networks and orthographic speech transcripts, we are able to harvest 55\% of the speech with a 93\% utterance-level accuracy and 99\% word accuracy for the produced transcriptions. The work is based on the assumption that there is a high degree of correspondence between the speech and text, and that a full transcription of all of the speech is not required. The method is language independent and the only prior knowledge and resources required are the speech and text transcripts, and a few minor user interventions.

View PDF

Reference:

Adriana Stan, Peter Bell, Simon King, "A Grapheme-based Method for Automatic Alignment of Speech and Text Data", In Proc. IEEE Workshop on Spoken Language Technology, Miami, Florida, USA, pp. 286-290, 2012.

Bibtex Entry:

@inproceedings{stan12_grapheme_alignment,
  author = {Stan, Adriana and Bell, Peter and King, Simon},
  title = {A Grapheme-based Method for Automatic Alignment of
                   Speech and Text Data},
  booktitle = {Proc. IEEE Workshop on Spoken Language Technology},
  address = {Miami, Florida, USA},
  abstract = {This paper introduces a method for automatic alignment
                   of speech data with unsynchronised, imperfect
                   transcripts, for a domain where no initial acoustic
                   models are available. Using grapheme-based acoustic
                   models, word skip networks and orthographic speech
                   transcripts, we are able to harvest 55\% of the speech
                   with a 93\% utterance-level accuracy and 99\% word
                   accuracy for the produced transcriptions. The work is
                   based on the assumption that there is a high degree of
                   correspondence between the speech and text, and that a
                   full transcription of all of the speech is not
                   required. The method is language independent and the
                   only prior knowledge and resources required are the
                   speech and text transcripts, and a few minor user
                   interventions.},
  month = dec,
  year = 2012,
  pages = {286-290},
  url = {papers/2012_SLT.pdf}
}