Banner Portal
Extending automatic transcripts in a unified data representation towards a prosodic-based metadata annotation and evaluation


Automatic speech processing
Speech alignment
Structural metadata
Speech prosody
Speech data representation
Multiple-domain speech corpora
Cross-language speech processing

How to Cite

Batista F, Moniz H, Trancoso I, Mamede N, Mata AI. Extending automatic transcripts in a unified data representation towards a prosodic-based metadata annotation and evaluation. J. of Speech Sci. [Internet]. 2021 Feb. 4 [cited 2023 May 29];2(2):113-36. Available from:


This paper describes a framework that extends automatic speech transcripts in order to accommodate relevant information coming from manual transcripts, the speech signal itself, and other resources, like lexica. The proposed framework automatically collects, relates, computes, and stores all relevant information together in a self-contained data source, making it possible to easily provide a wide range of interconnected information suitable for speech analysis, training, and evaluating a number of automatic speech processing tasks. The main goal of this framework is to integrate different linguistic and paralinguistic layers of knowledge for a more complete view of their representation and interactions in several domains and languages. The processing chain is composed of two main stages, where the first consists of integrating the relevant manual annotations in the speech recognition data, and the second consists of further enriching the previous output in order to accommodate prosodic information. The described framework has been used for the identification and analysis of structural metadata in automatic speech transcripts. Initially put to use for automatic detection of punctuation marks and for capitalization recovery from speech data, it has also been recently used for studying the characterization of disfluencies in speech. It was already applied to several domains of Portuguese corpora, and also to English and Spanish Broadcast News corpora.


Abad, A. and Neto, J. (2008). Incorporating acoustical modelling of phone transitions in a hybrid ANN/HMM speech recognizer. In Proc. of the 9th Annual Conference of the International Speech Communication Association (Interspeech 2008), Brisbane, Australia.

Batista, F., Moniz, H., Trancoso, I., Mamede, N. J., (2012) Bilingual Experiments on Automatic Recovery of Capitalization and Punctuation of Automatic Speech Transcripts, IEEE Transactions on Audio, Speech, and Language Processing, IEEE Signal Processing Society, vol. 20, n. 2, pages 474 - 485, doi: 10.1109/TASL.2011.2159594

Batista, F., Trancoso, I., Mamede, N. J. (2009), Comparing Automatic Rich Transcription for Portuguese, Spanish and English Broadcast News, In ASRU - Automatic Speech Recognition and Understanding Workshop, Merano, Italy.

Beckman, M., Pierrehumbert, J. (1986). “Intonational structure in Japanese and English”. Phonology Yearbook, pp. 15-70.

Bolinger, D. (1989). Intonation and its uses: Melody in grammar and discourse. London:Arnold.

Bruce, G. (1977). Swedish word accents in sentence perspective. Lund: Gleerup.

Calhoun, S., Carletta, J., Brenier, J., Mayo, N., Jurafsky, D., Steedman, M. and Beaver, D. (2010) The NXT-format Switchboard Corpus: A Rich Resource for Investigating the Syntax, Semantics, Pragmatics and Prosody of Dialogue. Language Resources and Evaluation Journal 44(4): 387-419. DOI: 10.1007/s10579-010-9120-1

Carletta, J., Evert, S., Heid, U., and Kilgour, J. (2005). The NITE XML Toolkit: data model and query. Language Resources and Evaluation Journal 39(4): 313-334. DOI 10.1007/s10579-006- 9001-9

Christensen, H., Gotoh, Y., and Renals, S. (2001). Punctuation annotation using statistical prosody models. In Proc. of the ISCA Workshop on Prosody in Speech Recognition and Understanding, pages 35–40.

Favre, B., Hakkani-Tur, D., and Shriberg, E. (2009). Syntactically-informed Models for Comma Prediction. In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’09), Taipei, Taiwan.

Fisher, B. (1996). The tsylb2 program. National Institute of Standards and Technology Speech.

Gotoh, Y. and Renals, S. (2000). Sentence boundary detection in broadcast speech transcripts. In Proc. of the ISCA Workshop: Automatic Speech Recognition: Challenges for the new Millennium ASR-2000, pages 228–235.

Gussenhoven, C. (2004). The Phonology of Tone and Intonation. Cambridge: Cambridge University Press.

Hindle, D. (1983) Deterministic parsing of syntactic non-fluencies. In Proc. of the 21st annual meeting of the Association for Computational Linguistics (A CL-83), pages 123-128.

Huang, J. and Zweig, G. (2002). Maximum entropy model for punctuation annotation from speech. In Proc. of the 7th International Conference on Spoken Language Processing (INTERSPEECH 2002), pages 917 – 920.

Jurafsky, D. and Martin, J. H. (2009). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall PTR, second edition.

Kim, J. and Woodland, P. C. (2001). The use of prosody in a combined system for punctuation generation and speech recognition. In Proc. of Eurospeech, pages 2757–2760.

Ladd, D. R. (1996). Intonational Phonology. Cambridge:CUP.

Ladd, D. R. (2008). Intonation Phonology, 2.ª Edição, Cambridge University Press, Cambridge.

Levelt, W. (1989). Speaking. MIT Press, Cambridge, Massachusetts.

Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 6:707–710. (English translation).

Liberman, M. (1975). The intonational system of English. PhD Dissertation, MIT. Distributed 1978 by IULC.

Liu, Y., Shriberg, E., Stolcke, A., Dustin, H., Ostendorf, M. And Harper, M. (2006) Enriching Speech Recognition with Automatic Detection of Sentence Boundaries and Disfluencies, IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, n. 5, pp. 1526-1540.

Meinedo, H, Abad, A., Pellegrini, T., Trancoso, I., Neto, J. P. (2010), The L2F Broadcast News Speech Recognition System, In Fala2010, Vigo, Spain.

Meinedo, H. and Neto, J. P. (2003). Audio segmentation, classification and clustering in a broadcast news task. In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’03), Hong Kong, China.

Moniz, H., Batista, F., Mata, A.I., and Trancoso, I., (2012) Analysis of disfluencies in a corpus of university lectures. In Proc. of Exling 2012, Athens, Greece.

Moniz, H., Batista, F., Meinedo, H., Abad, A., Trancoso, I., Mata, A. I., Mamede, N. J. (2010) Prosodically-based automatic segmentation and punctuation, In Speech Prosody 2010, ISCA, Chicago, USA.

Moniz, H., Batista, F., Trancoso, I., Mata, A. I., "Prosodic context-based analysis of disfluencies", in Proc. of Interspeech 2012, Portland, U.S.A.

Nakatani, C., Hirschberg, J. (1994). A corpus-based study of repair cues in spontaneous speech. Journal of the Acoustical Society of America (JASA), (95):1603–1616.

Nespor, M., Vogel, I.(2007). Prosodic Phonology. Berlin/New York: Mouton de Gruyter. (2nd edition).

Neto, J. P., Meinedo, H., Amaral, R., and Trancoso, I. (2003). A system for selective dissemination of multimedia information. In Proc. of the ISCA MSDR 2003

Neto, J., Meinedo, H., Viveiros, M., Cassaca, R., Martins, C., and Caseiro, D. (2008). Broadcast news subtitling system in Portuguese. In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’08), pages 1561–1564.

Ostendorf, M., Favre, B., Grishman, R., Hakkani-Tür, D., Harper, M., Hillard, D., Hirschberg, J., Ji, H., Kahn, J., Liu, Y., Makey, S., Matusov, E., Ney, H., Rosenberg, A., Shriberg, E., Wang, W. and Wooters, C., (2008) Speech Segmentation and Spoken Document Processing. IEEE Signal Processing Magazine, pp. 59-69.

Pellegrini, T., Moniz, H., Batista, F., Trancoso, I., Astudillo, R. (2012) Extension of the LECTRA corpus: classroom LECture TRAnscriptions in European Portuguese, In SPEECH AND CORPORA, Belo Horizonte.

Pierrehumbert, J. (1980) The phonology and phonetics of English intonation. Ph.D. dissertation, MIT.

Pierrehumbert, J., Hirschberg, J. (1990). The meaning of intonational contours in the interpretation of discourse. In Philip R. Cohen, Jerry Morgan & Martha E. Pollack (eds.), Intentions in communication, 271-311. Cambridge, MA: MIT Press.

Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing, Manchester, United Kingdom.

Shattuck-Hufnagel, S. And Turk, A., A Prosody Tutorial for Investigators of Auditory Sentence Processing. Journal of Psycholinguistic Research, vol. 25, n. 2, pp. 193-247, 1996.

Shriberg, E. (1994). Preliminaries to a Theory of Speech Disfluencies. PhD thesis, University of California.

Shriberg, E. (2001). To ”‘errrr”’ is human: ecology and acoustics of speech disfluencies. Journal of the International Phonetic Association, (31):153–169. 17. Levelt, W., Cutler, A. (1983) “Prosodic marking in speech repair,” Journal of Semantics, no. 2.

Shriberg, E., Favre, B., Fung, J., Hakkani-Tur, D., and Cuendet, S. (2009). Prosodic similarities of dialog act boundaries across speaking styles. Linguistic Patterns in Spontaneous Speech - Language and Linguistics Monograph Series, 25:213–239.

Shriberg, E., Stolcke, A., Hakkani-Tür, D., and Tür, G. (2000). Prosody-based automatic segmentation of speech into sentences and topics. Speech Communications, 32(1-2):127–154.

Sjölander, K. and Beskow, J. (2000). Wavesurfer-an open source speech tool. In Sixth International Conference on Spoken Language Processing, pages 464–467.

Sjölander, K., Beskow, J., Gustafson, J., Lewin, E., Carlson, R., and Granström, B. (1998). Webbased educational tools for speech technology. In Proc. of ICSLP98, 5th Intl Conference on Spoken Language Processing, pages 3217–3220, Sydney, Australia.

Trancoso, I., Martins, R., Moniz, H., Mata, A. I., Viana, M. C. (2008). The LECTRA Corpus - Classroom Lecture Transcriptions in European Portuguese. In Proc. LREC, Marrakech.

Trancoso, I., Viana, M. C., Duarte, I., Matos, G. (1998), Corpus de Diálogo CORAL, In PROPOR'98 - III Encontro para o Processamento Computacional da Língua Portuguesa Escrita e Falada, Porto Alegre, Brasil.

Vassière, J. (1983). Language-independent prosodic features. In Cutler, A. and Ladd, R., editors, Prosody: models and measurements, pages 55–66. Berlin: Springer.

Wang, D. and Narayanan, S. S. (2004). A multi-pass linear fold algorithm for sentence boundary detection using prosodic cues. In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’04), volume 1, pages 525–528.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Copyright (c) 2012 F. Batista, H. Moniz, I. Trancoso, N. Mamede, A. I. Mata


Download data is not yet available.