A method for lexical tone classification in audio-visual speech





Multimodal speech, Lexical tone, Cantonese language, Statistical learning, Linear discriminant analysis


This work presents a method for lexical tone classification in audio-visual speech. The method is applied to a speech data set consisting of syllables and words produced by a female native speaker of Cantonese. The data were recorded in an audio-visual speech production experiment. The visual component of speech was measured by tracking the positions of active markers placed on the speaker's face, whereas the acoustic component was measured with an ordinary microphone. A pitch tracking algorithm is used to estimate F0 from the acoustic signal. A procedure for head motion compensation is applied to the tracked marker positions in order to separate the head and face motion components. The data are then organized into four signal groups: F0, Face, Head, Face+Head. The signals in each of these groups are parameterized by means of a polynomial approximation and then used to train an LDA (Linear Discriminant Analysis) classifier that maps the input signals into one of the output classes (the lexical tones of the language). One classifier is trained for each signal group. The ability of each signal group to predict the correct lexical tones was assessed by the accuracy of the corresponding LDA classifier. The accuracy of the classifiers was obtained by means of a k-fold cross validation method. The classifiers for all signal groups performed above chance, with F0 achieving the highest accuracy, followed by Face+Head, Face, and Head, respectively. The differences in performance between all signal groups were statistically significant. 


Download data is not yet available.

Author Biographies

João Vítor Possamai de Menezes, Federal University of Minas Gerais

Master in Graduate Program in Electrical Engineering at the Federal University of Minas Gerais.

Maria Mendes Cantoni, Federal University of Minas Gerais

Assistant Professor at the Faculty of Arts of the Federal University of Minas Gerais.

Denis Burnham, Western Sydney University

PhD in Psychology at Monash University. Rsearch Professor at MARCS Institute for Brain, Behavior and Development at the Western Sydney University.

Adriano Vilela Barbosa, Federal University of Minas Gerais

Graduate Program in Electrical Engineering at the Federal University of Minas Gerais.


Boersma P. Accurate Short-term Analysis of the Fundamental Frequency and the Harmonics-to-noise Ratio of a Samples Sound. Institute of Phonetic Sciences, University of Amsterdam, Proceedings 17, 97-100, 1993.

Boersma P, Weenink D. Praat: doing phonetics by computer [Computer program]. Version 6.1.15, retrieved 20 May 2020 from http://www.praat.org/, 2020.

Borchers H W. pracma: Practical Numerical Math Functions. R package version 2.2.5. https://CRAN.R-project.org/package=pracma, 2019.

Brownman C P, Goldstein L M. Towards an Articulatory Phonology. Phonology Yearbook, Vol 3, 219-252, 1986.

Burnham D, Ciocca V, Stokes S. Auditory-Visual Perception of Lexical Tone. INTERSPEECH, 2001.

Burnham D, Lau S, Tam H, Schoknecht C. Visual Discrimination of Cantonese Tone by Tonal but Non-Cantonese Speakers, and by Non-Tonal Language Speakers. AVSP 2011 International Conference on Auditory-Visual Speech Processing, 2001.

Burnham D, Kasisopa B, Reid A, Luksaneeyanawin S, Lacerda F, Attina V, Xu Rattanasone N, Schwarz I-C, Webster D. Universality and language-specific experience in the perception of lexical tone and pitch. Applied Psycholinguistics, 77, 571-591, 2015.

Chen T H, Massaro D W. Seeing pitch: Visual information for lexical tones of Mandarin-Chinese. The Journal of the Acoustical Society of America, 123, 2356, 2008.

Danner S G, Barbosa A V, Goldstein L. Quantitative analysis of multimodal speech data. Journal of Phonetics, 71, 268-283, 2018.

Denby B, Schultz T, Honda K, Hueber T, Gilbert J M, Brumberg J S. Silent speech interfaces. Speech Communication 52, 270-287, 2010.

Fromkin V. Tone: A linguistic survey. New York: Academic Press, 1978.

Garg S, Hamarneh G, Jongman A, Sereno J A, Wang Y. Computer-vision analysis reveals facial movements made during Mandarin tone production align with pitch trajectories. Speech Communication, 113, 47-62, 2019.

Han Y, Goudbeek M, Mos M, Swerts M. Relative Contribution of Auditory and Visual Information to Mandarin Chinese Tone Identification by Native and Tone-naïve Listeners. Language and Speech 1-21, 2020. DOI: https://doi.org/10.1177/0023830919889995.

James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning. Springer. 2013.

Krahmer E J, Swerts M G J. The effects of visual beats on prosodic prominence: Acoustic analyses, auditory perception and visual perception. Journal of Memory and Language, 57(3), 396-414, 2007.

McGurk H, MacDonald J. Hearing lips and seeing voices. Nature, 264(12), 746-748, 1976.

McNeill D. Action, thought and language. Cognition, 10, 201-208, 1981.

Mixdorff H, Hu Y, Burnham D. Visual Cues in Mandarin Tone Perception. INTERSPEECH, 2005.

Northern Digital Inc. Measurement Sciences. Optotrak Certus. URL https://www.ndigital.com/msci/products/optotrak-certus/, 2020a.

Northern Digital Inc. Measurement Sciences. Optotrak Acessories. URL https://www.ndigital.com/msci/products/optical-accessories/, 2020b.

R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/, 2019.

Rabiner L, Juang H. Fundamentals of Speech Recognition. PTR Prentice Hall, 1993.

Smith D, Burnham D. Facilitation of Mandarin tone perception by visual speech in clear and degraded audio: Implications for cochlear implants. The Journal of the Acoustical Society of America, 131, 1480, 2012.

Sumby W H, Pollack I. Visual Contribution to Speech Intelligibility in Noise. The Journal of the Acoustical Society of America, 26(2), 212-215, 1954. DOI: https://doi.org/10.1121/1.1907309.

Tiede M, Bundgaard-Nielsen R, Kross C, Gibert G, Attina V, Kasisopa B, Vatikiotis-Bateson E, Best C. Speech articulator movements recorded from facing talkers using two electromagnetic articulometer systems simultaneously. The Journal of the Acoustical Society of America. 128(4), 2459, 2010. DOI: https://asa.scitation.org/doi/full/10.1121/1.3508805.

Vatikiotis-Bateson E, Ostry D J. An analysis of the dimensionality of jaw motion in speech. Journal of Phonetics, 23, 101-117, 1995.

Vatikiotis-Bateson E, Yehia H. Physiological Modeling of Facial Motion During Speech. Tech. Rep. ASJ H-96, 65, 1-8, 1996.

Yehia H, Rubin P, Vatikiotis-Bateson E. Quantitative association of vocal-tract and facial behavior. Speech Communication 26, 23-43, 1998.

Yehia H, Kuratate T, Vatikiotis-Bateson E. Linking facial animation, head motion and speech acoustics. Journal of Phonetics, 30, 555-568, 2002.

Yip M. Tone. Cambridge University Press, 2002.




How to Cite

Menezes JVP de, Cantoni MM, Burnham D, Barbosa AV. A method for lexical tone classification in audio-visual speech. J. of Speech Sci. [Internet]. 2020Sep.9 [cited 2021Sep.17];9(00):93-104. Available from: https://econtents.bc.unicamp.br/inpec/index.php/joss/article/view/14960



Thematic Issue