A method for lexical tone classification in audio-visual speech

João Vítor Possamai de Menezes; Maria Mendes Cantoni; Denis Burnham; Adriano Vilela Barbosa

doi:10.20396/joss.v9i00.14960

Vol. 9 (2020), Thematic Issue

Vol. 9 (2020)

A method for lexical tone classification in audio-visual speech

Thematic Issue

https://doi.org/10.20396/joss.v9i00.14960

Published 2020-09-09

João Vítor Possamai de Menezes⁺⁻
Maria Mendes Cantoni⁺⁻
Denis Burnham⁺⁻
Adriano Vilela Barbosa⁺⁻

João Vítor Possamai de Menezes

Federal University of Minas Gerais

http://orcid.org/0000-0002-7612-9754

Maria Mendes Cantoni

Federal University of Minas Gerais

https://orcid.org/0000-0001-9515-1802

Denis Burnham

Western Sydney University

http://orcid.org/0000-0002-1980-3458

Adriano Vilela Barbosa

Federal University of Minas Gerais

http://orcid.org/0000-0003-1083-8256

PDF

Keywords

Multimodal speech
Lexical tone
Cantonese language
Statistical learning
Linear discriminant analysis

How to Cite

1.

Menezes JVP de, Cantoni MM, Burnham D, Barbosa AV. A method for lexical tone classification in audio-visual speech. J. of Speech Sci. [Internet]. 2020 Sep. 9 [cited 2024 Jul. 26];9(00):93-104. Available from: https://econtents.bc.unicamp.br/inpec/index.php/joss/article/view/14960

Abstract

This work presents a method for lexical tone classification in audio-visual speech. The method is applied to a speech data set consisting of syllables and words produced by a female native speaker of Cantonese. The data were recorded in an audio-visual speech production experiment. The visual component of speech was measured by tracking the positions of active markers placed on the speaker's face, whereas the acoustic component was measured with an ordinary microphone. A pitch tracking algorithm is used to estimate F0 from the acoustic signal. A procedure for head motion compensation is applied to the tracked marker positions in order to separate the head and face motion components. The data are then organized into four signal groups: F0, Face, Head, Face+Head. The signals in each of these groups are parameterized by means of a polynomial approximation and then used to train an LDA (Linear Discriminant Analysis) classifier that maps the input signals into one of the output classes (the lexical tones of the language). One classifier is trained for each signal group. The ability of each signal group to predict the correct lexical tones was assessed by the accuracy of the corresponding LDA classifier. The accuracy of the classifiers was obtained by means of a k-fold cross validation method. The classifiers for all signal groups performed above chance, with F0 achieving the highest accuracy, followed by Face+Head, Face, and Head, respectively. The differences in performance between all signal groups were statistically significant.

https://doi.org/10.20396/joss.v9i00.14960

PDF

References

Boersma P. Accurate Short-term Analysis of the Fundamental Frequency and the Harmonics-to-noise Ratio of a Samples Sound. Institute of Phonetic Sciences, University of Amsterdam, Proceedings 17, 97-100, 1993.

Boersma P, Weenink D. Praat: doing phonetics by computer [Computer program]. Version 6.1.15, retrieved 20 May 2020 from http://www.praat.org/, 2020.

Borchers H W. pracma: Practical Numerical Math Functions. R package version 2.2.5. https://CRAN.R-project.org/package=pracma, 2019.

Brownman C P, Goldstein L M. Towards an Articulatory Phonology. Phonology Yearbook, Vol 3, 219-252, 1986.

Burnham D, Ciocca V, Stokes S. Auditory-Visual Perception of Lexical Tone. INTERSPEECH, 2001.

Burnham D, Lau S, Tam H, Schoknecht C. Visual Discrimination of Cantonese Tone by Tonal but Non-Cantonese Speakers, and by Non-Tonal Language Speakers. AVSP 2011 International Conference on Auditory-Visual Speech Processing, 2001.

Burnham D, Kasisopa B, Reid A, Luksaneeyanawin S, Lacerda F, Attina V, Xu Rattanasone N, Schwarz I-C, Webster D. Universality and language-specific experience in the perception of lexical tone and pitch. Applied Psycholinguistics, 77, 571-591, 2015.

Chen T H, Massaro D W. Seeing pitch: Visual information for lexical tones of Mandarin-Chinese. The Journal of the Acoustical Society of America, 123, 2356, 2008.

Danner S G, Barbosa A V, Goldstein L. Quantitative analysis of multimodal speech data. Journal of Phonetics, 71, 268-283, 2018.

Denby B, Schultz T, Honda K, Hueber T, Gilbert J M, Brumberg J S. Silent speech interfaces. Speech Communication 52, 270-287, 2010.

Fromkin V. Tone: A linguistic survey. New York: Academic Press, 1978.

Garg S, Hamarneh G, Jongman A, Sereno J A, Wang Y. Computer-vision analysis reveals facial movements made during Mandarin tone production align with pitch trajectories. Speech Communication, 113, 47-62, 2019.

Han Y, Goudbeek M, Mos M, Swerts M. Relative Contribution of Auditory and Visual Information to Mandarin Chinese Tone Identification by Native and Tone-naïve Listeners. Language and Speech 1-21, 2020. DOI: https://doi.org/10.1177/0023830919889995.

James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning. Springer. 2013.

Krahmer E J, Swerts M G J. The effects of visual beats on prosodic prominence: Acoustic analyses, auditory perception and visual perception. Journal of Memory and Language, 57(3), 396-414, 2007.

McGurk H, MacDonald J. Hearing lips and seeing voices. Nature, 264(12), 746-748, 1976.

McNeill D. Action, thought and language. Cognition, 10, 201-208, 1981.

Mixdorff H, Hu Y, Burnham D. Visual Cues in Mandarin Tone Perception. INTERSPEECH, 2005.

Northern Digital Inc. Measurement Sciences. Optotrak Certus. URL https://www.ndigital.com/msci/products/optotrak-certus/, 2020a.

Northern Digital Inc. Measurement Sciences. Optotrak Acessories. URL https://www.ndigital.com/msci/products/optical-accessories/, 2020b.

R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/, 2019.

Rabiner L, Juang H. Fundamentals of Speech Recognition. PTR Prentice Hall, 1993.

Smith D, Burnham D. Facilitation of Mandarin tone perception by visual speech in clear and degraded audio: Implications for cochlear implants. The Journal of the Acoustical Society of America, 131, 1480, 2012.

Sumby W H, Pollack I. Visual Contribution to Speech Intelligibility in Noise. The Journal of the Acoustical Society of America, 26(2), 212-215, 1954. DOI: https://doi.org/10.1121/1.1907309.

Tiede M, Bundgaard-Nielsen R, Kross C, Gibert G, Attina V, Kasisopa B, Vatikiotis-Bateson E, Best C. Speech articulator movements recorded from facing talkers using two electromagnetic articulometer systems simultaneously. The Journal of the Acoustical Society of America. 128(4), 2459, 2010. DOI: https://asa.scitation.org/doi/full/10.1121/1.3508805.

Vatikiotis-Bateson E, Ostry D J. An analysis of the dimensionality of jaw motion in speech. Journal of Phonetics, 23, 101-117, 1995.

Vatikiotis-Bateson E, Yehia H. Physiological Modeling of Facial Motion During Speech. Tech. Rep. ASJ H-96, 65, 1-8, 1996.

Yehia H, Rubin P, Vatikiotis-Bateson E. Quantitative association of vocal-tract and facial behavior. Speech Communication 26, 23-43, 1998.

Yehia H, Kuratate T, Vatikiotis-Bateson E. Linking facial animation, head motion and speech acoustics. Journal of Phonetics, 30, 555-568, 2002.

Yip M. Tone. Cambridge University Press, 2002.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Downloads

Download data is not yet available.

A method for lexical tone classification in audio-visual speech

Keywords

How to Cite

Download Citation

Abstract

References

Downloads