Review of research on speech technology : main contributions from spanish research groups

Rubén San-Segundo; Carlos D. Martínez-Hinarejos; Alfonso Ortega

doi:10.20396/joss.v1i1.15010

Vol. 1 No. 1 (2011), Reviews

Vol. 1 No. 1 (2011)

Review of research on speech technology : main contributions from spanish research groups

Reviews

https://doi.org/10.20396/joss.v1i1.15010

Published 2011-07-01

Rubén San-Segundo⁺⁻
Carlos D. Martínez-Hinarejos⁺⁻
Alfonso Ortega⁺⁻

Rubén San-Segundo

Universidad Politécnica de Madrid

Carlos D. Martínez-Hinarejos

Universidad Politécnica de Valencia

Alfonso Ortega

Universidad de Zaragoza

PDF

Keywords

Speech technology
Spain
Castilian
Catalan
Basque
Galician

How to Cite

1.

San-Segundo R, Martínez-Hinarejos CD, Ortega A. Review of research on speech technology : main contributions from spanish research groups. J. of Speech Sci. [Internet]. 2011 Jul. 1 [cited 2024 Jul. 22];1(1):31-53. Available from: https://econtents.bc.unicamp.br/inpec/index.php/joss/article/view/15010

Abstract

In the last two decades, there has been an important increase in research on speech technology in Spain, mainly due to a higher level of funding from European, Spanish and local institutions and also due to a growing interest in these technologies for developing new services and applications. This paper provides a review of the main areas of speech technology addressed by research groups in Spain, their main contributions in the recent years and the main focus of interest these days. This description is classified in five main areas: audio processing including speech, speaker characterization, speech and language processing, text to speech conversion and spoken language applications. This paper also introduces the Spanish Network of Speech Technologies (RTTH. Red Temática en Tecnologías del Habla) as the research network that includes almost all the researchers working in this area, presenting some figures, its objectives and its main activities developed in the last years.

https://doi.org/10.20396/joss.v1i1.15010

PDF

References

Alba-Castro JL, González-Jiménez D, Argones-Rúa E, González-Agulla E, Otero-Muras E, García-Mateo C. Pose-corrected face processing on video sequences for webcam-based remote biometric authentication. Journal of Electronic Imaging. 2008;18:11004-8.

ALBAYZÍN evaluation 2010; 2010 [accessed Jun 2010]. Available from:http://fala2010.uvigo.es/index.php?option=com_content&view=article&id=57&Itemid=65&lang=es.

AMIDA Project. State of the art overview: localization and tracking of multiple interlocutors with multiplesensors. Technical paper AMIDA Consortium; 2007 [accessed Mar 2010]. Available from:http://www.amiproject.org/ami-scientific-portal/documentation/annual-reports/pdf/SOTA-Localization-andTracking-Jan2007.pdf.

Anguera X, Aguilo M, Wooters C, Nadeu C, Hernando J. Hybrid speech/non-speech detector applied to speaker diarization of meetings. In: IEEE, editor. Proc. of IEEE Odyssey: The Speaker and Language Recognition Workshop. 2006 Jun 28-30; San Juan, Puerto Rico. p. 1-6.

Anguera X, Wooters C, Pardo JM. Robust speaker diarization for meetings: ICSI RT06s meetings evaluation

system. Lecture Notes in Computer Science. 2006;4299:346-58.

Argones-Rúa E, Alba-Castro JL, García-Mateo C. On the use of quality measures in face and speaker identity

verification based on video and audio streams. IET Signal Processing. 2009;3(4):301-9.

Argones-Rúa E, Bredin H, Mateo CG, Chollet G, Jiménez DG. Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden Markov models. Journal of Pattern Analysis and Applications. 2009;1;271-84.

Arias-Londoño JD, Godino-Llorente JI, Sáenz-Lechón N, Osma-Ruiz V, Castellanos-Domínguez G. An improved method for voice pathology detection by means of a HMM-based feature space transformation. Pattern recognition. 2010;23(9):3100-12.

Arias-Londoño JD, Godino-Llorente JI, Sáenz-Lechón N, Osma-Ruiz V, Castellanos-Domínguez G. Automatic detection of pathological voices using complexity measurements, noise parameters and melcepstral coefficients. IEEE Transactions on Biomedical Engineering. 2011;58 (2):370-7.

Aubert X. An overview of decoding techniques for large vocabulary continuous speech recognition. Computer Speech & Language. 2002;16:89-114.

Barra-Chicote R, Fernández F, Lutfi S, Lucas-Cuesta JM, Macias-Guarasa J, Montero JM, San-Segundo R,Pardo JM. Acoustic emotion recognition using dynamic bayesian networks and multi-space distributions. In:ISCA, editor. Proc. of Interspeech. 2009 Sep 6-10; Brighton, UK. p. 336-9.

Barra-Chicote R, Yamagishi J, King S, Montero JM, Macias-Guarasa J. Analysis of statistical parametric and unit selection speech synthesis systems applied to emotional speech. Speech Communication. 2010;52(5):394-404.

Baum LE. An equality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities. 1972(3):1-8.

Benedí JM, Sánchez JA. Estimation of stochastic context-free grammars and their use as language models.mComputer Speech and Language. 2005;19(3):249-74.

Bonafonte A, Moreno A, Adell J, Agüero PD, Banos E, Erro D, Esquerra I, Perez J, Polyakova T. The UPC TTS system description for the 2008 Blizzard challenge. Blizzard. 2008 Sep 22-26; Brisbane, Australia. p. 1-6.

Brummer N, Burget L, Cernocky J, Glembek O, Grezl F, Karafiat M, van Leewen DD, Matejka P, Scwartz P, Strasheim A. Fusion of heterogeneous speaker recognition systems in the STBU submission for the NSIT speaker recognition evaluation 2006. IEEE Transactions on Acoustics, Speech and Signal Processing.

;15(7):2072-84.

Brümmer N, Strasheim A, Hubeika V, Matějka P, Burget L, Glembek O. Discriminative acoustic language recognition via channel-compensated GMM statistics. In: ISCA, editor. Proc. of Interspeech. 2009 Sep 6-10; Brighton, UK. p. 2187-90.

Buera L, Miguel A, Lleida E, Saz O, Ortega A. Robust speech recognition with on-line unsupervised acoustic feature compensation. In: IEEE, editor. Proc. of IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU). 2007 Dec 9-13; Kyoto, Japan. p. 105-10.

Buera L, Miguel A, Lleida E, Ortega A, Saz, O. Cross-Probability Model based on GMM for Feature Vector Normalization in Car Environments. Biennial on DSP for in-Vehicle and Mobile Systems. 2007 Jun 1-6; Istanbul, Turkey. p. 1-6.

Butko T, Canton-Ferrer C, Segura C, Giró X, Nadeu C, Hernando J, Casas JR. Acoustic event detection based on feature-level fusion of audio and video modalities. EURASIP Journal on Advances in Signal Processing. 2011;2011:11 pages. Article ID 485738. DOI:10.1155/2011/485738.

Cai R, Lu L, Hanjalic A, Zhang H, Cai L-H. A flexible framework for key audio effects detection and

auditory context inference. IEEE Trans on Audio, Speech and Language Processing. 2006;14(3):1026-39.

Campbell WM, Campbell JP, Reynolds DA, Singer E, Torres-Carrasquillo PA. Support vector machines for

speaker and language recognition. Computer Speech and Language. 2006;20:210-29.

Caraballo MA, D'Haro LF, Cordoba R, San-Segundo R, Pardo JM. A discriminative text categorization

technique for language identification built into a PPRLM System. In: Proc. of FALA; 2010 Nov 10-12; Vigo,

Spain. p. 193-6.

Carmona JL, Peinado AM, Perez-Cordoba JL, Gomez AM. MMSE-based packet loss concealment for CELPcoded speech recognition. IEEE Tr. Audio Speech Lang. Proc., 2010;18(6);1341-53.

Casacuberta F, Vidal E. Machine translation with inferred stochastic finite-state transducers. Computational

Linguistics. 2004;30(2):205-25.

Chen Y, Rui Y. Real-time speaker tracking using particle filter sensor fusion. Proc. of the IEEE, 2004;92(3):

-94.

Chiang D. Hierarchical phrase-based translation. Computational Linguistics. 2007;33(2):201-28.

Chu S. Unstructured audio classification for environment recognition. Proceedings of the Twenty-Third

AAAI Conference on Artificial Intelligence; 2008 Jul 13-17; Chicago, Illinois, USA. p.1845-6.

Costa-Jussà M, Fonollosa JAR. An Ngram-based reordering model. Computer Speech and Language.

;23(3):362-75.

Costa-Jussà M, Fonollosa JAR. State-of-the-art word reordering approaches in statistical machine translation.

IEICE Transactions on Information and Systems. 2009;92(11):2179-85.

Crego JM, Marino JB. Improving SMT by coupling reordering and decoding. Machine Translation.

;20(3):199-215.

Dehak N, Kenny P, Dehak R, Dumouchel P, Ouellet P. Front-end factor analysis for speaker verification.

IEEE Transactions on Audio, Speech and Language Processing. 2011;19(4):788-98.

De la Torre Á, Ramírez J, Benítez C, Segura JC, García L, Rubio AJ. Noise robust model-based voice

activity detection. 9th International Conference on Spoken Language Processing, Interspeech. 2006 Sep 17-

Pittsburgh, USA. p.1954-7.

D'Haro LF, de Córdoba R, Ferreiros J, Hamerich SW, Schless V, Kladis B, Schubert V, Kocsis O, Igel S,

Pardo JM. An advanced platform to speed up the design of multilingual dialog applications for multiple

modalities. Speech Communication. 2005;48(8):863-87.

Dowding J, Gawron J, Appelt D, Bear J, Cherny L, Moore R, Moran D. 1993. GEMINI: a natural language

system for spoken language understanding. In: Proc. of ACL. 1993 Jun 22-26; Columbus, Ohio, USA. p. 54-

Eronen A, Peltonen V, Tuomi J, Klapuri A, Fagerlund S, Sorsa T, Lorho G, Huopaniemi J. Audio-based

context recognition. IEEE Trans on Speech and Audio Processing. 2006;14(1):321-9.

Erro D, Moreno A, Bonafonte A. Voice conversion based on weighted frequency warping. IEEE Trans.

Audio, Speech and Lang. Proc. 2010;18(5):922-31.

Erro D, Navas E, Hernaez I, Saratxaga I. Emotion Conversion based on Prosodic Unit Selection. IEEE Trans.

Audio, Speech and Lang. Proc. 2010;18(5):974-83.

Escudero D, Cardeñoso V. Applying data mining techniques to corpus based prosodic modeling. Speech

Communication. 2007;49(3):213-29.

Fant, G. Acoustic theory of speech production. The Hague, Netherlands: Mouton, 2nd edition, 1970.

Fernández I, Mazo M, Lázaro JL, Pizarro D, Santiso E, Martín P, Losada C. Guidance of a mobile robot

using an array of static cameras located in the environment. Autonomous Robot. 2007;23(4):305-24.

Fernández R, Ferreiros J, Córdoba R, Montero JM, San Segundo R, Pardo JM. A bayesian networks

approach for dialog modeling: the fusion BN. In: IEEE, editor. Proc. of the ICASSP 2009. IEEE International

Conference on Acoustics, Speech, and Signal Processing. 2009 Apr 19-24; Taipei, Taiwan. p. 4789-92.

Fernández-Pozo R, Murillo JLB, Gómez LH, Gonzalo EL, Ramírez JA, Toledano DT. Assessment of severe

apnoea through voice analysis, automatic speech, and speaker recognition techniques. EURASIP Journal on

Advances in Signal Processing - Special issue on recent advances in biometric systems: a signal processing

perspective. 2009;2009: 11 pages. Article ID 982531. DOI:10.1155/2009/982531.

Ferreiros J, Ellis D. Using acoustic condition clustering to improve acoustic change detection on broadcast

news. Proc. of ISCA ICSLP 2000. 2000 Oct 16-20; Beijing, China; p. 568-571.

Fragopanagos N, Taylor JG. Emotion recognition in human-computer interaction. Neural Networks.

;18(4):389-405.

Gallardo-Antolín A. Reconocimiento de habla robusto frente a condiciones de ruido aditivo y convolutivo

[PhD Thesis]. Madrid, Spain: Universidad Politécnica de Madrid; 2002.

Gallardo-Antolín A, Anguera X, Wooters C. Multi-stream speaker diarization systems for the meetings

domain. In: ISCA, editor. Proc. ICSLP 06. 17-21 Sep; Pittsburg, USA. 2006. p. 2186-9.

Gallardo-Antolín A, Montero JM. Histogram equalization-based features for speech, music, and song

discrimination. IEEE Signal Processing letters. 2010;17(7):659-62.

Gandour J, Tong Y, Talavage T, Wong D, Dzemidzic M, Xu Y, Li, X., Lowe M. Neural basis of first and

second language processing of sentence-level linguistic prosody. Human Brain Mapping. 2007;28:94-108.

García-Gómez R, López-Barquilla R, Puertas-Tera J-I, Parera-Bermúdez J, Haton M-C, Haton J-P, Alinat P,

Moreno S, Hess W, Sanchez-Raya M-A, Martínez-Gual E-A, Navas-Chabeli-Daza JL, Antoine C, Durel MM, Maurin G, Hohmann S. Speech training for deaf and hearing impaired people: ISAEUS consortium. In:

ISCA, editor. Proc. Eurospeech-Interspeech; 1999 Sep 5-9. Budapest, Hungary. p. 1067-70.

García-Mateo C, González-González M. An overview of the existing language resources for Galician. In:

ISCA, editor. LREC Workshop: Language Resources for European Minorities Languages. 1998 May 28-30;

Granada. Spain. p. 1-6.

Gatica-Perez D, Lathoud G, Odobez J-M, McCowan I. Audio-visual probabilistic tracking of multiple

speakers in meetings. IEEE Transactions on Audio, Speech, and Language Processing. 2007;15:601-16.

Gispert A, Mariño JB. On the impact of morphology in English to Spanish statistical MT. Speech

Communication. 2008;50(11-12):1034-46.

Godino-Llorente J-I, Gómez-Vilda P. Automatic detection of voice impairments by means of short-term

cepstral parameters and neural network based detectors. IEEE Transactions on Biomedical Engineering.

;51(2):380-4.

Godino-Llorente JI, Gómez-Vilda P. Dimensionality reduction of a pathological voice quality assessment

system based on gaussian mixture models and short-term cepstral parameters. IEEE Trans. on Biomedical

Eng. 2006;53(3):1943-53.

Godino-Llorente JI, Sáenz-Lechón N, Osma-Ruiz V, Gómez-Vilda P, Aguilera S. An integrated tool for the

evaluation of voice disorders. Medical Engineering and Physics. 2006;28(3):276-89.

Gómez A, Peinado AM, Sánchez V, Carmona JL. A robust scheme for distributed speech recognition over

loss-prone packet channels. Speech Comm. 2009;51(4):390-400.

Gómez A, Peinado AM, Sánchez V, Rubio AJ. On the Ramsey class of interleavers for robust speech

recognition in burst-like packet loss. IEEE Tr. Audio Speech Lang. Process. 2007;15(4):1496-9.

Gomez A, Peinado AM, Sanchez V, Rubio AJ. Combining media-specific FEC and error concealment for

robust distributed speech recognition over loss-prone packet channels. IEEE Tr. Multimedia. 2006;8(6):1228-

Gómez P, Fernández-Baíllo R, Rodellar V, Nieto V, Álvarez A, Mazaira LM, Martínez R, Godino JI. Glottal

source biometrical signature for voice pathology detection. Speech Communication. 2009a ;51:759-81.

Gómez P, Ferrández JM, Rodellar V, Fernández R. Time-frequency Representations in Speech Perception.

Neurocomputing. 2009b;72:820-30.

Gómez P, Lázaro C, Fernández R, Nieto A, Godino JI, Martínez R, Díaz F, Alvarez A, Murphy K, Nieto V,

Rodellar V, Fernández F-J. Using biomechanical parameter estimates in voice pathology detection. In: Proc

of 4th International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications

(MABEVA05). 2005 Oct 29-31; Florence, Italy. p. 29-31.

Gonzalez C, Cardeñoso V, Sanchis E. Experiments in speech driven question answering. 2008. In: proc of the

IEEE Workshop on Spoken Language Technology. 2008 Dec 15-19. Goa, India. p. 85-8.

González-Ferreras C. Estrategias para el acceso a contenidos web mediante habla [Ph.D Thesis]. Valladolid,

Spain: Universidad de Valladolid; 2009.

González-González M. A síntese de voz en lingua galega: o proxecto Cotovía. Revista Galega do Ensino.

;44:199-215.

González-Jiménez D, Alba-Castro JL. Shape-driven gabor jets for face description and authentication. IEEE

Transactions on Information Forensics and Security. 2007;2(4):769-80.

González-Jiménez D, Alba-Castro JL. Towards pose invariant 2D face recognition through point distribution

models and facial symmetry. IEEE Transactions on Information Forensics and Security. 2007;2(3):413-29.

Gonzalez-Rodriguez J, Rose P, Ramos D, Toledano DT, Ortega-Garcia J. Emulating DNA: rigorous

quantification of evidential weight in transparent and testable forensic speaker recognition. IEEE

Transactions on Audio, Speech and Language Processing. 2007;15(7):2104-15.

Goodman J. A bit of progress in language modelling. Computer Speech and Language, 2001;15(4):403-34.

Górriz JM, Ramírez J, Segura JC, Hornillo S. Voice activity detection using higher order statistics. In: IEEE,

editor. Proc. of IWANN 2005 8th International Work-Conference on Artificial Neural Networks. 2005 Jun 8-

; Barcelona. Spain. p. 837-44.

Griol D, Hurtado LF, Segarra E, Sanchis E. A statistical approach to spoken dialog systems design and

evaluation. Speech Communication. 2008;22:666-82.

Harabagiu S, Moldovan D, Picone J. Open-domain voice-activated question answering. In: Proceedings of

the 19th International Conference on Computational Linguistics (COLING-2002); 2002 Aug 24-Sep 1;

Taipei, Taiwan. p. 321-7.

Hori C, Hori T, Isozaki H, Maeda E, Katagiri S, Furui S. Study on spoken interactive open domain question

answering. In: Proc. of IEEE & ISCA Workshop on Spontaneous Speech Processing and Recognition

(SSPR). 2003 April 13-16; Tokyo, Japan. p. 111-4.

Hunt A, Black A. Unit selection in a concatenative speech synthesis system using a large speech database. In:

IEEE, editor. Proc. of International Conference on Acoustics, Speech, and Signal Processing, ICASSP-96.

May 15-19. Toulouse. France. p. 373-6.

Jelinek, F. Continuous Speech Recognition by Statistical Methods. Proceedings of the IEEE. 1972;64:532-

Jiang H, Li X, Liu C. Large margin hidden markov models for speech recognition. IEEE Transactions on

Audio, Speech and Language Processing. 2006;14(5):1584-95.

Kenny P, Ouellet P, Dehak N, Gupta V, Dumouchel P. A Study of inter-speaker variability in speaker

verification. IEEE Transactions on Audio, Speech and Language Processing. 2008;16(5):980-988.

Kenny, P., Reynolds, D. and Castaldo, F. Diarization of Telephone Conversations using Factor Analysis.

IEEE Journal of Selected Topics in Signal Processing. 2010;4(6):1059-70.

Koehn P. Statistical machine translation. Cambridge, UK: Cambridge University Press; 2007.

Krishnamoorthy P, Mahadeva Prasanna SR. Temporal and spectral processing methods for processing of

degraded speech: a review. IETE Tech Rev. 2009;26:137-48.

Kwon O., Chan K., Hao J., Lee T. Emotion Recognition by Speech Signals. In: ISCA, editor. Proc. of ISCA

Eurospeech; 2003 Sep 1-4; Geneva, Switzerland. p. 125-8.

Lathoud G, Odobez J-M. Short-term spatio-temporal clustering applied to multiple moving speakers. IEEE

Transactions on Audio, Speech & Language Processing 2007;15(5):1696-710.

Lee L, Rose R. A frequency warping approach to speaker normalization. IEEE Transactions on Speech and

Audio Processing. 1998;1(6):49.

Leggetter CJ, Woodland PC. Maximum likelihood linear regression for speaker adaptation of the parameters

of continuous density hidden Markov models. Computer Speech and Language. 1995;9:171-185.

Li J, Deng L, Yu D, Gong Y, Acero A. High-performance HMM adaptation with joint compensation of

additive and convolutive distortions via vector Taylor series. In: IEEE, editor. Proceedings of IEEE

Workshop on ASRU; 2007 Dec 9-13; Kyoto, Japan. p. 65-70.

Livescu K, Cetin O, Hasegawa-Johnson M, King S, Bartels C, Borges N, Kantor A, Lal P, Yung L, Bezman

A, Bronwyn W. Articulatory feature-based methods for acoustic and audio-visual speech recognition:

summary from the 2006 JHU summer workshop. In: IEEE, editor. Proc. ICASSP. 2007 Apr 15-20. Honolulu,

Hawaii, USA. p. 621-4.

Lohscheller J, Eysholdt U, Toy H, Dollinger M. Phonovibrography: mapping high-speed movies of vocal

folds vibrations into 2-D diagrams for visualizing and analysing the underlying laryngeal diseases. IEEE

Trans. on Biomedical Eng. 2008;27(3):300-9.

Luengo I, Navas E, Hernaez I. Feature analysis and evaluation for automatic emotion identification in speech.

IEEE Transactions on Multimedia. 2010;12(6):490-501.

Ma L, Milner B, Smith D. Acoustic environment classification. ACM Trans. Speech Lang. Process.

;3(2):1-22.

Maganti HK, Gatica-Perez D, McCowan I. Speech enhancement and recognition in meetings with an audiovisual sensor array. IEEE Transactions on Audio, Speech and Language Processing. 2007;15(8):2257-69.

Mariño Acebal J. Avivavoz: tecnologías para la traducción de voz. In: IV Jornadas en Tecnología del Habla.

Nov 12-16; Zaragoza. Spain. p. 285-90.

Mariño JB, Banchs RE, Crego JM, de Gispert A, Lambert P, Fonollosa JAR, Costa-Jussà MR. N-gram-based

machine translation. Computational Linguistics. 2006;32(4):527-49.

Meignier S, Moraru D, Fredouillea C, Bonastre J-F, Besacier L. Step-by-step and integrated approaches in

broadcast news speaker diarization. Computer Speech and Language. 2006;20:303-30.

Miguel A, Lleida E, Rose R, Buera L, Saz O, Ortega A. Capturing local variability for speaker normalization

in speech recognition. IEEE Transactions on Audio Speech and Language Processing. 2008;16(3):578.

Moore D, McCowan I. Microphone array speech recognition: experiments on overlapping speech in

meetings. In: IEEE, editor. Proc. ICASSP; 2003 Apr 6-10; Hong Kong, China. p.V-497-V-500.

Moreno A. Information search engine for multilingual audiovisual contents: BUCEADOR. FALA 2010.

Nov 10-12; Vigo, Spain. p. 259-62.

Moreno A, Poch D, Bonafonte A, Lleida E, Llisterri J, Marino JB, Nadeu C. Albayzin speech data base:

design of the phonetic corpus. In: ISCA, editor. Proceedings of EUROSPEECH'93. 1993 Sep 21-23; Berlin,

Germany. p.175-8.

Munkong R, Juang BH. Auditory Perception and Cognition. IEEE Signal Proc. Magazine. 2008;56(9):98-

Nadas A. On Turing’s formula for word probabilities. IEEE Trans. Acoustics, Speech, and Signal Proc.

;33:1414-6.

Nadeu C, Macho C, Hernando J. Frequency & time filtering of filter-bank energies for robust HMM speech

recognition. Speech Communication (Special Issue on Noise Robust ASR). 2001;34:93-114.

Nadeu C, Pachés-Leal P, Juang BH. Filtering the time sequences of spectral parameters for speech

recognition. Speech Communication. 1997;21:315-32.

Navarro-Mesa J-L, Quintana-Morales P, Pérez-Castellano I, Espinosa-Yáñez J. Oral corpus of the project

HACRO (help tool for the confidence of oral utterances) [technical report]. Las Palmas de Gran Canaria,

Spain: University of Las Palmas de Gran Canaria; 2005.

Navas E, Hernáez I, Luengo I. An objective and subjective study of the role of semantics and prosodic

features in building corpora for emotional TTS. IEEE Trans. Audio, Speech and Lang. Proc.

;14(4):1117-27.

Navas E, Hernaez I, Luengo I, Sanchez J, Saratxaga I. Analysis of the suitability of common corpora for

emotional speech modelling in standard basque. Lecture Notes in Artificial Intelligence. 2005;3658:265-72.

Navas E, Hernáez I, Sánchez J. Basque intonation modelling for text to speech conversion. In: ISCA, editor.

Proc. of 7th International Conference on Spoken Language Processing (ICSLP). 2002 Sep 16-20. Denver,

USA. p. 2409-12.

Ntalampiras S, Potamitis I, Fakotakis N. On acoustic surveillance of hazardous situations. In: IEEE, editor.

Proc. ICASSP. 2009 Apr 19-24; Taipei, China. p. 165-8.

Nguyen T, Sun H, Zhao S, Khine SZK, Tran HD, Ma TLN, Ma B, Chang ES, Li H. The IIR-NTU Speaker

Diarization Systems for RT 2009 [accessed Jun 2011]. Available from:

http://www.itl.nist.gov/iad/mig/tests/rt/2009/workshop/IIR-NTU-presentation.pdf

Och FJ. Minimum Error Rate Training in Statistical Machine Translation. In: 41st Annual Meeting of the

Association for Computational Linguistics (ACL); 2003July 7-12; Sapporo, Japan. p.160-7.

Oester A-M, House D, Protopapas A, Hatzis A. Presentation of a new EU project for speech therapy: OLP

(Ortho-Logo-Paedia). In: Proceedings of the XV Swedish Phonetics Conference (Fonetik 2002). 2002 May

-31. Stockholm, Sweden. p. 45-8.

Ortega-García J, Fierrez J, Alonso-Fernandez F, Galbally J, Freire MR, Gonzalez-Rodriguez J, Garcia-Mateo

C, Alba-Castro J-L, Gonzalez-Agulla E, Otero-Muras E, Garcia-Salicetti S, Allano L, Ly-Van B, Dorizzi B,

Kittler J, Bourlai T, Poh N, Deravi F, Ng MWR, Fairhurst M, Hennebert J, Humm A, Tistarelli M, Brodo L,

Richiardi J, Drygajlo A, Ganster H, Sukno FM, Pavani S-K, Frangi A, Akarun L, Savran A. The MultiScenario Multi-Environment BioSecure Multimodal Database (BMDB). IEEE Trans. on Pattern Analysis and Machine Intelligence. 2010;32(6):1097-111.

Osma-Ruiz V, Godino-Llorente JI, Sáenz-Lechón N, Fraile-Muñoz R. Segmentation of the glottal space from

laryngeal images using the watershed transform. Computerized Medical Imaging and Graphics.

a;32:193-201.

Pardo JM, Anguera X, Wooters C. Speaker diarization for multiple distant-microphone meetings using

several sources of information. IEEE Transactions on Computers. 2007;56(9):1212-24.

Peinado AM, Sánchez V, Pérez-Córdoba J, Rubio A. Efficient MMSE-based channel error mitigation

techniques. Application to distributed speech recognition over wireless channels. IEEE Tr. Wireless Comm.

;4(1):14-9.

Peinado AM, Segura JC. Speech recognition over digital channels: robustness and standards. New York,

USA: John Wiley & Sons Ltd.; 2006.

Perez-Freire L, Garcia-Mateo C. A multimedia approach for audio segmentation in TV broadcast news. In:

IEEE, editor. Proc. of IEEE Internacional Conference on Acoustics, Speech and Signal Processing (ICASSP),

Vol 1. 2004 May 17-21; Montreal, Canada. p. 369-72.

PHAUST. Feedback analysis for user adaptive statistical translation. [Accessed Jun 2011] 2010. Available

from: http://divf.eng.cam.ac.uk/faust

Pieraccini R, Levin E. A learning approach to natural language understanding. NATO-ASI, New Advances &

Trends in Speech Recognition and Coding, Springer-Verlag, Bubion, Spain. 1993.

Pizarro D, Mazo M, Santiso E, Marron M, Fernandez I. Localization and geometric reconstruction of mobile

robots using a camera ring. IEEE Transactions on Instrumentation and Measurement. 2009;58(8):2396-409.

Portelo J, Bugalho M, Trancoso I, Neto J, Abad A, Serralheiro A. Non-speech audio event detection. In:

IEEE, editor. ICASSP 2009; Int. Conf. on Acoustics, Speech, and Signal Processing; 2009 Apr 19-24;

Taiwan. China. p. 1973-1976.

Rabiner, LR, Wilpon, JG and Soong, FK. High performance connected digit recognition, using hidden

Markov models. Proceedings of International Conference on Acoustic Speech, and Signal Processing,

ICASSP-1988, 1988 Apr . 11-14, New York, USA. pp.119-122.

Rauschecker JP, Scott SK. Maps and streams in the auditory cortex: nonhuman primates illuminate human

speech processing. Nature Neuroscience. 2009;12(6):718-24.

Raux A, Black A. A unit selection approach to F0 modeling and its application to emphasis. In: Proc. ASRU.

30 Nov.-3 Dec. St Thomas, US Virgin Is. p.700-5.

Rodríguez-Fuentes LJ, Peñagarikano M, Bordel G, Varona A, Díez M. KALAKA: A TV broadcast speech

database for the evaluation of language recognition systems. In: Proc. LREC. 2010 May 17-23; Valletta.

Malta. p. 1895-8.

Rosenfeld R. Adaptive statistical language modeling: a maximum entropy approach [Ph.D. thesis].

Pittsburgh, USA: Carnegie Mellon University; 1994.

Sáenz-Lechón N, Godino-Llorente JI, Osma-Ruiz V, Gómez-Vilda P. Methodological issues in the

development of automatic systems for voice pathology detection. Biomedical Signal Processing and Control.

;1(2):120-8.

Sanchis E, Buscaldi D, Grau S, Hurtado L, Griol D. Spoken QA based on a passage retrieval engine. In: Proc.

of IEEE/ACL 2006 Workshop on Spoken Language Technology (SLT). 2006 Dec 10-13; Aruba. p. 62-5.

Sanchos E, Segarra E; Torres F. User simulation in a stochastic dialog system. Computer Speech and

Language. 2008;22:230-55.

San-Segundo R, Barra R, Córdoba R, D’Haro LF, Fernández F, Ferreiros J, Lucas JM, Macías-Guarasa J,

Montero JM, Pardo JM. Speech to sign language translation system for Spanish. Speech Communication.

;50:1009-20.

San Segundo R, Montero JM, Guarasa JM, Ferreiros J, Pardo JM. Knowledge-combining methodology for

dialogue design in spoken language systems. International Journal of Speech Technology. 2005;8(1):45-66.

Saz O. On line personalization and adaptation to disorders and variations of speech on automatic speech

recognition systems [PhD thesis]. Zaragoza: Universidad de Zaragoza; 2009. Available from:

http://dihana.cps.unizar.es/~oscar/data/Tesis_Oscar_Saz.pdf.

Saz O, Lleida E, Miguel A. Combination of acoustic and lexical speaker adaptation for disordered speech

recognition. In: ISCA, editor. Proc. of the 11th European Conference on Speech Communication and

Technology (Eurospeech-Interspeech); 2009 Sep 6-10; Brighton, United Kingdom. p. 544-7.

Saz O, Rodríguez W-R, Lleida E, Vaquero C. A novel corpus of children's impaired speech. In: Proceedings

of the 2008 Workshop on Children, Computer and Interaction; 2008 Oct 23; Chania, Crete, Greece. p. 1-6.

Saz O, Yin S-C, Lleida E, Rose R, Rodríguez W-R, Vaquero C. Tools and technologies for computer-aided

speech and language therapy. Speech Communication. 2009c;51(10):948-67.

Scherer KR. Vocal communication of emotion: a review of research paradigms. Speech Communication.

;40:227-56.

Seneff S. 1992 TINA: A natural language system for spoken language applications. Computational

Linguistics. 1992;18(1):61-86.

Shriberg, E., Stolcke, A. & Baron, D. Observations on Overlap: Findings and Implications for Automatic

Processing of Multi-Party Conversation. In: ISCA, editor. Proc. Conference EUROSPEECH. 2001 Sep 3-7.

Aalborg, Denmark. p. 1359-62.

Yaman, S., Hakkani-Tür, D., Tur. G., Combining semantic and syntactic information sources for 5-W

question answering. In: ISCA, editor. Proceedings of the Interspeech'09, Annual Conference of the

International Speech Communication Association. 2009 Sep 6-10; Brighton, United Kingdom. p.2707-10.

van den Heuvel H, Boves L, Moreno A, Omologo M, Richard G, Sanders E. Annotation in the SpeechDat

Projects. International Journal of Speech Technology. 2001;4(2):127-43.

Stylianou Y. Voice transformation: a survey. In: IEEE, editor. Proc. of the IEEE ICASSP. April 19-24;

Tapei, Taiwan. 2009. p. 3585-8.

Taylor JG, Scherer K, Cowie R. Emotion and brain: understanding emotions and modelling their recognition.

Neural Networks. 2005;18:313-6.

Temko A, Nadeu C. Acoustic event detection in a meeting-room environment. Pattern Recognition Letters.

;30(14):1281-8.

Temko A, Nadeu C, Macho D, Malkin R, Zieger C, Omologo M. Acoustic event detection and classification.

In: Waibel A, Stiefelhagen R, editors. Computers in the human interaction loop. London: Springer; 2009. p.

-73.

Torre-Toledano D, Lopez-Moreno I, Mateos I, Abejón A, Ramos D, Gonzalez-Rodriguez J. Automatic

language recognition on spontaneous speech: the ATVS-UAM system. JAES Journal on Audio Engineering

Society. 2009;10(57):788-806.

Toselli AT, Romero V, Pastor-i-Gadea M, Vidal E. Multimodal interactive transcription of text images.

Pattern Recognition. 2010;43(5):1814-25.

Tranter S, Reynolds DA. An overview of automatic speaker diarization. IEEE Trans. On Audio, Speech and

Language Processing. 2006;14(5):1557-65.

Turmo J, Comas PR, Rosset S, Galibert O, Moreau N, Mostefa D, Rosso P, Buscaldi D. Overview of QAST

In: 10th Int. Cross-Language Evaluation Forum CLEF-2009 working notes; 2009 Sep 30-Oct 2; Corfu,

Greece. p. 253-256.

Vaquero C, Ortega A, Lleida E. Intra-session variability compensation and hypothesis generation and

selection strategy for speaker segmentation. In: IEEE, editor. International Conference on Acoustics, Speech

and Signal Processing ICASSP; 2011 May 22-27; Prague, CZech Republic. p. 4532-5.

Vicente-Peña J, Gallardo-Antolín A, Peláez D, de María FD. Band-pass filtering of the time sequences of

spectral parameters for robust wireless speech recognition. Speech Communication. 2006;48(10):1379-98.

Vicsi K, Roach P, Oester A, Kacic Z, Barczikay P, Sinka I. SPECO: A multimedia multilingual teaching and

training system for speech handicapped children. In: ISCA, editor. Proc. of the 6th European Conference on

Speech Communication and Technology (Eurospeech-Interspeech). 1999 Sep 5-9; Budapest, Hungary. p.

-62.

Vijayasenan D, Valente F, Bourlard H. An information theoretic approach to speaker diarization of meeting

data. IEEE Transactions on Audio, Speech, and Language Processing. 2009;17(7):1382-93.

Waibel A, Fügen C. Spoken language translation. IEEE Signal Processing Magazine. 2008;25(3):70-9.

Wang Y. A robust parser for spoken language understanding. In: ISCA, editor. Proc. of EUROSPEECH.

Sep 5-9; Budapest, Hungary. p. 2055-8.

Wang Y, Acero A. Grammar learning for spoken language understanding. In: Proceedings of IEEE ASRU

Workshop; 2001 Dec 9-13; Madonna di Campiglio, Italy. p. 1229-44.

Wang Y, Acero A, Chelba C, Frey B, Wong L. Combination of statistical and rule-based approaches for

spoken language understanding. In: ISCA, editor. Proc. of ICSLP; 2002 Sep 16-20; Denver, Colorado, USA.

p. 609-12.

Ward W, Issar S. Recent improvements in the CMU spoken language understanding system. In: Proceedings

of ARPA Workshop on HLT. 1994 Mar 8-11; Plainsboro, New Jersey, USA. p. 213-6.

Wooters C, Huijbregts M. The ICSI RT07s speaker diarization System. In: Proceedings of the Rich

Transcription 2007 Meeting Recognition; 2007 May 1-4; Baltimore, Maryland. p. 509-19.

Yaman S, Hakkani-Tür D, Tur G, Grishman R, Harper M, McKeown KR, Meyers A, Sharma K.

Classification-based strategies for combining multiple 5-W question answering systems. In: ISCA, editor.

Proceedings of the Interspeech'09, Annual Conference of the International Speech Communication

Association. 2009 Sep 6-10; Brighton, UK. p. 2703-6.

Yamagishi J, Nose R, Zen H, Ling A, Toda T, Tokuda K, King S, Renals S. A robust speaker-adaptive

HMM-based text-to-speech synthesis. IEEE Trans. Audio, Speech and Lang. 2009;17(6):1208-30.

Yan Y, Chen X, Bless D. Automatic tracing of vocal-fold motion from high-speed digital images. IEEE

Trans. on Biomedical Eng. 2006;53(7):1394-400.

Zang Y, Bieginga, E., Tsuia H., and Jiang J., Efficient and effective extraction of vocal fold vibratory

patterns from high-speed digital imaging. Journal of Voice. 2010;24(1):21-9.

Zen H, Tokuda K, Black A. Statistical parametric speech synthesis. Speech Communication.

;51(11):1039-54.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Downloads

Download data is not yet available.

Review of research on speech technology : main contributions from spanish research groups

Keywords

How to Cite

Download Citation

Abstract

References

Downloads