Banner Portal
Prosody prediction for arabic via the open-source boundary-annotated qur’an corpus
PDF

Keywords

Phrase break prediction
Prosodic annotation
Tajwid recitation
N-gram and HMM taggers
Boundary-annotated
PoS-tagged Qur’an

How to Cite

1.
Sawalha MS, Brierley C, Atwell E. Prosody prediction for arabic via the open-source boundary-annotated qur’an corpus. J. of Speech Sci. [Internet]. 2021 Feb. 4 [cited 2024 Mar. 29];2(2):175-91. Available from: https://econtents.bc.unicamp.br/inpec/index.php/joss/article/view/15038

Abstract

humans or machines. To develop phrase break classifiers, we need a boundary-annotated and part-ofspeech tagged corpus. Boundary annotations in English speech corpora are descriptive, delimiting intonation units perceived by the listener; manual annotation must be done by an expert linguist. For Arabic, there are no existing suitable resources. We take a novel approach to phrase break prediction for Arabic, deriving our prosodic annotation scheme from Tajwid (recitation) mark-up in the Qur’an which we then interpret as additional text-based data for computational analysis. This mark-up is prescriptive, and signifies a widely-used recitation style, and one of seven original styles of transmission. Here we report on version 1.0 of our Boundary-Annotated Qur’an dataset of 77430 words and 8230 sentences, where each word is tagged with prosodic and syntactic information at two coarse-grained levels. We then use this dataset to train, test, and compare two probabilistic taggers (trigram and HMM) for Arabic phrase break prediction, where the task is to predict boundary locations in an unseen test set stripped of boundary annotations by classifying words as breaks or non-breaks. The preponderance of non-breaks in the training data sets a challenging baseline success rate: 85.56%. However, we achieve significant gains in accuracy with a trigram tagger, and significant gains in performance recognition of minority class instances with both taggers via the Balanced Classification Rate metric. This is initial work on a longterm research project to produce annotation schemes, language resources, algorithms, and applications for Classical and Modern Standard Arabic.

https://doi.org/10.20396/joss.v2i2.15038
PDF

References

Al-Sulaiti, L., Atwell, E.. The design of a corpus of contemporary Arabic. In: International Journal of Corpus Linguistics. 2006; Vol. 11, pp. 135-171.

Beckman, M., Hirschberg, J. The ToBI annotation conventions.The Ohio State University and AT&T Bell Laboratories, unpublished manuscript; 1994.Online.[Accessed: September 2011]. Available from:ftp://ftp.ling.ohio-state.edu/pub/phonetics/TOBI/ToBI/ToBI.6.html.

Bird, S., Klein, E., Loper, E. Natural Language Processing with Python.Sebastopol, CA. O’Reilly Media, Inc.2009.

Brierley, C. 2011. Prosody Resources and Symbolic Prosodic Features for Automated Phrase Break Prediction.[PhD Thesis]. Leeds: School of Computing. University of Leeds; 2011.

Brierley, C., Sawalha, M., Atwell, E. Open-Source Boundary-Annotated Corpus for Arabic Speech and Language Processing. In: Proceedings of LREC 2012: Language Resources and Evaluation Conference. May 2012; Istanbul, Turkey. 2012.

Brierley, C., Sawalha, M.; Atwell, E. Arabic Phonetics and Phonology for Text Analytics and Natural Language Processing Applications. PowerPoint presentation for Arabic Phonetics and Phonology PG Workshop. York.2011.

Brierley, C.; Atwell, E. ProPOSEC: a Prosody and PoS Spoken English Corpus. In: Proceedings of LREC 2010: Language Resources and Evaluation Conference.Valetta, Malta. May 2010.2010.

Croft, W. Intonation Units and Grammatical Structure. Linguistics.1995; 33: 839-882.

Denny, F.M. Qur’an Recitation: A Tradition of Oral Performance and Transmission. Oral Tradition.1989; 4/1-2: 5-26.

Denny, F.M. Review [untitled]. Journal for the Scientific Study of Religion. 1976;15.3: 287-289.

Dietterich,T.G. Approximate Statistical Tests for comparing supervised classification learning algorithms. In: Neural Computation. 1998; 10:1895-1924.

Dukes, K. The Quranic Arabic Corpus (v. 2.0); 2010.[Accessed: August 2011]. Available from:http://corpus.quran.com

Dukes, K., Atwell, E. LAMP: A Multimodal Web Platform for Collaborative Linguistic Analysis. In: Proceedings of LREC 2012: Language Resources and Evaluation Conference.Istanbul, Turkey. May 2012.2012.

Dukes, K., Habash, N. Morphological Annotation of Qur’anic Arabic. In: Proceedings of LREC 2010: Language Resources and Evaluation Conference. Valletta, Malta.2010.

Gilchrist, J. 2011. ‘Jam’ Al-Qur’an: The Codification of the Qur’an Text’.[Accessed September 2011]. Available from:http://www.answering-islam.org/Gilchrist/Jam/index.html

Grabe, E. 2001. Prosodic Annotation.PowerPoint. 9th ELSNET European Summer School on Language and Speech Communication, Prague.[Accessed: 2006].

Ingulfsen, T., Burrows, T.; Buchholz, S. Influence of Syntax on Prosodic Boundary Prediction. In: Proceedings, INTERSPEECH 2005.2005;1817-1820.

Islamic Bulletin. The Holy Quran Color Coded with Tajweed Rules. 2012. [Accessed: Feb. 2012]. Available from:http://www.islamicbulletin.com/services/details.aspx?id=260

Ladd, R. Intonational Phonology Cambridge, Cambridge University Press; 1996.

Liberman, M.Y., Church, K.W. Text Analysis and Word Pronunciation in Text-to-Speech Synthesis. In: Advances in Speech Signal Processing.Furui S., Sondhi, M.M., editors. New York. Marcel Dekker Inc; 1992.

Maamouri, M., Bies, A., Buckwalter, T., Mekki, W. The Penn Arabic Treebank: Building a Large-Scale Annotated Corpus. Philadelphia. Linguistic Data Consortium.2004.

Ostendorf, M., Price, P. , Shattuck-Hufnagel, S. Boston University Radio Speech Corpus. Philadelphia. Linguistic Data Consortium.1996.

Roach, P. English Phonetics and Phonology: A Practical Course (3rd. edition). Cambridge. Cambridge University Press; 2000.

Ryding, Karin C. A Reference Grammar of Modern Standard Arabic. Cambridge. Cambridge University Press.2005.

Sawalha, M. , Atwell, E. Fine-Grain Morphological Analyzer and Part-of-Speech Tagger for Arabic Text. In: Proceedings of LREC'10: Language Resources and Evaluation Conference,Valetta, Malta. May 2010.2010.

Sawalha, M. Open-Source Resources and Standards for Arabic Word Structure Analysis: Fine Grained Morphological Analysis of Arabic Text Corpora. [PhD. Thesis].Leeds:School of Computing. University of Leeds.2011.

Sawalha, M. The SALMA – Gold Standard. 2011. [Accessed: September 2011]. Available from:http://www.comp.leeds.ac.uk/sawalha/goldstandard.html

Sawalha, M., Brierley, C., Atwell, E. Predicting Phrase Breaks in Classical and Modern Standard Arabic Text.In: Proceedings of LREC 2012: Language Resources and Evaluation Conference.Istanbul, Turkey. May 2012.2012.

Sharaf, A.M. Macci, MadaniShurahs. 2011. [Accessed: October 2011]. Available from: http://www.textminingthequran.com/wiki/Makki_and_Madani_Surahs

Taylor, L.J., Knowles, G. Manual of Information to Accompany the SEC Corpus: The machine readable corpus of spoken English. 1988.[Accessed: January 2010]. Available from: http://khnt.hit.uib.no/icame/manuals/sec/INDEX.HTM

Taylor, P., Black, A.W. Assigning Phrase-Breaks from Part-of-Speech Sequences. In: Computer Speech and Language. 1998;12.2: 99-117.

Wright, W. A Grammar of the Arabic Language, Translated from the German of Caspari, and Editted with Numerous Additions and Corrections Beirut: Librairie du Liban.1996.

Zarabi-Zadeh. Tanzil Quran Project. 2012. [Accessed: April 2012]. Available from: http://www.tanzil.net

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Copyright (c) 2012 M. S. Sawalha, C. Brierley, E. Atwell

Downloads

Download data is not yet available.