Prosody Prediction for Arabic via the Open-Source Boundary-Annotated Qur’an Corpus

A phrase break classifier is needed to predict natural prosodic pauses in text to be read out loud by humans or machines. To develop phrase break classifiers, we need a boundary-annotated and part-ofspeech tagged corpus. Boundary annotations in English speech corpora are descriptive, delimiting intonation units perceived by the listener; manual annotation must be done by an expert linguist. For Arabic, there are no existing suitable resources. We take a novel approach to phrase break prediction for Arabic, deriving our prosodic annotation scheme from Tajwid (recitation) mark-up in the Qur’an which we then interpret as additional text-based data for computational analysis. This mark-up is prescriptive, and signifies a widely-used recitation style, and one of seven original styles of transmission. Here we report on version 1.0 of our Boundary-Annotated Qur’an dataset of 77430 words and 8230 sentences, where each word is tagged with prosodic and syntactic information at two coarse-grained levels. We then use this dataset to train, test, and compare two probabilistic taggers (trigram and HMM) for Arabic phrase break prediction, where the task is to predict boundary locations in an unseen test set stripped of boundary annotations by classifying words as breaks or non-breaks. The preponderance of non-breaks in the training data sets a challenging baseline success rate: 85.56%. However, we achieve significant gains in accuracy with a trigram tagger, and significant gains in performance recognition of minority class instances with both taggers via the Balanced Classification Rate metric. This is initial work on a longterm research project to produce annotation schemes, language resources, algorithms, and applications for Classical and Modern Standard Arabic.


Introduction
An accepted Universal of language is that people process speech (and text) in chunks (1), which in turn can be interpreted syntactically as function word groups (2) and prosodically as tone units (3,4). A phrase break classifier is needed to predict natural chunks in text to be read out loud by humans or machines. Phrase break prediction is a classification task within the Text-to-Speech synthesis pipeline that attempts to simulate human chunking strategies by assigning prosodic-syntactic boundaries to input text.
To develop phrase break classifiers, we need a boundary-annotated and part-of-speech tagged corpus.
Boundary annotations in English speech corpora are descriptive, delimiting intonation units perceived by the listener; manual annotation must be done by an expert English linguist. Our research applies techniques honed on English (5) to another stress-timed language, Arabic, and to the entire text of the Qur'an ( §4). For Modern Arabic, there are no existing suitable resources with prosodic phrase boundaries annotated by Arabic linguistics experts. However, the Qur'an can be used as a reputable "gold standard" for phrasing in Arabic, because traditional editions include boundary mark-up to aid correct recitation, based on long-established traditions of Quranic Arabic linguistics developed to help believers read and understand the Quran. We can harness the recitation markup in traditional Quran editions, to use these as phrase-break markup in a Boundary-Annotated Quran Corpus.
Chunking text via automatic assignment of sentence-medial and sentence-terminal prosodic-syntactic boundaries is a Natural Language Processing (NLP) and machine learning task which attempts to simulate human parsing and phrasing strategies. The latter are represented by "gold standard" boundary annotations in a speech corpus. Phrase break classifiers are typically trained and tested on such datasets, and assume prior sentence segmentation and part-of-speech (PoS) tagging for input text. Here, we utilize our boundary-annotated Qur'an corpus of Classical Arabic (6) to develop and evaluate two probabilistic taggers (n-gram and HMM) for the phrase break prediction task, using two different feature sets. We regard the Qur'an as a reputable 'gold standard' for phrasing in Arabic because recitation is integral to this text, and many editions ( §4) already carry prescriptive boundary mark-up representative of the longestablished traditions of Arabic linguistics. Hence we plan to assess the naturalness and intelligibility of outputs from our best-performing tagger over a sample of Modern Standard Arabic (MSA) text (6).

Phrase Break Prediction
Automated phrase break prediction is a natural language processing (NLP) task within the Text-to-Speech (TTS) synthesis pipeline, and sub-divides input sentences into meaningful chunks to copy the way in which a native speaker might parse or phrase the utterance. This equates to classifying junctures between words, or the words themselves, in terms of a finite set of boundary types, for example breaks or non-breaks. Establishing these delimiters is an essential component of the symbolic linguistic representation of text as output to a speech synthesizer.

General Procedure for Phrase Break Prediction
Phrase break prediction assumes prior sentence segmentation and part-of-speech tagging for input text, and therefore punctuation and syntax are traditionally used as classificatory features. Another prerequisite is a boundary-annotated and part-of-speech (PoS) tagged corpus (6) as 'gold standard' for developing phrase break classifiers. The classifier is trained on a substantive sample of 'gold-standard' boundary-annotated text, and tested on a smaller, unseen sample from the same source minus the boundary annotations.

Machine Learning Approaches to Phrase Break Prediction
There are two generic approaches to machine learning: rule-based or probabilistic. Phrase break models exemplifying these two approaches are: (i) Liberman and Church's chinks 'n' chunks algorithm

Metrics for Phrase Break Prediction
Performance is primarily evaluated in terms of accuracy, namely: the number of correct predictionsor the sum of true positives and true negatives (TP + TN) -made during test. There are also other relevant metrics such as f-score and balanced classification rate (BCR). The former is the trade-off between, or weighted mean, of recall (i.e. TP total / total number of boundaries in the sample), and precision (i.e. TP total / total number of boundaries retrieved). The latter (i.e. BCR) mitigates against high accuracy scores arising from class imbalance, a typical scenario for phrase break prediction since instances of the majority chunking of text to maximise communication effectiveness.

Building the Open-Source Boundary Annotated Qur'an Corpus
We derive a coarse-grained boundary annotation scheme for Arabic from traditional recitation markup (Tajwid) in the Qur'an; this is then compared with existing schemes for British and American English speech corpora (8,10). We then merge a PoS-tagged version of the text (12)  A prerequisite for developing and evaluating phrase break classifiers is a "gold standard" boundaryannotated and PoS-tagged corpus. We regard the Qur'an as a reputable "gold standard" for phrasing in Arabic because recitation is integral to this text, and many editions already carry prescriptive boundary mark-up representative of the long-established traditions of Arabic linguistics.

Pause Markers in the Qur'an
Qur'anic verses are meant to be recited aloud from memory at least as much as they are meant for silent reading: '…The Arabic word qur'an means "recitation"...While the words have…been available in written form, equal prominence has been given to the continuing oral tradition…' (14).
The art of Tajwid has developed over time to help believers achieve "clearly articulated recitation", and one aspect of this is the system of stops and starts ‫اء‬ َ ‫ﺪ‬ ِ ‫ﺘ‬ ْ ‫ٱﺑ‬ َ ‫و‬ ْ ‫ﻒ‬ ْ ‫ﻗ‬ َ ‫و‬ or waqf wa ibtidā defining intelligible and naturalistic phrasing within and between verses (15). We have derived a coarse-grained boundary annotation scheme for Arabic (16) fromTajwidstops and starts mark-up in a reputable edition of the Qur'an 2 , and in a widely-used recitation style: ḥafṣ bin 'āṣim(cf. 17). This uses the Qurayshi or Meccan dialect, and, according to a 'strong'hadīth, is one of seven original styles of transmission: '…The Qur'an has been revealed to be recited in seven different ways, so recite of it that which is easier for you…' (Sahih al-Bukhari in (18)) Our annotation scheme is coarse-grained because, for our immediate purposes (19), we have collapsed eight degrees of boundary strength (i.e. three major boundary types, four minor boundary types, and one prohibited stop) into the familiar {major, minor, none} set (figure 1). Future work will implement the full fine-grained boundary annotation scheme for text analytic investigation and experimentation with an updated version of the corpus. For the present, we note that in addition to its specificity, boundary mark-up in the Qur'an is prescriptive and proactive rather than descriptive and reactive, as in existing systems for English. Figure 2 displays Verse 45 from Chapter 29 of the Qur'an (Al-Ankabūt or The Spider) in decorative othmāni script, followed by the same verse as it appears in our corpus, in MSA script and with major/minorboundary mark-up. It also displays a transliteration and an English translation of the text.We consider MSA script as preferable for speech and language processing, and for boosting the currency of this corpus for the wider research community.

||
End of verse is a compulsory major break.

||
Major and compulsory verse-medial break which completes the meaning of a phrase.

||
Minor break: a break is allowed and preferable.
ۗ ‫ـ‬ | Minor break (continuation mark): the reader can continue without pausing, but a pause is preferable.
ۖ ‫ـ‬ | Minor break permitted: readers can pause if they wish, but it is better not to.

|
Minor break for a shorter time without breathing, where last pronounced letter before break is pronounced without its short vowel.
Alternative boundaries in the same phrase: if the reader breaks in one position, they must not break in the other, and vice versa.

nonbreak
Non-break: pausing is not permitted as it would change the meaning of the verse. Figure 1: Mapping from Tajwid symbols to coarse-grained tripartite boundary annotation scheme for Arabic. The majority of words do not carry Tajwid boundary markup and these are thus tagged as non-breaks in our corpus  An additional novelty is that we use compulsory and recommended, plus prohibited stops in Tajwidmark-up to segment the text into sentences (cf. Figure 3). Such 'sentences' may constitute the grammatical units of common parlance but may also be realised as sequences of intonation units or

Course-Grained Syntactic Annotation
Traditional Arabic grammar (21-23) classifies words into one of three syntactic categories{noun, verb, particle}, and we therefore retain this coarse-grained feature set as the default in our initial experiments (19). Qur'anic Arabic is fully vowelised, unlike MSA; and this facilitates syntactic analysis via this ostensibly straightforward scheme which, without vowelisation, becomes problematic (24). For example, native Arabic speakers will use context to disambiguate the non-vowelised form ‫ورد‬ wrd, which could either be the noun ٌ ‫د‬ ْ ‫ر‬ َ ‫و‬ ward un (roses), or the verb َ ‫د‬ َ ‫ر‬ َ ‫و‬ warada (to come). A further problem is the mismatch between descriptive frameworks for Arabic and English (aka 'Western') grammar; Arabic nouns subsume adjectives, adverbs, and some prepositions, while particles also subsume some prepositions, as well as conjunctions and negatives (25). Subsequently, we extend our sparse tag set to differentiate a limited selection of subcategories extracted from fully parsed sections of QAC, the Qur'anic Arabic Corpus 3 (12). Morpho-syntactic analysis in QAC is fine-grained. For example, in an earlier version of the corpus (v.2.0), the word ِ ‫ﻴﻢ‬ ِ ‫ﺣ‬  ‫اﻟﺮ‬ ar-raḥīm in Chapter 1:3 (the Most Merciful) is tagged as follows (cf. Figure   4).
r~aHiymi Al+ POS:ADJLEM:r~aHiymROOT:rHm MS GEN particles; disconnected letters}. We therefore extract this information from QAC to tag each token with its main part-of-speech; we also map these categories to the tripartite notation of traditional Arabic grammar: {noun, verb, particle}.

Building the Dataset
To build the Boundary-Annotated Qur'an Corpus we have extracted, processed, and merged information from two online sources: the Tanzil Qur'an project (27) and an earlier version of QAC, the Qur'anic Arabic Corpus (12). A full account of dataset build is intended for a future publication, but outline processing steps involved: (i) gathering and tracking boundary stops from Tanzil; (ii) extracting PoS tags from QAC; (iii) collapsing boundary stops into two alternative coarse-grained schema; (iv) collapsing PoS tags into two alternative coarse-grained schema; (v) merging these two data streams; (vi) segmenting long paragraphs into sentences.
The constructed boundary annotated corpus of 77430 words and 8230 sentences is stored in a tab separated column file, with each word also stored in a separate file (cf. Figure 5). The first four columns contain tracking information, including Sura (i.e. chapter) number, and Aya (i.e. verse) number, (the first two columns). The Arabic word in Othmani and then MSA script occupy the fifth and sixth columns respectively. Part-of-speech information is given in the next two columns, with tripartite coarse-grained tags in column seven, and more detailed syntactic annotation in column eight. Column nine stores the Tajwid boundary symbol (if present); and the next two columns show each word classified in terms of boundary type: boundary types stored as {major, minor, none}, and then as {breaks, non-breaks}. The penultimate column identifies sentence terminals, and the last column gives the word-forword English translation.

Taggers
We implement a trigram tagger based on the Natural Language ToolKit's (20) Ngram Tagger class to assign boundaries to a corpus of Qur'anic Arabic which is segmented into sentences and PoS-tagged, and where outputs from the tagger can be evaluated against 'gold standard' boundary annotations in the dataset (6). We also implement an HMM or sequence model based on NLTK's HiddenMarkovModelTagger class. Input to the tagger is the same in both cases: our purpose-built Qur'an dataset (6) is segmented into 8230 sentence tokens, and each sentence token is represented as a list all-praises-and-thanks we-ask-for-help terminal the-straight of tuples from which we specify permutations of features that match our research questions. A sample Qur'anic sentence is given in Figure 6.
Both taggers used in our experiments take input text segmented into sentences. Since we have classified compulsory and recommended stops in recitation mark-up as major breaks, these are used to identify sentence terminals. Then for our series of experiments, we prepare different permutations of the data to include/exclude words mapped to coarse and slightly finer-grained PoS and either two or three boundary classes. Figure 6 shows sample training input to the tagger as nested lists of tuples.

The Trigram Tagger
Our trigram tagger is coded in Python and trained on Qur'an text represented as (PoS, boundary-type) or ((word, PoS), boundary-type) pairings. For the former, it assigns the most likely boundary type (e.g. break or non-break) based on the current PoS, plus the two preceding boundary types as context. Figure 7 is an adaptation from (20 p.204): shaded areas denote context, and the target for prediction is italicised.
Readers will note that this trigram tagger is based on Python dictionaries: a look-up table is consulted to determine an appropriate tag for each instance; and the tagger backs off to a majority class tagger (i.e. tags the instance as non-break) if look-up fails. Figure 7: Abstract representation of trigram context used for predicting breaks or non-breaks

The HMM Tagger
One drawback of this method is that there is no way to revise previously assigned boundaries as the algorithm iterates through the list (i.e. the sentence). To resolve this, we also implement NLTK's HMM tagger for comparative evaluation ( §6). For these initial experiments, we have simply used the train() and evaluate() methods with default parameter settings, plus the train() method with labeled and unlabeled sequences (i.e. training and test set splits), to determine the optimal/most probable combination of break types for each sentence via the Maximum Likelihood Estimate (MLE), which maximizes the joint probability of symbol/state sequences. The HMM tagger generates a probability distribution over all possible boundary types -either break versus non-break (the two-class problem), or major/minor/non-break (the three-class problem). The product of these probabilities then gives a probability score for each boundary sequence, and the highest-scoring sequence is then chosen.

Evaluation
The immediate research question pertaining to this study is: Can we successfully recapture prosodic boundaries authenticated by Tajwid recitation markup using probabilistic taggers trained and tested on our Boundary-Annotated Qur'an Corpus?

Methodology
To address this question, we comparatively evaluate the performance of a trigram tagger and an Arabic word will be resolved as an instance of one, and only one, specific boundary type.

Test Set Selection
Test set sentences were not randomly selected. There is agreement on the provenance of most Qur'anic verses in terms of whether they originate from the Prophet's period of residence in Mecca or Medina. However, there are 21 (out of 114) chapters where Mecca/Medina verse associations are in doubt (cf.28). Meccan and Medinan verses differ stylistically (28), and therefore the 21 disputed chapters were used as our test set, since they constitute a representative sample of both styles and a fair test for a tagger trained on the rest of the corpus.

Confusion Matrices
Tagger accuracy for each classification task can be expressed as an overall percentage calculated by summing the number of correct predictions for each boundary type, and dividing this total by the total word count (i.e. the total number of items to be classified). Output predictions are presented as a confusion matrix where false positives and false negatives (FPs and FNs) are used to infer basic issues in performance. Table 1 is an example of the confusion matrix for the two-class problem, where shaded area counts constitute the proportion of correct predictions (true positives and true negatives) retrieved during test for our trigram tagger using very coarse-grained PoS. Readers will note that class distributions in the test set are highly skewed: 6261 non-breaks versus 1057 breaks.  Table 2 displays results for binary classification experiments with both taggers, and feature set permutations which include/exclude words PoS-tagged at two different levels of granularity. What is immediately obvious is that data skew (i.e. the over-preponderance of non-breaks) sets a high baseline accuracy of 85.56%. Nevertheless, the trigram tagger in Runs 1 and 5 significantly outperforms the baseline for both syntactic feature sets: 88.47% for 3 PoS categories, and 88.44% for 10 PoS categories. Success rate for the HMM tagger is below par, but its superior true positive hit rate (i.e. 600 TPs) and

The Two-Class Problem
BCR statistic of 0.72 suggest that this tagger has learnt the concept better than the others. Brierley (5) recommends consideration of more than one evaluation metric when comparing phrase break classifiers.
She also recommends further significance testing to verify apparent gains in accuracy, and to explore conflicting results: in this case, accuracy versus BCR scores for the HMM tagger in Runs 2 and 6. We What is additionally interesting is that the HMM taggers in Runs 2 and 6 also represent a statistically significant gain in performance in terms of BCR even when set against Run 1. This claim is verified by applying McNemar's significance test to compare the performance of both classifiers (29). In this case, the focus for comparison is success in minority class recognition. For example, we assembled counts for concordant and discordant output predictions for the trigram and HMM taggers in Runs 5 and 6 in a 2 x 2 contingency table (Table 3). Here, it turns out that they are: the two-tailed p-value is <0.000001, and the odds ratio is 20.98 with a 95% confidence interval 4 . Thus, the while the HMM tagger over predicts, it captures many more TPs and achieves a better average positive hit rate. Table 4 records results for tripartite classification. Significant gains in both accuracy and BCR over baseline performance were achieved by the trigram tagger for the 3-class problem using both feature sets in Runs 9 and 13: 88.69% and 88.62% respectively.

The Three-Class Problem
The HMM tagger also achieved significant gains in terms of BCR (Runs 10 and 14), and in one experiment (Run 16), where words were disabled as a feature, improved on baseline success rate, albeit at the expense of BCR.

Scheme Ratification on Modern Standard Arabic
We construe our boundary-annotated and PoS-tagged Qur'an as a 'gold standard' for supervised learning of the phrase break prediction task. The Qur'an is a rich dataset, despite its size, and has previously been used as an evaluative 'gold-standard' for machine learning (e.g. for Arabic morphological analysers in Morpho Challenge 2009) 5 . The general procedure is to train the classifier on a substantive sample of 'gold-standard' boundary-annotated text, and to hold out a smaller sample from the same source for testing. Although target boundary sites in the test set are available to the researcher for comparative evaluation, they are missing from test data presented to the classifier. Classifier accuracy therefore equates to the number of correct boundaries retrieved during test.

Delimiting Sentences in the MSA Corpus
Our MSA corpus replicates our Qur'an dataset classification of each word in terms of two levels of syntactic plus prosodic information. For the latter, "sentences" within longer paragraphs are readily identified via major breaks as sentence terminals, whereas for MSA text we segment on punctuation.
Working with MSA text is not straightforward. First, it is not fully vowelised, and restoring full vowelisation is an essential preliminary step to morphological analysis, POS-tagging and parsing. In our "gold-standard" excerpt 6 from the Corpus of Contemporary Arabic (30), full vowelisation has been restored automatically by the SALMA Tagger (31,24). Another problem is that sentences in Arabic can be very long, and punctuation is sparse at best. For this study, sentence segmentation was done manually. A longer term goal is to develop reliable chunking algorithms for Arabic such that MSA text can be chunked automatically and extra intelligible and naturalistic boundaries inserted which meet with human approval.

Long-term Goals
Our over-arching research objectives are: (i) to determine whether Qur'anic Arabic speech rhythms still inform native speaker intuitions, and parsing and phrasing strategies, for Modern Standard Arabic; and (ii) to analyse and leverage prosodic-syntactic boundary correlates in the Qur'an for Arabic speech and language applications. This will eventually entail use of subjective human judgment to scrutinise output predictions from our best-performing tagger which is first evaluated on the boundary-annotated Qur'an (6), and then tested on unseen 'gold standard' PoS-tagged MSA text 7 .
We take a novel approach to phrase break prediction for Arabic, deriving our prosodic annotation scheme from Tajwid  Our original insight is then to view the Tajwid system of chunk boundary delimiters, and other features extracted from the orthographic form (5,6,35) as additional sources of text-based data for computational analysis. Text analytics techniques honed on English (5) will then be used to discover significant linguistic patterns in the vicinity of these benchmark phrase break annotations, to be evaluated as classificatory features in machine learning experiments. The best-performing feature set will then be evaluated on and adapted for Modern Standard Arabic.
Let not their speech grieve thee: for all power and honour belong to Allah: it is He Who heareth and knoweth (all things).

Conclusion
Our boundary-annotated Qur'an corpus is a unique, open-source dataset for Arabic phrase break prediction and for Arabic speech and language processing in general. Boundary annotations in this corpus differ from similar corpora for English in that they are proactive, not reactive, and provide detailed and corroborated guidance for the reader/speaker on optimal parsing and phrasing strategies for interpreting and conveying meaning. Thus, in the longer term, we are interested in the possibility of leveraging this received wisdom for Modern Standard Arabic language engineering applications. This will entail enriching the dataset with morpho-syntactic analyses via the SALMA tagger (32,31,24), and with symbolic prosodic information (5,33), prior to exploratory text data mining and feature extraction of prospective boundary correlates.
This paper constitutes initial work and compares the performance of sequence models for Qur'anic Arabic phrase break prediction. The trigram and HMM taggers in these experiments are prototypes, and use coarse-grained syntactic features only. Nevertheless, sharable experience and insights of interest to fellow corpus linguists are to be gained from the present implementation and evaluation. As with English (2,11,34), syntactic information proves a reliable feature, but what is especially interesting is that our highest accuracy scores have been achieved with a very coarse-grained feature set with a long-established history: the tripartite classification of Arabic words as {noun, verb, particle} in traditional Arabic grammar (cf.9).
What also emerges is the vexed question of class imbalance, potentially compounded by the problem of sparse data: our Qur'an corpus is only 77430 words long, and it is one of a kind. The morphological complexity of Arabic increases the likelihood of data sparseness. We will ascertain whether data sparseness is affecting classification results and if so, how this can best be addressed as part of future work.
Another recommendation (cf.5) for understanding as well as evaluating classifier performance in this task is to use a combination of performance metrics (not just accuracy) to determine how well the classifier has learnt the concept: selective use of one or other metric, and inconsistency of metrics used across studies in phrase break prediction is counter-productive -and prosodic-syntactic chunking is already inherently variable.
This is original research in that: (i) our goal is to derive chunking algorithms for Arabic speech and language applications from traditional prosodic mark-up in the Qur'an; and (ii) our underpinning question is whether Qur'anic Arabic speech rhythms still inform native speaker intuition and judgment when processing Modern Standard Arabic. This, along with our other recent publications (10,16,19), represents groundwork for a larger-scale project to produce annotation schemes, language resources, algorithms, and applications for Classical and Modern Standard Arabic.