Prosodic features of situational variation across nine speaking styles in French

This paper presents results from an on-going study of prosodic and phonostylistic variation across speaking styles, i.e., acoustic images assoc i ted to types of language production, also called phonogenres. It extends previous work in ( 1, 2) by enlarging the corpus (C-PhonoGenre, 8 hours) and by exploring a more comprehensive coll e tion of genres. The situational parameters in ( 3, 4) are reduced to four situational features, each ad mitting three values, the combination of which differentiates sub-phonogenres . The main goal of this study is to establish correlations between the situational and prosodic f eatures of discourse. Corpus processing, annotation and measure calculation are performed se mi-automatically, through a set of tools implemented under Praat and manual steps. Rhythmica l e surements by DurationAnalyser ( 5) combined with the output of ProsoReport ( 6) produce an acoustic analysis of the differences between phonogenres. A large number of microand m acro-prosodic measures provide a finegrained ‘prosometric’ description. This article pre sents the methodology for collecting the corpus, and results for the phonogenres.


Introduction
A speaking style does not depend only on the speaker's identity, such as his level of education or his place of origin. It also depends on the speaking situation that calls for a more or less determinate encoding of specific speech features, regarding the lexis, grammar and discourse structure, as well as phonetic and phonological features. The goal of the present study is to establish correlations between situational settings and prosodic features.
The general context for the present research is the growing interest in genre in language sciences (7)(8)(9), and more specifically for spoken language genres and situational variation (1,2,(10)(11)(12)(13)(14)(15)(16)(17). We use the term phonogenre for a spoken language genre, and we distinguish it from a phonostyle. The phonogenre is defined as a typified acoustic image associated to a situation and speech activity, whereas phonostyle refers to the features of a given speech sample within a phonogenre. The term speaking style, commonly used in this research field, is to be understood as embracing both phonogenre and phonostyle. By introducing the distinction between genre and style we are underlining that speech is necessarily produced within one phonogenre, but each speaker has his individual phonostyle, within each phonogenre.
Situational variation is approached from two points of view. On one hand, situations are grouped according to an implicit typology, which still needs to be determined; on the other hand, typical prosodic features tend to characterise phonogenres (e.g., sport commentary, religious sermons), and to make them highly recognisable.
Speech samples are collected and grouped according to shared situational features, inspired from "situational invariants" (3), and "speech conceptional features" (4) that situate each sample within a continuum that ranges from "language of immediacy" to "language of distance". Four features of speech situations are considered, each one admitting three degrees (Table 1): The studied phonogenres are described according to their situational features in Table 3.
The degree semi-is introduced to reflect the complexity of certain speaking situations, where the discrete opposition presence vs. absence of one feature is not sufficient because the discourse can either share the two conditions or stand in-between. For example, the religious sermon would usually be considered as non-media, but as it is also broadcasted in the media (sermons on the Internet and televised church mass) we labelled it as semi-media. As for interactivity feature, we labelled the parliamentary speech as semi-interactive as it lies between a monologue and interactive dialogue: one member of the parliament addresses a prepared question (monologue), and then the government minister answers the question: this is once more a monologue, but it is semi-prepared since the minister has to improvise from a prepared draft. Therefore, the label semi-interactive was assigned for two reasons: (a) this question and answer exchange represent a sort of dialogue exchange and (b) the government minister has to respond to the reactions of the members of the opposition in the Parliament.
This research expands the set of previously studied phonogenres, as well as the corpus duration, both globally and per studied genre; it relies on the same improved semi-automatic speech annotation methodology as (1,2,6). It further joins rhythmical measurements (5) to ProsoReport (6).

Corpus collection and annotation
Previous work on a smaller multi-speaking style speech corpus (C-Prom, 17) has oriented this research in two ways: in pointing out (a) the need for more homogeneous phonogenres by constraining their situational features, and (b) the need to avoid idiosyncrasy by studying a larger number of speakers for each phonogenre. Therefore, our C-PhonoGenre corpus is composed of speaking styles whose situational features are more constrained, with ideally 10 speakers per style.

Corpus collection
The corpus C-PhonoGenre is composed of eight phonogenres that we describe here in detail, providing their metadata (source, year of recording, variety of French). The total duration of each phonogenre, as well as number of recordings is summarised in Table 2 Both female and male speakers are represented equally in the corpus whenever achievable. In fact, LIT (religious sermons) and SPO (sports commentaries) consist exclusively of recordings of male speakers. Among the presidential New Year's wishes genre, there are two Swiss female speakers.
Data cover three different French-speaking areas: Metropolitan France, Belgium and Switzerland. Regional variation is not explored in this study; nevertheless, the information is present in the corpus and can be used for further study. In the discussion part, regional variation is taken into account as a partial explanation of the dispersion observed at the intra-genre and inter-speaker levels.
The average duration of the recordings is 4min 40sec (minimum: 37 sec; max: 13 min). 75% of the recordings have duration between 3 and 5 minutes. The corpus has been gathered from various media sources: television (DID, LIT, MET, SPO, VXP), radio (DID, RPR) or Internet (LIT, ASS), whereas NAR and LEC have been home-recorded.
For this comparative study of different phonogenres, the existing C-PhonoGenre corpus was increased by 16 recordings of "reading" [LEC] from the C-Prom-PFC corpus (18).

Situational features
At a first level, genre identification is based on the type of speech activity. This allows for grouping under DID (educational speech) both media and non-media speech, as well as for considering that parliamentary speech is distinguished from presidential New Year's wishesthough they could have been united in a "political discourse" genre label. A description of the assumed phonogenres based on their four situational features ( On the other hand, the two sub-phonogenres of parliamentary speech [ASS] are determined by the position of the speaker in the exchange: he is either the one who asks or the one who answers. Therefore, these distinctions are to be considered at the interactional and psychosociological level. The division of [SPO] into a basketball sub-phonogenre on one hand, and a football and rugby on the other, was made because the two latter are interactive (several commentators), whereas in the case of basketball, there is only one speaker.

Segmentation
A manual orthographic transcription in Praat (19) and a semi-automatic processing (EasyAlign, 20) result in a lexical, syllabic and phonemic segmentation of the corpus. Moreover, EasyAlign detects automatically pauses and provides a PSU (Pause-Separated Units) tier. This information is relevant for the study of discourse structure and for the prosodic boundaries. In order to retrieve reliable results from instrumental and acoustic analysis, the segments' boundaries of the whole corpus and of each level of segmentation have been manually corrected. Of 129 566 intervals in tier syllables, 117 502 (90.7%) are plain articulated syllables and 12 064 (9.3%) are pause intervals. More details about phone and word levels are presented in Table 4.

Delivery
Additionally to the above-mentioned tiers, an extra tier named delivery has been created by duplicating the syllable tier and annotating it manually with stylistic and phonological variations such as liaisons, elision and hesitation, breath and mouth noises in pauses, and post-tonic schwas. In French, a post-tonic schwa is optional, in the sense that it may be omitted. This specificity makes it interesting for various levels of speech studies. For example, at a phonological level, if schwa is pronounced it means that there is an extra syllable added to the word (cf. Figure 1, schwa [@] is pronounced in word "groupes" [gRu.p@]). The various symbols used for the delivery tier are grouped in Table 5. Among the 117 502 articulated syllables, 8 490 (7.2%) are tagged with one or more delivery symbols, representing 8 731 delivery symbols (as one syllable may have several tags). The 11 754 silences are all tagged with the main silence symbol _ and 5 028 (42.7%) are tagged with one or more silence-related delivery symbols (* o t).
The information contained in the delivery tier is used for describing a phonogenre as well as for characterising the personal speaking style of each speaker. Furthermore, it is an indicator of speech fluency or disfluency. It is also used in the detection of prosodic prominence for distinguishing between effective vowel lengthening and hesitation when taking into consideration the parameter of duration (cf. 2.3.5). A detailed annotation of silence characteristics can be used for detecting the prosodic boundaries. Finally, the annotation grouped under other symbols in Table 5 eliminates invalid syllables in the process of acoustic analyses (cf. 2.4).

Grammatical annotation
The lexical segmentation tier is doubled with a part-of-speech (POS) tier (named pos-min in Figure 1). Each word has been labelled automatically with its grammatical category by the tool DisMo (21). The following simplified version of tag-set is used to set apart lexical/content words from functional/grammatical words: The total number of words obtained by DisMo (82 025) is greater than the word count during segmentation with EasyAlign (77 985), as DisMo correctly makes a lexical separation (in tier tok-min) for contracted forms, such as preposition-noun pairs (e.g. "d'opposition" illustrated in Figure 1).
Grammatical information is used subsequently for automatic creation of three additional tiers: i) in the lex tier, the symbol * indicates a lexical word. This information is used in the following processing steps to detect ii) potentially stressed groups in the SG tier. The potentially stressed group ("groupe accentuable", or "stress group" as defined in 22) is considered to be the minimal potentially stressed unit in French and can contain more than one word (e.g., functional word(s) followed by lexical word). The term "potentially stressed" refers to the fact that the automatic detection is based on a predictive model (22) that postulates that each final full syllable of a lexical word may, but does not have to, be stressed. iii) The if tier indicates the different kinds of stress that can possibly take place within a SG: final stress f is the most frequent in French; initial stress is tagged i if it is at the beginning of a lexical word, or 1 if it is at the beginning of a SG; if the penultimate syllable is stressed, it's tagged p in the if tier. The final syllables containing schwa are considered non-stressable. For that reason the syllable [p@] of [gRu.p@] is tagged @ in the if tier and the final stress f is assigned to [gRu] (cf. Figure 1).
In summary, the TextGrids have one tier at phone level, four tiers at syllabic level (syllable, if, prominence, delivery), four tiers at word level (words, tok-min, pos-min and lex), one SG tier and three tiers at pause-separated units level (phono, ortho and PSU) as illustrated in Figure 1.

Pitch
Praat's automatic pitch detection method resulted in frequent errors for data with particularly noisy environments, such as sport commentary or church mass, and for data containing many hesitations, such as spontaneous narration. This proved to be an obstacle for some acoustic measurements, and therefore the pitch tier was recalculated and then manually corrected within Praat. For the sake of homogeneity of the corpus, pitch was corrected for all data and not only for those where the recording was problematic.

Prominence
After segmentation and alignment of the speech signal, the corpus received prosodic annotation of prominence. Each syllable was assigned a score of acoustic prominence from 0 to 4 using ProsoProm (23). This tool is built upon findings on acoustic correlates of perceived prominence in French, previously according to a protocol defining three degrees of prominence and applied to a manual annotation of the 70 minutes corpus C-Prom (17), and today according to another protocol defining five such degrees and applied to a manual annotation of an 18 minutes corpus.
As mentioned under 2.3.2, the information from the delivery tier helps to improve the automatic detection of prominence by taking into account hesitation and the post-tonic syllabic schwa. Both phenomena were indicated as being problematic for prominence detection in (24). ProsoProm takes into account pitch and duration of syllables relatively to surrounding syllables, as well as pauses and pitch rises. For the entire corpus, 63.2% of the syllables were labelled [0] non-prominent, 11.9% scored [1], 5.8% scored [2], 5.5% scored [3] and 13.6% scored [4]. The automatic, semi-automatic and manual preliminary steps described under 2.3 are required for the acoustical analyses and further results.

Acoustic analysis and prosodic report
Three tools implemented as Praat scripts -Prosogram, ProsoReport and DurationAnalyser -are used for the acoustic and statistic treatment of the corpus.
Prosogram (25) is applied for pitch stylisation of the data. Its two-step algorithm first detects vocalic nuclei for each syllable based on voicing and intensity; and then the nucleus pitch curve is stylised into a static or dynamic tone based on a perceptual glissando approach (26).
Taking the pitch stylisation from Prosogram as a starting point, ProsoReport (6)  A detailed prosodic report provides measures at local (phones, syllables, pauses) and global (Pause-Separated Units, PSU) level, as well as measures and statistics for the entire recording (e.g., articulation rate/ratio, duration/pitch mean and deviation, pitch distribution). In some cases only relative measures (e.g., mean, rate, percentage) are considered, and in other cases, only absolute measures (e.g., number of phones, total duration). This could be useful if groups of recordings are compared, while the size and number of recordings as well as the speakers' individual properties need to be ignored. Thanks to the previous automatic prominence detection (cf. 2.3.5), ProsoReport also computes the tonal and rhythmic distribution of prominent and non-prominent syllables (e.g., percentage of prominent syllables in various positions).
DurationAnalyser (5) computes exclusively temporal measures and rhythmical variability measures based on vocalic, consonantal and syllabic intervals.

Results
Data measures are grouped either by phonogenre, by sub-phonogenre, or by each situational feature. They are divided into three parts:

Acoustics 3.2.1 Macroprosodic measures
The macroprosodic component refers to the speaker's choice of rhythm and intonation patterns.
Measures of the articulation ratio (proportion of articulated speech vs. silences) at phonogenre level show a global effect (F(12,91)=30.96 p<0.001) and oppose religious sermons [LIT], presidential wishes [VXP] and sport commentaries [SPO] to the others and thus corroborate their character announced under 3.1.
Articulation ratio at sub-phonogenre level brings some new insights illustrated in Figure  3. Ministers (or their delegates) who answer [ASS-R] during Question Time in parliamentary speech tend to occupy more speech time than deputies who formulate their question [ASS-Q]. It is probably because the answer provokes more or less loud reactions among deputies in the parliament. The speaker reacts in turn to this situation by reducing the number of pauses in order to maximise his use of the two minutes allocated to him. Educational speech [DID] as well presents differences at the sub-phonogenre level, mainly between Radio and TV. This might be for a couple of reasons: TV recordings are longer (10 minutes) than radio ones (3 minutes); speech time at television must be shared with the visual flow, and prosodic features are likely to be different in the situation where speech refers to image (27). A lecture at a scientific conference [DID-cnf] pronounced by university professors is naturally closer to [DID-Rad], that gathers answers from teachers and researchers, than to [DID-TV] with clearly media context, though instructional. The media framework can have an impact on the speech flow. For example, artificial silences are often introduced and the speech flow is cut and reorganised in order to create a documentary or reportage. The difference between two situations of religious sermons [LIT] speech is hardly distinguishable as for articulation ratio, nevertheless the two sub-phonogenres differ more clearly in other prosodic measures as discussed below. comes to speech rate (number of syllables per second silence included). Post-hoc tests for the articulation rate (F(12,91) = 9.33, p<0.001) oppose significantly [LIT] and presidential wishes [VXP] to the other phonogenres ( Figure 4). Weather forecast [MET] is detached from the others as phonogenre with the highest speech rate. Reading [LEC] is the most spread one: this can be explained by the fact that the recordings are equally shared between older and younger generations, as well as between speakers of Paris and Lyon (East of France). Studies in regional prosody of French (28) report that the speech rate is faster in the former than in the latter. The dispersion of narration [NAR] data is less important; nevertheless, it reflects the heterogeneity of story topics and of the speaking styles of storytellers. A similar dispersion is observed for presidential New Year's wishes [VXP] and reflects geographic and diachronic differences described under 2.1. Intonational properties indicate a lower relative pitch variation (standard deviation σ of pitch / average x ̅ of pitch; measured in semitones) for phonogenres with a larger audience (F(2,102)=10.5; p<0.001); this is surprising, as we hypothesised that public speaking would entail a greater speaker's involvement. However, this acoustic parameter varies according to our predictions across the media feature (F(2,102)=12.06; p<0.001).

Microprosodic measures
The study of initial and final positions of prominent syllables permits to differentiate phonogenres according to their situational features. The percentage of prominent final syllables is decreasing as the phonogenre is getting more interactive (F(2,102)=10.43; p<0.001). This can be explained by a high score of hesitation in narration [NAR] and of vowel lengthening, typical for sport commentaries [SPO] (Figure 6, left).
The percentage of prominent initial syllables is getting higher as a phonogenre falls within broadcast media speaking style, where it is important to clearly distinguish discourse segments ( Figure  6, right). The initial prominent syllables of a potentially stressed group (SG) show similar results (F(2,102)=5.88; p<0.001). The relative length of initial and final syllables of the potentially stressed group (SG) varies in a significant manner across the preparation dimension (initial syllables F(2,102)=4.05; p<0.001; final F(2,102)=5.42; p<0.001). Initial syllables of SGs tend to be shorter in prepared discourse than in nonprepared, but final syllables become longer (Figure 7).

Principal Components Analysis
A Principal Components Analysis (PCA) was applied to investigate globally the differences between phonogenres and between each situational feature. This statistical technique makes an optimal linear combination of all the acoustics parameters. The resulting "principal components" (PCs) are dimensions of a normalised vector space but do not correspond to the original features. Instead, each principal component (a linear combination of various acoustic features) "explains" or improves the prediction of increasing parts of the population's variation. In our case, the parameters of the two tools ProsoReport and DurationAnalyser were grouped to model phonogenre distinction in the PCA. The first two principal components explain 58% of the variation, while the first eight explain 90.5%. A discriminating analysis for an automatic classification with those first eight PCs over nine phonogenres showed that 93% of recordings were identified correctly. The graphical distribution of phonogenres (represented by abbreviations in Figure 12) shows the projection of the selection of 105 recordings onto the first two Principal Components. It can be observed that parliamentary speech [ASS] and weather forecasts [MET] are the most compact phonogenres, probably because of the strict constraints of the situation. The dispersion of reading [LEC] and narration [NAR] is slightly larger and reflects the geographical and age differences among speakers. Educational speech [DID] and religious sermon [LIT] are even less compact: this is because of the differences in speech situation explained above. The same for radio press review [RPR] where the dispersion is probably due to one speaker with a particular speaking style represented 3 times in the corpus. Finally, presidential wishes [VXP] present more than one particularity: (a) the grouping of French presidents into the three chronological periods -1970s, 1980-1990s and 2000s; (b) the clear separation of the discourse of Swiss and French presidents that shows the impact of geographical dimension.

Discussion
We have presented a large corpus consisting of a variety of nine phonogenres. Except for sport commentary [SPO] and religious sermon [LIT] phonogenres, each of these is represented by at least ten speakers, with the idea that we study the phonogenre itself and get rid of individual characteristics.
We have shown how acoustic measures give rise to groupings of phonogenres among themselves and according to situational features, and how they characterise phonogenres. This justifies a posteriori our choice of ten speakers per phonogenre, and is a progress with respect to the previous studies quoted in this paper. During this research it appeared that four situational features with three degrees each (cf. Table 3) are significant for the study of phonogenre. In this sense reducing a much larger set of features and degrees established in earlier work is justified.
Our results show that phonogenres present evidence for groupings according to unforeseen, or hidden, situational features that are part of their prototypical image. For example, external time pressure, reflected in the duration of speech runs, is inherent in parliamentary speech [ASS], weather forecasts [MET] and radio press review [RPR]; ritual pressure on solemnity is characteristic for religious sermons [LIT] and presidential wishes [VXP] and is reflected in the bigger proportion of falling syllables.
The sub-phonogenre level was introduced to ensure solid definitions and to reduce the excessive heterogeneity of some phonogenres. The differences observed between questions and answers within the parliamentary speech [ASS] phonogenre suggest the relevance of the interactivity situational feature at the sub-phonogenre distinction level. They reveal a prosodic reflection of a discursive (not situational) category, namely to be an initial or reactive member of an exchange (though not in direct interaction). This essentially shows that, when collecting a sub-phonogenre corpus, a general speaking style label should be avoided and that special attention should be given to the exact situation of each recording, i.e., accurately defining its situational features.
This study provides a methodology for broad prosodic investigation of a large and varied corpus by using a semi-automatic set of procedures. Although some annotation steps remain manual, most of the procedure is automatic. As this framework was built in a very generic way, future work should propose a more targeted selection of prosodic measures, and test corpora of other languages. Two kinds of applications can be considered: verification of linguistic hypotheses and automatic phonogenre identification.
Finally, we should mention that the corpus will be made available to the community for research purposes.