|Project Framework||Horizon 2020 Marie Skłodowska-Curie Actions Individual Fellowship|
|Project title||ProsoDeep : Deep understanding and modelling of the hierarchical structure of Prosody|
|Supervisor||Gérard Bailly, GIPSA-lab, Grenoble Institute of Technology, Grenoble, France|
|Researcher||Brаnislаv Gеrаzоv, FEEIT, Ss Cyril and Methodius University in Skopje, Macedonia|
|Project partners||Yi Xu, University College London, London, United Kingdom|
|Philip Garner, Idiap Research Institute, Martigny, Switzerland|
The prime objective of the ProsoDeep project is to gain a deeper understanding of the language of prosody through the analysis of all the levels in the production hierarchy of prosody. In particular, it will exploit the benefits of the top-down and bottom-up approaches through their incorporation within a deep prosody model (DPM).
The DPM will facilitate the advancement of speech technologies that rely both on the synthesis of prosody, e.g. text-to-speech (TTS) systems, and its analysis, e.g. speech recognition and speech emotion recognition (SER). To reach this objective the project will draw on a variety of scientific fields, including signal processing, physiology of prosody production and biomechanics, linguistics, and machine learning, and will also be augmented with respiratory measurements.
Prosody is a multidimensional phenomenon comprising intonation, energy, and rhythm. It is the carrier of both linguistic information, e.g. sentence structure, focus and contrast, lexical stress; as well as paralinguistic information, e.g. gender, age, personality, and emotions. In recent years, prosodic research has considerably enlarged the spectrum of properties and functions. In contrast, prosodic models able to map signals to functions or vice-versa are rare: comprehensive models of rhythm and intonation have difficulties coping with this expanding dimensionality, and machine learning techniques still have difficulties with offering structuring principles.
The importance of prosody in TTS systems has been the driving force in the development of prosody models, with a special focus on intonation. Most of these models follow a bottom-up approach, i.e. from signals to functions. A number of intonation models following this approach also incorporate physiological constrains. Only a few models seek to model prosody taking a top-down function that starts with the linguistic functions themselves.
The Superposition of Functional Contours (SFC) model is a top-down approach based on the decomposition of prosodic contours into functionally relevant elementary contours [sfc]. It proposes a generative mechanism for encoding socio-communicative functions, such as syntactic structure and attitudes, through the use of prosody. The SFC has been successfully used to model different linguistic levels, including: attitudes, dependency relations of word groups, word focus, tones in Mandarin, etc. It has been used for a number of languages including: French, Galician, German and Chinese. Recently, the SFC model has been extended into the visual prosody domain through modelling facial expressions and head and gaze motion [sfc-av]. An example decomposition of an intonation contour into its functional prototypes is shown in Fig. 1.
SFC tries to solve the many-to-many ill-posed problem of determining the shape of the function specific contours. A major limitation of the SFC is that it only extracts the average contour for each function prototype.
One representative bottom-up model with physiological constraints is the Generalized Command Response (GCR) model [gcr] that decomposes the intonation contour using \(f_0\) atoms that correspond to elementary muscle activations, as shown in Fig. 2. The parameters of the GCR model can be trained completely automatically using a matching pursuit algorithm with the perceptually relevant weighted correlation as a cost function. As it is based on the physiology of intonation production, the model is inherently speaker and language independent. This is experimentally confirmed for English, French and German. The GCR atoms were also shown to have linguistic significance when compared to ToBI events, and can be used to help extract and synthesise emphasis.
The implementation of the GCR parameter extraction algorithm called Weighted Correlation based Atom Decomposition (WCAD) is free software and is available on GitHub at: https://github.com/dipteam/wcad
The main problem with the bottom-up approach is that it is difficult to establish the linguistic significance of the extracted model parameters.
The DPM will seek to merge the strengths of these two opposing paradigms. Specifically, we will expand further on the methodology developed in the SFC while incorporating GCR’s physiological constraints.