Multi-task learning for speech attribute detection of children’s speech

Mostafa Shahin, Beena Ahmed, Julien Epps

Friday, December 16th, 2022, 11.15am – 11.45am


Speech attributes, including manners and places of articulation, provide a detailed description of sound production directly related to the speech articulators. An accurate modeling of such attributes enriches automatic pronunciation assessment applications by providing informative feedback to the user. In this paper we propose a DNN based classification model structure to automatically detect the absence or existence of 25 specific attributes in English children’s speech. In the models, we utilised multi-task learning (MTL) with frame-level phoneme classification as an auxiliary task and a discriminative additive (DA) task for highly confusable phonemes to improve the detection of attributes. We compared the performance of the 25 DNN-MTL and DNN-MTL-DA attribute detection models across all the phonemes to base DNN models in two different children corpora to determine the impact of MTL and the DA task. We also compared the performance of our DNN-MTL-DA across different children’s age groups. Our attribute models achieved detection accuracies ranging from ~80% to ~91%, with the best detection accuracy for Nasal and retroflex and worst for Tense and affricates