Menu principal

Data selection for the training of deep neural networks in the framework of automatic speech recognition

Start: Spring 2017
To apply: send a CV, a letter of motivation, and your BSc/MSc transcripts to Irina Illina and Dominique Fohr

About 300 hours of multimedia are uploaded on the Internet every minute. Spoken data represent a very important part. The classical approach for spoken content retrieval from spoken documents involves automatic speech recognition followed by text retrieval. Automatic speech recognition relies on an acoustic model that infers phonetic labels froms sounds. Currently, the best performing models are based on deep neural networks. These models are usually trained on a large set of spoken documents for which the exact text transcript is available (supervised training).

The Multi-Genre Broadcast (MGB) Challenge [Bell2015] is an evaluation of speech recognition systems, using TV recordings in English or Arabic. The speech data covers 8 genres of broadcast TV: advice, children’s, comedy, competition, documentary, drama, events and news. This represents a challenging task for speech technology. One problem with this dataset is that the exact transcription of training data is not available. Only subtitles (with start and end times of appearance on the screen) are given and they are sometimes far from what is actually pronounced: some words may be omitted, hesitations are rarely transcribed and some sentences are reformulated, hence supervised training cannot be applied.

The goal of the internship is to develop data selection methods for obtaining high performance acoustic models, that is to say with a word error rate as small as possible. If we use all the training data, the errors in the subtitles will lead to poor quality acoustic models and therefore a high word error rate. We propose to use a deep neural network (DNN) [Deng2013] to classify the segments into two categories: audio segments corresponding to accurate subtitles vs other segments. The student will analyze what information, acoustic and/or linguistic, is relevant to this selection task and can be used as input of the DNN [Lanchantin2016]. The student will validate the proposed approaches using the automatic transcription system of TV broadcast developed in our team.

[Bell2015] P. Bell, MJF. Gales, T. Hain, J. Kilgour, P. Lanchantin, X. Liu, A. Parland, S. Renals, O. Saz, M. Wester, P. Woodland. The MGB Challenge: Evaluating Multi-Genre Broadcast Media Recognition, ASRU 2015.

[Deng2013] Deng, L., Li, J., Huang, J.-T., Yao, K., Yu, D., Seide, F., Seltzer, M., Zweig, G., He, X., Williams, J., Gong, Y. and Acero A. Recent advances in deep learning for speech research at Microsoft, ICASSP 2013.

[Lanchantin2016] P. Lanchantin, M.J.F. Gales, P. Karanasou, X. Liu, Y. Qian, L. Wang, P.C. Woodland & C. Zhang. Selection of Multi-Genre Broadcast Data for the Training of Automatic Speech Recognition Systems, Interspeech 2016.

[Venkataraman2004] A. Venkataraman, A. Stolcke, W. Wang, D. Vergyri, V. Ramana R. Gadde, J. Zheng. An Efficient Repair Procedure For Quick Transcriptions, ICSLP 2004.

Inria Nancy - Grand Est
54600 Villers-lès-Nancy  
Équipe de recherche
Site Web
Langues obligatoires
Bac +4; Bac +5; Bac +8

BSc in computer science, machine learning, or a related field. MSc/PhD ongoing. Programming experience in Python.
Background in statistics or natural language processing and experience with deep learning toolkits (Theano, Tensorflow, Keras, Chainer…) and Kaldi is a plus.

4 à 6 mois
Informations de contact