Language models (LMs) play a key role in modern automatic speech recognition systems and ensure that the output respects the pattern of the language. In the state-of-the-art systems, the language model is a combination of n-gram LMs [Bellegarda2004] and neural network LMs [Yu2015] because they are complementary. These LM are trained on a large corpus of varied texts, which provides average performance on all types of data. However, document content is generally heavily influenced by the domain, which can include topic, genre (documentary, news, etc.) and speaking style. It has been shown that domain adaptation of LMs to small amounts of matched in-domain text data provide significant improvements in both perplexity and word error rate.
The objective of the internship is to adapt a neural network based LM to the domain of a spoken document without knowing the exact text of this document. For this, we will use the transcription provided by a first pass of automatic speech recognition without the adapted LM. This work will be applied to the Multi-Genre Broadcast (MGB) Challenge [Bell2015], that is an evaluation of speech recognition systems using TV recordings in English or Arabic. The speech data covers 8 genres of broadcast TV: advice, children’s, comedy, competition, documentary, drama, events and news. This represents a challenging task for speech technology.
Neural network LM adaptation can be categorized as either feature-based [Watanabe2016] or model-based [Gangireddy2016]. In feature-based adaptation, the input of the neural network is augmented with auxiliary features, which model domain, topic information, etc. However, these auxiliary features must be seen during training and thus require retraining the whole LM. Model-based adaptation consists in adding complementary layers and training these layers with domain-specific adaptation data [Deena2016]. An advantage of this method is that full retraining is not necessary. Another model-based adaptation method is fine-tuning: after training the model with the whole training data, the model is tuned with the target domain data. The downside of this approach is the lack of an optimization objective [Watanabe2016].
During the internship, the student will perform a bibliographic study on model adaptation approaches. Depending on the pros and cons of these approaches, we will propose a method that is applicable to challenging datasets such as MGB. This method may involve changing the architecture of the neural network. The student will validate the proposed approach using the automatic transcription system of radio broadcast developed in our team.
[Bell2015] P. Bell, M. Gales, T. Hain, J. Kilgour, P. Lanchantin, X. Liu, A. Parland, S. Renals, O. Saz, M. Wester, P. Woodland. The MGB Challenge: Evaluating Multi-Genre Broadcast Media Recognition, ASRU 2015.
[Bellegarda2004] J. Bellegarda. Statistical language model adaptation: review and perspectives, Speech Communication vol. 42, n.93–108, 2004.
[Deena2016] S. Deena, M. Hasan, M. Doulaty, O. Saz and T. Hain. Combining Feature and Model-Based Adaptation of RNNLMs for Multi-Genre Broadcast Speech Recognition, Interspeech 2016
[Gangireddy2016] S. Gangireddy, P. Swietojanski, P. Bell and S. Renals. Unsupervised Adaptation of Recurrent Neural Network Language Models, Interspeech 2016.
[Swietojanski2014] P. Swietojanski, S. Renals. Learning Hidden Unit Contributions for Unsupervised Speaker Adaptation of Neural Network Acoustic Model, IEEE SLT Workshop 2014.
[Watanabe2016] Watanabe Kazuma Hashimoto, Y. Tsuruoka. Domain Adaptation for Neural Networks by Parameter Augmentation, arxiv.org/pdf/1607.00410.pdf.
[Yu2015] D. Yu, L. Deng. Deep Neural Network. Automatic Speech Recognition, 57-77, 2015.