Speech activity detection is the task of identifying speech and non-speech segments in a continuous recording and possibly their gross characteristics, e.g. silence, noise, music, clean speech, noisy speech… This is a key module in speech recognition and transcription systems.
Early approaches used a limited set of features like frame energy and zero crossing rates. More recently, machine learning based techniques have been applied. Among them we can cite [Mesgarani et al., 2006] which uses auditory model features and an SVM classifier. A multi-layer perceptron (MLP) approach is used in [Dynes et al., 2006] for speech segmentation of meeting records. [Ng et al., 2012] compares a Gaussian-Mixture model (GMM) and an MLP approach; it also shows than combining both approaches improves performance. [Ryant et al., 2013] shows that using deep neural networks (DNN) leads to a much lower frame classification error rate than GMM-based models on YouTube data. In the 2015 Multi-Genre Broadcast (MGB) challenge, [Saz et a., 2015] and [Woodland et al., 2015] used DNN-based approaches for detecting speech segments in multi-genre broadcast shows (including advice, comedy, documentary, drama, events, news…). The difficulty of the task came from the variety of noises involved and from lightly supervised annotations (sub-titles of the video shows).
Considering this brief overview, the classifier to be developed during the internship, will be based on deep learning techniques. First experiments and analysis of the speech materials will be used to precisely defined the most interesting classes that should be handled by the classifier. Speech vs. non speech classification is mandatory; however, the identification of some environmental conditions can be useful for using acoustic models matching the same environmental conditions in the speech recognition process.
Large amounts of labelled data, with respect to the expected classes, are necessary for training the classifier. As it would be too much time consuming to manually annotate large amounts of data, we will consider using artificially corrupted data for training. Hence relying on annotations available for some rather clean corpora, additional corpora will be created by adding noise and/or music signals at various signal-to-noise ratios (SNR). One challenge is related to the choice of environmental noises and SNR levels when creating the training data set. A possible tradeoff is to optimally weight the training data, similarly to what was done in [Sivasankaran et al., submitted] for robust speech recognition. Another challenge is the choice of the decision threshold to be applied on the classifier output. The speech detection module is only a part of the whole transcription system, and the overall goal is to have the best possible transcription performance.
[Dynes et al., 2006] Dines, J., Vepa, J., & Hain, T. (2006). The segmentation of multi-channel meeting recordings for automatic speech recognition. In Proc. ICSLP’2006, Int. Conf. on Spoken Language Processing.
[Mesgarani et al., 2006] Mesgarani, N., Slaney, M., & Shamma, S. A. (2006). Discrimination of speech from nonspeech based on multiscale spectro-temporal modulations. IEEE Transactions on Audio, Speech, and Language Processing, 14(3), 920-930.
[Ng et al., 2012] Ng, T., Zhang, B., Nguyen, L., Matsoukas, S., Zhou, X., Mesgarani, N., Vesely, K., & Matejka, P. (2012). Developing a Speech Activity Detection System for the DARPA RATS Program. In Proc. INTERSPEECH’2012, 13th Annual Conference of the International Speech Communication Association, Portland, Oregon, USA, pp. 1969-1972.
[Ryant et al., 2013] Ryant, N., Liberman, M., & Yuan, J. (2013). Speech activity detection on youtube using deep neural networks. In Proc. INTERSPEECH’2013, 14th Annual Conference of the International Speech Communication Association, Lyon, France, pp. 728-731.
[Saz et a., 2015] Saz, O., Doulaty, M., Deena, S., Milner, R., Ng, R. W., Hasan, M., Liu, Y., & Hain, T. (2015). The 2015 Sheffield system for transcription of multi-genre broadcast media. In Proc. ASRU’2015, IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 624-631.
[Sivasankaran et al., submitted] Sivasankaran, S., Vincent, E., & Illina, I. (submitted). Discriminative importance weighting of augmented training data for acoustic model training. Submitted.
[Woodland et al., 2015] Woodland, P. C., Liu, X., Qian, Y., Zhang, C., Gales, M. J. F., Karanasou, P., Lanchantin, P., & Wang, L. (2015). Cambridge University transcription systems for the Multi-Genre Broadcast challenge. In Proc. ASRU’2015, IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 639-646.