PhD thesis offer: Deep Learning-based Video Face and Voice Digital Markers for Early Detection and Stratification of Parkinson Disease
at Telecom SudParis / Institut Polyetchnique de Paris
19 place Marguerite Perey 91120 Palaiseau, France
- Prof. Mounîm A. El Yacoubi SAMOVAR, Telecom SudParis, IMT, Thesis’s supervisor (50%)
- Dr. Dijana Petrovska SAMOVAR, Telecom SudParis, IMT, co-supervisor (50%)
- In collaboration with medical research team : Jean-Christophe CORVOL, Marie VIDAILLET et Stéphane LEHERICY (Sorbonne University, Inserm, CNRS, Paris Brain Institute - ICM, Paris, France, 2-APHP, Hôpital Pitié-Salpêtrière, Department of Neurology, Clinical Investigation Center for Neurosciences, Paris, France ; APHP, Hôpital Pitié-Salpêtrière, Department of Neuroradiology, Paris, France)
- Mounîm A. El Yacoubi : email@example.com
- Dijana Petrovska : firstname.lastname@example.org
This thesis aims at developing robust deep-learning digital biomarkers for early-stage Parkinson detection based on face video and voice, by detecting alterations such as Hypomimia and Dysarthria. Our proposal goes beyond current state of the art in several ways. First, we will perform feature representation learning directly from the raw face videos and voice signals instead of handcrafted features. To tackle the lack of sufficient training data, we will investigate advanced adaptive adversarial transfer learning mechanisms from models trained on large voice and face classification corpuses. A key innovation of our deep learning models is that they will be endowed with interpretability mechanisms in order to make their decision explainable to the stakeholders and to reveal which voice and face features are most discriminant of Parkinson. It is worth nothing that our automatic features will be generated not only from transversal data but also from longitudinal ones, as our datasets consist of recordings acquired over long periods, a key competitive advantage over existing schemes. We will harness these longitudinal data to perform stratification of patients into different groups to allow the medical staff to design specific treatment and therapy for each group. Finally, our digital markers will be assessed against Neuroimaging and clinical scores.
Early detection and Stratification of Parkinson disease detection; audio-visual digital markers; Deep Learning-based Transfer Learning; Interpretability of Deep Neural Networks; Neuroimaging
1. Scientific context
Parkinson’s disease (PD) is the second most common neurodegenerative disease after Alzheimer’s. Its prevalence increases with age: 1% of people over the age of 60 and up to 4% of those over 80 are affected [DeLau-2006]. This disease results in motor disorders worsening over time, caused by a progressive loss of dopaminergic neurons in the substantia nigra, located in the midbrain. The standard diagnosis, mainly based on clinical examination, is usually made when at least two of the following three symptoms are observed: akinesia (slowness of initiation of movement), rigidity, and tremors at rest. These symptoms, unfortunately, appear once 50 to 60% of dopaminergic neurons in the substantia nigra are destroyed [Has-tu-2012]. That is why PD detection at an early stage is key for testing treatments before the occurrence of large irreversible brain damages, and for slowing down, or even stopping, its progression. Nowadays, new neuroimaging methods, such as Magnetic Resonance Imaging (MRI), are used to detect PD. Such methods, however, are costly, have limited availability, and some are associated with radiation. Consequently, the clinical evaluation of PD remains the gold standard. It is, therefore, important to develop innovative methods that are less costly, non-invasive, and can be carried out remotely.
Besides hospitals’ screening tests, Digital biomarkers (DMs) from everyday sensors and devices have the potential to change fundamentally our understanding of Parkinson’s Disease (PD). They allow for a quantitative and continuous monitoring of disease symptoms, including outside clinics/hospital monitoring (telemedicine). Such DMs can be helpful for early PD detection. They also provide a possibility to monitor the response to treatments, hence opening the opportunity to adapt medication pathways quickly, if necessary.
2. Scientific Content
2.1. Main Objectives
This main thesis’ aim is to develop robust deep-learning digital biomarkers for early-stage Parkinson detection based on face video and voice alterations such as Hypomimia and Dysarthria. A special emphasis will be put on making our deep neural networks interpretable. A strong competitive positioning w.r.t current state of the art is that we will develop digital biomarkers on not only transversal voice and face data, but also longitudinal ones, as our datasets consist of recordings acquired over long periods. This will allow us to not only detect PD, but also to stratify PD patients into different groups, which will be key for the medical staff to design specific treatment and therapy for each group. Finally, our digital markers will be assessed against Neuroimaging and clinical scores.
2.1. State of the art related to digital markers related to the Parkinson’s disease (PD)
Voice-based biomarker: Digital markers on PD detection from voice have been extensively investigated. These studies have observed disturbances called hypokinetic dysarthria, expressed by prosody reduction, irregularities in phonation and difficulties in articulation. The classification performances ranged from 65 to 99% for moderate to advanced stages of the disease. Most studies do not focus on early PD detection through voice, and rely on small datasets (around 40 subjects) [Orozco-2015],[Novotny-2014]. Recently, PD detection using telephone recordings has been carried out on early stages [Jeancolas-interspeech-2019]. Recently also, researchers have applied deep learning models for PD detection. [Xu-2018] used Weighted-MFCC (Mel Frequency Cepstral Coefficients) voiceprint feature extraction and Deep Neural Networks for classification. [Gunduz-2019] proposed different CNN architectures to detect PD using vocal features, by combining several handcrafted feature sets. [Wodzinski-2019] created a modified ResNet architecture trained on spectrogram images. They achieved 91.7% cross validation accuracy using only the frequency-based features from spectrograms. [Vasquez-2020] applied transfer learning based on fine-tuning to classify PD in three different languages: Spanish, German, and Czech. Mel-scale spectrograms are extracted and used to train a CNN for each language. Then, the trained models are used to fine-tune the training of a new model in the remaining two languages.
Face-based biomarker: [Makinen-2019] have shown, based on the Unified Parkinson's Disease Rating Scale, that patients with PD had an upper extremity rigidity and reduced facial expression, and were significantly slowed in reaching a peak expression (i.e., bradykinesia). Although Hypomimia, the reduction in face expressiveness, is a secondary sign of PD often presented in its early stages [Jankovic-2008], it has barely been investigated in an automated way. There are, nonetheless, some works quantifying automatically Hypomimia. [Vinokurov-2015] developed machine learning tools to automatically detect and assess the severity of Hypomimia, using a 3D sensor for fairly accurate facial movements tracking. To evaluate the predicted Hypomimia score, they computed its correlation with the ones provided by two neurologists. They reported reduced expressiveness of the PD subjects that accompany speech and emotional facial expressions. There were only 14 PD and 15 healthy control (HC) subjects in their study. [Bandini-2017] analyzed facial expressions through video-based automatic methods. 17 PD and 17 HC subjects were asked to perform basic facial expressions. Through an existing face tracker, the Euclidean distance of the facial model from a neutral baseline was computed to quantify facial expressivity changes. An automatic facial expression recognizer was trained to study how PD expressions differed from the standard expressions. Results showed that HC subjects showed larger movements during both posed and imitated facial expressions. More recently, [Grammatikoupoulou-2019] developed PD detection tests, based on the interaction of users with everyday technological devices to quantify the progressive decrease of variability of facial expressions in early PD patients by analyzing patterns emerging from photos (selfies). Promising results were presented from both a) a small development set of 36 subjects (23 confirmed PD and 13 HC subjects) and b) a large set of selfie photos obtained from 1292 users (with self-reported assessment) individuals. Their analysis was based on detecting facial landmarks obtained using the Microsoft Face API.
2.2. Proposed methodology
This thesis is possible thanks to ICEBERG, a unique database with a large number of subjects, a precise medical assessment of PD and HC subjects, with multiple data available (DNA, clinical, behavioral, cognitive, sleep, DATscan and multimodal MRI data). ICEBERG comprises 221 French subjects with 121 recently diagnosed with idiopathic PD (less than 4 years beforehand) and 100 healthy controls. The temporal span of the acquired data per subject is 5 years. TSP has collaborated with CIC, ICM and APHP on the acquisition of the audio and facial video data. Our previous results from the thesis work of [Jeancolas-PhD-2019], [Jeancolas-interspeech-2018] related to audio digital markers has confirmed the ability of these markers to detect early PD subjects. To improve accuracy, we propose innovative research directions that go much beyond current state of the art. First, we will perform feature representation learning directly from the raw face videos and voice signals instead of handcrafted features, in order to detect face Hypomimia and voice Dysarthria. To tackle the lack of sufficient training data, we will investigate advanced adaptive adversarial transfer learning mechanisms from models trained on large voice and face classification corpuses. A key innovation is that we will design new neural architectures to make them interpretable, a highly desirable feature in e-health. It is worth noting that our automatic features will be generated not only from transversal data but also from longitudinal ones, in order to stratify PD patients into different groups, thereby allowing the medical staff to design specific treatment for each group. Finally, our digital markers will be assessed against Neuroimaging and clinical scores. More precisely, we will address the following research challenges:
PD Detection from Voice Recording: Compared to our previous work that discriminate PD patients by learning X vectors features from standard MFCC voice features through a TDNN network, we propose new research directions by considering two deep LSTM networks that learn respectively from spectrograms and raw audio waveforms. Our rationale is that spectrograms and above all the raw voice signal may comprise characteristics that are discriminant of PD, but which are not contained in MFCC. Most existing techniques are based only on standard voice features used for speech/speaker recognition. Only one work has considered spectrograms [Wodzinski-2019]. A spectrogram is a visual representation of the signal’s frequencies spectrum that encodes the frequencies changes over time. However, they used a pretrained neural net on the ImageNet dataset that consists of natural images, and fine-tuned it on the spectrograms which is not sound as the latter are virtual images not consisting of natural pixel distributions. Our proposal of learning directly from the raw signal, inspired by [Ravanelli-2020] used for speech recognition, has never been investigated in the context of PD. It has the potential to extract fine local sound features unlike spectrograms, for which good frequency representations requires large voice segments.
PD Detection from Facial Video Recordings: To detect Hypomimia from face video recordings, we will investigate three new approaches. The first builds on our long experience in analyzing faces for biometric purposes, validated within the recent audio-visual NIST challenge [Nist-av-challenge-2019], where our face recognizer was ranked in the top five systems. After using this system to detect the face and to extract its landmarks at each frame, we will estimate the motion amplitude variation of each landmark across the frames in the video, and design a machine learning classifier to detect typical face expressions from Parkinsonian ones associated with Hypomimia. The second approach will take directly the vectorial sequence of landmarks in the video as input to a short-term memory network (LSTM) network to learn how these landmarks evolve over time, and to detect Hypomimia accordingly. The third approach, finally, will take directly the raw face video recording as input to a LSTM-like network with the same objective, but relying on the face pixels, thus allowing to detect Hypomimia potentially not only based on the landmarks but on other face locations. These approaches have not been proposed in current state of the art.
Combination of Voice and Face Video based digital markers: By contrast to existing work, we will detect simultaneously voice and face digital markers as, in our dataset, each face video is acquired while the subject is talking. The rationale is that Hypomimia is not symptomatic of every PD patient, and voice alterations are not systematic either. Their combination, therefore, will increase significantly the likelihood of detecting early PD.
Transfer Learning: To tackle the lack of sufficient training data, we will investigate transfer learning mechanisms to transfer knowledge from models trained on large corpuses. The targeted corpuses will be the huge raw face video and audio datasets used for face/speaker recognition. It is worth mentioning that their raw nature will allow our neural networks to learn a much larger spectrum of automatic features w.r.t standard handcrafted features, and thus with a much large potential to include features that are specific to PD detection. To do this will be investigate new innovative adaptive adversarial transfer learning techniques by building on our previous work in the glycaemia prediction context [De Bois-CMPB2021].
Longitudinal Analysis for Stratification: Our deep features will be generated not only from transversal data but also from longitudinal ones, a key competitive advantage over existing schemes. We will harness these longitudinal data to perform stratification of PD patients into different groups to allow the medical staff designing specific treatment and therapy for each group. We have already investigated such a stratification for Alzheimer’s assessment based on online Handwriting [El-Yacoubi-2019]
Interpretability: A key innovation of our deep learning models is that they will be endowed with interpretability mechanisms in order to make their decision explainable to the stakeholders and to reveal which voice and face features are most discriminant of Parkinson. To do this, we will investigate new neural architectures for interpretability by building on our previous work in the health context [De Bois- IJPRAI2021].
Correlation with neuroimaging data: Our voice and facial markers will be assessed against neuroimaging and clinical data made available by ICM. Such confrontation assessments, never done before, are key to validate our digital markers for real-life aid-to-diagnosis usage.
2.3. International Collaboration
This thesis is proposed in the context of DIGIPD: “Parkinson’s Disease, digital biomarkers, telehealth, precision medicine, Artificial Intelligence”, a European project that investigates how digital techniques such as, detection of voice alterations, classification of Hypomimia in facial expressions, and detection of gait changes, can be harnessed for a more precise and individualized diagnosis and prognosis of Parkinson's disease. This projects gathers, in addition to Telecom SudParis and Paris Brain Institute, Fraunhofer SCAI (Germany), the University of Luxembourg, the University of Namur (UN), and the University Hospital Erlangen (Germany). This collaboration will allow combining the digital biomarkers (voice, face and gait-based) with clinical, brain imaging and different types of molecular data.
[Bandini--2017] A Bandini, S Orlandi, HJ Escalante…, Analysis of facial expressions in Parkinson's disease through video-based automatic methods; Journal of neuroscience, 2017.
[De Bois-CMPB2021] M. De Bois, M.A. El Yacoubi, M. Ammi, Adversarial multi-source transfer learning in healthcare: Application to glucose prediction for diabetic people, Computer Methods and Programs in Biomedicine, 2021.
[De Bois-IJPRAI2021] M. De Bois, M.A. El Yacoubi and M. Ammi, Enhancing the Interpretability of Deep Models in Heathcare Through Attention: Application to Glucose Forecasting for Diabetic People, ICPRAI 2020 and IJPRAI Journal, 2021.
[DeLau-2006] De Lau, L.M. and Breteler, M.M., Epidemiology of Parkinson’s disease. The Lancet Neurology, 5, 2006.
[El-Yacoubi-2019] M.A. El-Yacoubi, et al., “From aging to early-stage Alzheimer’s: Uncovering handwriting multimodal behaviors by semi-supervised learning and sequential representation learning,” Pattern Recognition, Vol. 86, 2019.
[Grammatikoupoulou-2019] A. Grammatikoupoulou, N.Grammalidis, S.Bostanjopoulou, Petra 2019.
[Gunduz-2019] H. Gunduz, Deep learning based Parkinson’s Disease classification using vocal feature sets, IEEE Access, 2019.
[Jankovic-2008] Jankovic, J., Parkinson’s disease: clinical features and diagnosis. Journal of Neurology, Neurosurgery & Psychiatry, 79(4), 2008.
[Jeancolas-interspeech-2019] Jeancolas, L., Petrovska-Delacrétaz, D., Benkelfat, B.-E., et al., Comparison of Telephone Recordings and Professional Microphone Recordings for Early Detection of Parkinson’s Disease, Using Mel-Frequency Cepstral Coefficients with Gaussian Mixture Models. Interspeech 2019.
[Jeancolas-PhD-2019] Laetitia Jeancolas. Détection précoce de la maladie de Parkinson par l’analyse de la voix et corrélations avec la neuroimagerie. Université Paris-Saclay, 2019.
[Jeancolas-2021] L. Jeancolas, D. Petrovska, et al., Vectors: New Quantitative Biomarkers for Early Parkinson's Disease Detection From Speech, Frontiers in Neuroinformatics, 2021.
[Makinen-2019] E. Mäkinen, et al. Individual parkinsonian motor signs and striatal dopamine transporter deficiency: a study with [I-123]FP-CIT SPECT. J Neurol 266, 2019.
[Nist-av-challenge-2019] https://www.nist.gov/publications/2019-nist-audio-visual-speaker-recognition-evaluation, 2019.
[Novotny-2014] M. Novotny, et al. Automatic Evaluation of Articulatory Disorders in Parkinson’s Disease. IEEE/ACM Trans. on Audio, Speech, and Language Processing, 2014.
[Ravanelli-2020] M. Ravanelli and Y. Bengio, "Speaker Recognition from Raw Waveform with SincNet," IEEE Spoken Language Technology Workshop (SLT), 2018.
[Orozco-2015] Orozco-Arroyave, et al., Voiced/unvoiced transitions in speech as a potential bio-marker to detect parkinson’s disease. INTERSPEECH, 95–99, 2015.
[Vasquez-2020] J.C. Vasquez-Correa et al., Convolutional Neural Networks and a Transfer Learning Strategy to Classify Parkinson’s Disease from Speech in Three Different Languages, arXiv:2002.04374, 2020.
[Vinokurov-2015] N Vinokurov et al., Quantifying hypomimia in Parkinson patients using a depth camera. Int. Symposium on Pervasive Computing Paradigms for Mental Health 2015.
[Wodzinski-2019] M. Wodzinski, et al., CNN Dedicated to Image Classification, 2019 EMBC.
[Xu-2018] Zhijing Xu, Juan Wang, Ying Zhang, Xiangjian He. (2018). Voiceprint recognition of Parkinson patients based on deep learning, arXiv:1812.06613.