Neurodevelopmental syndromes, a continuously growing issue, are impairments in the growth and development of the brain and CNS which are pronounced in a variety of emotional, cognitive, motor and social skills. Early assessment and detection of typical, clinically correlated early signs of developmental abnormalities is crucial for early and effective intervention, supporting initiation of early treatment and minimizing neurological and functional deficits. Successful early interventions would then direct to early time windows of higher neural plasticity. Various syndromes are reflected in early vocal and motor characteristics, making them suitable indicators of an infant’s neural development. Performance of the computerized classifiers we developed shows approximately 90% accuracy on a database of diagnosed babies. The results demonstrate the potential of vocal and motor analysis for computer-assisted early detection of neurodevelopmental insults.
- brain development
- early neurodevelopment classifier
- brain injury
- computer-assisted diagnosis
- tracking algorithm
- ferns algorithm
- early motor & vocal expression
- premature babies
1.1. Brain development: Cortical Subplate
Neurodevelopmental syndromes, a continuously growing issue, are impairments in the growth and development of the brain and CNS which are pronounced in a variety of emotional, cognitive, motor and social skills.
Fetuses at 33–41 weeks’ gestational age recognize properties of their mothers’ voice and their native language [1, 2] suggesting that neural networks connecting circuits which process afferent sensory information (such as voice and pain), basic efferent vocal expressions, and high brain function (attention, memory, learning) are already being formed [3, 4, 5].
In early fetal life a transient structure develops in the subcortical future white matter, situated between the intermediate zone and the developing cortical plate: the Cortical Subplate . The Subplate is thickest at around 29 weeks’ postmenstrual age (PMA), and is absorbed gradually until around 4–6 months’ post-term, with relocation of fiber terminals into the cortex, in the order of “first to come last to go” [7, 8, 9]. Most of its afferent and efferent connections run through the (future) periventricular white matter. The size and duration of the Subplate visibility increases along mammalian evolution and reaches the peak in human fetuses, concomitantly with an increase of cortical fiber complexity. Hence it is considered a recent phylogenetic structure that expanded to enable the increasing complexity of cortical circuitry .
The Cortical Subplate serves as a relay station for the growing neural protrusions of the developing cortical neurons and as an integrating element that synchronizes neural network activity by distributing signals effectively in the developing cortical plate [11, 12]. As such, Subplate neurons are essential in creating the accurate wiring and functionality of the cerebral cortex: Subplate neurons create preliminary transient synapses between Thalamic axons (carrying mainly sensory inputs) and their targets in the forming cortical layer number 4 .
Subplate neurons, with their multiple excitatory, inhibitory and neuromodulator synaptic connections, are key elements to affect cortical development and maturation [14, 15, 16]. In the time window of their existence, the Subplate circuits are sensitive to hypoxic injury , which may lead to long lasting impact on brain development and functional deficits: sensory, cognitive, social, motor. The time window of Subplate circuits’ high activity is also the time window when young infants born premature make their first coping out of utero.
Premature birth rate is around 10% of all deliveries in average; it is a continuously growing phenomenon (WHO, 2012) and is the second common cause of death among infants worldwide (after Pneumonia). Recent medical advances enable increasingly more premature infants to survive, however, many of them are in high risk for brain injury [18, 19, 20, 21] and neurodevelopmental impairments [22, 23]. For example, babies may develop Periventricular Leukomalacia, which may cause severe, long term damage to brain tissue [24, 25, 26, 27]. Early assessment and detection of developmental impairments is crucial for early and effective intervention, as early identification will support initiation of early treatment and may minimize neurological and functional deficits. The babies participating in this study project were born premature and participated in this research during Cortical Subplate activity time window.
1.2. Vocal expression
A preliminary model of neural mechanism of baby vocalization was called “the brainstem model”, based on reports about vocalization of anencephalic human infants whose cerebral hemispheres were vastly non-functional . Further studies with baby primates and mammal pups showed elevated activation in additional brain areas during early vocalization, including the Peri-Aquaductal Grey, cerebellar Vermis, Thalamus, anterior Cortical Cingulate Gyrus, Amygdala and the neural pathways connecting the components of Limbic system [29, 30, 31].
Early studies of infant cry were conducted with babies considered at risk for Developmental impairments, and babies with identified disorders [32, 33, 34]. More recent studies have described acoustic features (Formant, Pitch, etc.) of infant cry in babies at risk for poor developmental outcome due to perinatal risks or medical complications, such as hyperbilirubinemia, prenatal substance exposure, lead exposure, or evidence of brain damage [35, 36, 37, 38]. Thus, baby’s vocal characteristics are regarded as an indication of neurobehavioral insult.
Neonatal vagal tone is an indicator for balanced, ripe function of the autonomic nervous system  that has shown to predict the infant’s neurobehavioral and cognitive development and social-emotional adaptation across infancy and up to 6 years of age [40, 41]. Neonatal vagal tone was also found to predict the degree of mother-infant synchrony at 3 months . Moreover, vagal tone is expressed in the baby’s vocalization and in infant oral neuromotor competence; the production of vocalization or cry is the result of air being forced through the vocal tract, over the larynx and through the vibration of the vocal folds. Pitch frequency, the base repetition rate of a sound waveform, is dictated by vocal cord vibration, vagal tone and autonomic regulation. Formants (the resonant frequencies of upper airways) reflect breath, utterance and soft palate synchronization. Indeed, vocal features may be indicators of an infant’s brain development.
Here we present a new approach to early developmental diagnosis of young infants, comprised of computerized vocal analysis and a database of double blind diagnosed premature born babies. The clinical diagnosis of the babies which comprise the database consists of Spontaneous movements and oral neuromotor competence.
1.3. Motor expression: spontaneous movements and oral neuromotor competence
Markers of neural development can be observed and quantified in the newborn and young infant using the quality of spontaneous general movements (employing the “General Movements assessment tool”, GM) [43, 44], and oral neuromotor competence (employing the “Neonatal Oral-Motor Assessment Scale”, NOMAS) .
The assessment of the quality of spontaneous general movements is a sensitive, age specific neurodevelopmental tool, which in particular teaches about the integrity of complex supra spinal neural circuits. Spontaneous General Movements are first identified early in fetal development, produced by deep brain circuits and before afferent stimuli exist . From about 34 weeks’ postmenstrual age, three consecutive age specific characteristic forms of Spontaneous General Movements appear, reflecting maturation and reorganization of subcortical and cortical brain circuits [44, 47]. Spontaneous General Movements are composite movement patterns comprising head, trunk, arms and legs, with variable movements and trajectories, and they disappear when goal directed movements appear at the age of 3–4 months’ post-term . Significant relationships between abnormal general movements at 1 and 3 months and cerebral white matter abnormalities on MRI in preterm infants were demonstrated , supporting the concept that abnormal spontaneous general movements reflect developmental damage to cerebral white matter and sub cortical plate neural circuits . Additionally, assessment of spontaneous general movements has predictive power for developmental outcome at school age for major disorders (cerebral palsy), and minor developmental disorders [51, 52, 53].
Moreover, neural development, as well as later developmental outcome, is reflected by the newborn’s feeding patterns, as these require synchronization of neural circuits that master integrated sucking-swallowing-breathing ability and normal vagal tone [45, 54, 55, 56]. Sucking reflex develops at 16 weeks in utero, coordination of suck/swallow appears at 32–34 gestational weeks, and coordination of suck/swallow/breath appears at 37 gestational weeks or later [57, 58, 59]. Suck/swallow/breath coordination and rhythmicity control requires involvement of many brain circuits, including afferent and efferent fibers of cranial nerves (IV, V, VII, IX, X, XII), brain stem Lower Medulla nuclei (Ambiguus, Solitarius, Hypoglossus) participating in Bulbar circuits, sensory supra-bulbar fibers and motor cortical circuits . Problematic feeding patterns occur frequently in babies born premature, in neonatal encephalopathy, chronic lung disease, after intra-uterine drug exposure, abnormal somatosensory balance, structural abnormalities and pain. The NOMAS scale is used for assessing either breast feeding or bottle feeding (exhausted human milk or formula) behaviors [45, 56, 61].
NOMAS and GM performance reflect integrity of the neural circuitries of the Cortical Subplate and together they may give complementary information for better prediction of developmental outcome [64, 65]. Hence both tools were employed here to clinically evaluate infants’ neural development and create the database for the computerized motor and vocal algorithms.
As described above, early detection of brain insults enables early beginning of intervention, which may minimize neurological and functional deficits. However clinical experts seldom are available in remote or poor populations. Previous works have shown that motion and vocalization can be used as a diagnosis tool for health and development, using various methods to develop an automated system for characterizing pathologies [64, 65, 66, 67]. However, these methods require complex classifiers and long training time. Therefore, in this work, we introduce accessible, simple and cost-effective assisting tools for a convenient early diagnosis of infants, requiring only recorded samples of infant vocalization and motor performance.
2. Computer-assisted early developmental assessment
2.1. Babies’ reference developmental status
Tracking of infants’ neurodevelopment was conducted after parents signed informed consent. Participating babies were recorded using Kinect for windows of Microsoft (formerly PrimeSence). Double blind neuromotor diagnosis was conducted using GM (General Movements) and NOMAS (Neural Oral-Motor Assessment) tools. The tools were shown to be highly correlated [62, 63] (Figure 1).
GM diagnosis referred mainly to Complexity, Variability and Fluency of babies’ motor performance, and was classified as normal (2 grades: optimal, sub-optimal) or abnormal (2 grades: mildly, definitely).
NOMAS diagnosis referred mainly to functionality and synchronization of suck, swallow and breathing, and defined as normal, disorganized, or dysfunctional.
2.2. Computerized analysis of babies’ motor expression
Young infants with brain injury have typical neuromotor performance, expressed in synchronized movements, relative dominance of upper or lower limbs, asymmetry, absent or abnormal fidgety general movements, and more. These higher order motor features have been shown to have clinical correlations.
2.2.1. Tracking algorithm: following a ‘Cloud of points’
Existing skeletal joint tracking tools [68, 69, 70] were originally developed for the gaming console and humans bigger than one meter, and are not suitable for tracking of babies’ joints. In addition, the background surface on which the baby is lying is too close to the baby to allow the algorithm to perform stably, the morphology of the baby consists of round shape silhouette contours rather than sharp angular joints; and the delicate, high/low tone spontaneous movements performed by babies inherently made joint tracking unstable and inefficient. In order to solve these issues, a new tracking method was implemented, consisting first on volumes and then computed baby skeleton. First, estimation of the baby’s movement was based on the normalized volume occupied between the segmented body and the background, using the depth stream.
where vi(t) is the overall volume occupied by the baby in video i and time frame t.
Background of the baby was erased (using body segmentation), spatial alignment and temporal synchronization of sensor beams (RGB camera and depth sensor) were performed. Due to filming angles, a depth-rectification was also implemented.
Tracking algorithm followed changes in babies’ volume occupied between the segmented baby’s body and background, using shape descriptors known as Hu central moments  and the depth stream.
2.2.2. Identification and selection of motor features
Feature extraction is conducted according to high order movement parameters:
Complexity and variation are calculated as spatial and temporal motor variability.
Fluency is calculated as one over the jerkiness (third time derivative) of the limbs’ trajectories.
Predictability is calculated according to predictive information which is the mutual information between the position at time t and position at time t + 1. In other words, the more predictable the trajectory, the higher the predictive information.
Several statistics were computed for the normalized babies’ volume: variance, jerkiness (third derivative), predictability of normalized volume variation (1st derivative of volume) using Hurst exponent  to distinguish system’s randomness:
where μ(Vi) is the average of the normalized volume over time. The variance of the normalized volume is negatively correlated to the developmental diagnosis.
Average absolute jerkiness of the normalized volume
the jerkiness is also negatively correlated to the developmental diagnosis.
Wavelet coefficients obtained with the continuous wavelet transform (CWT) of the signal. Hence, given a signal x(t) and a basic wavelet ψ(t) for a shifting parameter b and a scaling parameter a:
As we were interested in the high frequency content of the signal, so a symlet wavelet with a center frequency of 0.66 Hz was chosen and averaged the coefficient obtained with scales [9.0, 9.1, …, 10] over the entire time series.
Predictability of the normalized volume variation (i.e. first derivative of the volume). The Hurst exponent H is used to determine the predictability of a time series and to distinguish random from non-random systems . It is defined as:
R(n) = max(x1, …, xn) − min(x1, …, xn) is the range of the first n values,
S(n) is the standard deviation of the first n values,
is the expected value
n is the number of points in the time series
C is a constant
2.2.3. Developmental classification according to motor features
A machine learning algorithm was applied to analyze the clinically correlated features, according to samples from train series (clinically diagnosed by a specialist).
A linear estimator was trained using step-wise linear discriminant analysis (SWLDA) to estimate the developmental diagnosis from input features, using the following computed features for each video: variance of the normalized volume, average CWT coefficient of the normalized volume and Hurst exponent of the normalized volume variations (adding the average absolute jerkiness of the normalized volume degraded the performance). So for each video i, we have a vector of features:
A linear decoder was trained the following way: for each combination of 12 videos, 9 videos were included in training set and 3 videos in testing set, which leaded to 220 training runs with each different combination of training samples. This means that each video was tested 55 times using different training datasets. The following model was fitted on each run using step-wise linear discriminant analysis (SWLDA):
where is the estimated developmental diagnosis, w are the weights of the predictive model and b the bias term of the prediction.
In order to evaluate the performance of diagnosis classification, Baby videos were divided according to clinical assessment as normal neurodevelopment (grades 3, 4) and abnormal neurodevelopment (grades 1, 2). Prediction of the test videos for each run was plotted against the developmental diagnosis. In order to evaluate the performance of computer classifier, threshold was applied to the estimated diagnosis.
We define: true positive (TP): correct diagnosis of non-healthy baby; true negative (TN): correct diagnosis of healthy baby; false positive (FP): false alarm (type I error); false negative (FN): missed non-healthy baby (type II error).
We then have three different performance assessment:
The classification performances for different threshold values are showed in Figure 2. The classification performance for threshold value of 2.3 is 87% with sensitivity of 73% and specificity of 98%. This value is close enough to the clinical division between normal and abnormal neurodevelopment. Motor system architecture up to this stage is shown in Figure 3.
2.2.4. Tracking algorithm—from a ‘Cloud of points’ back to joints in a skeleton
Our aim is to develop specific fine-tuned tools for specific syndromes. Tracking a ‘cloud of points’ does not allow direct translation and fine tuning of clinical criteria related to specific body parts into mathematical phrases. Hence the first stage giving a result “normal” or “abnormal” neurodevelopment is important but not final.
The Kinect sensor is using a predefined skeleton model to return the joints’ positions based on the depth data. There is no option to inject new rules and adjust the predefined joint distances which are not suitable for infants’ dimensions, as there is no access to all the internal functions that process and output the human skeleton. Hence we would be unable to use the inherent skeleton as a reference for how the babies’ skeleton is reconstructed. Instead, we use the Random Ferns [73, 74, 75], and Random Forests  algorithms.
Random Decision Forests combine binary depth comparison features to assign a body part label for each input depth pixel; based on the detected body parts, the joint positions are expected. However, the training of the forests is computationally very intense and not efficient for a large amount of data. In order to simplify the training procedure and to flexibly adapt the system to different application requirements we use Random Ferns which are an efficient and robust alternative to random forests in order to find the 3D positions of body joints in single depth images.
We use Random Ferns for melding binary depth comparison features and create a pixel wise body part classifier. Based on the body parts, we estimate the positions of the body joints. Then each pixel of the body is allocated to a body part class and joints’ positions are deduced by calculating the mean of the cluster of each body part. A large amount of labeled data for training was created, using an open source three dimensional model of a baby’s body . Open source software  and CMU motion capture dataset  were used for animating the model in various postures. Depth images were then created from the model and body part label was allocated to each pixel. The virtual camera viewpoint for generation of our depth images is frontal views of the virtual body, corresponding to the babies in our real data. The babies’ reconstructed skeletons are then classified using machine learning algorithm with the existing and additional high motor features which directly reflect early signs of specific brain injuries.
2.3. Computerized analysis of babies’ vocal expression.
2.3.1. Identification and selection of vocal features
For processing of vocal signal and extraction of distinguishing acoustic features that are correlated with brain development, baby recordings were divided to frames of 15 ms and segments of 0.3 s with 0.15 s overlap. This way each segment contains 20 frames and enables good detection of time varying phenomena with minimal loss .
The most prominent vocal features extracted from frames were:
Pitch frequency—the frequency of vocal cord periodic vibrations when the baby sound is produced. A typical Pitch range in babies is 200–450 Hz, hence frames of 15 mill sec would contain 3 pitch periods, the minimum needed for reliable pitch detection [81, 82, 83].
Formant—the dominant resonant frequencies of the air flowing through the oral and nose cavities during baby’s vocal expression. Typical frequencies of healthy baby formants are 1100 Hz (F1), 3300 Hz (F2), 3500 Hz (F3) . The first three formants were extracted.
Spectral centroid—gives an estimate of the spectral content of the voice frame, with typical value of 1000 Hz for healthy babies, was calculated for each frame using FFT.
Dominant frequencies—first second and third quarterly frequencies of each frame are the frequencies above which 25%, 50%, 70% of the vocal energy resides.
Mel-Frequency Cepstrum coefficients—were extracted in order to better estimate the spectral envelope of the vocal signal, using a logarithmic scale for the frequency axis (in order to imitate human pitch perception) .
Linear Predictive Coding coefficients—represent the spectral envelope of the signal using a linear predictive model. LPC coefficients are mostly used for speech compression and encoding; and are plausible also for infant vocal analysis. We extracted the first three LPC coefficients from each frame.
The most prominent vocal features extracted from segments are changes in Pitch contour that were shown to correlate with developmental disorders in infants .
Glide—defined as a steep rise or fall of Pitch contour of at least 600 Hz in 0.1 s.
Vibrato—defined as rapid falling and rising of Pitch contour. Vibrato was detected from sub-segment groups containing runs of more than 2 positive/negative pitch differences larger than 3 Hz. These were summed and normalized to define Vibrato intensity .
Modes—describe a continuous temporal state in vocal segment, in which the pitch contour is either in a certain range or cannot be clearly detected (as in aperiodic signal). Two modes were extracted—phonation (pitch is up to 750 Hz), and hyper phonation (pitch is above 1000 Hz .
Melodies (falling, rising, flat)—describe the general trend of pitch contour, as rising, falling or flat. Vocal expression containing a majority of one melody may reflect a developmental impairment [86, 87, 88, 89]. Cry melody was calculated using derivatives of the pitch contour, when a positive value reflects a rising trend in pitch contour and a negative value reflects a falling trend, both of over 50 Hz. When pitch contour derivative was smaller than 50 Hz melody was defined as flat.
According to the RELIEFE iterative feature selection algorithm , the most prominent features for correct categorization of training samples in the developmental classification, are short time energy, the third formant, the vibrato feature and the melodies (falling, rising, flat).
2.3.2. Developmental classification according to vocal features
After babies’ reference developmental status (“data base”) had been quantitatively categorized as “normal” (44%) or “impaired” (56%) (Figure 1), a machine learning approach was employed with the babies’ vocal recordings. Computerized vocal classification was based on computer analysis of the vocal features extracted from babies’ vocal expressions, against the clinical diagnosis of the premature infants (data base). vocal system architecture is shown in Figure 4.
For the machine learning process, a k-NN algorithm was employed. In the training phase each frame was represented as a 22 dimensional feature vector, according to the ten vocal features described above. In the training set vocal signals were divided into disjoint sets so that each piece of vocal recording would appear only in one set. System performance was evaluated using 5-fold balanced cross validation. The system classified correctly 89% of the babies. The percentage of babies’ vocal signals falsely classified as “impaired” while diagnosed “healthy” was approximately 9% (type 1 error, false positive, alpha) and the percentage of babies’ vocal signals falsely classified as “healthy” while diagnosed “impaired” was 2% (type 2 error, false negative, beta). Infant Cry Analyzer algorithm in work is shown in Figure 5.
The results verify the correlation between early motor & vocal features and infant neurodevelopment. Performance of classifiers show approximately 90% accuracy on a database of diagnosed babies. The results demonstrate the potential of neuromotor and vocal computerized analysis system as an assisting tool in early detection of developmental insults. We currently characterize and group specific features which are early signs of specific brain injuries. Ultimately this system can be widely used in remote clinics, leading to earlier diagnosis of developmental insults and early intervention.
We wish to thank: the parents of the babies for their trust and cooperation. Chava Kasher (Sheba MC), Dr. Amir Kushnir (Poriah MC), Moran Moskovich (Ben-Gurion university), Dr. Guillaume Sikard, Amit Oren, Avi Matzliach, Rami Cohen (the Technion), Dr. Oren Forkosh, Dr. Goren Gordon (Weizmann Institute). Marjory M. Palmer (NOMAS International), for their contributions. The National Institute for Psychobiology in Israel, for their generous support.