Speaker Recognition

Speaker Recognition is a multi-disciplinary technology which uses the vocal characteristics of speakers to deduce information about their identities. It is a branch of biometrics that may be used for identification, verification, and classification of individual speakers, with the capability of tracking, detection, and segmentation by extension. A speaker recognition system first tries to model the vocal tract characteristics of a person. This may be a mathematical model of the physiological system producing the human speech or simply a statistical model with similar output characteristics as the human vocal tract. Once a model is established and has been associated with an individual, new instances of speech may be assessed to determine the likelihood of them having been generated by the model of interest in contrast with other observed models. This is the underlying methodology for all speaker recognition applications. The earliest known papers on speaker recognition were published in the 1950s (Pollack et al., 1954; Shearme & Holmes, 1959). Initial speaker recognition techniques relied on a human expert examining representations of the speech of an individual and making a decision on the person’s identity by comparing the characteristics in this representation with others. The most popular representation was the formant representation. In the recent decades, fully automated speaker recognition systems have been developed and are in use (Beigi, 2011). There have been a number of tutorials, surveys, and review papers published in the recent years (Bimbot et al., 2004; Campbell, 1997; Furui, 2005). In a somewhat different approach, we have tried to present the material, more in the form of a comprehensive summary of the field with an ample number of references for the avid reader to follow. A coverage of most of the aspects is presented, not just in the form of a list of different algorithms and techniques used for handling part of the problem, as it has been done before. As for the importance of speaker recognition, it is noteworthy that speaker identity is the only biometric which may be easily tested (identified or verified) remotely through the existing infrastructure, namely the telephone network. This makes speaker recognition quite valuable and unrivaled in many real-world applications. It needs not be mentioned that with the growing number of cellular (mobile) telephones and their ever-growing complexity, speaker recognition will become more popular in the future. There are countless number of applications for the different branches of speaker recognition. If audio is involved, one or more of the speaker recognition branches may be used. However, in terms of deployment, speaker recognition is in its early stages of infancy. This is partly due to unfamiliarity of the general public with the subject and its existence, partly because of the limited development in the field. These include, but are certainly not limited to, financial, 1

activities, a text-dependent analysis was made, in order to simplify the task of identification. In 1959, not long after Pollack's analysis, Shearme, et al. [4] started comparing the formants of speech, in order to facilitate the identification process. However, still a human expert would do the analysis. This first incarnation of speaker recognition, namely using human expertise, has been used to date, in order to handle forensic speaker identification [5,6]. This class of approaches have been improved and used in a variety of criminal and forensic analyses by legal experts. [7,8] Although it is always important to have a human expert available for important cases, such as those in forensic applications, the need for an automatic approach to speaker recognition was soon established. Prunzansky, et al. [9,10] started by looking at an automatic statistical comparison of speakers using a text-dependent approach. This was done by analyzing a population of 10 speakers uttering several unique words. However, it is well understood that, at least for speaker identification, having a text-dependent analysis is not practical in the least [1]. Nevertheless, there are cases where there is some merit to having a text-dependent analysis done for the speaker verification problem. This is usually when there is limited computation resource and/or obtaining speech samples for longer than a couple of seconds is not feasible.
To date, still the most prevalent modeling techniques are the Gaussian mixture model (GMM) and support vector machine (SVM) approaches. Neural networks and other types of classifiers have also been used, although not in significant numbers. In the next two sections, we will briefly recap GMM and SVM approaches. See Beigi [1] for a detailed treatment of these and other classifiers.

Gaussian Mixture Model (GMM) recognizers
In a GMM recognition engine, the models are the parameters for collections of multi-variate normal density functions which describe the distribution of the features [1] for speakers' enrollment data. The best results have been shown on many occasions, and by many research projects, to have come from the use of Mel-Frequency Cepstral Coefficient (MFCC) features [1]. Although, later we will review other features which may perform better for certain special cases. The Gaussian mixture model (GMM) is a model that expresses the probability density function of a random variable in terms of a weighted sum of its components, each of which is described by a Gaussian (normal) density function. In other words, p(x|ϕ ϕ ϕ) = Γ ∑ γ=1 p(x|θ θ θ γ )P(θ θ θ γ ) (1) where the supervector of parameters, ϕ ϕ ϕ, is defined as an augmented set of Γ vectors constituting the free parameters associated with the Γ mixture components, θ θ θ γ , γ ∈ {1, 2, · · · , Γ} and the Γ − 1 mixture weights, P(θ = θ θ θ γ ), γ = {1, 2, · · · , Γ − 1}, which are the prior probabilities of each of these mixture models known as the mixing distribution [11].
The parameter vectors associated with each mixture component, in the case of the Gaussian mixture model, are the parameters of the normal density function, where the unique parameters vector is an invertible transformation that stacks all the free parameters of a matrix into vector form. For example, if Σ Σ Σ γ is a full covariance matrix, then u(Σ Σ Σ γ ) is the vector of New Trends and Developments in Biometrics the elements in the upper triangle of Σ Σ Σ γ including the diagonal elements. On the other hand, if Σ Σ Σ γ is a diagonal matrix, then, Therefore, we may always reconstruct Σ Σ Σ γ from u γ using the inverse transformation, The parameter vector for the mixture model may be constructed as follows, where only (Γ − 1) mixture coefficients (prior probabilities), p(θ θ θ γ ), are included in ϕ ϕ ϕ, due to the constraint that Γ ∑ γ=1 p(ϕ ϕ ϕ γ ) = 1 (6) Thus the number of free parameters in the prior probabilities is only Γ − 1.
For a sequence of independent and identically distributed (i.i.d.) observations, {x} N 1 , the log of likelihood of the sequence may be written as follows, Assuming the mixture model, defined by Equation 1, the likelihood of the sequence, {x} N 1 , may be written in terms of the mixture components, p(x n |θ θ θ γ )P(θ θ θ γ ) Since maximizing Equation 8 requires the maximization of the logarithm of a sum, we can utilize the incomplete data approach that is used in the development of the EM algorithm to simplify the solution.
Beigi [1] shows the derivation of the incomplete data equivalent of the maximization of Equation 8 using the EM algorithm.
Each multivariate distribution is represented by Equation 9.
In Equation 9, µ µ µ γ is the mean vector for cluster γ computed from the vectors in that cluster, where, The sample mean approximation for Equation 10 is, where N is the number of samples and x i are the MFCC [1].
The Covariance matrix is defined as, The diagonal elements of Σ Σ Σ γ are the variances of the individual dimensions of x. The off-diagonal elements are the covariances across the different dimensions.
The unbiased estimate of Σ Σ Σ γ ,Σ Σ Σ γ , is given by the following, where the sample mean, µ µ µ γ , is given by Equation 11 and the second order sum matrix (Scatter Matrix), S γ | N , is given by, Therefore, in a general GMM model, the above statistical parameters are computed and stored for the set of Gaussians along with the corresponding mixture coefficients, to represent each speaker. The features used by the recognizer are Mel-Frequency Cepstral Coefficients (MFCC). Beigi [1] describes details of such a GMM-based recognizer.

Support Vector Machine (SVM) recognizers
In general, SVM are formulated as two-class classifiers. Γ-class classification problems are usually reduced to Γ two-class problems [12], where the γ th two-class problem compares the γ th class with the rest of the classes combined. There are also other generalizations of the SVM formulation which are geared toward handling Γ-class problems directly. Vapnik has proposed such formulations in Section 10.10 of his book [12]. He also credits M. Jaakkola and C. Watkins, et al. for having proposed similar generalizations independently. For such generalizations, the constrained optimization problem becomes much more complex. For this reason, the approximation using a set of Γ two-class problems has been preferred in the literature. It has the characteristic that if a data point is accepted by the decision function of more than one class, then it is deemed as not classified. Furthermore, it is not classified if no decision function claims that data point to be in its class. This characteristic has both positive and negative connotations. It allows for better rejection of outliers, but then it may also be viewed as giving up on handling outliers.
In application to speaker recognition, experimental results have shown that SVM implementations of speaker recognition may perform similarly or sometimes even be slightly inferior to the less complex and less resource intensive GMM approaches. However, it has also been noted that systems which combine GMM and SVM approaches often enjoy a higher accuracy, suggesting that part of the information revealed by the two approaches may be complementary [13].
The problem of overtraining (overfitting) plagues many learning techniques, and it has been one of the driving factors for the development of support vector machines [1]. In the process of developing the concept of capacity and eventually SVM, Vapnik considered the generalization capacity of learning machines, especially neural networks. The main goal of support vector machines is to maximize the generalization capability of the learning algorithm, while keeping good performance on the training patterns. This is the basis for the Vapnik-Chervonenkis theory (CV theory) [12], which computes bounds on the risk, R(o), according to the definition of the VC dimension and the empirical risk -see Beigi [1].
The multiclass classification problem is also quite important, since it is the basis for the speaker identification problem. In Section 10.10 of his book, Vapnik [12] proposed a simple approach where one class was compared to all other classes and then this is done for each class. This approach converts a Γ-class problem to Γ two-class problems. This is the most popular approach for handling multi-class SVM and has been dubbed the one-against-all 1 approach [1]. There is also, the one-against-one approach which transforms the problem into Γ(Γ + 1)/2 two-class SVM problems. In Section 6.2.1 we will see more recent techniques for handling multi-class SVM.

Challenging audio
One of the most important challenges in speaker recognition stems from inconsistencies in the different types of audio and their quality. One such problem, which has been the focus of most research and publications in the field, is the problem of channel mismatch, in which the enrollment audio has been gathered using one apparatus and the test audio has been produced by a different channel. It is important to note that the sources of mismatch vary and are generally quite complicated. They could be any combination and usually are not limited to mismatch in the handset or recording apparatus, the network capacity and quality, noise conditions, illness related conditions, stress related conditions, transition between different media, etc. Some approaches involve normalization of some kind to either transform the data (raw or in the feature space) or to transform the model parameters. Chapter 18 of Beigi [1] discusses many different channel compensation techniques in order to resolve this issue. Vogt, et al. [14] provide a good coverage of methods for handling modeling mismatch.
One such problem is to obtain ample coverage for the different types of phonation in the training and enrollment phases, in order to have a better performance for situations when different phonation types are uttered. An example is the handling of whispered phonation which is, in general, very hard to collect and is not available under natural speech scenarios. Whisper is normally used by individuals who desire to have more privacy. This may happen under normal circumstances when the user is on a telephone and does not want others to either hear his/her conversation or does not wish to bother others in the vicinity, while interacting with the speaker recognition system. In Section 3.1, we will briefly review the different styles of phonation. Section 3.2 will then cover some work which has been done, in order to be able to handle whispered speech.
Another challenging issue with audio is to handle multiple speakers with possibly overlapping speech. The most difficult scenario would be the presence of multiple speakers on a single microphone, say a telephone handset, where each speaker is producing similar level of audio at the same time. This type of cross-talk is very hard to handle and indeed it is very difficult to identify the different speakers while they speak simultaneously. A somewhat simpler scenario is the one which generally happens in a conference setting, in a room, in which case, a far-field microphone (or microphone array) is capturing the audio. When multiple speakers speak in such a setting, there are some solutions which have worked out well in reducing the interference of other speakers, when focusing on the speech of a certain individual. In Section 3.4, we will review some work that has been done in this field.

Different styles of phonation
Phonation deals with the acoustic energy generated by the vocal folds at the larynx. The different kinds of phonation are unvoiced, voiced, and whisper.
Unvoiced phonation may be either in the form of nil phonation which corresponds to zero energy or breath phonation which is based on relaxed vocal folds passing a turbulent air stream.
Majority of voiced sounds are generated through normal voiced phonation which happens when the vocal folds are vibrating at a periodic rate and generate certain resonance in the upper chamber of the vocal tract. Another category of voiced phonation is called laryngealization (creaky voice). It is when the arytenoid cartilages fix the posterior portion of the vocal folds, only allowing the anterior part of the vocal folds to vibrate. Yet another type voiced phonation is a falsetto which is basically the un-natural creation of a high pitched voice by tightening the basic shape of the vocal folds to achieve a false high pitch.
In another view, the emotional condition of the speaker may affect his/her phonation. For example, speech under stress may manifest different phonetic qualities than that of, so-called, neutral speech [15]. Whispered speech also changes the general condition of phonation. It is thought that this does not affect unvoiced consonants as much. In Sections 3.2 and 3.3 we will briefly look at whispered speech and speech under stressful conditions.

Treatment of whispered speech
Whispered phonation happens when the speaker acts like generating a voiced phonation with the exception that the vocal folds are made more relaxed so that a greater flow of air can pass through them, generating more of a turbulent airstream compared to a voiced resonance. However, the vocal folds are not relaxed enough to generate an unvoiced phonation.
As early as the first known paper on speaker identification [3], the challenges of whispered speech were apparent. The general text-independent analysis of speaker characteristics relies mainly on the normal voiced phonation as the primary source of speaker-dependent information. [1] This is due to the high-energy periodic signal which is generated with rich resonance information. Normally, very little natural whisper data is available for training. However, in some languages, such as Amerindian languages 2 (e.g., Comanche [16] and Tlingit -spoken in Alaska) and some old languages, voiceless vocoids exist and carry independent meaning from their voiced counterparts [1].
An example of a whispered phone in English is the egressive pulmonic whisper [1] which is the sound that an [h] makes in the word, "home." However, any utterance may be produced by relaxing the vocal folds and generating a whispered version of the utterance. This partial relaxation of the vocal folds can significantly change the vocal characteristics of the speaker. Without ample data in whisper mode, it would be hard to identify the speaker.
Pollack, et al. [3] say that we need about three times as much speech samples for whispered speech in order to obtain an equivalent accuracy to that of normal speech. This assessment was made according to a comparison, done using human listeners and identical speech content, as well as an attempted equivalence in the recording volume levels.
Jin, et al. [17] deal with the insufficient amount of whisper data by creating two GMM models for each individual, assuming that ample data is available for the normal-speech mode for any target speaker. Then, in the test phase, they use the frame-based score competition (FSC) method, comparing each frame of audio to the two models for every speaker (normal and whispered) and only using the result for that frame, from the model which produces the higher score. Otherwise, they continue with the standard process of recognition.
Jin, et al. [17] conducted experiments on whispered speech when almost no whisper data was available for the enrollment phase. The experiments showed that noise greatly impacts recognition with whispered speech. Also, they concentrate on using a throat microphone which happens to be more robust in terms of noise, but it also picks up more resonance for whispered speech. In general, using the two-model approach with FSC, [17] show significant reduction in the error rate.
Fan, et al. [18] have looked into the differences between whisper and neutral speech. By neutral speech, they mean normal speech which is recorded in a modal (voiced) speech setting in a quiet recording studio. They use the fact that the unvoiced consonants are quite similar in the two types of speech and that most of the differences stem from the remaining phones. Using this, they separate whispered speech into two parts. The first part includes all the unvoiced consonants, and the second part includes the rest of the phones. Furthermore, they show better performance for unvoiced consonants in the whispered speech, when using linear frequency cepstral coefficients (LFCC) and exponential frequency cepstral coefficients (EFCC) -see Section 4.3. In contrast, the rest of the phones show better performance with MFCC features. Therefore, they detect unvoiced consonants and treat them using LFCC/EFCC features. They send the rest of the phones (e.g., voiced consonants, vowels, diphthongs, triphthongs, glides, liquids) through an MFCC-based system. Then they combine the scores from the two segments to make a speaker recognition decision.
The unvoiced consonant detection which is proposed by [18], uses two measures for determining the frames stemming from unvoiced consonants. For each frame, l, the energy of the frame in the lower part of the spectrum, E (l) l , and that of the higher part of the band, E l , (for f ≤ 4000Hz and 4000Hz < f ≤ 8000Hz respectively) are computed, along with the total energy of the frame, E l , to be used for normalization. The relative energy of the lower frequency is then computed for each frame by Equation 15.
It is assumed that most of spectral energy of unvoiced consonants is concentrated in the higher half of the frequency spectrum, compared to the rest of the phones. In addition, the Jeffreys' divergence [1] of the higher portion of the spectrum relative to the previous frame is computed using Equation 16. where Two separate thresholds may be set for R l and D J (l ↔ l − 1), in order to detect unvoiced consonants from the rest of the phones.

Speech under stress
As noted earlier, the phonation undergoes certain changes when the speaker is under stressful conditions. Bou-Ghazale, et al.
[15] have shown that this may effect the significance of certain frequency bands, making MFCC features miss certain nuances in the speech of the individual under stress. They propose a new frequency scale which it calls the exponential-logarithmic (expo-log) scale. In Section 4.3 we will describe this scale in more detail since it is also used by Bou-Ghazale, et al. [18] to handle the unvoiced consonants. On another note, although research has generally shown that cepstral coefficients derived from FFT are more robust for the handling of neutral speech [19], Bou-Ghazale, et al.
[15] suggest that for speech, recorded under stressful conditions, cepstral coefficients derived from the linear predictive model [1] perform better.

Multiple sources of speech and far-field audio capture
This problem has been addressed in the presence of microphone arrays, to handle cases when sources are semi-stationary in a room, say in a conference environment. The main goal would amount to extracting the source(s) of interest from a set of many sources of audio and to reduce the interference from other sources in the process [20]. For instance, Kumatani, et al. [21] address the problem using the, so called, beamforming technique [20,22] for two speakers speaking simultaneously in a room. They construct a generalized sidelobe canceler (GSC) for each source and adjusts the active weight vectors of the two GSCs to extract two speech signals with minimum mutual information [1] between the two. Of course, this makes a few essential assumptions which may not be true in most situations. The first assumption is that the number of speakers is known. The second assumption is that they are semi-stationary and sitting in different angles from the microphone array. Kumatani, et al. [21] show performance results on the far-field PASCAL speech separation challenge, by performing speech recognition trials.
One important part of the above task is to localize the speakers. Takashima, et al. [23] use an HMM-based approach to separate the acoustic transfer function so that they can separate the sources, using a single microphone. It is done by using an HMM model of the speech of each speaker to estimate the acoustic transfer function from each position in the room. They have experimented with up to 9 different source positions and have shown that their accuracy of localization decreases with increasing number of positions.

Channel mismatch
Many publications deal with the problem of channel mismatch, since it is the most important challenge in speaker recognition. Early approaches to the treatment of this problem concentrated on normalization of the features or the score. Vogt, et al. [14] present a good coverage of different normalization techniques. Barras, et al. [24] compare cepstral mean subtraction (CMS) and variance normalization, Feature Warping, T-Norm, Z-Norm and the cohort methods. Later approaches started by using techniques from factor analysis or discriminant analysis to transform features such that they convey the most information about speaker differences and least about channel differences. Most GMM techniques use some variation of joint factor analysis (JFA) [25]. An offshoot of JFA is the i-vector technique which does away with the channel part of the model and falls back toward a PCA approach [26]. See Section 5.1 for more on the i-vector approach.
SVM systems use techniques such as nuisance attribute projection (NAP) [27]. NAP [13] modifies the original kernel, used for a support vector machine (SVM) formulation, to one with the ability of telling specific channel information apart. The premise behind this approach is that by doing so, in both training and recognition stages, the system will not have the ability to distinguish channel specific information. This channel specific information is what is dubbed nuisance by Solomonoff, et al. [13].
NAP is a projection technique which assumes that most of the information related to the channel is stored in specific low-dimensional subspaces of the higher dimensional space to which the original features are mapped. Furthermore, these regions are assumed to be somewhat distinct from the regions which carry speaker information. This is quite similar to the idea of joint factor analysis. Seo, et al. [28] use the statistics of the eigenvalues of background speakers to come up with discriminative weight for each background speaker and to decide on the between class scatter matrix and the within-class scatter matrix.
Shanmugapriya, et al. [29] propose a fuzzy wavelet network (FWN) which is a neural network with a wavelet activation function (known as a Wavenet). A fuzzy neural network is used in this case, with the wavelet activation function. Unfortunately, [29] only provides results for the TIMIT database [1] which is a database acquired under a clean and controlled environment and is not very challenging.
Villalba, et al. [30] attempt to detect two types of low-tech spoofing attempts. The first one is the use of a far-field microphone to record the victim's speech and then to play it back into a telephone handset. The second type is the concatenation of segments of short recordings to build the input required for a text-dependent speaker verification system. The former is handled by using an SVM classifier for spoof and non-spoof segments trained based on some training data. The latter is detected by comparing the pitch and MFCC feature contours of the enrollment and test segments using dynamic time warping (DTW).

Alternative features
As seen in the past, most classic features used in speech and speaker recognition are based on LPC, LPCC, or MFCC. In Section 6.3 we see that Dhanalakshmi, et al. [19] report trying these three classic features and have shown that MFCC outperforms the other two. Also, Beigi [1] discusses many other features such as those generated by wavelet filterbanks, instantaneous frequencies, EMD, etc. In this section, we will discuss several new features, some of which are variations of cepstral coefficients with a different frequency scaling, such as CFCC, LFCC, EFCC, and GFCC. In Section 6.2 we will also see the RMFCC which was used to handle speaker identification for gaming applications. Other features are also discussed, which are more fundamentally different, such as missing feature theory (MFT), and local binary features.

Multitaper MFCC features
Standard MFCC features are usually computed using a periodogram estimate of the spectrum, with a window function, such as the Hamming window. [1] MFCC features computed by this method portray a large variance. To reduce the variance, multitaper spectrum estimation techniques [31] have been used. They show lower bias and variance for the multitaper estimate of the spectrum. Although bias terms are generally small with the windowed periodogram estimate, the reduction in the variance, using multitaper estimation, seems to be significant.
A multitaper estimate of a spectrum is made by using the mean value of periodogram estimates of the spectrum using a set of orthogonal windows (known as tapers). The multitaper approach has been around since early 1980s. Examples of such taper estimates are Thomson [32], Tukey's split cosine taper [33], sinusoidal taper [34], and peak matched estimates [35]. However, their use in computing MFCC features seems to be new. In Section 5.1, we will see that they have been recently used in accordance with the i-vector formulation and have also shown promising results.

Cochlear Filter Cepstral Coefficients (CFCC)
Li, et al. [36] present results for speaker identification using cochlear filter cepstral coefficients (CFCC) based on an auditory transform [37] while trying to emulate natural cochlear signal processing. They maintain that the CFCC features outperform MFCC, PLP, and RASTA-PLP features [1] under conditions with very low signal to noise ratios. Figure 1 shows the block diagram of the CFCC feature extraction proposed by Li, et al. [36]. The auditory transform is a wavelet transform which was proposed by Li, et al. [37]. It may be implemented in the form of a filter bank, as it is usually done for the extraction of MFCC features [1]. Equations 18 and 19 show a generic wavelet transform associated with one such filter. T where The wavelet basis functions [1], {ψ (a,b) (t)}, are defined by Li, et al. [37], based on the mother wavelet, ψ(t) (Equation 20), which mimics the cochlear impulse response function.
Each wavelet basis function,according to the scaling and translation parameters a > 0 and b > 0 is, therefore, given by Equation 21.
In Equation 21, α and β are strictly positive parameters which define the shape and the bandwidth of the cochlear filter in the frequency domain. Li, et al. [36] determine them empirically for each filter in the filter bank. u(t) is the units step (Heaviside) function defined by Equation 22.

Linear and Exponential Frequency Cepstral Coefficients (LFCC and EFCC)
Some experiments have shown that using linear frequency cepstral coefficients (LFCC) and exponential frequency cepstral coefficients (EFCC) for processing unvoiced consonants may produce better results for speaker recognition. For instance, Fan, et al. [18] use an unvoiced consonant detector to separate frames which contain such phones and to use LFCC and EFCC features for these frames (see Section 3.2). These features are then used to train up a GMM-based speaker recognition system. In turn, they send the remaining frames to a GMM-based recognizer using MFCC features. The two recognizers are treated as separate systems. At the recognition stage, the same segregation of frames is used and the scores of two recognition engines are combined to reach the final decision.
The EFCC scale was proposed by Bou-Ghazale, et al.
[15] and later used by Fan, et al. [18]. This mapping is given by where the two constants, c and k, are computed by solving Equations 24 and 25.  Fan el al. [18] show better accuracy for unvoiced consonants, when EFCC is used over MFCC. However, it shows even better accuracy when LFCC is used for these frames!

Gammatone Frequency Cepstral Coefficients (GFCC)
Shao, et al. [38] use gammatone frequency cepstral coefficients (GFCC) as features, which are the products of a cochlear filter bank, based on psychophysical observations of the total auditory system. The Gammatone filter bank proposed by Shao, et al. [38] has 128 filters, centered from 50Hz to 8kHz, at equal partitions on the equivalent rectangular bandwidth (ERB) [39,40]  where f is the frequency in Hertz and E is the number of ERBs, in a similar fashion as Barks or Mels are defined [1]. The bandwidth, E b , associated with each center frequency, f , is then given by Equation 29. Both f and E b are in Hertz (Hz) [40].
The impulse response of each filter is given by Equation 30.
where t denotes the time and f is the center frequency of the filter of interest. a is the order of the filter and is taken to be a = 4 [38], and b is the filter bandwidth.
In addition, as it is done with other models such as MFCC, LPCC, and PLP, the magnitude also needs to be warped. Shao, et al. [38] base their magnitude warping on the method of cubic root warping (magnitude to loudness conversion) used in PLP [1].
The same group that published [38], followed by using a computational auditory scene analysis (CASA) front-end [43] to estimate a binary spectrographical mask to determine the useful part of the signal (see Section 4.5), based on auditory scene analysis (ASA) [44]. They claim great improvements in noisy environments, over standard speaker recognition approaches.

Missing Feature Theory (MFT)
Missing feature theory (MFT) tries to deal with bandlimited speech in the presence of non-stationary background noise. Such missing data techniques have been used in the speech community, mostly to handle applications of noisy speech recognition. Vizinho, et al. [45] describe such techniques by estimating the reliable regions of the spectrogram of speech and then using these reliable portions to perform speech recognition. They do this by estimating the noise spectrum and the SNR and by creating a mask that would remove the noisy part from the spectrogram. In a related approach, some feature selection methods use Bayesian estimation to estimate a spectrographic mask which would remove unwanted part of the spectrogram, therefore removing features which are attributed to the noisy part of the signal.
The goal of these techniques is to be able to handle non-stationary noise. Seltzer, et al. [46] propose one such Bayesian technique. This approach concentrates on extracting as much useful information from the noisy speech as it can, rather than trying to estimate the noise and to subtract it from the signal, as it is done by Vizinho, et al. [45]. However, there are many parameters which need to be optimized, making the process quite expensive, calling for suboptimal search. Pullella, et al. [47] have combined the two techniques of spectrographic mask estimation and dynamic feature selection to improve the accuracy of speaker recognition under noisy conditions. Lim, et al. [48] propose an optimal mask estimation and feature selection algorithm.

Local binary features (slice classifier)
The idea of statistical boosting is not new and was proposed by several researchers, starting with Schapire [49] in 1990. The Adaboost algorithm was introduced by Freund, et al. [50] in 1996 as one specific boosting algorithm. The idea behind statistical boosting is that a combination of weak classifiers may be combined to build a strong one.
Rodriguez [51] used the statistical boosting idea and several extensions of the Adaboost algorithm to introduce face detection and verification algorithms which would use features based on local differences between pixels in a 9 × 9 pixel grid, compared to the central pixel of the grid.
Inspired by [51], Roy, et al. [52] created local binary features according to the differences between the bands of the discrete Fourier transform (DFT) values to compare two models. One important claim of this classifier is that it is less prone to overfitting issues and that it performs better than conventional systems under low SNR values. The resulting features are binary because they are based on a threshold which categorizes the difference between different bands of the FFT to either 0 or 1. The classifier of [52] has a built-in discriminant nature, since it uses certain data as those coming from impostors, in contrast with the data which is generated by the target speaker. The labels of impostor versus target allow for this built-in discrimination. The authors of [52] call these features, boosted binary features (BBF). In a more recent paper [53], Roy, et al. refined their approach and renamed the method a slice classifier. They show similar results with this classifier, compared to the state of the art, but they explain that the method is less computationally intensive and is more suitable for use in mobile devices with limited resources.

Alternative speaker modeling
Classic modeling techniques for speaker recognition have used Gaussian mixture models (GMM), support vector machines (SVM), and neural networks [1]. In Section 6 we will see some other modeling techniques such as non-negative matrix factorization. Also, in Section 4, new modeling implementations were used in applying the new features presented in the section. Generally, most new modeling techniques use some transformation of the features in order to handle mismatch conditions, such as joint factor analysis (JFA), Nuisance attribute projection (NAP), and principal component analysis (PCA) techniques such as the i-vector implementation. [1] In the next few sections, we will briefly look at some recent developments in these and other techniques.

The i-vector model (total variability space)
Dehak, et al. [54] recombined the channel variability space in the JFA formulation [25] with the speaker variability space, since they discovered that there was considerable leakage from the speaker space into the channel space. The combined space produces a new projection (Equation 31) which resembles a PCA, rather than a factor analysis process.
They called the new space total variability space and in their later works [55][56][57], they referred to the projections of feature vectors into this space, i-vectors. Speaker factor coefficients are related to the speaker coordinates, in which each speaker is represented as a point. This space is defined by the Eigenvoice matrix. These speaker factor vectors are relatively short, having in the order of about 300 elements [58], which makes them desirable for use with support vector machines, as the observed vector in the observation space (x).
Generally, in order to use an i-vector approach, several recording sessions are needed from the same speaker, to be able to compute the within class covariance matrix in order to do within class covariance normalization (WCCN). Also, methods using linear discriminant analysis (LDA) along with WCCN [57] and recently, probabilistic LDA (PLDA) with WCCN [59][60][61][62] have also shown promising results.
Alam, et al. Glembek, et al. [26] provide simplifications to the formulation of the i-vectors to reduce the memory usage and to increase the speed of computing the vectors. Glembek, et al. [26] also explore linear transformations using principal component analysis (PCA) and Heteroscedastic Linear Discriminant Analysis 4 (HLDA) [64] to achieve orthogonality of the components of the Gaussian mixture.

Non-negative matrix factorization
In Section 6.3, we will see several implementations of extensions of non-negative matrix factorization [65,66]. These techniques have been successfully applied to classification problems. More detail is give in Section 6.3.

Using multiple models
In Section 3.2 we briefly covered a few model combination and selection techniques that would use different specialized models to achieve better recognition rates. For example, Fan, et al. [18] used two different models to handle unvoiced consonants and the rest of the phones. Both models had similar form, but they used slightly different types of features (MFCC vs. EFCC/LFCC). Similar ideas will be discuss in this section.

Frame-based score competition (FSC):
In Section 3.2 we discussed the fact that Jin, et al. [17] used two separate models, one based on the normal speech (neutral speech) model and the second one based on whisper data. Then, at the recognition stage, each frame is evaluated against the two models and the higher score is used. [17] Therefore, it is called a frame-based score competition (FSC) method.

SNR-Matched Recognition:
After performing voice activity detection (VAD), Bartos, et al. [67] estimate the signal to noise ratio (SNR) of that part of the signal which contains speech. This value is used to load models which have been created with data recorded under similar SNR conditions. Generally, the SNR is computed in deciBels given by Equations 32 and 33 -see [1] for more. SNR = 10 log 10 P s P n (32) Bartos, et al. [67] consider an SNR of 30dB or higher to be clean speech. An SNR of 30dB happens to be equivalent to the signal amplitude being about 30 times that of the noise. When the SNR is 0, the signal amplitude is roughly the same as the energy of the noise.
Of course, to evaluate the SNR from Equation 32 or 33, we would need to know the power or amplitude of the noise as well as the true signal. Since this is not possible, estimation techniques are used to come up with an instantaneous SNR and to average that value over the whole signal. Bartos, et al. [67] present such an algorithm.
Once the SNR of the speech signal is computed, it is categorized within a quantization of 4dB segments and then identification or verification is done using models which have been enrolled with similar SNR values. This, according to [67], allows for a lower equal error rate in case of speaker verification trials. In order to generate speaker models for different SNR levels (of 4dB steps), [67] degrades clean speech iteratively, using some additive noise, amplified by a constant gain associated with each 4db level of degradation.

Branch-specific progress
In this section, we will quickly review the latest developments for the main branches of speaker recognition as listed at the beginning of this chapter. Some of these have already been reviewed in the above sections. Most of the work on speaker recognition is performed on speaker verification. In the next section we will review some such systems.

Verification
As we mentioned in Section 4, Roy, et al. [52,53] used the so-called boosted binary features (slice classifier) for speaker verification. Also, we reviewed several developments regarding the i-vector formulation in Section 5.1. The i-vector has basically been used for speaker verification. Many recent papers have dealt with aspects such as LDA, PLDA, and other discriminative aspects of the training.
Salman, et al. [68] use a neural network architecture with very deep number of layers to perform a greedy discriminative learning for the speaker verification problem. The deep neural architecture (DNA), proposed by [68], uses two identical subnets, to process two MFCC feature vectors respectively, for providing discrimination results between two speakers. They show promising results using this network.
Sarkar, et al. [69] use multiple background models associated with different vocal tract length (VTL) [1] estimates for the speakers, using MAP [1] to derive these background models from a root background model. Once the best VTL-based background model for the training or test audio is computed, the transformation to get from that universal background model (UBM) to the root UBM is used to transform the features of the segment to those associated with the VTL of the root UBM. Sarkar, et al. [69] show that the results of this single UBM system is comparable to a multiple background model system.

Identification
In Section 5.3.2 we discussed new developments on SNR-matched recognition. The work of Bartos, et al. [67] was applied to improving speaker identification based on a matched SNR condition.
Bharathi, et al. [70] try to identify phonetic content for which specific speakers may be efficiently recognized. Using these speaker-specific phonemes, a special text is created to enhance the discrimination capability for the target speaker. The results are presented for the TIMIT database [1] which is a clean and controlled database and not very challenging. However, the idea seems to have merit.
Cai, et al. [71] use some of the features described in Section 4, such as MFCC and GFCC in order to identify the voice of signers from a monophonic recording of songs in the presence of sounds of music from several instruments.
Do, et al. [72] examine the speaker identification problem for identifying the person playing a computer game. The specific challenges are the fact that the recording is done through a far-field microphone (see Section 3.4) and that the audio is generally short, apparently based on the commands used for gaming. To handle the reverberation and background noise, Do, et al. [72] argue for the use of the, so-called, reverse Mel frequency cepstral coefficients (RMFCC). They propose this set of features by reversing the triangular filters [1] used for computing the MFCC, such that the lower frequency filters have larger bandwidths and the higher frequency filters have smaller bandwidths. This is exactly the opposite of the filters being used for MFCC. They also use LPC and F0 (the fundamental frequency) as additional features.
In Section 3.2 we saw the treatment of speaker identification for whispered speech in some detail. Also, Ghiurcau, et al. [73] study the emotional state of speakers on the results of speaker identification. The study treats happiness, anger, fear, boredom, sadness, and neutral conditions; it shows that these emotions significantly affect identification results. Therefore, they [73] propose using emotion detection and having emotion-specific models. Once the emotion is identified, the proper model is used to identify the test speaker.
Liu, et al. [74] use the Hilbert Huang Transform to come up with new acoustic features. This is the use of intrinsic mode decomposition described in detail in [1].
In the next section, we will look at the multi-class SVM which is used to perform speaker identification.

Multi-Class SVM
In Section 2.2 we discussed the popular one-against-all technique for handling multi-class SVM. There have been other more recent techniques which have been proposed in the last few years. One such technique is due to Platt, et al. [75], who proposed the, so-called, decision directed acyclic graph (DDAG) which produces a classification node for each pair of classes, in a Γ-class problem. This leads to Γ(Γ − 1)/2 classifiers and results in the creation of the DAGSVM algorithm [75].
Wang [76] presents a tree-based multi-class SVM which reduces the number of matches to the order of log(Γ). Although at the training phase, the number of SVM are similar to that of DDAG, namely, Γ(Γ − 1)/2. This can significantly reduce the amount of computation for speaker identification.

Classification and diarization
Aside from the more prominent research on speaker verification and identification, audio source and gender classification are also quite important in most audio processing systems including speaker and speech recognition.
In many practical audio processing systems, it is important to determine the type of audio. For instance, consider a telephone-based system which includes a speech recognizer. Such recognition engines would produce spurious results if they were presented with non-speech, say music. These results may be detrimental to the operation of an automated process. This is also true for speaker identification and verification systems which expect to receive human speech. They may be confused if they are presented with music or other types of audio such as noise. For text-independent speaker identification systems, this may result in mis-identifying the audio as a viable choice in the database and resulting in dire consequences! Similarly, some systems are only interested in processing music. An example is a music search system which would look for a specific music or one resembling the presented segment. These systems may be confused, if presented with human speech, uttered inadvertently, while only music is expected.
As an example, an important goal for audio source classification research is to develop filters which would tag a segment of audio as speech, music, noise, or silence [77]. Sometimes, we would also look into classifying the genre of audio or video such as movie, cartoon, news, advertisement, etc. [19].
The basic problem contains two separate parts. The first part is the segmentation of the audio stream into segments of similar content. This work has been under development for the past few decades with some good results [78][79][80].
The second part is the classification of each segment into relevant classes such as speech, music, or the rejection of the segment as silence or noise. Furthermore, when the audio type is human speech, it is desirable to do a further classification to determine the gender of the individual speaker.
Gender classification [77] is helpful in choosing appropriate models for conducting better speech recognition, more accurate speaker verification, and reducing the computation load in large-scale speaker identification. For the speaker diarization problem, the identity of the speaker also needs to be recognized.
Dhanalakshmi, et al. [19] report developments in classifying the genre of audio, as stemming from different video sources, containing movies, cartoons, news, etc. Beigi [77] uses a text and language independent speaker recognition engine to achieve these goals by performing audio classification. The classification problem is posed by Beigi [77] as an identification problem among a series of speech, music, and noise models.

Age and Gender Classification
Another goal for classification is to be able to classify age groups. Bocklet, et al. [81] categorized the age of the individuals, in relation to their voice quality, into 4 categories (classes). These classes are given by Table 1. With the natural exception of the child group (13 years or younger), each group is further split into the two male and female genders, leading to 7 total age-gender classes. Bahari, et al. [82] use a slightly different definition of age groups, compared to those used by [81]. They use 3 age groups for each gender, not considering individuals who are less than 18 years old. These age categories are given in Table 2.
They use weighted supervised non-negative matrix factorization (WSNMF) to classify the age and gender of the individual. This technique combines weighted non-negative matrix factorization (WNMF) [83] and supervised non-negative matrix factorization (SNMF) [84] which are themselves extensions of non-negative matrix factorization (NMF) [65,66]. NMF techniques have also been successfully used in other classification implementations such as that of the identification of musical instruments [85].
NMF distinguishes itself as a method which only allows additive components that are considered to be parts of the information contained in an entity. Due to their additive and positive nature, the components are considered to, each, be part of the information that builds up a description. In contrast, methods such as principal component analysis and vector quantization techniques are considered to be learning holistic information and hence are not considered to be parts-based [66]. According to the image recognition example presented by Lee, et al. [66], a PCA method such as Eigenfaces [86,87]  Although it is dependent on the data being used, but an accuracy of 96% for the gender classification case is not necessarily a great result. It is hard to make a qualitative assessment without running the same algorithms under the same conditions and on exactly the same data. But Beigi [77] shows 98.1% accuracy for gender classification.
In [77], 700 male and 700 female speakers were selected, completely at random, from over 70, 000 speakers. The speakers were non-native speakers of English, at a variety of proficiency levels, speaking freely. This introduced significantly higher number of pauses in each recording, as well as more than average number of humming sounds while the candidates would think about their speech. The segments were live responses of these non-native speakers to test questions in English, aimed at evaluating their linguistic proficiency.
Dhanalakshmi, et al. [19] also present a method based on an auto-associative neural network (AANN) for performing audio source classification. AANN is a special branch of feedforward neural networks which tries to learn the nonlinear principal components of a feature vector. The way this is accomplished is that the network consists of three layers, an input layer, an output layer of the same size, and a hidden layer with a smaller number of neurons. The input and output neurons generally have linear activation functions and the hidden (middle) layer has nonlinear functions.
In the training phase, the input and target output vectors are identical. This is done to allow for the system to learn the principal components that have built the patterns which most likely have built-in redundancies. Once such a network is trained, a feature vector undergoes a dimensional reduction and is then mapped back to the same dimensional space as the input space. If the training procedure is able to achieve a good reduction in the output error over the training samples and if the training samples are representative of the reality and span the operating conditions of the true system, the network can learn the essential information in the input signal. Autoassociative networks (AANN) have also been successfully used in speaker verification [88].
Class Name Advertisement Cartoon Movie News Songs Sports Table 3. Audio Classification Categories used by [19] Dhanalakshmi, et al. [19] use the audio classes represented in Table 3. It considers three different front-end processors for extracting features, used with two different modeling techniques. The features are LPC, LPCC, and MFCC features [1]. The models are Gaussian mixture models (GMM) and autoassociative neural networks (AANN) [1]. According to these experiments, Dhanalakshmi, et al. [19] show consistently higher classification accuracies with MFCC features over LPC and LPCC features. The comparison between AANN and GMM is somewhat inconclusive and both systems seem to portray similar results. Although, the accuracy of AANN with LPC and LPCC seems to be higher than that of GMM modeling, for the case when MFCC features are used, the difference seems somewhat insignificant. Especially, given the fact that GMM are simpler to implement than AANN and are less prone to problems such as encountering local minima, it makes sense to conclude that the combination of MFCC and GMM still provides the best results in audio classification. A combination of GMM with MFCC and performing Maximum a-Posteriori (MAP) adaptation provides very simple and considerable results for gender classification, as seen in [77].

Music Modeling
Beigi [77] classifies musical instruments along with noise and gender of speakers. Much in the same spirit as described in Section 6.3.1, [77] has made an effort to choose a variety of different instruments or sets of instruments to be able to cover most types of music. Table 4 shows these choices. A total of 14 different music models were trained to represent all music, with an attempt to cover different types of timbre [89].
An equal amount of music was chosen by Beigi [77] to create a balance in the quantity of data, reducing any bias toward speech or music. The music was downsampled from its original quality to 8kHz, using 8-bit µ-Law amplitude encoding, in order to match the quality of speech. The 1400 segments of music were chosen at random from European style classical music, as well as jazz, Persian classical, Chinese classical, folk, and instructional performances. Most of the music samples were orchestral pieces, with some solos and duets present.
Although a very low quality audio, based on highly compressed telephony data (AAC compressed [1]), was used by Beigi [77], the system achieved a 1% error rate in discriminating between speech and music and a 1.9% error in determining the gender of individual speakers once the audio is tagged as speech.  Beigi [77] has shown that MAP adaptation techniques used with GMM models and MFCC features may be used successfully for the classification of audio into speech and music and to further classify the speech by the gender of the speaker and the music by the type of instrument being played.

Open problems
With all the new accomplishments in the last couple of years, covered here and many that did not make it to our list due to shortage of space, there is still a lot more work to be done. Although incremental improvements are made every day, in all branches of speaker recognition, still the channel and audio type mismatch seem to be the biggest hurdles in reaching perfect results in speaker recognition. It should be noted that perfect results are asymptotes and will probably never be reached. Inherently, as the size of the population in a speaker database grows, the intra-speaker variations exceed the inter-speaker variations. This is the main source of error for large-scale speaker identification, which is the holy grail of the different goals in speaker recognition. In fact, if large-scale speaker identification approaches acceptable results, most other branches of the field may be considered trivial. However, this is quite a complex problem and will definitely need a lot more time to be perfected, if it is indeed possible to do so. In the meanwhile, we seem to still be at infancy when it comes to large-scale identification.
New Trends and Developments in Biometrics 22