Classification and Separation of Audio and Music Signals

This chapter addresses the topic of classification and separation of audio and music signals. It is a very important and a challenging research area. The importance of classification process of a stream of sounds come up for the sake of building two different libraries: speech library and music library. However, the separation process is needed sometimes in a cocktail-party problem to separate speech from music and remove the undesired one. In this chapter, some existed algorithms for the classification process and the separation process are presented and discussed thoroughly. The classification algorithms will be divided into three categories. The first category includes most of the real time approaches. The second category includes most of the frequency domain approaches. However, the third category introduces some of the approaches in the time-frequency distribution. The approaches of time domain discussed in this chapter are the short-time energy (STE), the zero-crossing rate (ZCR), modified version of the ZCR and the STE with positive derivative, the neural networks, and the roll-off variance. The approaches of the frequency spectrum are specifically the roll-off of the spectrum, the spectral centroid and the variance of the spectral centroid, the spectral flux and the variance of the spectral flux, the cepstral residual, and the delta pitch. The time-frequency domain approaches have not been yet tested thoroughly in the process of classification and separation of audio and music signals. Therefore, the spectrogram and the evolutionary spectrum will be introduced and discussed. In addition, some algorithms for separation and segregation of music and audio signals, like the independent Component Analysis, the pitch cancelation and the artificial neural networks will be introduced.


Introduction
Audio signal processing is an important subfield of signal processing that is concerned with the electronic manipulation of audio signals [1][2][3][4][5][6]. The problem of discriminating music from audio has increasingly become very important as automatic audio signal recognition (ASR) systems and it has been increasingly applied in the domain of real-world multimedia [7]. Human's ear can easily distinguish audio without any influence of the mixed music [8][9][10][11][12][13][14][15][16][17][18][19][20][21][22][23]. Due to the new methods of the analysis and the synthesis processing of audio signals, the processing of musical signals has gained particular weight [16,24], and therefore, the classical sound analysis methods may be used in the processing of musical signals [25][26][27][28]. Many types of musical signals such as Rock music, Pop music, Classical music, Country music, Latin music, Arabic music, Disco and Jazz, Electronic music, etc. are existed [29]. The sound type signals hierarchy is shown in Figure 1 [30].
Audio signal changes randomly and continuously through time. As an example, music and audio signals have strong energy content in the low frequencies and weaker energy content in the high frequencies [31,32]. Figure 2 depicts a generalized time and frequency spectra of audio signals [33]. The maximum frequency f max varies according to type of audio signal, where, in the telephone transmission f max is equal to 4 kHz, 5 kHz in mono-loudspeaker recording, 6 KHz in multi-loudspeaker recording or stereo, 11 kHz in FM broadcasting, however, it equals to 22 KHz in the CD recording.
Acoustically speaking, the audio signals can be classified into the following classes: 1. Single talker in specific time [34].
3. Mixture of background music and single talker audio. 4. Songs that are a mixture of music with a singer voice. 5. May completely be music signal without any audio component. 6. Complex sound mixture like multi-singers or multi-speakers with multi-music sources.  Generalized frequency spectrum for audio signal [33]. 7. Non-music and non-audio signals: like fan, motor, car, jet sounds, etc.

Representation of audio signal
The letters symbols used for writing are not adequate, as the way they are pronounced varies; for example, the letter "o" in English, is pronounced differently in words "pot" most" and "one". It is almost impossible to tackle the audio classification problem without first establishing some way of representing the spoken utterances by some group of symbols representing the sounds produced [39][40][41][42][43]. The phonemes in Table 1 are divided into groups based on the way they are produced [44], forming a set of allophones [45]. In some tonal languages, such as Vietnamese and Mandarin, the intonation determines the meaning of each word [46][47][48].

Production of audio signal
Since the range of sounds that can be produced by any system is limited [39][40][41][42][43][44], the pressure in the lungs is increased by the reverse process. They push the air up the trachea; the larynx is situated at the top of the trachea. By changing the shape of the vocal tract, different sounds are produced, so the fundamental frequency will be changing with time. The spectrogram (or sonogram) for the sentence "What can I have for dinner tonight?" is shown in Figure 3. The way that humans recognize and interpret audio signal has been considered by many researchers [1,25,39]. To produce a complete set of English vowels, many researchers have depicted that the two lowest formants are necessary, as well as that the three lowest formants in frequency are necessary for good audio intelligibility. As the number of formants increased, sounds that are more natural are produced. However, when we deal with continues audio, the problem becomes more complex. The history of audio signal identification can be found in [1,25,[39][40][41][42][43][44][45][46][47][48].

Representation of music signal
There are two kinds of tone structures in music signal. The first one is a simple tone formed of single sinusoidal waveform, however, the second one is a more complex tone consisting of more than one harmonic [31,[49][50][51][52]. The spectrum of music signal has twice the bandwidth of audio spectrum, and most of the power of audio signal is concentrated at lower frequencies. Melodists and musicians divide musical minor to eight parts and each part named octave, where each octave is divided into seven parts called tones [30]. For different instrument, a tempered scale is shown in Table 2. These tones, shown in Table 2, are named (Do, Re, Me, Fa, So, La and Se) or simply (A, B, C, D, E, F, and G). The tone (A1) at the first octave has the fundamental frequency of the first tone in each octave, i.e., every first tone in each octave takes the reduplicate frequency of the first tone of previous one, (i.e., An = 2 n A1 or Bn = 2 n B1 and so on where n ∈ {2, 3, 4, 5, 6, 7}. From Table 2, the highest tone C8 occurs at the frequency of 4186 Hz, which is the highest frequency produced by human sound system, which leads musical A sonogram for the sentence "What can I have for dinner tonight?" [43]. instrument manufactures to try their best to bound music frequency to human's sound system limits to achieve strong concord [35,53,54]. In the real world, musical instruments cover more frequencies than audible band, which is limited to 20 kHz).

Production of music signal
The concept of tone quality that is most common depends on the subjective acoustic properties, regardless of partials or formants and the production of music depends mainly on the kind of musical instruments [53,54]. These instruments can be summarized as follows: 1. The string musical instrument. Its tones is produced by vibrating chords made from horsetail hair, or other manufactured material like copper or plastic. Every vibrating chord has its own fundamental frequency, producing complex tones so that it covers most of the audible bands. Figure 4 shows string instruments.
2. The brass musical instrument. The Brass musical instrument depends on blowing air like woodwind. Its shape looks like an animal horn and has manual valves to control cavity size. Brass musical instrument has huge number of nonharmonic signals existed in its spectrum. Figure 5 shows brass instruments.
3. The woodwind musical instrument. Woodwind instrument consists of an open cylindrical tube at both ends. Some woodwind instruments may use small-vibrated piece of copper to produce tones. It produces many numbers of harmonic tones. Figure 6 shows woodwind instruments.

The percussion musical instrument.
Examples of percussion instruments are piano, snare drum, chimes, marimba, timpani, and xylophone. Most of the power of tones in percussion instruments produces non-harmonic components. Figure 7 shows some percussion instruments.

5.
The electronic musical instrument. The most qualified robust and accurate electronic musical instrument is the organ. It has a large keyboard, a memory that can store notes and use their frequencies as basic cadences or tones. Without organ help, disco, pop, rock and jazz cannot stand [29,[35][36][37][38]. Organ is not the only electronic musical producer. If the electronic musical instruments are used for producing music, the tone quality measure of the fundamental frequency or harmonics is not needed. Figure 8 shows an example of organ electronic instrument.

Characteristics and differences between audio and music
The audio signal is a slowly time varying signal in the sense that, when examined over a sufficiently short period of time "between 5 and l00 msec. Therefore, its characteristics are stationary within this period of time. A simple example of an audio signal is shown in Figure 9. Figure 10 is a typical example of music portion. It is very clear from the two spectrums in Figures 9 and 10 that we can distinguish between the two types of signals. Figures 11 and 12 depict the evolutionary spectrum of two different types of signals, audio and music. Now, let us discuss some of the main similarity and differences between the two types of signals.     Tonality. By tone, we mean a single harmonic of a pure periodical sinusoid. Regardless of the type of instruments or music, the musical signal is composed of a multiple of tones; however, this is not the case in the voice signal [47,52,[55][56][57].
Bandwidth. Normally, the audio signal has 90% of its power concentrated within frequencies lower than 4 kHz and limited to 8 kHz; however, music signal can extend its power to the upper limits of the ear's response, which is 20 kHz [52,58].   Alternative sequence. Audio exhibits an alternating sequence of noise-like segment while music alternates in more tonal shape. In other words, audio signal is distributed through its spectrum more randomly than music does.
Power distribution. Normally, the power distribution of an audio signal is concentrated at frequencies lower than 4 kHz, and then collapsed rapidly above this frequency. On the other hand, there is no specific shape of the power of music spectrum [59].
Dominant frequency. For a single talker, his dominant frequency can accurately be determined uniquely, however, in a single musical instrument only the average dominant frequency can be determined. In multiple musical instruments, the case will be worst.
Fundamental frequency. For a single talker, his fundamental frequency can be accurately configured. However, this is not the case for a single music instrument.
Excitation patterns. The excitation signals (pitch) for audio are usually existed only over a span of three octaves, while the fundamental music tones can span up to six octaves [60].
Energy sequences. A reasonable generalization is that audio follows a pattern of high-energy conditions of voicing followed by low energy conditions, which the envelope of music is less likely to exhibit.
Tonal duration. The duration of vowels in audio is very regular, following the syllabic rate. Music exhibits a wider variation in tone lengths, not being constrained by the process of articulation. Hence, tonal duration would likely be a good discriminator.
Consonants. Audio signal contains too many consonants while music is usually continuous through the time [33].
Zero crossing rate (ZCR). The ZCR in music is greater than that in audio. We can use this idea to design a discriminator [60].
In the frequency domain, there is a strong overlapping between audio and music signals, so no ordinary filter can separate them. As mentioned before, audio signal may cover spectrum between 0 and 4 kHz with a dominant frequency of an average = 1.8747 kHz. However, the lowest fundamental frequency (A1) of a music signal is about 27.5 Hz and the highest frequency of the tone C8 is around 4186 Hz. The reason behind this is that musical instrument manufacturers try to bound music frequency to human's sound limits in order to achieve a strong consonant and a strong frequency overlap. Moreover, music may propagate over the audible

Key Difference
Audio Music

Units of Analysis Phonemes Notes Finite
Temporal Structure • Short sample (40 ms-200 ms).
• More steady state than dynamic.

Syntactic / Semantic Structure
• Symbolic • Productive • Can be combined in grammar spectrum to cover more than the audible band of 20 kHz, with a dominant frequency of an average = 1.9271 kHz [25]. Table 3 summarizes the main similarity and differences between music and audio signals.

Audio and music signals classification
The main classification approaches will be discussed in this section. They can be categorized into three different approaches: (1) time domain approaches, (2) frequency domain approaches, and (3) time-frequency domain approaches. A twolevel music and audio classifier was developed by El-Maleh [61,62]. He used a combination of long-term features such as the variance, the differential parameters, the zero crossing rate (ZCR), and the time-averages of spectral parameters. Saunders [60] proposed another two-level classifier. His approach was based on the short-time energy (STE) and the average ZCR features. In addition, Matityaho and Furst [63] have developed a neural network based model for classifying music signals. Their model was designed based on human cochlea functional performance.
For audio detection, Hoyt and Wecheler [64] have developed a neural network base model using Fourier transform, Hamming filtering, and a logarithmic function as pre-processing then they applied a simple threshold algorithm for detecting audio, music, wind, traffic or any interfering sound. In addition, to improve the performance, they suggested wavelet transform feature for pre-processing. Their work is much similar to the work done by Matityaho and Furst's [63,64]. 13 features were examined by Scheirer and Slaney [65]. Some of these features were simple modification of each other's. They also tried combining them in several multidimensional classification forms. From these previous works, the most powerful discrimination features were the STE and the ZCR. Therefore, the STE and the ZCR will be discussed thoroughly. Finally, the common classifiers of the audio and the music signals can be divided into the following approaches: II.The Frequency-domain algorithms [32, 33, 35, 59, 112, 66-77,

The ZCR algorithm
The ZCR algorithm can be defined as the number of crossing the signal the zero axis within a specific window. It is widely used because its simplicity and robustness [34]. We may define the ZCR as in the following equation.
where Z n is the ZCR, N is the number of samples in one window, and sgn is the sign of the signal such that sgn [x(n)] = 1 when x(n) > 0, sgn [x(n)] = À1, when x(n) < 0. An essential not is that the sampling rate must be high enough to catch any crossing through zero. Another important note before evaluating the ZCR is to normalize the signal by subtracting its average value. It is clear from Eq. (1) that the value of the ZCR is proportional to the sign change in the signal, i.e., the dominant frequency of x(n). Therefore, we may find that the ZCR of music is, in general, higher than that of audio, but not sure at the unvoiced audio.
Properties of ZCR: The ZCR properties can be summarized as follow.

The Principle of Dominant Frequency
The dominant frequency of a pure sinusoid is the only value in the spectrum. This value of frequency is equal to the ZCR of the signal in one period. If we have a non-sinusoidal periodic signal, its dominant frequency is frequency with the largest amplitude. The dominant frequency (ω 0 ) can be evaluated as follow.
where N is the number of intervals, E{.} is the expected value, and D o is the ZCR per interval.

The Highest frequency
Since D 0 denotes the ZCR of a discrete-time signal Z(i), let us assume that D n denotes the ZCR of the n th derivative of Z(i), i.e., D 1 is the ZCR of the first derivative of Z(i), D 2 is the ZCR of the second derivative of Z(i), and so on. Then, the highest frequency ω max in the signal can be evaluated as follow.
where N is the number of samples. If the sampling rate equals 11 KHz, then the change in ω max can be ignored for i > 10.

The Lowest frequency
Assuming that the time period between any two samples is normalized to unity, the derivative∇ of Z(i) can be defined as Then, the ZCR of the nth derivative of Z(i) is defined as D n . Now, let us define ∇ + as the +ve derivative of Z(i), then ∇ + [Z(i)] can be defined as follow.
Now, let us define the ZCR of the nth + ve derivative of Z(i) by the symbol n D.
Then we can find the lowest frequency ω min of a signal as follow.

Measure of Periodicity
A signal is said to be purely periodic if and only if.
The Ratio of High ZCR (RHZCR) It was found that the variation of the ZCR is more discriminative than the exact ZCR, so the RHZCR can be considered as one feature [78]. The RHZCR is defined as the ratio of the number of frames whose ZCR are above 1 over the average ZCR in one-window, and can be defined as follow.
where N is the number of frames per one-window, n is the index of the frame, sgn[.] is a sign function and ZCR(n) is the zero-crossing rate at the n th frame. In general, audio signals consist of alternating voiced and unvoiced sounds in each syllable rate, while music does not have this kind of alternation. Therefore, from Eq. (7) and Eq. (8), we may observe that the variation of the ZCR (or the RHZCR) in an audio signal is greater than that of a music, as shown in Figure 13.

The STE algorithm
The amplitude of the audio signal varies appreciably with time. In particular, the amplitude of unvoiced segments is generally much lower than the amplitude of voiced segments. The STE of the audio signal provides a convenient representation Figure 13. Music and audio sharing some values [65]. that reflects these amplitude variations. Unlike the audio signal, since the music signal does not contain unvoiced segments, the STE of the music signal is usually bigger than that of audio [60]. The STE of a discrete-time signal s(n) can define as.
where STE s in Eq. (9) is the total energy of the signal. The average power of s(n) is defined as.
Signals can be classified into three types, in general: an energy signal, which has a non-zero and finite energy, a power signal, which has a non-zero and finite energy, and the third type is neither energy nor power signal, see Table 4. Now, let us define another sequence {f(n;m)} as follow.
where w(n) is just a window with a length of N with a value of zero outside [0, N-1]. Therefore, f s (n,m) will be zero outside [m-N + 1, m].

Deriving short term features
The silence and unvoiced period in audios can be considered a stochastic background noise. Now, let us define F s as a feature of {s(n)}, mapping its values of the Hilbert space, H, to a set of complex numbers C such that.
The long-term feature of {s(n)} may be defined as follow.
The long-term average, when applied to energy signals, will have zero values, however, it is appropriate for power signals. Eq. (13) can be re-written as follow.

Types of signals.
Resulting a family of mappings. If each member of the family is selected to be a λ, the we can use the notation F s (λ). The discrete-time Fourier transforms is an example of a parametric long-term feature. The long-term feature can be of the form.
where M in Eq. (15) is the mapping sequence. It maps {s(n)} to another sequence. The long-term feature F s (λ) is defined as L o M, a composition of function L and M. If F s (λ) is the long-term feature of Eq. (12), then the short-term feature F s (λ,m) of time period m can be constructed as follows: • Define a frame as in Eq. (11).
• Apply the long-term feature transformation to the frame sequence as in Eq. (16).
Low Short Time Energy Ratio (LSTER) As done in the ZCR, the variation is selected [33]. Here, the LSTER is used to represent the variation of the STE. LSTER is defined as the ratio of the number of frames whose STE are less than 0.5 times of the average STE in a one-second window, as in Eq. (17). where.
N is the total number of frames, STE(n) is the STE at the n th frame, and STE av in Eq. (18) is the average STE in a one-window. Figure 14 shows the preprocessing flow on Z(i) using the positive derivation concept (∇+), which provided some improvement in the discrimination process [78].

The effect of positive derivation
This pre-processing increased the ZCR of music and reduced the ZCR of the audio with the expenses of some delay. The averages of the ZCR in speech, mixture, and music are shown in Figure 15, after applying the +ve derivative of order 50.

Figure 14.
The preprocessing using the +ve derivative before evaluating the ZCR.

Spectral flux mean and variance
This feature characterizes the change in the shape of the spectrum so it measures frame-to-frame spectral difference. Audio signals go through less frame-to-frame changes than music. The spectral flux values in audio signal is lower than that of music.
The spectral flux, sometimes called the delta spectrum magnitude, is defined as the second norm of the spectral amplitude of the difference vector and defined as in Eq. (19).
where X(k) is the signal power and k is the corresponding frequency. Another definition of the SF is also described as follow.
where A(n, k) in Eq. (20) is the discrete Fourier transform (DFT) of the n th frame of the input signal and can be described as in Eq. (21).
and x(m) is the original audio data, L is the window length, M is the order of the DFT, N is the total number of frames, δ is an arbitrary constant, and w(m) is the The average ZCR of speech, mixture, and music, after pre-processing with the +ve derivative [78]. window function. Scheirer and Slaney [65] has found that SF feature is very useful in discriminating audio from music. Figure 16 depicts that the variances are lower for music than for audio, and the means are less for audio than for music signal. Rossignol and others [133] have computed the means and variances of a one-second segment using frames of length 18 milliseconds.
Rossignol and others [133] have tested three classification approaches to classify the segments. They used the k-nearest-neighbors (kNN) with k = seven, the Gaussian mixture model (GMM), and the ANN classifiers. Table 5 shows their results are shown in Table 5, using the mean and the variance of the SF.

The mean and variance of the spectral centroid
In the frequency domain, the mean and variance of the spectral centroid feature describes the center of frequency at which most of the power in the signal is found. In audio signals, the pitches of the signals are concentrated in narrow range of low frequencies. In contrast, music signals have higher frequencies that result higher spectral means, i.e., higher spectral centroids. For a frame at time t, the spectral centroid can be evaluated as follows.
where X(k) is the power of the signal at the corresponding frequency band k. When the mean and the variance of the SP are combined with the mean and the variance of the SC in Eq. (22), and the mean and the variance of the ZCR, the results of Table 6 are found.  [133].  Table 5.

Energy at 4 Hz modulation
Audio signal has an energy peak centered on the 4 Hz syllabic rate. Therefore, a 2nd order band pass filter is used, with center frequency of 4 Hz. Although audio signals have higher energy at that 4 Hz, some music bass instruments was found to have modulation energy around this frequency [65,133].

Roll-off point
In the frequency domain, the roll-off point feature is the value of the frequency that has 95% of the power of the signal. The value of the roll-off point can be found as follow [65,133].
where the left hand side of Eq. (23) is the sum of the power at the frequency value V, and the right hand side of Eq. (23) is the 95% of the total power of the signal of the frame, and X(k) is the DFT of x(t).

Cepstrum
The cepstrum of a signal can be defined as the inverse of the DFT of the logarithm of the spectrum of a signal. Music signals have higher cepstrum values than that of speech ones. The complex cepstrum is defined in the following Equation [122][123][124].X and then.x where X(e jω ) is the DFT of the sequence x(n). Table 7 summarizes the percentage error of a simulation done per each feature. Latency refers to the amount of past input data required to calculate the feature.

Summary
Scheirer and Slaney [65] have evaluated their models using 20 minutes long data sets of music and audio. Their data set consists of 80 samples, each with 15-secondlong audio. They collected their samples using a 16-bit monophonic FM tuner with a sampling rate of 22.05 kHz, from a variety of stations, with different content styles  Table 6.

Features The 4 Hz
Latency and univariate discrimination performance for each feature [65].
and different noise levels, over a period of three days in the San Francisco Bay Area. They also claimed that they have audios from both male and female. They also recorded samples of many types of music, like pop, jazz, salsa, country, classical, reggae, various sorts of rock, various non-Western styles [29,65]. They also used several features in a spatial partitioning classifier. Table 8 summarizes their results.
The features used in Best 8 are the plus the 4 Hz modulation, the variance features, the pulse metric, and the low-energy frame [80,134]. In the Best 3, they used the pulse metric, the 4 Hz energy, and the variance of spectral flux. In the Fast 5, they used the 5 basic features. From results shown in Table 8, we conclude that it is not necessary to use all features in order to have a good classification, so in real time a good performance system may be found using only few features. A more detailed discussion can be found in [29,65,80,134].

Spectrogram (or sonogram)
The spectrogram is an example of time-frequency distribution and this method was found to be a good classical tool for analyzing audio signal [13,19,86,127]. The spectrogram (or sonogram) of a signal x(n) can be defined as follow.
where N is the length of the sequence x(n), and W(n) is a specific window. The method of spectrogram can be used in discriminating audio from music signal, however, it may have a high percentage error. That is because it depends on the strength of the frequency in the tested samples. Figure 17 depicts two examples of spectrograms of audio and music signals.

Evolutionary spectrum (ES)
The spectral representation of a stationary signal may be viewed as an infinite sum of sinusoids with random amplitudes and phases as described in Eq. (27).
where Z(ω) is the process with orthogonal increments i.e. andS ω ð Þ in Eq. (28) is the spectrum of e(n) [81]. Since the audio signal is, in general nonstationary, we will use the Wold-Cramer (WC) representation of a nonstationary signal. WC considers the discrete-time non-stationary process {x(n)} as the output of a casual, linear, and time-variant (LTV) system with a white noise input e(n) that has a zero-mean, unit-variant, i.e., where h(n,m) is defined as the unit impulse response of an LTV system. Substituting e(n) into x(n) of Eq. (29) (assuming S(ω) = 1 for white noise) we get.  (31) and the instantaneous power of x(n) is given by and then, the Wold-Cramer ES is defined as The ES S(n,ω) in Eq. (33) was found to be a good classifier for the distinction of audio from music signals [81,129]. Because of the extensive math calculation of the time-frequency spectrum, they may be very useful in off-line classification and analysis. The ESs of music and audio signals are shown in Figure 18(a) and (b), respectively. The suppression of the amplitude for audio might due to gaussianity.

Separation of audio and music signals
Since the separation of audio and music signals is more complicated than classification, in this section we will introduce only two approaches [7-13, 22, 76, 77, 86, 135]. The first approach is the approach of independent component analysis (ICA) with ANN. The second classifier is the pitch cancelation approach. A block diagram of a classifier integrated with a separator is depicted in Figure 19.

ICA with ANN separation approach
In [13,20,21,127,136], Wang and Brown proposed a model for audio segregation algorithm. His model consists of preprocessing using cochlear filtering, gammatone filtering, and correlogram forming autocorrelation function and feature extraction. The impulse response of the gammatone filters is represented as.
where n is the filter order, N is the number of channels, and U is the unit step function. Therefore, the gammatone system can be considered as a causal, time invariant system with an infinite response time. For the i th channel, fi is the center Figure 18.  frequency of the channel, ϕi is the phase of the channel, b is the rate of decay of the impulse response and g(i) is an equalizing gain adjust for each filter. Figure 20 depicts the impulse response of the gammatone system, where Figure 21 depicts the block diagram of the Wang and Brown model. Wang and Brown model has some drawbacks. The first drawback is its complexity. Their model needs a high specification hardware to perform the calculations. In [20], Andre reported that Wang and Brown model needs to be improved. The ICA method can be used for separation if two sources of mixture are available assuming that the two signals from the two different sources are statistically independent [66,74,75,121,137]. In [19], Takigawa tried to improve the performance of W & B model. He used the short time Fourier transform (STFT) in the input stage and used the spectrogram values instead of correlogram, however, they have not reported the amount of improvement. A similar work for separating the voiced audio of two talkers speaking simultaneously at similar intensities in a single channel, using pitch peak canceling in cepstrum domain, was done by Stubbs [8].

The pitch cancelation
The pitch cancelation method is widely used in noise reduction. A good try to separate two talkers speaking simultaneously at similar intensities in a single channel, or by other words, separation of two talkers without any restriction was introduced by Stubbs [8]. For a certain person, the letters A and R have lot of consonant. These consonants, in the frequency domain, have low amplitudes, however, they appear as long pitch peak in the cepstrum domain. If these consonants are deleted  by replacing the five-cepstral samples centered at the pitch peak by zeros, the audio segment may be attenuated or distorted completely. A typical example of the cepstrum of two audio and music signals is depicted in Figure 22 for 5 seconds signals. The logarithmic effect will increase low amplitude reduce high one, and the values near zero will be very large after the logarithm.

Conclusions
In this chapter, a general review of the common classification and separation algorithms used for speech and music was presented and some were introduced and discussed thoroughly. The approaches dealt with classification were divided into three categories. The first category included most of the real-time approaches. In the real-time approaches, we introduced the ZCR, the STE, the ZCR and the STE with positive derivative, with some of their modified versions, and the neural networks. The second category included most of the frequency domain approaches such as the spectral centroid and its variance, the spectral flux and its variance, the roll-off of the spectrum, the cepstral residual, and the delta pitch. However, the last category introduced two time-frequency approaches, mainly the spectrogram and the evolutionary spectrum. It has been noticed that the time-frequency classifiers provided an excellent and a robust discrimination result in discriminating speech from music signals in digital audio. Depending on the application, the decision of which feature should be chosen is selected. The algorithms of the first category are faster since the processing is made in the real time; however, those of the second  one are more precise. The time-frequency approaches has not been discussed thoroughly in literature and they still need more research and elaboration. Lastly, we may conclude that many classification algorithms were proposed in literature, however, few ones were proposed for separation. The algorithms introduced in this chapter can be summarized in Table 9.