Open access peer-reviewed chapter - ONLINE FIRST

Classification and Separation of Audio and Music Signals

By Abdullah I. Al-Shoshan

Submitted: May 6th 2020Reviewed: November 6th 2020Published: December 15th 2020

DOI: 10.5772/intechopen.94940

Downloaded: 22


This chapter addresses the topic of classification and separation of audio and music signals. It is a very important and a challenging research area. The importance of classification process of a stream of sounds come up for the sake of building two different libraries: speech library and music library. However, the separation process is needed sometimes in a cocktail-party problem to separate speech from music and remove the undesired one. In this chapter, some existed algorithms for the classification process and the separation process are presented and discussed thoroughly. The classification algorithms will be divided into three categories. The first category includes most of the real time approaches. The second category includes most of the frequency domain approaches. However, the third category introduces some of the approaches in the time-frequency distribution. The approaches of time domain discussed in this chapter are the short-time energy (STE), the zero-crossing rate (ZCR), modified version of the ZCR and the STE with positive derivative, the neural networks, and the roll-off variance. The approaches of the frequency spectrum are specifically the roll-off of the spectrum, the spectral centroid and the variance of the spectral centroid, the spectral flux and the variance of the spectral flux, the cepstral residual, and the delta pitch. The time-frequency domain approaches have not been yet tested thoroughly in the process of classification and separation of audio and music signals. Therefore, the spectrogram and the evolutionary spectrum will be introduced and discussed. In addition, some algorithms for separation and segregation of music and audio signals, like the independent Component Analysis, the pitch cancelation and the artificial neural networks will be introduced.


  • audio signal
  • music signal
  • classification
  • separation
  • time domain
  • frequency domain
  • time-frequency domain

1. Introduction

Audio signal processing is an important subfield of signal processing that is concerned with the electronic manipulation of audio signals [1, 2, 3, 4, 5, 6]. The problem of discriminating music from audio has increasingly become very important as automatic audio signal recognition (ASR) systems and it has been increasingly applied in the domain of real-world multimedia [7]. Human’s ear can easily distinguish audio without any influence of the mixed music [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]. Due to the new methods of the analysis and the synthesis processing of audio signals, the processing of musical signals has gained particular weight [16, 24], and therefore, the classical sound analysis methods may be used in the processing of musical signals [25, 26, 27, 28]. Many types of musical signals such as Rock music, Pop music, Classical music, Country music, Latin music, Arabic music, Disco and Jazz, Electronic music, etc. are existed [29]. The sound type signals hierarchy is shown in Figure 1 [30].

Figure 1.

Types of audio signals.

Audio signal changes randomly and continuously through time. As an example, music and audio signals have strong energy content in the low frequencies and weaker energy content in the high frequencies [31, 32]. Figure 2 depicts a generalized time and frequency spectra of audio signals [33]. The maximum frequency fmax varies according to type of audio signal, where, in the telephone transmission fmax is equal to 4 kHz, 5 kHz in mono-loudspeaker recording, 6 KHz in multi-loudspeaker recording or stereo, 11 kHz in FM broadcasting, however, it equals to 22 KHz in the CD recording.

Figure 2.

Generalized frequency spectrum for audio signal [33].

Acoustically speaking, the audio signals can be classified into the following classes:

  1. Single talker in specific time [38].

  2. Singing without music.

  3. Mixture of background music and single talker audio.

  4. Songs that are a mixture of music with a singer voice.

  5. May completely be music signal without any audio component.

  6. Complex sound mixture like multi-singers or multi-speakers with multi-music sources.

  7. Non-music and non-audio signals: like fan, motor, car, jet sounds, etc.

  8. Audio signal that is a mixture of more than one speakers talking simultaneously at the same time [8].

  9. Abnormal music can be single word cadence, human whistle sound, or opposite reverberation [4, 34, 35, 36, 37, 38].

2. Analysis of audio and music signals

2.1 Properties of audio signal

2.1.1 Representation of audio signal

The letters symbols used for writing are not adequate, as the way they are pronounced varies; for example, the letter “o” in English, is pronounced differently in words “pot” most“ and “one”. It is almost impossible to tackle the audio classification problem without first establishing some way of representing the spoken utterances by some group of symbols representing the sounds produced [39, 40, 41, 42, 43]. The phonemes in Table 1 are divided into groups based on the way they are produced [44], forming a set of allophones [45]. In some tonal languages, such as Vietnamese and Mandarin, the intonation determines the meaning of each word [46, 47, 48].


Table 1.

Phoneme categories of British English and examples of words in which they are used [44].

2.1.2 Production of audio signal

Since the range of sounds that can be produced by any system is limited [39, 40, 41, 42, 43, 44], the pressure in the lungs is increased by the reverse process. They push the air up the trachea; the larynx is situated at the top of the trachea. By changing the shape of the vocal tract, different sounds are produced, so the fundamental frequency will be changing with time. The spectrogram (or sonogram) for the sentence “What can I have for dinner tonight?” is shown in Figure 3.

Figure 3.

A sonogram for the sentence “What can I have for dinner tonight?” [43].

The way that humans recognize and interpret audio signal has been considered by many researchers [1, 25, 39]. To produce a complete set of English vowels, many researchers have depicted that the two lowest formants are necessary, as well as that the three lowest formants in frequency are necessary for good audio intelligibility. As the number of formants increased, sounds that are more natural are produced. However, when we deal with continues audio, the problem becomes more complex. The history of audio signal identification can be found in [1, 25, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48].

2.2 Properties of music signal

2.2.1 Representation of music signal

There are two kinds of tone structures in music signal. The first one is a simple tone formed of single sinusoidal waveform, however, the second one is a more complex tone consisting of more than one harmonic [31, 49, 50, 51, 52]. The spectrum of music signal has twice the bandwidth of audio spectrum, and most of the power of audio signal is concentrated at lower frequencies. Melodists and musicians divide musical minor to eight parts and each part named octave, where each octave is divided into seven parts called tones [30]. For different instrument, a tempered scale is shown in Table 2. These tones, shown in Table2, are named (Do, Re, Me, Fa, So, La and Se) or simply (A, B, C, D, E, F, and G). The tone (A1) at the first octave has the fundamental frequency of the first tone in each octave, i.e., every first tone in each octave takes the reduplicate frequency of the first tone of previous one, (i.e., An = 2 n A1 or Bn = 2 n B1 and so on where n ∈ {2, 3, 4, 5, 6, 7}.

A HzB HzC HzD HzE HzF HzG Hz
A1 27.5B1 30.863C1 32.703D1 36.708E1 41.203F1 43.654G1 48.99
A2 55B2 61.735C2 65.406D2 73.416E2 82.407F2 87.307G2 97.99
A3 110B3 123.47C3 130.81D3 146.83E3 164.81F3 174.61G3 196
A4 220B4 246.94C4 261.63D4 293.66E4 329.63F4 349.23G4 392
A5 440B5 493.88C5 523.25D5 587.33E5 659.26F5 698.46G5 783.9
A6 880B6 987.77C6 1046.5D6 1174.7E6 1318.5F6 1396.9G6 1568
A7 176B7 1975.5C7 2093D7 2349.3E7 2637F7 2793G7 3136
A8 352B8 3951.1C8 4186

Table 2.

Frequencies of notes in the tempered scale [3].

From Table 2, the highest tone C8 occurs at the frequency of 4186 Hz, which is the highest frequency produced by human sound system, which leads musical instrument manufactures to try their best to bound music frequency to human’s sound system limits to achieve strong concord [34, 53, 54]. In the real world, musical instruments cover more frequencies than audible band, which is limited to 20 kHz).

2.2.2 Production of music signal

The concept of tone quality that is most common depends on the subjective acoustic properties, regardless of partials or formants and the production of music depends mainly on the kind of musical instruments [53, 54]. These instruments can be summarized as follows:

  1. The string musical instrument. Its tones is produced by vibrating chords made from horsetail hair, or other manufactured material like copper or plastic. Every vibrating chord has its own fundamental frequency, producing complex tones so that it covers most of the audible bands. Figure 4 shows string instruments.

  2. The brass musical instrument. The Brass musical instrument depends on blowing air like woodwind. Its shape looks like an animal horn and has manual valves to control cavity size. Brass musical instrument has huge number of nonharmonic signals existed in its spectrum. Figure 5 shows brass instruments.

  3. The woodwind musical instrument. Woodwind instrument consists of an open cylindrical tube at both ends. Some woodwind instruments may use small-vibrated piece of copper to produce tones. It produces many numbers of harmonic tones. Figure 6 shows woodwind instruments.

  4. The percussion musical instrument. Examples of percussion instruments are piano, snare drum, chimes, marimba, timpani, and xylophone. Most of the power of tones in percussion instruments produces non-harmonic components. Figure 7 shows some percussion instruments.

  5. The electronic musical instrument. The most qualified robust and accurate electronic musical instrument is the organ. It has a large keyboard, a memory that can store notes and use their frequencies as basic cadences or tones. Without organ help, disco, pop, rock and jazz cannot stand [29, 34, 35, 36, 37]. Organ is not the only electronic musical producer. If the electronic musical instruments are used for producing music, the tone quality measure of the fundamental frequency or harmonics is not needed. Figure 8 shows an example of organ electronic instrument.

Figure 4.

String instruments.

Figure 5.

Brass instruments.

Figure 6.

Woodwind instruments.

Figure 7.

Percussion instruments.

Figure 8.

Electronic organ.

2.3 Characteristics and differences between audio and music

The audio signal is a slowly time varying signal in the sense that, when examined over a sufficiently short period of time “between 5 and l00 msec. Therefore, its characteristics are stationary within this period of time. A simple example of an audio signal is shown in Figure 9.

Figure 9.

An example of audio signal of specking the two-second long phrase “Very good night”: (a) time domain (b) magnitude. (c) Phase.

Figure 10 is a typical example of music portion. It is very clear from the two spectrums in Figures 9 and 10 that we can distinguish between the two types of signals.

Figure 10.

A 2-second long music signal: (a) time domain. (b) Spectrum. (c) Phase.

Figures 11 and 12 depict the evolutionary spectrum of two different types of signals, audio and music.

Figure 11.

The spectrum of an average of 500 specimens: (a) audio, (b) music.

Figure 12.

Evolutionary spectrum of an average of 500 specimens: (a) audio, (b) music.

Now, let us discuss some of the main similarity and differences between the two types of signals.

Tonality. By tone, we mean a single harmonic of a pure periodical sinusoid. Regardless of the type of instruments or music, the musical signal is composed of a multiple of tones; however, this is not the case in the voice signal [47, 52, 55, 56, 57].

Bandwidth. Normally, the audio signal has 90% of its power concentrated within frequencies lower than 4 kHz and limited to 8 kHz; however, music signal can extend its power to the upper limits of the ear’s response, which is 20 kHz [52, 58].

Alternative sequence. Audio exhibits an alternating sequence of noise-like segment while music alternates in more tonal shape. In other words, audio signal is distributed through its spectrum more randomly than music does.

Power distribution. Normally, the power distribution of an audio signal is concentrated at frequencies lower than 4 kHz, and then collapsed rapidly above this frequency. On the other hand, there is no specific shape of the power of music spectrum [59].

Dominant frequency. For a single talker, his dominant frequency can accurately be determined uniquely, however, in a single musical instrument only the average dominant frequency can be determined. In multiple musical instruments, the case will be worst.

Fundamental frequency. For a single talker, his fundamental frequency can be accurately configured. However, this is not the case for a single music instrument.

Excitation patterns. The excitation signals (pitch) for audio are usually existed only over a span of three octaves, while the fundamental music tones can span up to six octaves [60].

Energy sequences. A reasonable generalization is that audio follows a pattern of high-energy conditions of voicing followed by low energy conditions, which the envelope of music is less likely to exhibit.

Tonal duration. The duration of vowels in audio is very regular, following the syllabic rate. Music exhibits a wider variation in tone lengths, not being constrained by the process of articulation. Hence, tonal duration would likely be a good discriminator.

Consonants. Audio signal contains too many consonants while music is usually continuous through the time [33].

Zero crossing rate (ZCR). The ZCR in music is greater than that in audio. We can use this idea to design a discriminator [60].

In the frequency domain, there is a strong overlapping between audio and music signals, so no ordinary filter can separate them. As mentioned before, audio signal may cover spectrum between 0 and 4 kHz with a dominant frequency of an average = 1.8747 kHz. However, the lowest fundamental frequency (A1) of a music signal is about 27.5 Hz and the highest frequency of the tone C8 is around 4186 Hz. The reason behind this is that musical instrument manufacturers try to bound music frequency to human’s sound limits in order to achieve a strong consonant and a strong frequency overlap. Moreover, music may propagate over the audible spectrum to cover more than the audible band of 20 kHz, with a dominant frequency of an average = 1.9271 kHz [25].

Table 3 summarizes the main similarity and differences between music and audio signals.

Key DifferenceAudioMusic
Units of AnalysisPhonemesNotes Finite
Temporal Structure
  • Short sample (40 ms–200 ms).

  • More steady state than dynamic.

  • Timing unstrained but variable.

  • Amplitude modulation rate for sentences is slow (∼ 4 Hz)

  • Longer sample: 600–1200 ms.

  • Mix of steady state (strings, winds) and transient (percussion).

  • Strong periodicity.

Spectral Structure
  • Largely harmonic (vowels, voiced consonants).

  • Tend to group in formants.

  • Some inharmonic stops.

  • Largely harmonic and some inharmonic (percussion).

Syntactic /Semantic Structure
  • Symbolic

  • Productive

  • Can be combined in grammar

  • Symbolic

  • Productive

  • Combined in a grammar

Table 3.

The main differences between audio and music signals.

3. Audio and music signals classification

The main classification approaches will be discussed in this section. They can be categorized into three different approaches: (1) time domain approaches, (2) frequency domain approaches, and (3) time-frequency domain approaches. A two-level music and audio classifier was developed by El-Maleh [61, 62]. He used a combination of long-term features such as the variance, the differential parameters, the zero crossing rate (ZCR), and the time-averages of spectral parameters. Saunders [60] proposed another two-level classifier. His approach was based on the short-time energy (STE) and the average ZCR features. In addition, Matityaho and Furst [63] have developed a neural network based model for classifying music signals. Their model was designed based on human cochlea functional performance.

For audio detection, Hoyt and Wecheler [64] have developed a neural network base model using Fourier transform, Hamming filtering, and a logarithmic function as pre-processing then they applied a simple threshold algorithm for detecting audio, music, wind, traffic or any interfering sound. In addition, to improve the performance, they suggested wavelet transform feature for pre-processing. Their work is much similar to the work done by Matityaho and Furst’s [63, 64]. 13 features were examined by Scheirer and Slaney [65]. Some of these features were simple modification of each other’s. They also tried combining them in several multidimensional classification forms. From these previous works, the most powerful discrimination features were the STE and the ZCR. Therefore, the STE and the ZCR will be discussed thoroughly. Finally, the common classifiers of the audio and the music signals can be divided into the following approaches:

  1. The Time domain algorithms:

    1. The ZCR algorithm [1, 38, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 136]:

      1. The standard deviation of first order difference of the ZCR.

      2. The 3rd central moment of the mean of ZCR.

      3. The total number of zero crossings exceeding a specific threshold.

    2. The STE [60, 61, 62, 63, 64, 65, 66].

    3. The ZCR and the STE positive derivative [66, 73].

    4. The Pulse Metric [31, 59, 67, 68, 69].

    5. The number of silence [32, 60].

    6. The HMM (Hidden Markov Model) [70, 71, 72].

    7. The ANN (Artificial neural networks) [12, 49, 58, 63, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108].

    8. The Roll-Off Variance [31, 59].

  2. The Frequency-domain algorithms [32, 33, 34, 59, 100, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 136]:

    1. The Spectrum [31, 99]:

      1. The Spectral Centroid.

      2. The Spectral Flux Variance.

      3. The Spectral Centroid Mean and Variance.

      4. The Spectral Flux Mean and Variance.

      5. The Spectrum Roll-Off.

      6. The Signal Bandwidth.

      7. The Spectrum Amplitude.

      8. The Delta Amplitude.

    2. The Cepstrum [110]:

      1. The Cepstral Residual [110, 111, 112].

      2. The Variance of the Cepstral Residual [110, 111, 112].

      3. The Cepstral feature [110, 111, 112].

      4. The Pitch [82, 95, 96, 105, 106, 107, 113, 114].

      5. The Delta Pitch [76, 107].

  3. The Time-Frequency domain algorithms:

    1. The Spectrogram (or Sonogram) [13, 19, 74, 115].

    2. The Evolutionary Spectrum and the Evolutionary Bispectrum [68, 116, 117].

3.1 Time domain algorithms

3.1.1 The ZCR algorithm

The ZCR algorithm can be defined as the number of crossing the signal the zero axis within a specific window. It is widely used because its simplicity and robustness [38]. We may define the ZCR as in the following equation.


where Zn is the ZCR, N is the number of samples in one window, and sgn is the sign of the signal such that sgn [x(n)] = 1 when x(n) > 0, sgn [x(n)] = −1, when x(n) < 0. An essential not is that the sampling rate must be high enough to catch any crossing through zero. Another important note before evaluating the ZCR is to normalize the signal by subtracting its average value. It is clear from Eq. (1) that the value of the ZCR is proportional to the sign change in the signal, i.e., the dominant frequency of x(n). Therefore, we may find that the ZCR of music is, in general, higher than that of audio, but not sure at the unvoiced audio.

Properties of ZCR:

The ZCR properties can be summarized as follow.

  1. The Principle of Dominant Frequency

    The dominant frequency of a pure sinusoid is the only value in the spectrum. This value of frequency is equal to the ZCR of the signal in one period. If we have a non-sinusoidal periodic signal, its dominant frequency is frequency with the largest amplitude. The dominant frequency (ω 0) can be evaluated as follow.


where N is the number of intervals, E{.} is the expected value, and Do is the ZCR per interval.

  • The Highest frequency

    Since D0 denotes the ZCR of a discrete-time signal Z(i), let us assume that Dn denotes the ZCR of the nth derivative of Z(i), i.e., D 1 is the ZCR of the first derivative of Z(i), D 2 is the ZCR of the second derivative of Z(i), and so on. Then, the highest frequency ωmax in the signal can be evaluated as follow.


    where N is the number of samples. If the sampling rate equals 11 KHz, then the change in ωmax can be ignored for i > 10.

  • The Lowest frequency

    Assuming that the time period between any two samples is normalized to unity, the derivative of Z(i) can be defined as Z(i= Z(i) – Z(i–1). Then, the ZCR of the nth derivative of Z(i) is defined as Dn. Now, let us define ∇ + as the +ve derivative of Z(i), then ∇ + [Z(i)] can be defined as follow.


    Now, let us define the ZCR of the nth + ve derivative of Z(i) by the symbol nD. Then we can find the lowest frequency ωmin of a signal as follow.


  • Measure of Periodicity

    A signal is said to be purely periodic if and only if.


    Using Eq. (6), it was found that music is more periodic or than audio [44, 45, 46, 47, 55, 56, 57, 118].

  • The Ratio of High ZCR (RHZCR)

    It was found that the variation of the ZCR is more discriminative than the exact ZCR, so the RHZCR can be considered as one feature [66]. The RHZCR is defined as the ratio of the number of frames whose ZCR are above 1 over the average ZCR in one-window, and can be defined as follow.


    where N is the number of frames per one-window, n is the index of the frame, sgn[.] is a sign function and ZCR(n) is the zero-crossing rate at the n th frame. In general, audio signals consist of alternating voiced and unvoiced sounds in each syllable rate, while music does not have this kind of alternation. Therefore, from Eq. (7) and Eq. (8), we may observe that the variation of the ZCR (or the RHZCR) in an audio signal is greater than that of a music, as shown in Figure 13.

    Figure 13.

    Music and audio sharing some values [65].

    3.1.2 The STE algorithm

    The amplitude of the audio signal varies appreciably with time. In particular, the amplitude of unvoiced segments is generally much lower than the amplitude of voiced segments. The STE of the audio signal provides a convenient representation that reflects these amplitude variations. Unlike the audio signal, since the music signal does not contain unvoiced segments, the STE of the music signal is usually bigger than that of audio [60]. The STE of a discrete-time signal s(n) can define as.


    where STEs in Eq. (9) is the total energy of the signal. The average power of s(n) is defined as.


    Signals can be classified into three types, in general: an energy signal, which has a non-zero and finite energy, a power signal, which has a non-zero and finite energy, and the third type is neither energy nor power signal, see Table 4. Now, let us define another sequence {f(n;m)} as follow.

    Energy Signal
    0 < Es  < ∞
    TransientS(n) = αnu(n) |α| < 1
    Finite Sequenceeβt [u(n)-u(n-255)] |β| < ∞
    Power Signal
    0 < Ps  < ∞
    Constants(n) = α -∞ < α < ∞
    Periodics(n) = α sin(nωo + φ) -∞ < α < ∞
    StochasticS(n) = rand (seed)
    Neither Energy nor Power SignalZeros(n) = 0
    Blow ups(n) = αn u(n) |α| > 1

    Table 4.

    Types of signals.


    where w(n) is just a window with a length of N with a value of zero outside [0, N-1]. Therefore, fs (n,m) will be zero outside [m-N + 1, m].

    Deriving short term features

    The silence and unvoiced period in audios can be considered a stochastic background noise. Now, let us define F s as a feature of {s(n)}, mapping its values of the Hilbert space, H, to a set of complex numbers C such that.


    The long-term feature of {s(n)} may be defined as follow.


    The long-term average, when applied to energy signals, will have zero values, however, it is appropriate for power signals. Eq. (13) can be re-written as follow.


    Resulting a family of mappings. If each member of the family is selected to be a λ, the we can use the notation Fs (λ). The discrete-time Fourier transforms is an example of a parametric long-term feature. The long-term feature can be of the form.


    where M in Eq. (15) is the mapping sequence. It maps {s(n)} to another sequence. The long-term feature Fs(λ) is defined as LoM, a composition of function L and M. If Fs (λ) is the long-term feature of Eq. (12), then the short-term feature Fs(λ,m) of time period m can be constructed as follows:

    • Define a frame as in Eq. (11).

    • Apply the long-term feature transformation to the frame sequence as in Eq. (16).


    Low Short Time Energy Ratio (LSTER)

    As done in the ZCR, the variation is selected [33]. Here, the LSTER is used to represent the variation of the STE. LSTER is defined as the ratio of the number of frames whose STE are less than 0.5 times of the average STE in a one-second window, as in Eq. (17).




    N is the total number of frames, STE(n) is the STE at the n th frame, and STE av in Eq. (18) is the average STE in a one-window.

    3.1.3 The effect of positive derivation

    Figure 14 shows the preprocessing flow on Z(i) using the positive derivation concept (∇+), which provided some improvement in the discrimination process [66].

    Figure 14.

    The preprocessing using the +ve derivative before evaluating the ZCR.

    This pre-processing increased the ZCR of music and reduced the ZCR of the audio with the expenses of some delay. The averages of the ZCR in speech, mixture, and music are shown in Figure 15, after applying the +ve derivative of order 50.

    Figure 15.

    The average ZCR of speech, mixture, and music, after pre-processing with the +ve derivative [66].

    3.1.4 Artificial neural network (ANN) approach

    The ANN approach is a multipurpose technique that was used for implementing many algorithms [14, 35, 63, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 98, 113], especially in classification issues [16, 49, 95, 96, 97, 98, 99, 107, 108, 119, 120]. A multi-layer ANN approach was used in many classification tools since it can represent nonlinear decision support systems.

    3.2 Algorithms in the frequency domain

    3.2.1 The spectrum approaches Spectral flux mean and variance

    This feature characterizes the change in the shape of the spectrum so it measures frame-to-frame spectral difference. Audio signals go through less frame-to-frame changes than music. The spectral flux values in audio signal is lower than that of music.

    The spectral flux, sometimes called the delta spectrum magnitude, is defined as the second norm of the spectral amplitude of the difference vector and defined as in Eq. (19).


    where X(k) is the signal power and k is the corresponding frequency. Another definition of the SF is also described as follow.


    where A(n, k) in Eq. (20) is the discrete Fourier transform (DFT) of the n th frame of the input signal and can be described as in Eq. (21).


    and x(m) is the original audio data, L is the window length, M is the order of the DFT, N is the total number of frames, δ is an arbitrary constant, and w(m) is the window function. Scheirer and Slaney [65] has found that SF feature is very useful in discriminating audio from music. Figure 16 depicts that the variances are lower for music than for audio, and the means are less for audio than for music signal. Rossignol and others [109] have computed the means and variances of a one-second segment using frames of length 18 milliseconds.

    Figure 16.

    3D histogram normalized features (the mean and the variance of spectral flux) of: (a) music signal, (b) audio signal [109].

    Rossignol and others [109] have tested three classification approaches to classify the segments. They used the k-nearest-neighbors (kNN) with k = seven, the Gaussian mixture model (GMM), and the ANN classifiers. Table 5 shows their results are shown in Table 5, using the mean and the variance of the SF.


    Table 5.

    Percentage of misclassified segments [109]. The mean and variance of the spectral centroid

    In the frequency domain, the mean and variance of the spectral centroid feature describes the center of frequency at which most of the power in the signal is found. In audio signals, the pitches of the signals are concentrated in narrow range of low frequencies. In contrast, music signals have higher frequencies that result higher spectral means, i.e., higher spectral centroids. For a frame at time t, the spectral centroid can be evaluated as follows.


    where X(k) is the power of the signal at the corresponding frequency band k. When the mean and the variance of the SP are combined with the mean and the variance of the SC in Eq. (22), and the mean and the variance of the ZCR, the results of Table 6 are found.


    Table 6.

    Percentage of misclassified segments [109]. Energy at 4 Hz modulation

    Audio signal has an energy peak centered on the 4 Hz syllabic rate. Therefore, a 2nd order band pass filter is used, with center frequency of 4 Hz. Although audio signals have higher energy at that 4 Hz, some music bass instruments was found to have modulation energy around this frequency [65, 109]. Roll-off point

    In the frequency domain, the roll-off point feature is the value of the frequency that has 95% of the power of the signal. The value of the roll-off point can be found as follow [65, 109].


    where the left hand side of Eq. (23) is the sum of the power at the frequency value V, and the right hand side of Eq. (23) is the 95% of the total power of the signal of the frame, and X(k) is the DFT of x(t).

    3.2.2 Cepstrum

    The cepstrum of a signal can be defined as the inverse of the DFT of the logarithm of the spectrum of a signal. Music signals have higher cepstrum values than that of speech ones. The complex cepstrum is defined in the following Equation [110, 111, 112].


    and then.


    where X(ejω ) is the DFT of the sequence x(n).

    3.2.3 Summary

    Table 7 summarizes the percentage error of a simulation done per each feature. Latency refers to the amount of past input data required to calculate the feature.

    FeaturesThe 4 Hz Mod EnergyThe Low EnergyThe Roll offThe Roll off VarSpec CentroidSpec Centroid VarThe Spec FluxSpec Flux VarThe ZCRThe Var of the ZC RateThe Cepstrum ResidCepstrum Res VarThe Pulse Metric
    Latencies1 sec1 sec1 frame1 sec1 frame1 sec1 frame1 sec1 frame1 sec1 frame1 sec5 sec
    Errors12 +/−1.7%14 +/−3.6%46 +/− 2.9%20 +/− 6.4%39 +/− 8.0%14 +/− 3.7%39 +/− 1.1%5.9 +/− 1.9%38 +/− 4.6%18 +/− 4.8%37 +/− 7.5%22 +/− %5.718 +/− %2.9

    Table 7.

    Latency and univariate discrimination performance for each feature [65].

    Scheirer and Slaney [65] have evaluated their models using 20 minutes long data sets of music and audio. Their data set consists of 80 samples, each with 15-second-long audio. They collected their samples using a 16-bit monophonic FM tuner with a sampling rate of 22.05 kHz, from a variety of stations, with different content styles and different noise levels, over a period of three days in the San Francisco Bay Area. They also claimed that they have audios from both male and female.

    They also recorded samples of many types of music, like pop, jazz, salsa, country, classical, reggae, various sorts of rock, various non-Western styles [29, 65]. They also used several features in a spatial partitioning classifier. Table 8 summarizes their results.

    SubsetAll featuresBest 8Best 3VS Flux onlyFast 5
    Audio % Error5.8 +/− 2.16.2 +/− 2.26.7 +/− 1.912 +/− 2.233 +/− 4.7
    Music % Error7.8 +/− 6.47.3 +/− 6.14.9 +/− 3.715 +/− 6.421 +/− 6.6
    Total % Error6.8 +/− 3.56.7 +/− 3.35.8 +/− 2.113 +/− 3.527 +/− 4.6

    Table 8.

    Performance for various subsets of features.

    The features used in Best 8 are the plus the 4 Hz modulation, the variance features, the pulse metric, and the low-energy frame [67, 121]. In the Best 3, they used the pulse metric, the 4 Hz energy, and the variance of spectral flux. In the Fast 5, they used the 5 basic features. From results shown in Table 8, we conclude that it is not necessary to use all features in order to have a good classification, so in real time a good performance system may be found using only few features. A more detailed discussion can be found in [29, 65, 67, 121].

    3.3 Algorithms in the time-frequency domain

    3.3.1 Spectrogram (or sonogram)

    The spectrogram is an example of time-frequency distribution and this method was found to be a good classical tool for analyzing audio signal [13, 19, 74, 115]. The spectrogram (or sonogram) of a signal x(n) can be defined as follow.


    where N is the length of the sequence x(n), and W(n) is a specific window.

    The method of spectrogram can be used in discriminating audio from music signal, however, it may have a high percentage error. That is because it depends on the strength of the frequency in the tested samples. Figure 17 depicts two examples of spectrograms of audio and music signals.

    Figure 17.

    (a) Audio spectrogram, (b) music Spectrum.

    3.3.2 Evolutionary spectrum (ES)

    The spectral representation of a stationary signal may be viewed as an infinite sum of sinusoids with random amplitudes and phases as described in Eq. (27).


    where Z(ω) is the process with orthogonal increments i.e.


    andSωin Eq. (28) is the spectrum of e(n) [68]. Since the audio signal is, in general nonstationary, we will use the Wold-Cramer (WC) representation of a nonstationary signal. WC considers the discrete-time non-stationary process {x(n)} as the output of a casual, linear, and time-variant (LTV) system with a white noise input e(n) that has a zero-mean, unit-variant, i.e.,


    where h(n,m) is defined as the unit impulse response of an LTV system. Substituting e(n) into x(n) of Eq. (29) (assuming S(ω) = 1 for white noise) we get.


    where H(n,ω) in Eq. (30) is the time-frequency transfer function of the LTV system defined as


    and the instantaneous power of x(n) is given by


    and then, the Wold-Cramer ES is defined as


    The ES S(n,ω) in Eq. (33) was found to be a good classifier for the distinction of audio from music signals [68, 117]. Because of the extensive math calculation of the time-frequency spectrum, they may be very useful in off-line classification and analysis. The ESs of music and audio signals are shown in Figure 18(a) and (b), respectively. The suppression of the amplitude for audio might due to gaussianity.

    Figure 18.

    (a) The ES of a music signal, (b) the ES of an audio signal [68].

    4. Separation of audio and music signals

    Since the separation of audio and music signals is more complicated than classification, in this section we will introduce only two approaches [7, 8, 9, 10, 11, 12, 13, 22, 74, 134, 135, 136]. The first approach is the approach of independent component analysis (ICA) with ANN. The second classifier is the pitch cancelation approach. A block diagram of a classifier integrated with a separator is depicted in Figure 19.

    Figure 19.

    A block diagram of a classifier integrated with a separator.

    4.1 ICA with ANN separation approach

    In [13, 20, 21, 115, 122], Wang and Brown proposed a model for audio segregation algorithm. His model consists of preprocessing using cochlear filtering, gammatone filtering, and correlogram forming autocorrelation function and feature extraction. The impulse response of the gammatone filters is represented as.


    where n is the filter order, N is the number of channels, and U is the unit step function. Therefore, the gammatone system can be considered as a causal, time invariant system with an infinite response time. For the i th channel, fi is the center frequency of the channel, ϕi is the phase of the channel, b is the rate of decay of the impulse response and g(i) is an equalizing gain adjust for each filter. Figure 20 depicts the impulse response of the gammatone system, where Figure21 depicts the block diagram of the Wang and Brown model.

    Figure 20.

    4th order impulse response Gammatone system: (a) In time domain when i = 1, fi = 80 Hz. (b) In time domain when i = 5, fi = 244 Hz. (c) In the frequency domain for the 1st five filters (i.e i = 1 to i = 5) with gain g(i) set to unity.

    Figure 21.

    A block diagram of Wang and Brown model.

    Wang and Brown model has some drawbacks. The first drawback is its complexity. Their model needs a high specification hardware to perform the calculations. In [20], Andre reported that Wang and Brown model needs to be improved. The ICA method can be used for separation if two sources of mixture are available assuming that the two signals from the two different sources are statistically independent [123, 124, 132, 133, 137]. In [19], Takigawa tried to improve the performance of W & B model. He used the short time Fourier transform (STFT) in the input stage and used the spectrogram values instead of correlogram, however, they have not reported the amount of improvement. A similar work for separating the voiced audio of two talkers speaking simultaneously at similar intensities in a single channel, using pitch peak canceling in cepstrum domain, was done by Stubbs [8].

    4.2 The pitch cancelation

    The pitch cancelation method is widely used in noise reduction. A good try to separate two talkers speaking simultaneously at similar intensities in a single channel, or by other words, separation of two talkers without any restriction was introduced by Stubbs [8]. For a certain person, the letters A and R have lot of consonant. These consonants, in the frequency domain, have low amplitudes, however, they appear as long pitch peak in the cepstrum domain. If these consonants are deleted by replacing the five-cepstral samples centered at the pitch peak by zeros, the audio segment may be attenuated or distorted completely. A typical example of the cepstrum of two audio and music signals is depicted in Figure 22 for 5 seconds signals. The logarithmic effect will increase low amplitude reduce high one, and the values near zero will be very large after the logarithm.

    Figure 22.

    (a) A typical 5 seconds audio signal in cepstrum domain, the pitch peak appears near zero. (b) a typical 5 seconds music signal in cepstrum domain.

    5. Conclusions

    In this chapter, a general review of the common classification and separation algorithms used for speech and music was presented and some were introduced and discussed thoroughly. The approaches dealt with classification were divided into three categories. The first category included most of the real-time approaches. In the real-time approaches, we introduced the ZCR, the STE, the ZCR and the STE with positive derivative, with some of their modified versions, and the neural networks. The second category included most of the frequency domain approaches such as the spectral centroid and its variance, the spectral flux and its variance, the roll-off of the spectrum, the cepstral residual, and the delta pitch. However, the last category introduced two time-frequency approaches, mainly the spectrogram and the evolutionary spectrum. It has been noticed that the time-frequency classifiers provided an excellent and a robust discrimination result in discriminating speech from music signals in digital audio. Depending on the application, the decision of which feature should be chosen is selected. The algorithms of the first category are faster since the processing is made in the real time; however, those of the second one are more precise. The time-frequency approaches has not been discussed thoroughly in literature and they still need more research and elaboration. Lastly, we may conclude that many classification algorithms were proposed in literature, however, few ones were proposed for separation. The algorithms introduced in this chapter can be summarized in Table 9.

    ApproachesTime domainFrequency domain
    (Spectrum) | (Cepstrum)
    Time-Frequency domain
    AlgorithmsZCRSpectral CentroidCepstral ResidualSpectrogram (Sonogram)
    STESpectral FluxVariance of the Cepstral ResidualEvolutionary Spectrum
    Roll-Off VarianceSpectrum Roll-OffCepstral featureEvolutionary Bispectrum
    Pulse MetricSignal BandwidthPitch
    Number of SilenceSpectrum AmplitudeDelta Pitch
    HMMDelta Amplitude

    Table 9.

    Summary of the classification and separation algorithms.

    Download for free

    chapter PDF

    © 2020 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

    How to cite and reference

    Link to this chapter Copy to clipboard

    Cite this chapter Copy to clipboard

    Abdullah I. Al-Shoshan (December 15th 2020). Classification and Separation of Audio and Music Signals [Online First], IntechOpen, DOI: 10.5772/intechopen.94940. Available from:

    chapter statistics

    22total chapter downloads

    More statistics for editors and authors

    Login to your personal dashboard for more detailed statistics on your publications.

    Access personal reporting

    We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

    More About Us