Open access peer-reviewed chapter

Classification and Separation of Audio and Music Signals

Written By

Abdullah I. Al-Shoshan

Submitted: May 6th, 2020 Reviewed: November 6th, 2020 Published: December 15th, 2020

DOI: 10.5772/intechopen.94940

Chapter metrics overview

525 Chapter Downloads

View Full Metrics


This chapter addresses the topic of classification and separation of audio and music signals. It is a very important and a challenging research area. The importance of classification process of a stream of sounds come up for the sake of building two different libraries: speech library and music library. However, the separation process is needed sometimes in a cocktail-party problem to separate speech from music and remove the undesired one. In this chapter, some existed algorithms for the classification process and the separation process are presented and discussed thoroughly. The classification algorithms will be divided into three categories. The first category includes most of the real time approaches. The second category includes most of the frequency domain approaches. However, the third category introduces some of the approaches in the time-frequency distribution. The approaches of time domain discussed in this chapter are the short-time energy (STE), the zero-crossing rate (ZCR), modified version of the ZCR and the STE with positive derivative, the neural networks, and the roll-off variance. The approaches of the frequency spectrum are specifically the roll-off of the spectrum, the spectral centroid and the variance of the spectral centroid, the spectral flux and the variance of the spectral flux, the cepstral residual, and the delta pitch. The time-frequency domain approaches have not been yet tested thoroughly in the process of classification and separation of audio and music signals. Therefore, the spectrogram and the evolutionary spectrum will be introduced and discussed. In addition, some algorithms for separation and segregation of music and audio signals, like the independent Component Analysis, the pitch cancelation and the artificial neural networks will be introduced.


  • audio signal
  • music signal
  • classification
  • separation
  • time domain
  • frequency domain
  • time-frequency domain

1. Introduction

Audio signal processing is an important subfield of signal processing that is concerned with the electronic manipulation of audio signals [1, 2, 3, 4, 5, 6]. The problem of discriminating music from audio has increasingly become very important as automatic audio signal recognition (ASR) systems and it has been increasingly applied in the domain of real-world multimedia [7]. Human’s ear can easily distinguish audio without any influence of the mixed music [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]. Due to the new methods of the analysis and the synthesis processing of audio signals, the processing of musical signals has gained particular weight [16, 24], and therefore, the classical sound analysis methods may be used in the processing of musical signals [25, 26, 27, 28]. Many types of musical signals such as Rock music, Pop music, Classical music, Country music, Latin music, Arabic music, Disco and Jazz, Electronic music, etc. are existed [29]. The sound type signals hierarchy is shown in Figure 1 [30].

Figure 1.

Types of audio signals.

Audio signal changes randomly and continuously through time. As an example, music and audio signals have strong energy content in the low frequencies and weaker energy content in the high frequencies [31, 32]. Figure 2 depicts a generalized time and frequency spectra of audio signals [33]. The maximum frequency fmax varies according to type of audio signal, where, in the telephone transmission fmax is equal to 4 kHz, 5 kHz in mono-loudspeaker recording, 6 KHz in multi-loudspeaker recording or stereo, 11 kHz in FM broadcasting, however, it equals to 22 KHz in the CD recording.

Figure 2.

Generalized frequency spectrum for audio signal [33].

Acoustically speaking, the audio signals can be classified into the following classes:

  1. Single talker in specific time [34].

  2. Singing without music.

  3. Mixture of background music and single talker audio.

  4. Songs that are a mixture of music with a singer voice.

  5. May completely be music signal without any audio component.

  6. Complex sound mixture like multi-singers or multi-speakers with multi-music sources.

  7. Non-music and non-audio signals: like fan, motor, car, jet sounds, etc.

  8. Audio signal that is a mixture of more than one speakers talking simultaneously at the same time [8].

  9. Abnormal music can be single word cadence, human whistle sound, or opposite reverberation [4, 34, 35, 36, 37, 38].


2. Analysis of audio and music signals

2.1 Properties of audio signal

2.1.1 Representation of audio signal

The letters symbols used for writing are not adequate, as the way they are pronounced varies; for example, the letter “o” in English, is pronounced differently in words “pot” most“ and “one”. It is almost impossible to tackle the audio classification problem without first establishing some way of representing the spoken utterances by some group of symbols representing the sounds produced [39, 40, 41, 42, 43]. The phonemes in Table 1 are divided into groups based on the way they are produced [44], forming a set of allophones [45]. In some tonal languages, such as Vietnamese and Mandarin, the intonation determines the meaning of each word [46, 47, 48].


Table 1.

Phoneme categories of British English and examples of words in which they are used [44].

2.1.2 Production of audio signal

Since the range of sounds that can be produced by any system is limited [39, 40, 41, 42, 43, 44], the pressure in the lungs is increased by the reverse process. They push the air up the trachea; the larynx is situated at the top of the trachea. By changing the shape of the vocal tract, different sounds are produced, so the fundamental frequency will be changing with time. The spectrogram (or sonogram) for the sentence “What can I have for dinner tonight?” is shown in Figure 3.

Figure 3.

A sonogram for the sentence “What can I have for dinner tonight?” [43].

The way that humans recognize and interpret audio signal has been considered by many researchers [1, 25, 39]. To produce a complete set of English vowels, many researchers have depicted that the two lowest formants are necessary, as well as that the three lowest formants in frequency are necessary for good audio intelligibility. As the number of formants increased, sounds that are more natural are produced. However, when we deal with continues audio, the problem becomes more complex. The history of audio signal identification can be found in [1, 25, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48].

2.2 Properties of music signal

2.2.1 Representation of music signal

There are two kinds of tone structures in music signal. The first one is a simple tone formed of single sinusoidal waveform, however, the second one is a more complex tone consisting of more than one harmonic [31, 49, 50, 51, 52]. The spectrum of music signal has twice the bandwidth of audio spectrum, and most of the power of audio signal is concentrated at lower frequencies. Melodists and musicians divide musical minor to eight parts and each part named octave, where each octave is divided into seven parts called tones [30]. For different instrument, a tempered scale is shown in Table 2. These tones, shown in Table 2, are named (Do, Re, Me, Fa, So, La and Se) or simply (A, B, C, D, E, F, and G). The tone (A1) at the first octave has the fundamental frequency of the first tone in each octave, i.e., every first tone in each octave takes the reduplicate frequency of the first tone of previous one, (i.e., An = 2nA1 or Bn = 2n B1 and so on where n ∈ {2, 3, 4, 5, 6, 7}.

A HzB HzC HzD HzE HzF HzG Hz
A1 27.5B1 30.863C1 32.703D1 36.708E1 41.203F1 43.654G1 48.99
A2 55B2 61.735C2 65.406D2 73.416E2 82.407F2 87.307G2 97.99
A3 110B3 123.47C3 130.81D3 146.83E3 164.81F3 174.61G3 196
A4 220B4 246.94C4 261.63D4 293.66E4 329.63F4 349.23G4 392
A5 440B5 493.88C5 523.25D5 587.33E5 659.26F5 698.46G5 783.9
A6 880B6 987.77C6 1046.5D6 1174.7E6 1318.5F6 1396.9G6 1568
A7 176B7 1975.5C7 2093D7 2349.3E7 2637F7 2793G7 3136
A8 352B8 3951.1C8 4186

Table 2.

Frequencies of notes in the tempered scale [3].

From Table 2, the highest tone C8 occurs at the frequency of 4186 Hz, which is the highest frequency produced by human sound system, which leads musical instrument manufactures to try their best to bound music frequency to human’s sound system limits to achieve strong concord [35, 53, 54]. In the real world, musical instruments cover more frequencies than audible band, which is limited to 20 kHz).

2.2.2 Production of music signal

The concept of tone quality that is most common depends on the subjective acoustic properties, regardless of partials or formants and the production of music depends mainly on the kind of musical instruments [53, 54]. These instruments can be summarized as follows:

  1. The string musical instrument. Its tones is produced by vibrating chords made from horsetail hair, or other manufactured material like copper or plastic. Every vibrating chord has its own fundamental frequency, producing complex tones so that it covers most of the audible bands. Figure 4 shows string instruments.

  2. The brass musical instrument. The Brass musical instrument depends on blowing air like woodwind. Its shape looks like an animal horn and has manual valves to control cavity size. Brass musical instrument has huge number of nonharmonic signals existed in its spectrum. Figure 5 shows brass instruments.

  3. The woodwind musical instrument. Woodwind instrument consists of an open cylindrical tube at both ends. Some woodwind instruments may use small-vibrated piece of copper to produce tones. It produces many numbers of harmonic tones. Figure 6 shows woodwind instruments.

  4. The percussion musical instrument. Examples of percussion instruments are piano, snare drum, chimes, marimba, timpani, and xylophone. Most of the power of tones in percussion instruments produces non-harmonic components. Figure 7 shows some percussion instruments.

  5. The electronic musical instrument. The most qualified robust and accurate electronic musical instrument is the organ. It has a large keyboard, a memory that can store notes and use their frequencies as basic cadences or tones. Without organ help, disco, pop, rock and jazz cannot stand [29, 35, 36, 37, 38]. Organ is not the only electronic musical producer. If the electronic musical instruments are used for producing music, the tone quality measure of the fundamental frequency or harmonics is not needed. Figure 8 shows an example of organ electronic instrument.

Figure 4.

String instruments.

Figure 5.

Brass instruments.

Figure 6.

Woodwind instruments.

Figure 7.

Percussion instruments.

Figure 8.

Electronic organ.

2.3 Characteristics and differences between audio and music

The audio signal is a slowly time varying signal in the sense that, when examined over a sufficiently short period of time “between 5 and l00 msec. Therefore, its characteristics are stationary within this period of time. A simple example of an audio signal is shown in Figure 9.

Figure 9.

An example of audio signal of specking the two-second long phrase “Very good night”: (a) time domain (b) magnitude. (c) Phase.

Figure 10 is a typical example of music portion. It is very clear from the two spectrums in Figures 9 and 10 that we can distinguish between the two types of signals.

Figure 10.

A 2-second long music signal: (a) time domain. (b) Spectrum. (c) Phase.

Figures 11 and 12 depict the evolutionary spectrum of two different types of signals, audio and music.

Figure 11.

The spectrum of an average of 500 specimens: (a) audio, (b) music.

Figure 12.

Evolutionary spectrum of an average of 500 specimens: (a) audio, (b) music.

Now, let us discuss some of the main similarity and differences between the two types of signals.

Tonality. By tone, we mean a single harmonic of a pure periodical sinusoid. Regardless of the type of instruments or music, the musical signal is composed of a multiple of tones; however, this is not the case in the voice signal [47, 52, 55, 56, 57].

Bandwidth. Normally, the audio signal has 90% of its power concentrated within frequencies lower than 4 kHz and limited to 8 kHz; however, music signal can extend its power to the upper limits of the ear’s response, which is 20 kHz [52, 58].

Alternative sequence. Audio exhibits an alternating sequence of noise-like segment while music alternates in more tonal shape. In other words, audio signal is distributed through its spectrum more randomly than music does.

Power distribution. Normally, the power distribution of an audio signal is concentrated at frequencies lower than 4 kHz, and then collapsed rapidly above this frequency. On the other hand, there is no specific shape of the power of music spectrum [59].

Dominant frequency. For a single talker, his dominant frequency can accurately be determined uniquely, however, in a single musical instrument only the average dominant frequency can be determined. In multiple musical instruments, the case will be worst.

Fundamental frequency. For a single talker, his fundamental frequency can be accurately configured. However, this is not the case for a single music instrument.

Excitation patterns. The excitation signals (pitch) for audio are usually existed only over a span of three octaves, while the fundamental music tones can span up to six octaves [60].

Energy sequences. A reasonable generalization is that audio follows a pattern of high-energy conditions of voicing followed by low energy conditions, which the envelope of music is less likely to exhibit.

Tonal duration. The duration of vowels in audio is very regular, following the syllabic rate. Music exhibits a wider variation in tone lengths, not being constrained by the process of articulation. Hence, tonal duration would likely be a good discriminator.

Consonants. Audio signal contains too many consonants while music is usually continuous through the time [33].

Zero crossing rate (ZCR). The ZCR in music is greater than that in audio. We can use this idea to design a discriminator [60].

In the frequency domain, there is a strong overlapping between audio and music signals, so no ordinary filter can separate them. As mentioned before, audio signal may cover spectrum between 0 and 4 kHz with a dominant frequency of an average = 1.8747 kHz. However, the lowest fundamental frequency (A1) of a music signal is about 27.5 Hz and the highest frequency of the tone C8 is around 4186 Hz. The reason behind this is that musical instrument manufacturers try to bound music frequency to human’s sound limits in order to achieve a strong consonant and a strong frequency overlap. Moreover, music may propagate over the audible spectrum to cover more than the audible band of 20 kHz, with a dominant frequency of an average = 1.9271 kHz [25].

Table 3 summarizes the main similarity and differences between music and audio signals.

Key DifferenceAudioMusic
Units of AnalysisPhonemesNotes Finite
Temporal Structure
  • Short sample (40 ms–200 ms).

  • More steady state than dynamic.

  • Timing unstrained but variable.

  • Amplitude modulation rate for sentences is slow (∼ 4 Hz)

  • Longer sample: 600–1200 ms.

  • Mix of steady state (strings, winds) and transient (percussion).

  • Strong periodicity.

Spectral Structure
  • Largely harmonic (vowels, voiced consonants).

  • Tend to group in formants.

  • Some inharmonic stops.

  • Largely harmonic and some inharmonic (percussion).

Syntactic /Semantic Structure
  • Symbolic

  • Productive

  • Can be combined in grammar

  • Symbolic

  • Productive

  • Combined in a grammar

Table 3.

The main differences between audio and music signals.


3. Audio and music signals classification

The main classification approaches will be discussed in this section. They can be categorized into three different approaches: (1) time domain approaches, (2) frequency domain approaches, and (3) time-frequency domain approaches. A two-level music and audio classifier was developed by El-Maleh [61, 62]. He used a combination of long-term features such as the variance, the differential parameters, the zero crossing rate (ZCR), and the time-averages of spectral parameters. Saunders [60] proposed another two-level classifier. His approach was based on the short-time energy (STE) and the average ZCR features. In addition, Matityaho and Furst [63] have developed a neural network based model for classifying music signals. Their model was designed based on human cochlea functional performance.

For audio detection, Hoyt and Wecheler [64] have developed a neural network base model using Fourier transform, Hamming filtering, and a logarithmic function as pre-processing then they applied a simple threshold algorithm for detecting audio, music, wind, traffic or any interfering sound. In addition, to improve the performance, they suggested wavelet transform feature for pre-processing. Their work is much similar to the work done by Matityaho and Furst’s [63, 64]. 13 features were examined by Scheirer and Slaney [65]. Some of these features were simple modification of each other’s. They also tried combining them in several multidimensional classification forms. From these previous works, the most powerful discrimination features were the STE and the ZCR. Therefore, the STE and the ZCR will be discussed thoroughly. Finally, the common classifiers of the audio and the music signals can be divided into the following approaches:

  1. The Time domain algorithms:

    1. The ZCR algorithm [1, 34, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77]:

      1. The standard deviation of first order difference of the ZCR.

      2. The 3rd central moment of the mean of ZCR.

      3. The total number of zero crossings exceeding a specific threshold.

    2. The STE [60, 61, 62, 63, 64, 65, 78].

    3. The ZCR and the STE positive derivative [78, 79].

    4. The Pulse Metric [31, 59, 80, 81, 82].

    5. The number of silence [32, 60].

    6. The HMM (Hidden Markov Model) [83, 84, 85].

    7. The ANN (Artificial neural networks) [12, 49, 58, 63, 79, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120].

    8. The Roll-Off Variance [31, 59].

  2. The Frequency-domain algorithms [32, 33, 35, 59, 112, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 121]:

    1. The Spectrum [31, 111]:

      1. The Spectral Centroid.

      2. The Spectral Flux Variance.

      3. The Spectral Centroid Mean and Variance.

      4. The Spectral Flux Mean and Variance.

      5. The Spectrum Roll-Off.

      6. The Signal Bandwidth.

      7. The Spectrum Amplitude.

      8. The Delta Amplitude.

    2. The Cepstrum [122]:

      1. The Cepstral Residual [122, 123, 124].

      2. The Variance of the Cepstral Residual [122, 123, 124].

      3. The Cepstral feature [122, 123, 124].

      4. The Pitch [94, 107, 108, 117, 118, 119, 125, 126].

      5. The Delta Pitch [88, 119].

  3. The Time-Frequency domain algorithms:

    1. The Spectrogram (or Sonogram) [13, 19, 86, 127].

    2. The Evolutionary Spectrum and the Evolutionary Bispectrum [81, 128, 129].

3.1 Time domain algorithms

3.1.1 The ZCR algorithm

The ZCR algorithm can be defined as the number of crossing the signal the zero axis within a specific window. It is widely used because its simplicity and robustness [34]. We may define the ZCR as in the following equation.


where Zn is the ZCR, N is the number of samples in one window, and sgn is the sign of the signal such that sgn [x(n)] = 1 when x(n) > 0, sgn [x(n)] = −1, when x(n) < 0. An essential not is that the sampling rate must be high enough to catch any crossing through zero. Another important note before evaluating the ZCR is to normalize the signal by subtracting its average value. It is clear from Eq. (1) that the value of the ZCR is proportional to the sign change in the signal, i.e., the dominant frequency of x(n). Therefore, we may find that the ZCR of music is, in general, higher than that of audio, but not sure at the unvoiced audio.

Properties of ZCR:

The ZCR properties can be summarized as follow.

  1. The Principle of Dominant Frequency

    The dominant frequency of a pure sinusoid is the only value in the spectrum. This value of frequency is equal to the ZCR of the signal in one period. If we have a non-sinusoidal periodic signal, its dominant frequency is frequency with the largest amplitude. The dominant frequency (ω0) can be evaluated as follow.


    where N is the number of intervals, E{.} is the expected value, and Do is the ZCR per interval.

  2. The Highest frequency

    Since D0 denotes the ZCR of a discrete-time signal Z(i), let us assume that Dn denotes the ZCR of the nth derivative of Z(i), i.e., D1 is the ZCR of the first derivative of Z(i), D2 is the ZCR of the second derivative of Z(i), and so on. Then, the highest frequency ωmax in the signal can be evaluated as follow.


    where N is the number of samples. If the sampling rate equals 11 KHz, then the change in ωmax can be ignored for i > 10.

  3. The Lowest frequency

    Assuming that the time period between any two samples is normalized to unity, the derivative of Z(i) can be defined as Z(i= Z(i) – Z(i–1). Then, the ZCR of the nth derivative of Z(i) is defined as Dn. Now, let us define ∇ + as the +ve derivative of Z(i), then ∇ + [Z(i)] can be defined as follow.


    Now, let us define the ZCR of the nth + ve derivative of Z(i) by the symbol nD. Then we can find the lowest frequency ωmin of a signal as follow.


  4. Measure of Periodicity

    A signal is said to be purely periodic if and only if.


    Using Eq. (6), it was found that music is more periodic or than audio [44, 45, 46, 47, 55, 56, 57, 130].

The Ratio of High ZCR (RHZCR)

It was found that the variation of the ZCR is more discriminative than the exact ZCR, so the RHZCR can be considered as one feature [78]. The RHZCR is defined as the ratio of the number of frames whose ZCR are above 1 over the average ZCR in one-window, and can be defined as follow.


where N is the number of frames per one-window, n is the index of the frame, sgn[.] is a sign function and ZCR(n) is the zero-crossing rate at the nth frame. In general, audio signals consist of alternating voiced and unvoiced sounds in each syllable rate, while music does not have this kind of alternation. Therefore, from Eq. (7) and Eq. (8), we may observe that the variation of the ZCR (or the RHZCR) in an audio signal is greater than that of a music, as shown in Figure 13.

Figure 13.

Music and audio sharing some values [65].

3.1.2 The STE algorithm

The amplitude of the audio signal varies appreciably with time. In particular, the amplitude of unvoiced segments is generally much lower than the amplitude of voiced segments. The STE of the audio signal provides a convenient representation that reflects these amplitude variations. Unlike the audio signal, since the music signal does not contain unvoiced segments, the STE of the music signal is usually bigger than that of audio [60]. The STE of a discrete-time signal s(n) can define as.


where STEs in Eq. (9) is the total energy of the signal. The average power of s(n) is defined as.


Signals can be classified into three types, in general: an energy signal, which has a non-zero and finite energy, a power signal, which has a non-zero and finite energy, and the third type is neither energy nor power signal, see Table 4. Now, let us define another sequence {f(n;m)} as follow.

Energy Signal
0 < Es < ∞
TransientS(n) = αnu(n) |α| < 1
Finite Sequenceeβt[u(n)-u(n-255)] |β| < ∞
Power Signal
0 < Ps < ∞
Constants(n) = α -∞ < α < ∞
Periodics(n) = α sin(o + φ) -∞ < α < ∞
StochasticS(n) = rand (seed)
Neither Energy nor Power SignalZeros(n) = 0
Blow ups(n) = αn u(n) |α| > 1

Table 4.

Types of signals.


where w(n) is just a window with a length of N with a value of zero outside [0, N-1]. Therefore, fs(n,m) will be zero outside [m-N + 1, m].

Deriving short term features

The silence and unvoiced period in audios can be considered a stochastic background noise. Now, let us define Fs as a feature of {s(n)}, mapping its values of the Hilbert space, H, to a set of complex numbers C such that.


The long-term feature of {s(n)} may be defined as follow.


The long-term average, when applied to energy signals, will have zero values, however, it is appropriate for power signals. Eq. (13) can be re-written as follow.


Resulting a family of mappings. If each member of the family is selected to be a λ, the we can use the notation Fs(λ). The discrete-time Fourier transforms is an example of a parametric long-term feature. The long-term feature can be of the form.


where M in Eq. (15) is the mapping sequence. It maps {s(n)} to another sequence. The long-term feature Fs(λ) is defined as LoM, a composition of function L and M. If Fs(λ) is the long-term feature of Eq. (12), then the short-term feature Fs,m) of time period m can be constructed as follows:

  • Define a frame as in Eq. (11).

  • Apply the long-term feature transformation to the frame sequence as in Eq. (16).


Low Short Time Energy Ratio (LSTER)

As done in the ZCR, the variation is selected [33]. Here, the LSTER is used to represent the variation of the STE. LSTER is defined as the ratio of the number of frames whose STE are less than 0.5 times of the average STE in a one-second window, as in Eq. (17).




N is the total number of frames, STE(n) is the STE at the nth frame, and STEav in Eq. (18) is the average STE in a one-window.

3.1.3 The effect of positive derivation

Figure 14 shows the preprocessing flow on Z(i) using the positive derivation concept (∇+), which provided some improvement in the discrimination process [78].

Figure 14.

The preprocessing using the +ve derivative before evaluating the ZCR.

This pre-processing increased the ZCR of music and reduced the ZCR of the audio with the expenses of some delay. The averages of the ZCR in speech, mixture, and music are shown in Figure 15, after applying the +ve derivative of order 50.

Figure 15.

The average ZCR of speech, mixture, and music, after pre-processing with the +ve derivative [78].

3.1.4 Artificial neural network (ANN) approach

The ANN approach is a multipurpose technique that was used for implementing many algorithms [14, 36, 63, 79, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 110, 125], especially in classification issues [16, 49, 107, 108, 109, 110, 111, 119, 120, 131, 132]. A multi-layer ANN approach was used in many classification tools since it can represent nonlinear decision support systems.

3.2 Algorithms in the frequency domain

3.2.1 The spectrum approaches Spectral flux mean and variance

This feature characterizes the change in the shape of the spectrum so it measures frame-to-frame spectral difference. Audio signals go through less frame-to-frame changes than music. The spectral flux values in audio signal is lower than that of music.

The spectral flux, sometimes called the delta spectrum magnitude, is defined as the second norm of the spectral amplitude of the difference vector and defined as in Eq. (19).


where X(k) is the signal power and k is the corresponding frequency. Another definition of the SF is also described as follow.


where A(n, k) in Eq. (20) is the discrete Fourier transform (DFT) of the nth frame of the input signal and can be described as in Eq. (21).


and x(m) is the original audio data, L is the window length, M is the order of the DFT, N is the total number of frames, δ is an arbitrary constant, and w(m) is the window function. Scheirer and Slaney [65] has found that SF feature is very useful in discriminating audio from music. Figure 16 depicts that the variances are lower for music than for audio, and the means are less for audio than for music signal. Rossignol and others [133] have computed the means and variances of a one-second segment using frames of length 18 milliseconds.

Figure 16.

3D histogram normalized features (the mean and the variance of spectral flux) of: (a) music signal, (b) audio signal [133].

Rossignol and others [133] have tested three classification approaches to classify the segments. They used the k-nearest-neighbors (kNN) with k = seven, the Gaussian mixture model (GMM), and the ANN classifiers. Table 5 shows their results are shown in Table 5, using the mean and the variance of the SF.


Table 5.

Percentage of misclassified segments [133]. The mean and variance of the spectral centroid

In the frequency domain, the mean and variance of the spectral centroid feature describes the center of frequency at which most of the power in the signal is found. In audio signals, the pitches of the signals are concentrated in narrow range of low frequencies. In contrast, music signals have higher frequencies that result higher spectral means, i.e., higher spectral centroids. For a frame at time t, the spectral centroid can be evaluated as follows.


where X(k) is the power of the signal at the corresponding frequency band k. When the mean and the variance of the SP are combined with the mean and the variance of the SC in Eq. (22), and the mean and the variance of the ZCR, the results of Table 6 are found.


Table 6.

Percentage of misclassified segments [133]. Energy at 4 Hz modulation

Audio signal has an energy peak centered on the 4 Hz syllabic rate. Therefore, a 2nd order band pass filter is used, with center frequency of 4 Hz. Although audio signals have higher energy at that 4 Hz, some music bass instruments was found to have modulation energy around this frequency [65, 133]. Roll-off point

In the frequency domain, the roll-off point feature is the value of the frequency that has 95% of the power of the signal. The value of the roll-off point can be found as follow [65, 133].


where the left hand side of Eq. (23) is the sum of the power at the frequency value V, and the right hand side of Eq. (23) is the 95% of the total power of the signal of the frame, and X(k) is the DFT of x(t).

3.2.2 Cepstrum

The cepstrum of a signal can be defined as the inverse of the DFT of the logarithm of the spectrum of a signal. Music signals have higher cepstrum values than that of speech ones. The complex cepstrum is defined in the following Equation [122, 123, 124].


and then.


where X(e) is the DFT of the sequence x(n).

3.2.3 Summary

Table 7 summarizes the percentage error of a simulation done per each feature. Latency refers to the amount of past input data required to calculate the feature.

FeaturesThe 4 Hz Mod EnergyThe Low EnergyThe Roll offThe Roll off VarSpec CentroidSpec Centroid VarThe Spec FluxSpec Flux VarThe ZCRThe Var of the ZC RateThe Cepstrum ResidCepstrum Res VarThe Pulse Metric
Latencies1 sec1 sec1 frame1 sec1 frame1 sec1 frame1 sec1 frame1 sec1 frame1 sec5 sec
Errors12 +/−1.7%14 +/−3.6%46 +/− 2.9%20 +/− 6.4%39 +/− 8.0%14 +/− 3.7%39 +/− 1.1%5.9 +/− 1.9%38 +/− 4.6%18 +/− 4.8%37 +/− 7.5%22 +/− %5.718 +/− %2.9

Table 7.

Latency and univariate discrimination performance for each feature [65].

Scheirer and Slaney [65] have evaluated their models using 20 minutes long data sets of music and audio. Their data set consists of 80 samples, each with 15-second-long audio. They collected their samples using a 16-bit monophonic FM tuner with a sampling rate of 22.05 kHz, from a variety of stations, with different content styles and different noise levels, over a period of three days in the San Francisco Bay Area. They also claimed that they have audios from both male and female.

They also recorded samples of many types of music, like pop, jazz, salsa, country, classical, reggae, various sorts of rock, various non-Western styles [29, 65]. They also used several features in a spatial partitioning classifier. Table 8 summarizes their results.

SubsetAll featuresBest 8Best 3VS Flux onlyFast 5
Audio % Error5.8 +/− 2.16.2 +/− 2.26.7 +/− 1.912 +/− 2.233 +/− 4.7
Music % Error7.8 +/− 6.47.3 +/− 6.14.9 +/− 3.715 +/− 6.421 +/− 6.6
Total % Error6.8 +/− 3.56.7 +/− 3.35.8 +/− 2.113 +/− 3.527 +/− 4.6

Table 8.

Performance for various subsets of features.

The features used in Best 8 are the plus the 4 Hz modulation, the variance features, the pulse metric, and the low-energy frame [80, 134]. In the Best 3, they used the pulse metric, the 4 Hz energy, and the variance of spectral flux. In the Fast 5, they used the 5 basic features. From results shown in Table 8, we conclude that it is not necessary to use all features in order to have a good classification, so in real time a good performance system may be found using only few features. A more detailed discussion can be found in [29, 65, 80, 134].

3.3 Algorithms in the time-frequency domain

3.3.1 Spectrogram (or sonogram)

The spectrogram is an example of time-frequency distribution and this method was found to be a good classical tool for analyzing audio signal [13, 19, 86, 127]. The spectrogram (or sonogram) of a signal x(n) can be defined as follow.


where N is the length of the sequence x(n), and W(n) is a specific window.

The method of spectrogram can be used in discriminating audio from music signal, however, it may have a high percentage error. That is because it depends on the strength of the frequency in the tested samples. Figure 17 depicts two examples of spectrograms of audio and music signals.

Figure 17.

(a) Audio spectrogram, (b) music Spectrum.

3.3.2 Evolutionary spectrum (ES)

The spectral representation of a stationary signal may be viewed as an infinite sum of sinusoids with random amplitudes and phases as described in Eq. (27).


where Z(ω) is the process with orthogonal increments i.e.


andSω in Eq. (28) is the spectrum of e(n) [81]. Since the audio signal is, in general nonstationary, we will use the Wold-Cramer (WC) representation of a nonstationary signal. WC considers the discrete-time non-stationary process {x(n)} as the output of a casual, linear, and time-variant (LTV) system with a white noise input e(n) that has a zero-mean, unit-variant, i.e.,


where h(n,m) is defined as the unit impulse response of an LTV system. Substituting e(n) into x(n) of Eq. (29) (assuming S(ω) = 1 for white noise) we get.


where H(n,ω) in Eq. (30) is the time-frequency transfer function of the LTV system defined as


and the instantaneous power of x(n) is given by


and then, the Wold-Cramer ES is defined as


The ES S(n,ω) in Eq. (33) was found to be a good classifier for the distinction of audio from music signals [81, 129]. Because of the extensive math calculation of the time-frequency spectrum, they may be very useful in off-line classification and analysis. The ESs of music and audio signals are shown in Figure 18(a) and (b), respectively. The suppression of the amplitude for audio might due to gaussianity.

Figure 18.

(a) The ES of a music signal, (b) the ES of an audio signal [81].


4. Separation of audio and music signals

Since the separation of audio and music signals is more complicated than classification, in this section we will introduce only two approaches [7, 8, 9, 10, 11, 12, 13, 22, 76, 77, 86, 135]. The first approach is the approach of independent component analysis (ICA) with ANN. The second classifier is the pitch cancelation approach. A block diagram of a classifier integrated with a separator is depicted in Figure 19.

Figure 19.

A block diagram of a classifier integrated with a separator.

4.1 ICA with ANN separation approach

In [13, 20, 21, 127, 136], Wang and Brown proposed a model for audio segregation algorithm. His model consists of preprocessing using cochlear filtering, gammatone filtering, and correlogram forming autocorrelation function and feature extraction. The impulse response of the gammatone filters is represented as.


where n is the filter order, N is the number of channels, and U is the unit step function. Therefore, the gammatone system can be considered as a causal, time invariant system with an infinite response time. For the ith channel, fi is the center frequency of the channel, ϕi is the phase of the channel, b is the rate of decay of the impulse response and g(i) is an equalizing gain adjust for each filter. Figure 20 depicts the impulse response of the gammatone system, where Figure 21 depicts the block diagram of the Wang and Brown model.

Figure 20.

4th order impulse response Gammatone system: (a) In time domain when i = 1, fi = 80 Hz. (b) In time domain when i = 5, fi = 244 Hz. (c) In the frequency domain for the 1st five filters (i.e i = 1 to i = 5) with gain g(i) set to unity.

Figure 21.

A block diagram of Wang and Brown model.

Wang and Brown model has some drawbacks. The first drawback is its complexity. Their model needs a high specification hardware to perform the calculations. In [20], Andre reported that Wang and Brown model needs to be improved. The ICA method can be used for separation if two sources of mixture are available assuming that the two signals from the two different sources are statistically independent [66, 74, 75, 121, 137]. In [19], Takigawa tried to improve the performance of W & B model. He used the short time Fourier transform (STFT) in the input stage and used the spectrogram values instead of correlogram, however, they have not reported the amount of improvement. A similar work for separating the voiced audio of two talkers speaking simultaneously at similar intensities in a single channel, using pitch peak canceling in cepstrum domain, was done by Stubbs [8].

4.2 The pitch cancelation

The pitch cancelation method is widely used in noise reduction. A good try to separate two talkers speaking simultaneously at similar intensities in a single channel, or by other words, separation of two talkers without any restriction was introduced by Stubbs [8]. For a certain person, the letters A and R have lot of consonant. These consonants, in the frequency domain, have low amplitudes, however, they appear as long pitch peak in the cepstrum domain. If these consonants are deleted by replacing the five-cepstral samples centered at the pitch peak by zeros, the audio segment may be attenuated or distorted completely. A typical example of the cepstrum of two audio and music signals is depicted in Figure 22 for 5 seconds signals. The logarithmic effect will increase low amplitude reduce high one, and the values near zero will be very large after the logarithm.

Figure 22.

(a) A typical 5 seconds audio signal in cepstrum domain, the pitch peak appears near zero. (b) a typical 5 seconds music signal in cepstrum domain.


5. Conclusions

In this chapter, a general review of the common classification and separation algorithms used for speech and music was presented and some were introduced and discussed thoroughly. The approaches dealt with classification were divided into three categories. The first category included most of the real-time approaches. In the real-time approaches, we introduced the ZCR, the STE, the ZCR and the STE with positive derivative, with some of their modified versions, and the neural networks. The second category included most of the frequency domain approaches such as the spectral centroid and its variance, the spectral flux and its variance, the roll-off of the spectrum, the cepstral residual, and the delta pitch. However, the last category introduced two time-frequency approaches, mainly the spectrogram and the evolutionary spectrum. It has been noticed that the time-frequency classifiers provided an excellent and a robust discrimination result in discriminating speech from music signals in digital audio. Depending on the application, the decision of which feature should be chosen is selected. The algorithms of the first category are faster since the processing is made in the real time; however, those of the second one are more precise. The time-frequency approaches has not been discussed thoroughly in literature and they still need more research and elaboration. Lastly, we may conclude that many classification algorithms were proposed in literature, however, few ones were proposed for separation. The algorithms introduced in this chapter can be summarized in Table 9.

ApproachesTime domainFrequency domain
(Spectrum) | (Cepstrum)
Time-Frequency domain
AlgorithmsZCRSpectral CentroidCepstral ResidualSpectrogram (Sonogram)
STESpectral FluxVariance of the Cepstral ResidualEvolutionary Spectrum
Roll-Off VarianceSpectrum Roll-OffCepstral featureEvolutionary Bispectrum
Pulse MetricSignal BandwidthPitch
Number of SilenceSpectrum AmplitudeDelta Pitch
HMMDelta Amplitude

Table 9.

Summary of the classification and separation algorithms.


  1. 1. Al-Shoshan A. I. Speech and Music Classification and Separation: A Review. Journal of King Saud University- Engineering Sciences. 2006; 19(1): 95–133. doi:10.1016/S1018-3639(18)30850-X
  2. 2. Martin, K. Towards Automatic Sound Source Recognition: Identifying Musical Instruments. In: Proceedings of NATO Computational Hearing Advanced Study Institute, Italy, July 1998
  3. 3. Herrera-Boyer P., Amatriain X., Batlle E., Serra X. Towards Instrument Segmentation for Music Content Description: a Critical Review of Instrument Classification Techniques. In: Proceedings of International Society for Music Information Retrieval (ISMIR), 2000
  4. 4. Gjerdingen R.O. Using Connectionist Models to Explore Complex Musical Patterns. Computer Music Journal. 1989; 13(3):67–75. DOI: 10.2307/3680013
  5. 5. Hörnel D., Menzel W. Learning Musical Structure and Style with Neural Networks. Computer Music Journal. 1998; 22(4):44–62. doi:10.2307/3680893
  6. 6. Leman, M., Van Renterghem P. Transputer Implementation of the Kohonen Feature Map for a Music Recognition Task. In: Proceedings of the Second International Transputer Conference; Antwerpen: BIRA, 1989. pp. 1–20
  7. 7. Al-Atiyah A. Music and Speech Separation [thesis]. King Saud University; 2002
  8. 8. Stubbs R., Summerfield Q. Effects of signal-to-noise ratio, signal periodicity, and degree of hearing impairment on the performance of voice-separation algorithms. J. Acoustical Society of America. 1991; 89:1383–1393. DOI:10.1121/1.400539
  9. 9. Lee T-W., Koehler B-U. Blind Source Separation of Nonlinear Mixing Modes. In: Proceedings of the IEEE Signal Processing Society Workshop (1997). Amelia Island, FL. USA: IEEE; 1997. pp. 406–415. doi: 10.1109/NNSP.1997.622422
  10. 10. Lee T-W., Orglmeister R. A Contextual Blind Separation of Delayed and Convolved Sources. IEEE ICASSP’97; (1997). pp. 1199–1202. DOI: 10.1109/ICASSP.1997.596159
  11. 11. Lee T-W., Bell A., Lambert R. Blind Separation of Convolved and Delayed Sources. Advance in Neural Information Processing System. MIT Press. 1997
  12. 12. Lee T-W., Bell A. J., Orglmeister R. Blind Source Separation of Real Word Signals. IEEE ICNN. Houston, USA; (1997). 2129–2134. DOI: 10.1109/ICNN.1997.614235
  13. 13. Wang D. L., Brown G. J. Separation of Speech From Interfering Sounds Based on Oscillatory Correlation. IEEE Transaction on Neural Networks. Vol. 10: No. 3. (May 1999), 684–697. DOI: 10.1109/72.761727
  14. 14. Leman M. The Theory of Tone Semantics: Concept, Foundation, and Application, Minds and Machines. 2(4): (1992); pp. 345–363.
  15. 15. Patel A.D., Gibson E., Ratner J., Besson M., Holcomb P.J. Processing Grammatical Relations in Music and Language: An Event-Related Potential (ERP) Study. Proceedings of the Fourth International Conference on Music Perception and Cognition. Montreal: McGill University. (1996). 337–342
  16. 16. Stevens C., Latimer C. A Comparison of Connectionist Models of Music Recognition and Human Performance. Minds and Machines, 2(4): (1992); pp. 379–400
  17. 17. Weigend A.S. Connectionism for Music and Audition. In J. Cowan, G. Tesauro & J. Alspector (Eds.), Advances in Neural Information Processing Systems 6, San Francisco: Morgan Kaufmann. (1994); pp. 1163–1164
  18. 18. Anagnostopoulou C., Westermann G. Classification in Music: A Computational Model for Paradigmatic Analysis. Proceedings of the International Computer Music Conference, San Francisco, (1997), 125–128
  19. 19. Takigawa I., Toyama J., Shimbo M. A Modified LEGION using a spectrogram for speech segregation. IEEE. (1999); I526-I531. DOI: 10.1109/ICSMC.1999.814147
  20. 20. Andre J. W., Kouwe V. D., Wang D., Brown G. J. A Comparison of Auditory and blind Separation Techniques for speech segregation. IEEE transaction on speech and audio processing. 9(3): (March 2001); pp. 189–195. DOI: 10.1109/89.905993
  21. 21. Wang D. L., Brown G. J. Speech Segregation on Sound Localization. IEEE, (2001), 2861–2866
  22. 22. Belouchrani A., Aben-Meraim K., Cardoso J. F., Moulines E. A Blind Source Separation Technique Using Second Order Statistics. IEEE Trans. Signal processing, vol. 45, (Feb. 1997), pp. 434–444. DOI: 10.1109/78.554307
  23. 23. Govindarajan K.K., Grossberg S., Wyse L.L., Cohen M.A. A Neural Network Model of Auditory Scene Analysis and Source Segregation. Technical Report CAS/CNS-TR-94-039, Boston University, Dept. of Cognitive and Neural Systems, 1994
  24. 24. Kahrs M., Brandenburg K. Application of digital signal processing to audio and acoustics. Kluwer Academic Puplisher, Bosten/ Dordrecht/ London, 1998
  25. 25. Backus J. The Acoustical Foundations of Music. 2nd, W. W. Norton & Company, 1977
  26. 26. Gang D., Lehmann D., Wagner N. Harmonizing Melodies in Real-Time: The Connectionist Approach. Proceedings of the International Computer Music Conference, San Francisco, (1997), pp. 27–31
  27. 27. Kaipainen M., Toiviainen P., Louhivuori J. A Self-Organizing Map that Recognizes and Generates Melodies. In P. Pylkkänen & P. Pylkkö (Eds.), New Directions in Cognitive Science, (1995), 286–315
  28. 28. Port R., Anderson S. Recognition of Melody Fragments in Continuously Performed Music. Proceedings of the Eleventh Annual Conference of the Cognitive Science Society, Hillsdale, NJ: Erlbaum Associates, (1989), 820–827
  29. 29. Toiviainen P. Modeling the Target-Note Technique of Bebop-Style Jazz Improvisation: An Artificial Neural Network Approach. Music Perception, 12(4), (1995), 399–413
  30. 30. Cook, N. A Guide to Musical Analysis. Oxford University Press, 1987
  31. 31. Roy, D. and Malamud, C. Speaker Identification Based Text to Audio Alignment for an Audio Retrieval System. IEEE ICASSP’97, vol. 2, Munich, Germany, (April 1997), 1099–1102
  32. 32. Beigi H., Maes S., Sorensen J., Chaudhari U. A Hierarchical Approach to Large-Scale Speaker Recognition. IEEE ICASSP’99, Phoenix, Arizona, March 1999
  33. 33. Rabiner L., Juang B. H. Fundamentals of Speech Recognition. Prentice-Hall, Englewood Cliffs, NJ, 1993
  34. 34. Kedem B. Spectral Analysis and Discrimination by Zero-Crossings. Proceedings of IEEE, Vol. 74, NO. 11, (Nov. 1986), 1477–1492
  35. 35. Bateman W. Introduction to Computer Music. John Wiley&sons, 1984
  36. 36. Fedor P. Principles of the Design of D-Neuronal Networks I: Net Representation for Computer Simulation of a Melody Compositional Process. International Journal of Neural Systems, 3(1), (1992), 65–73
  37. 37. Horner A., Goldberg D.E. Genetic Algorithms and Computer-Assisted Music Composition. In B. Alphonce & B. Pennycock (Eds.), Proceedings of the 1991 International Computer Music Conference, San Francisco, (1991 ), 479–482
  38. 38. McIlwain P. The Yuri Program: Computer Generated Music for Multi-Speaker Sound Systems. Proceedings of the ACMA Conference, Melbourne, Australia, (1995), 150–151
  39. 39. Ainsworth W. A. Speech Recognition by Machine. Peter Peregrinus Ltd., 1988
  40. 40. Muthusamy Y. K., Barnard E., Cole R. A. Reviewing Automatic Language Identification. IEEE Signal Processing Magazine, (October 1994), 33–41
  41. 41. Ladefoged P. Elements of Acoustic Phonetics. University of Chicago Press, 1962
  42. 42. Fry D. B. The Physics of Speech. Cambridge University Press, 1979
  43. 43. Beck D.L., Callaway S.L. Breakthroughs in signal processing and feedback reduction lead to better speech understanding. Hearing Review. 2019; 26(4) [Apr]:30–31
  44. 44. Linster C. Rhythm Analysis with Backpropagation. In R. Pfeifer, Z. Schreter, F. Fogelman-Soulie & L. Steels (Eds.), Connectionism in Perspective, North-Holland: Elsevier Science Publishers B.V, (1989), 385–393
  45. 45. Jakobsson M. Machine-Generated Music with Themes. Proceedings of the International Conference on Artificial Neural Networks (Vol. 2) Amsterdam: Elsevier, (1992), 1645–1646
  46. 46. Griffith N.J.L. Connectionist Visualization of Tonal Structure. AI Review, 8, (1995), 393–408
  47. 47. Stevens C., Wiles J. Representations of Tonal Music: A Case Study in the Development of Temporal Relationships. In M.C. Mozer, P. Smolensky, D.S. Touretzky, J.E. Elman & A.S. Weigend (Eds.), Proceedings of the Connectionist Models Summer School, Hillsdale, NJ: Erlbaum, (1993), 228–235
  48. 48. Young P. H. Electrical Communication Techniques. 2nd, MERRILL, 1990
  49. 49. Laine P. Generating Musical Patterns Using Mutually Inhibited Artificial Neurons. Proceedings of the International Computer Music Conference, San Francisco, (1997), 422–425
  50. 50. Leman M. Symbolic and Subsymbolic Description of Music. In G. Haus (Ed.), Music Processing, New York: Oxford University Press, (1993), 119–164
  51. 51. Lischka C. Understanding Music Cognition: A Connectionist View. In G. De Poli, A. Piccialli & C. Roads (Eds.), Representations of Musical Signals, Cambridge, MA: MIT Press, (1991), 417–445
  52. 52. Griffith N., Todd P. M. Musical Networks. Bradford Books The MIT Press, 1999
  53. 53. Pierce J. R. The Science of Musical Sound. 3rd Ed., W.H.Freeman and company, 1996
  54. 54. Lerdahl F. and Jackendoff, R., A Generative Theory of Tonal Music. MIT Press, Cambridge, 1983
  55. 55. Monelle R. Linguistics and Semiotics in Music. Harwood Academic Publishers, 1992
  56. 56. Gang D., Berger J. Modeling the Degree of Realized Expectation in Functional Tonal Music: A Study of Perceptual and Cognitive Modeling Using Neural Networks. In D. Rossiter (Ed.), Proceedings of the International Computer Music Conference, San Francisco, (1996), 454–457
  57. 57. Bharucha J. Tonality and Expectation. In R. Aiello (Ed.), Musical Perceptions, New York: Oxford University Press, (1994 ), 213–239
  58. 58. Feiten B., Ungvary T. “Organizing Sounds with Neural Nets”, Int. Computer Music Conference, San Francisco, (1991 ), 441–443
  59. 59. Foote J. T. Content-Based Retrieval of Music and Audio. SPIE’97, (1997), 138–147
  60. 60. Saunders J. Real-Time Discrimination of Broadcast Speech/Music. IEEE ICASSP’96, (1996), 993–996
  61. 61. El-maleh K., Samoulian A., Kabal P. Frame-Level Noise Classification in Mobile Environment. proc. IEEE Int. Conf. On Acoustics, Speech, Signal processing, Phoenix, Arizona, (March 1999), 237–240
  62. 62. El-maleh K., Klein M., Petrucci G., Kabal P. Speech/Music Discriminator for Multimedia Application. Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing (Istanbul), (June 2000), 2445–2448
  63. 63. Benyamin M., Miriam F. Neural Network Based Model for Classification of Music Type. IEEE Cat., No. 95, (1995 ), 640–645
  64. 64. Hoyt J. D., Wechsler H. Detection of Human Speech Using Hybrid Recognition Models. IEEE, (1994 ), 330–333
  65. 65. Scheirer E., Slaney M. Construction and Evaluation of a Robust Multifeature Speech/Music Discriminator. Proceedings of the 1997 International Conference on Acoustics, Speech, and Signal Processing (ICASSP97), Munich, Germany, April 1997
  66. 66. Chien J-T. Source Separation and Machine Learning. Elsevier Inc. 2019.
  67. 67. Pope S. T., Holm F., Kouznetsov A. Feature extraction and database design for music software. Proceedings of the International Computer Music Conference, (2004), 596–603
  68. 68. McKay C., Fujinaga I. Automatic genre classification using large high-level musical feature sets, Proceedings of the International Conference on Music Information Retrieval, (2004), 525–530
  69. 69. Essed S., Richard G., David B. Musical instrument recognition based on class pairwise feature selection. Proceedings of the International Conference on Music Information Retrieval, (2004), 560–568
  70. 70. Downie J. The scientific evaluation of music information retrieval systems: Foundations and future. Computer Music Journal, 28, 2, (2004), 12–33
  71. 71. West K., Cox, S. Finding an Optimal Segmentation for Audio Genre Classification. Proceedings of the 6th Int. Symposium on Music Information Retrieval, University of London, (2005), 680–685
  72. 72. Tzanetaki G. Music Information Retrieval. ICASSP2005, Tutorial TUT-5, Philadelphia, 2005
  73. 73. West C., Cox S. Features and classifiers for the automatic classification of musical audio signals. Proceedings of the International Conference on Music Information Retrieval, (2004), 531–537
  74. 74. Yang X-S. Introduction to Algorithms for Data Mining and Machine Learning. Elsevier Inc. 2019.
  75. 75. Kotu V. Data Science: Concepts and Practice. Elsevier Inc. 2019.
  76. 76. Mimilakis S. I., Drossos K., Cano E. Schuller G. Examining the Mapping Functions of Denoising Autoencoders in Singing Voice Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing. vol. 28, pp. 266–278, 2020. DOI: 10.1109/TASLP.2019.2952013
  77. 77. Sharma G., Umapathy K., Krishnan S. Trends in audio signal feature extraction methods. Applied Acoustics. Elsevier Ltd. Volume 158, 15 January 2020.
  78. 78. Al-Shoshan A., Al-Atiyah A., Al-Mashouq K. A Three-Level Speech, Music, and Mixture Classifier. Journal of King Saud University {Engineering Sciences (No. 2)}, Volume 16, (1424), 319–332.
  79. 79. Al-Shoshan, A.I., “A Classification of Music, Speech and Mixture Signals Via Fuzzy Logic,” The 28th International Conference on Computers and Their Applications, (CATA-2013), Honolulu, Hawaii, USA, pp. 117–122, March 4–6, 2013
  80. 80. Berger J., Gang D. A Neural Network Model of Metric Perception and Cognition in the Audition of Functional Tonal Music. Proceedings of the 1997 International Computer Music Conference, San Francisco, (1997), 23–26
  81. 81. Al-Shoshan, A.I., “Audio Signal Discrimination Using Evolutionary Spectrum,” International Journal of Computers and Applications, Volume 31, No. 2, pp. 69–73, 2009
  82. 82. Toiviainen P., Kaipainen M., Louhivuori J. Musical Timbre: Similarity Ratings Correlate with Computational Feature Space Distances. Journal of New Music Research, 24(3), (1995), 282–298
  83. 83. Jin H., Kubala F., Schwartz R. Automatic Speaker Clustering. Proc. of the Speech Recognition Workshop, (1997 ), 108–111
  84. 84. Meddis R., Hewitt M. Modeling the Identification of Concurrent Vowels with Different Fundamental Frequency. J. Acoust. Soc. Am., vol. 91, (1992), 233–245
  85. 85. Raphael C. “Automatic Segmentation of Acoustic Musical Signals Using Hidden Markov Models”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.21, No. 4, April 1999
  86. 86. Hyvarinen, A. and Oja, E. Independent Component Analysis: Algorithms and Applications. Int. J. of Neural Networks, April 1999
  87. 87. Akarte N.J. Music Composition Using Neural Networks. Master’s thesis, University of Nevada, Reno, 1992
  88. 88. Barnard E., Cole R.A., Vea M.P., Alleva F.A. Pitch Detection with a Neural-Net Classifier. IEEE Transactions on Signal Processing, 39(2), (1991), 298–307
  89. 89. Bellgard M.I., Tsang C.P. Harmonizing Music Using a Network of Boltzmann Machines. In Proceedings of the Fifth Annual Conference of Artificial Neural Networks and Their Applications (Neuro-Nimes), (1992), 321–332
  90. 90. Bellgard M.I., Tsang C.P. “On the use of an Effective Boltzmann Machine for Musical Style Recognition and Harmonization”, Proceedings of the International Computer Music Conference, San Francisco, (1996), 461–464
  91. 91. Berger J., Gang D. Modeling Musical Expectations: A Neural Network Model of Dynamic Changes of Expectation in the Audition of Functional Tonal Music. Proceedings of the Fourth International Conference on Music Perception and Cognition, Montreal: McGill University, (1996), 373–378
  92. 92. Bharucha J. Neural Net Modeling of Music. Proceedings of the First Workshop on Artificial Intelligence and Music, Menlo Park, CA, (1988), 173–182
  93. 93. Bharucha J. Neural Networks and Perceptual Learning of Tonal Expectancies. Proceedings of the First International Conference on Music Perception and Cognition, Kyoto: Kyoto City University of Arts, (1989), 81–86
  94. 94. Bharucha J. Pitch, Harmony and Neural Nets: A Psychological Perspective. In P.M. Todd & D.G. Loy (Eds.), Music and Connectionism, Cambridge, MA: MIT Press, (1991), 84–99
  95. 95. Bharucha J., Olney K.L. Tonal Cognition, Artificial Intelligence and Neural Nets. Contemporary Music Review, Vol. 4, (1989), 341–356
  96. 96. Bresin R, Vedovetto A. Neural Networks for Musical Tones Compression, Control, and Synthesis. In Proceedings of the International Computer Music Conference, San Francisco, (1994), 368–371
  97. 97. Bresin R., Vedovetto A. Neural Networks for the Compression of Musical Tones and for the Control of Their Resynthesis. Proceedings of the IEEE-SP International Symposium on Time-Frequency and Time-Scale Analysis, 1994
  98. 98. Carpinteiro O. A Neural Model to Segment Musical Pieces. In E. Miranda (Ed.), Proceedings of the Second Brazilian Symposium on Computer Music, Fifteenth Congress of the Brazilian Computer Society, (1995), 114–120
  99. 99. Ciaccia P., Lugli F., Maio D. Using Neural Networks to Perform Harmonic Analysis in Music. The Fifth Italian Workshop on Neural Nets, WIRN VIETRI-92, Singapore, (1992), 273–279
  100. 100. Cosi P., DePoli G., Lauzzana G. Auditory Modeling and Self-Organizing Neural Networks for Timbre Classification. Journal of New Music Research, 23(1), (1994), 71–98
  101. 101. Fedor P. Principles of the Design of D-Neuronal Networks II: Composing Simple Melodies. International Journal of Neural Systems, 3(1), (1992), 75–82
  102. 102. Feiten B., Guenzel S. Automatic Indexing of a Sound Data Base Using Self-Organizing Neural Nets. Computer Music Journal, 18(3), (1994), 53–65
  103. 103. Feulner J. Learning the Harmonies of Western Tonal Music Using Neural Networks. Proceedings of the International Symposium on Computer and Information Sciences VII, Paris: EHEI Press, (1992), 303–307
  104. 104. Feulner J. Neural Networks that Learn and Reproduce Various Styles of Harmonization. Proceedings of the International Computer Music Conference, San Francisco, (1993), 236–239
  105. 105. Gang D., Lehmann D. An Artificial Neural Net for Harmonizing Melodies. Proceedings of the International Computer Music Conference, San Francisco, (1995), 440–447
  106. 106. Gjerdingen R.O. Categorization of Musical Patterns by Self-Organizing Neuronlike Networks. Music Perception, 7(4), (1990), 339–370
  107. 107. Laden B. A Parallel Learning Model for Pitch Perception. Journal of New Music Research, 23(2), (1994), 133–144
  108. 108. Laden B., Keefe B.H. The Representation of Pitch in a Neural Net Model of Pitch Classification. Computer Music Journal, 13(4), (1989), 12–26. Also in P.M. Todd & D.G. Loy (Eds.), Music and Connectionism, Cambridge, MA: MIT Press, (1991), 64–78
  109. 109. Leman M. Artificial Neural Networks in Music Research. In A. Marsden & A. Pople (Eds.), Computer Representations and Models in Music, London: Academic Press, (1991), 265–301
  110. 110. Mencl W.E. Effects of Tuning Sharpness on Tone Categorization by Self-Organizing Neural Networks. Proceedings of the Fourth International Conference on Music Perception and Cognition, Montreal: McGill University, (1996), 217–218
  111. 111. Mourjopoulos J.N., Tsoukalas D.E. Neural Network Mapping to Subjective Spectra of Music Sounds. Journal of the Audio Engineering Society, 40(4), (1992), 253–259
  112. 112. Cohen M.A., Grossberg S., Wyse L.L. A Spectral Network Model of Pitch Perception. Journal of the Acoustical Society of America, 498(2), (1995), 862–879.
  113. 113. Ohya K. A Sound Synthesis by Recurrent Neural Network. In E. Michie (Ed.), Proceedings of the International Computer Music Conference, San Francisco, (1995), 420–423
  114. 114. Palmieri F. Learning Binaural Sound Localization through a Neural Network, Proceedings of the IEEE Seventeenth Annual Northeast Bioengineering Conference, (1991), 13–14
  115. 115. Röbel A. Neural Networks for Modeling Time Series of Musical Instruments. In E. Michie (Ed.), Proceedings of the International Computer Music Conference, San Francisco, (1995), 424–428
  116. 116. Röbel A. Neural Network Modeling of Speech and Music Signals. In M.C. Mozer, M.I. Jordan & T. Petsche (Eds.), Advances in Neural Information Processing Systems 9, Cambridge, MA: MIT Press, 1997
  117. 117. Sano H., Jenkins K.B. A Neural Network Model for Pitch Perception. Computer Music Journal, 13(3), (1989), 41–48. Also in P.M. Todd & D.G. Loy (Eds.), Music and Connectionism, Cambridge, MA: MIT Press, (1991), 42–49
  118. 118. Taylor I. Artificial Neural Network Types for the Determination of Musical Pitch. Unpublished doctoral thesis, University of Wales, College of Cardiff, Dept. of Physics, 1994
  119. 119. Taylor I. J. Greenhough, M. Neural Network Pitch Tracking Over the Pitch Continuum. In E. Michie (Ed.), Proceedings of the International Computer Music Conference, San Francisco, (1995), 432–435
  120. 120. Trubitt D.R., Todd P.M. The Computer Musician: Neural Networks and Computer Music. Electronic Musician, 7(1), (1991), 20–24
  121. 121. Walpole R. E., Myers R. H. Probability and Statistics for Engineer and Scientists. 5th Ed., Macmillan Publishing, 1993
  122. 122. Bogert B. P., Healy M. J. R., Tukey J. W. The Quefrency Alanysis of Time Series for Echoes: Cepstrum, Pseudo-autocovariance, Cross-Cepstrum, and Saphe Cracking. John Wiley and Sons, New York, (1963), 209–243
  123. 123. Eronen A., Klapuri A. Musical Instrument Recognition using cepstral coefficients and temporal features", Proc. ICASSP 2000
  124. 124. Cosi P., DePoli G., Prandoni P. “Timbre characterization with mel-cepstrum and neural nets”, Proceedings of the International Computer Music Conference, (1994), 42–45
  125. 125. Griffith N. J. L. Modeling the Influence of Pitch Duration on the Induction of Tonality from Pitch-use. Proceedings of the International Computer Music Conference, San Francisco, (1994), 35–37
  126. 126. Taylor I., Greenhough M. An Object Oriented ARTMAP System for Classifying Pitch. Proceedings of the International Computer Music Conference, San Francisco, (1993), 244–247
  127. 127. Mu G., Wang D. L. An Extended Model for Speech Segregation. Proceeding of IEEE, (2001), 1089–1094
  128. 128. Priestley M. B. Non-linear and Non-stationary Time Series Analysis. New York, NY:Academic Press, 1988
  129. 129. Al-Shoshan A.I. LTV System Identification Using the Time-Varying Autocorrelation Function and Application to Audio Signal Discrimination. ICSP02, Bejing, China, 2002. DOI: 10.1109/ICOSP.2002.1181036
  130. 130. Scarborough D.L., Miller B.O., Jones, J.A. Connectionist Models for Tonal Analysis. Computer Music Journal, 13(3), (1989), 49–55. Also in P.M. Todd & D.G. Loy (Eds.), Music and Connectionism, Cambridge, MA: MIT Press, (1991), 54–60
  131. 131. Shuttleworth T., Wilson R., A Neural Network for Triad Classification. In E. Michie (Ed.), Proceedings of the International Computer Music Conference, San Francisco, (1995), 428–431
  132. 132. Sergent J. Mapping the Musician Brain, Human Brain Mapping, (1993), 20–38
  133. 133. Rossignol S., Rodet X., Soumagne J., Collette L., Depalle P. Feature extraction and temporal segmentation of acoustic signals. Proceedings of the International Computer Music Conference, 1998
  134. 134. Scarborough D.L., Miller B.O., Jones J.A. On the Perception of Meter. In M. Balaban, K. Ebcioglu & O. Laske (Eds.), Understanding Music with AI: Perspectives in Music Cognition, Cambridge, MA: MIT Press, (1992), 427–447
  135. 135. Magron P., Virtanen T. Online Spectrogram Inversion for Low-Latency Audio Source Separation. IEEE Signal Processing Letters. vol. 27, pp. 306–310, 2020. DOI: 10.1109/LSP.2020.2970310
  136. 136. Wang D. L. Primitive Auditory Segregation Based on Oscillator Correlation. Cognit. Sci., vol. 20, (1996), 409–456
  137. 137. Israr M., Khan M. S., Khan K. Speech Sources Separation Based on Models of Interaural Parameters and Spatial Properties of Room. International Journal of Engineering Works. 7(1): January 2020; pp. 22–26.

Written By

Abdullah I. Al-Shoshan

Submitted: May 6th, 2020 Reviewed: November 6th, 2020 Published: December 15th, 2020