Open access peer-reviewed chapter

Convolutional Neural Networks for Raw Speech Recognition

By Vishal Passricha and Rajesh Kumar Aggarwal

Submitted: April 24th 2018Reviewed: July 6th 2018Published: December 12th 2018

DOI: 10.5772/intechopen.80026

Downloaded: 1080

Abstract

State-of-the-art automatic speech recognition (ASR) systems map the speech signal into its corresponding text. Traditional ASR systems are based on Gaussian mixture model. The emergence of deep learning drastically improved the recognition rate of ASR systems. Such systems are replacing traditional ASR systems. These systems can also be trained in end-to-end manner. End-to-end ASR systems are gaining much popularity due to simplified model-building process and abilities to directly map speech into the text without any predefined alignments. Three major types of end-to-end architectures for ASR are attention-based methods, connectionist temporal classification, and convolutional neural network (CNN)-based direct raw speech model. In this chapter, CNN-based acoustic model for raw speech signal is discussed. It establishes the relation between raw speech signal and phones in a data-driven manner. Relevant features and classifier both are jointly learned from the raw speech. Raw speech is processed by first convolutional layer to learn the feature representation. The output of first convolutional layer, that is, intermediate representation, is more discriminative and further processed by rest convolutional layers. This system uses only few parameters and performs better than traditional cepstral feature-based systems. The performance of the system is evaluated for TIMIT and claimed similar performance as MFCC.

Keywords

  • ASR
  • attention-based model
  • connectionist temporal classification
  • CNN
  • end-to-end model
  • raw speech signal

1. Introduction

ASR system has two important tasks—phoneme recognition and whole-word decoding. In ASR, the relationship between the speech signal and phones is established in two different steps [1]. In the first step, useful features are extracted from the speech signal on the basis of prior knowledge. This phase is known as information selection or dimensionality reduction phase. In this, the dimensionality of the speech signal is reduced by selecting the information based on task-specific knowledge. Highly specialized features like MFCC [2] are preferred choice in traditional ASR systems. In the second step, discriminative models estimate the likelihood of each phoneme. In the last, word sequence is recognized using discriminative programming technique. Deep learning system can map the acoustic features into the spoken phonemes directly. A sequence of the phoneme is easily generated from the frames using frame-level classification.

Another side, end-to-end systems perform acoustic frames to phone mapping in one step only. End-to-end training means all the modules are learned simultaneously. Advanced deep learning methods facilitate to train the system in an end-to-end manner. They also have the ability to train the system directly with raw signals, i.e., without hand-crafted features. Therefore, ASR paradigm is shifting from cepstral features like MFCC [2], PLP [3] to discriminative features learned directly from raw speech. End-to-end model may take raw speech signal as input and generates phoneme class conditional probabilities as output. The three major types of end-to-end architectures for ASR are attention-based method, connectionist temporal classification (CTC), and CNN-based direct raw speech model.

Attention-based models directly transcribe the speech into phonemes. Attention-based encoder-decoder uses the recurrent neural network (RNN) to perform sequence-to-sequence mapping without any predefined alignment. In this model, the input sequence is first transformed into a fixed length vector representation, and then decoder maps this fixed length vector into the output sequence. Attention-based encoder-decoder is much capable of learning the mapping between variable-length input and output sequences. Chorowski and Jaitly proposed speaker-independent sequence-to-sequence model and achieved 10.6% WER without separate language models and 6.7% WER with a trigram language model for Wall Street Journal dataset [4]. In attention-based systems, the alignment between the acoustic frame and recognized symbols is performed by attention mechanism, whereas CTC model uses conditional independence assumptions to efficiently solve sequential problems by dynamic programming. Attention model has shown high performance over CTC approach because it uses the history of the target character without any conditional independence assumptions.

Another side, CNN-based acoustic model is proposed by Palaz et al. [5, 6, 7] which processes the raw speech directly as input. This model consists of two stages: feature learning stage, i.e., several convolutional layers, and classifier stage, i.e., fully connected layers. Both the stages are learned jointly by minimizing a cost function based on relative entropy. In this model, the information is extracted by the filters at first convolutional layer and modeled between first and second convolutional layer. In classifier stage, learned features are classified by fully connected layers and softmax layer. This approach claims comparable or better performance than traditional cepstral feature-based system followed by ANN training for phoneme recognition on TIMIT dataset.

This chapter is organized as follows: In Section 2, the work performed in the field of ASR is discussed with the name of related work. Section 3 covers the various architectures of ASR. Section 4 presents the brief introduction about CNN. Section 5 explains CNN-based direct raw speech recognition model. In Section 6, available experimental results are shown. Finally, Section 7 concludes this chapter with the brief discussion.

2. Related work

Traditional ASR system leveraged the GMM/HMM paradigm for acoustic modeling. GMM efficiently processes the vectors of input features and estimates emission probabilities for each HMM state. HMM efficiently normalizes the temporal variability present in speech signal. The combination of HMM and language model is used to estimate the most likely sequence of phones. The discriminative objective function is used to improve the recognition rate of the system by the discriminatively fine-tuned methods [8]. However, GMM has a shortcoming as it shows inability to model the data that is present on the boundary line. Artificial neural networks (ANNs) can learn much better models of data lying on the boundary condition. Deep neural networks (DNNs) as acoustic models tremendously improved the performance of ASR systems [9, 10, 11]. Generally, discriminative power of DNN is used for phoneme recognition and, for decoding task, HMM is preferred choice. DNNs have many hidden layers with a large number of nonlinear units and produce a very large number of outputs. The benefit of this large output layer is that it accommodates the large number of HMM states. DNN architectures have densely connected layers. Therefore, such architectures are more prone to overfitting. Secondly, features having the local correlations become difficult to learn for such architectures. In [12], speech frames are classified into clustered context-dependent states using DNNs. In [13, 14], GMM-free DNN training process is proposed by the researchers. However, GMM-free process demands iterative procedures like decision trees, generating forced alignments. DNN-based acoustic models are gaining much popularity in large vocabulary speech recognition task [10], but components like HMM and n-gram language model are same as in their predecessors.

GMM or DNN-based ASR systems perform the task in three steps: feature extraction, classification, and decoding. It is shown in Figure 1. Firstly, the short-term signal stis processed at time “t” to extract the features xt. These features are provided as input to GMM or DNN acoustic model which estimates the class conditional probabilities Peixifor each phone class i1I.The emission probabilities are as follows:

pextipxtipxt=Pixtpiii,,IE1

Figure 1.

General framework of automatic speech recognition system.

The prior class probability piis computed by counting on the training set.

DNN is a feed-forward NN containing multiple hidden layers with a large number of hidden units. DNNs are trained using the back-propagation methods then discriminatively fine-tuned for reducing the gap between the desired output and actual output. DNN-/HMM-based hybrid systems are the effective models which use a tri-phone HMM model and an n-gram language model [10, 15]. Traditional DNN/HMM hybrid systems have several independent components that are trained separately like an acoustic model, pronunciation model, and language model. In the hybrid model, the speech recognition task is factorized into several independent subtasks. Each subtask is independently handled by a separate module which simplifies the objective. The classification task is much simpler in HMM-based models as compared to classifying the set of variable-length sequences directly. Figure 2 shows the hybrid DNN/HMM phoneme recognition model.

Figure 2.

Hybrid DNN/HMM phoneme recognition.

On the other side, researchers proposed end-to-end ASR systems that directly map the speech into labels without any intermediate components. As the advancements in deep learning, it has become possible to train the system in an end-to-end fashion. The high success rate of deep learning methods in vision task motivates the researchers to focus on classifier step for speech recognition. Such architectures are called deep because they are composed of many layers as compared to classical “shallow” systems. The main goal of end-to-end ASR system is to simplify the conventional module-based ASR system into a single deep learning framework. In earlier systems, divide and conquer approaches are used to optimize each step independently, whereas deep learning approaches have a single architecture that leads to more optimal system. End-to-end speech recognition systems directly map the speech to text without requiring predefined alignment between acoustic frame and characters [16, 17, 18, 19, 20, 21, 22, 23, 24]. These systems are generally divided into three broad categories: attention-based model [19, 20, 21, 22], connectionist temporal classification [16, 17, 18, 25], and CNN-based direct raw speech method [5, 6, 7, 26]. All these models have a capability to address the problem of variable-length input and output sequences.

Attention-based models are gaining much popularity in a variety of tasks like handwriting synthesis [27], machine translation [28], and visual object classification [29]. Attention-based models directly map the acoustic frame into character sequences. However, this model differs from other machine translation tasks by requesting much longer input sequences. This model generates a character based on the inputs and history of the target character. The attention-based models use encoder-decoder architecture to perform the sequence mapping from speech feature sequences to text as shown in Figure 3. Its extension, i.e., attention-based recurrent networks, has also been successfully applied to speech recognition. In the noisy environment, these models’ results are poor because the estimated alignment is easily corrupted by noise. Another issue with this model is that it is hard to train from scratch due to misalignment on longer input sequences. Sequence-to-sequence networks have also achieved many breakthroughs in speech recognition [20, 21, 22]. They can be divided into three modules: an encoding module that transforms sequences, attention module that estimates the alignment between the hidden vector and targets, and decoding module that generates the output sequence. To develop successful sequence-to-sequence model, the understanding and preventing limitations are required. The discriminative training is a different way of training that raises the performance of the system. It allows the model to focus on most informative features with the risk of overfitting.

Figure 3.

Attention-based ASR model.

End-to-end trainable speech recognition systems are an important application of attention-based models. The decoder network computes a matching score between hidden states generated by the acoustic encoder network at each input time. It processes its hidden states to form a temporal alignment distribution. This matching score is used to estimate the corresponding encoder states. The difficulty of attention-based mechanism in speech recognition is that the feature inputs and corresponding letter outputs generally proceed in the same order with only small deviations within word. However, the different length of input and output sequences makes it more difficult to track the alignment. The advantage of attention-based mechanism is that any conditional independence assumptions (Markov assumption) are not required in this mechanism. Attention-based approach replaces the HMM with RNN to perform the sequence prediction. Attention mechanism automatically learns alignment between the input features and desired character sequence.

CTC techniques infer the speech-label alignment automatically. CTC [25] was developed for decoding the language. Firstly, Hannun et al. [17] used it for decoding purpose in Baidu’s deep speech network. CTC uses dynamic programming [16] for efficient computation of a strictly monotonic alignment. However, graph-based decoding and language model are required for it. CTC approaches use RNN for feature extraction [28]. Graves et al. [30] used its objective function in deep bidirectional long short-term memory (LSTM) system. This model successfully arranges all possible alignments between input and output sequences during model training, not on the prior.

Two different versions of beam search are adopted by [16, 31] for decoding CTC models. Figure 4 shows the working architecture of the CTC model. In this, noisy and not informative frames are discarded by the introduction of the blank label which results in the optimal output sequence. CTC uses intermediate label representation to identify the blank labels, i.e., no output labels. CTC-based NN model shows high recognition rate for both phoneme recognition [32] and LVCSR [16, 31]. CTC-trained neural network with language model offers excellent results [17].

Figure 4.

CTC model for speech recognition.

End-to-end ASR systems perform well and achieve good results, yet they face two major challenges. First is how to incorporate lexicons and language models into decoding. However, [16, 31, 33] have incorporated lexicons for searching paths. Second, there is no shared experimental platform for the purpose of benchmark. End-to-end systems differ from the traditional system in both aspects: model architecture and decoding methods. Some efforts were also made to model the raw speech signal with little or no preprocessing [34]. Palaz et al. [6] showed in his study that CNN [35] can calculate the class conditional probabilities from raw speech signal as direct input. Therefore, CNNs are the preferred choice to learn features from the raw speech. Two stages of learned feature process are as follows: initially, features are learned by the filters at first convolutional layer, and then learned features are modeled by second and higher-level convolutional layers. An end-to-end phoneme sequence recognizer directly processes the raw speech signal as inputs and produces a phoneme sequence. The end-to-end system is composed of two parts: convolutional neural networks and conditional random field (CRF). CNN is used to perform the feature learning and classification, and CRFs are used for the decoding stage. CRF, ANN, multilayer perceptron, etc. have been successfully used as decoder. The results on TIMIT phone recognition task also confirm that the system effectively learns the features from raw speech and performs better than traditional systems that take cepstral features as input [36]. This model also produces good results for LVCSR [7].

3. Various architectures of ASR

In this section, a brief review on conventional GMM/DNN ASR, attention-based end-to-end ASR, and CTC is given.

3.1. GMM/DNN

ASR system performs sequence mapping of T-length speech sequence features, X=XtRDt=1,,T,into an N-length word sequence, W=wnυn=1Nwhere Xtrepresents the D-dimensional speech feature vector at frame t and wnrepresents the word at position nin the vocabulary, υ.

The ASR problem is formulated within the Bayesian framework. In this method, an utterance is represented by some sequence of acoustic feature vector X, derived from the underlying sequence of words W, and the recognition system needs to find the most likely word sequence as given below [37]:

Ŵ=argmaxwpWXE2

In Eq. (2), the argument of pWX, that is, the word sequence W, is found which shows maximum probability for given feature vector, X.Using Bayes’ rule, it can be written as

Ŵ=argmaxwpXWpWpXE3

In Eq. (3), the denominator pXis ignored as it is constant with respect to W. Therefore,

Ŵ=argmaxwpXWpWE4

where pXWrepresents the sequence of speech features and it is evaluated with the help of acoustic model. pWrepresents the prior knowledge about the sequence of words Wand it is determined by the language model. However, current ASR systems are based on a hybrid HMM/DNN [38], which is also calculated using Bayes’ theorem and introduces the HMM state sequence S, to factorize pWXinto the following three distributions:

arg maxwυpWXE5
=arg maxwυSpXS,WpSWpWE6
arg maxwυSpXS,pSWpWE7

where pXS, pSW, and pWrepresent acoustic, lexicon, and language models, respectively. Equation (6) is changed into Eq. (7) in a similar way as Eq. (4) is changed into Eq. (5).

3.1.1. Acoustic models pXS

pXScan be further factorized using a probabilistic chain rule and Markov assumption as follows:

pXS=t=1Tpxtx1,,xt1,SE8
t=1Tpxtstt=1TpstxtpstE9

In Eq. (9), framewise likelihood function pxtstis changed into the framewise posterior distribution pstxtpstwhich is computed using DNN classifiers by pseudo-likelihood trick [38]. In Eq. (9), Markov assumption is too strong. Therefore, the contexts of input and hidden states are not considered. This issue can be resolved using either the recurrent neural networks (RNNs) or DNNs with long-context features. A framewise state alignment is required to train the framewise posterior which is offered by an HMM/GMM system.

3.1.2. Lexicon model pSW

pSWcan be further factorized using a probabilistic chain rule and Markov assumption (first order) as follows:

pSW=t=1Tpsts1,,st1,WE10
t=1Tpstst1,WE11

An HMM state transition represents this probability. A pronunciation dictionary performs the conversion from wto HMM states through phoneme representation.

3.1.3. Language model pW

Similarly, pWcan be factorized using a probabilistic chain rule and Markov assumption ((m-1)th order) as an m-gram model, i.e.,

pW=n=1Npwnw1wn1E12
n=1Npwnwnm1,,wn1E13

The issue of Markov assumption is addressed using recurrent neural network language model (RNNLM) [39], but it increases the complexity of decoding process. The combination of RNNLMs and m-gram language model is generally used and it works on a rescoring technique.

3.2. Attention mechanism

The approach based on attention mechanism does not make any Markov assumptions. It directly finds the posterior pCX,on the basis of a probabilistic chain rule:

pCX=l=1Lpclc1,,cl1,XpattC/XE14

where pattCXrepresents an attention-based objective function. pclc1,,cl1,Xis obtained by

ht=EncoderX,E15
alt=ContentAttentionql1htLocationAttention({al1}t=1T,ql1,htE16
rl=t=1Taltht,E17
pclc1,,cl1,X=Decoderrlql1cl1E18

Eq. (15) represents the encoder and Eq. (18) represents the decoder networks. altrepresents the soft alignment of the hidden vector, ht. Here, rlrepresents the weighted letter-wise hidden vector that is computed by weighted summation of hidden vectors. Content-based attention mechanism with or without convolutional features are shown by ContentAttention.and LocationAttention., respectively.

3.2.1. Encoder network

The input feature vector Xis converted into a framewise hidden vector, htusing Eq. (15). The preferred choice for an encoder network is BLSTM, i.e.,

EncoderXBLSTMtXE19

It is to be noted that the computational complexity of the encoder network is reduced by subsampling the outputs [20, 21].

3.2.2. Content-based attention mechanism

ContentAttention.is shown as

elt=gTtanhLinql1+LinBhtE20
alt=Softmax(elt}t=1TE21

grepresents a learnable parameter. eltt=1Trepresents a T-dimensional vector. tanh.and Lin.represent the hyperbolic tangent activation function and linear layer with learnable matrix parameters, respectively.

3.2.3. Location-aware attention mechanism

It is an extended version of content-based attention mechanism to deal with the location-aware attention. If al1=al1t=1Tis replaced in Eq. (16), then LocationAware.is represented as follows:

ftt=1T=Ral1E22
elt=ɡTtanhLinql1+Linht+LinBftE23
alt=softmaxett=1TE24

Here, * denotes 1-D convolution along the input feature axis, t, with the convolution parameter, R, to produce the set of T features ftt=1T.

3.2.4. Decoder network

The decoder network is an RNN that is conditioned on previous output Cl1and hidden vector ql1. LSTM is preferred choice of RNN that represented as follows:

Decoder.softmax(LinBLSTMl.E25

LSTMl.represents uniconditional LSTM that generates hidden vector qlas output:

ql=LSTMlrlql1cl1E26

rlrepresents the concatenated vector of the letter-wise hidden vector; cl1represents the output of the previous layer which is taken as input.

3.2.5. Objective function

The objective function of the attention model is computed from the sequence posterior

pattCXl=1Lpclc1cl1XpattCXE27

where clrepresents the ground truth of the previous characters. Attention-based approach is a combination of letter-wise objectives based on multiclass classification with the conditional ground truth history cl,,cl1in each output l.

3.3. Connectionist temporal classification (CTC)

The CTC formulation is also based on Bayes’ decision theory. It is to be noted that L-length letter sequence,

C=<b>c1<b>c2<b>cL<b>=clU <b>l=12L+1E28

In C, clis always <b>and letter when lis an odd and an even number, respectively. Similar as DNN/HMM model, framewise letter sequence with an additional blank symbol

Z=ztU <b>t=1TE29

is also introduced. The posterior distribution, pCX, can be factorized as

pCX=zpCZ,XpZXE30
zpCZ.pZXE31

Same as Eq. (3), CTC also uses Markov assumption, i.e., pCZXpCZ, to simplify the dependency of the CTC acoustic model, pZX, and CTC letter model, pCZ.

3.3.1. CTC acoustic model

Same as DNN/HMM acoustic model, pZXcan be further factorized using a probabilistic chain rule and Markov assumption as follows:

pZX=t=1Tpztz1,,zt1,XE32
t=1TpztXE33

The framewise posterior distribution, pztXis computed from all inputs, X, and it is directly modeled using bidirectional LSTM [30, 40]:

pztX=SoftmaxLinBht,E34
ht=BLSTMtXE35

where Softmax.represents the softmax activation function. LinB.is used to convert the hidden vector, ht, to a U+1dimensional vector with learnable matrix and bias vector parameter. BLSTMt.takes full input sequence as input and produces hidden vector htat t.

3.3.2. CTC letter model

By applying Bayes’ decision theory probabilistic chain rule and Markov assumption, pZXcan be written as

pC/Z=pZ/CpCpZE36
=t=1Tpztz1zt1CpCpZE37
t=1Tpztzt1,C)pCpZE38

where pztzt1,C)represents state transition probability. pCrepresents letter-based language model, and pZrepresents the state prior probability. CTC architecture incorporates letter-based language model. CTC architecture can also incorporate a word-based language model by using letter-to-word finite state transducer during decoding [18]. The CTC has the monotonic alignment property, i.e.,

when zt1=cm, then zt=clwhere lm.

Monotonic alignment property is an important constraint for speech recognition, so ASR sequence-to-sequence mapping should follow the monotonic alignment. This property is also satisfied by HMM/DNN.

3.3.3. Objective function

The posterior, pCX, is represented as

pCXzt=1Tpztzt1,CpztXpctcC/X.pCpZE39

Viterbi method and forward-backward algorithm are dynamic programming algorithm which is used to efficiently compute the summation over all possible Z.CTC objective function pCTCCXis designed by excluding the pC/pZfrom Eq. (23).

The CTC formulation is also same as HMM/DNN. The minute difference is that Bayes’ rule is applied to pCZinstead of pWX. It has also three distribution components like HMM/DNN, i.e., framewise posterior distribution, pztX;transition probability, pztzt1,C; and letter model, pC.It also uses Markov assumption. It does not fully utilize the benefit of end-to-end ASR, but its character output representation still possesses the end-to-end benefits.

4. Convolutional neural networks

CNNs are the popular variants of deep learning that are widely adopted in ASR systems. CNNs have many attractive advancements, i.e., weight sharing, convolutional filters, and pooling. Therefore, CNNs have achieved an impressive performance in ASR. CNNs are composed of multiple convolutional layers. Figure 5 shows the block diagram of CNN. LeCun and Bengio [41] describe the three states of convolutional layer, i.e., convolution, pooling, and nonlinearity.

Figure 5.

Block diagram of convolutional neural network.

Deep CNNs set a new milestone by achieving approximate human level performance through advanced architectures and optimized training [42]. CNNs use nonlinear function to directly process the low-level data. CNNs are capable of learning high-level features with high complexity and abstraction. Pooling is the heart of CNNs that reduces the dimensionality of a feature map. Maxout is widely used nonlinearity and has shown its effectiveness in ASR tasks [43, 44].

Pooling is an important concept that transforms the joint feature representation into the valuable information by keeping the useful information and eliminating insignificant information. Small frequency shifts that are common in speech signal are efficiently handled using pooling. Pooling also helps in reducing the spectral variance present in the input speech. It maps the input from p adjacent units into the output by applying a special function. After the element-wise nonlinearities, the features are passed through pooling layer. This layer executes the downsampling on the feature maps coming from previous layer and produces the new feature maps with a condensed resolution. This layer drastically reduces the spatial dimension of input. It serves the two main purposes. The first is that the amount of parameters or weight is reduced by 65%, thus lessening the computational cost. The second is that it controls the overfitting. This term refers to when a model is so tuned to the training examples.

5. CNN-based end-to-end approach

A novel acoustic model based on CNN is proposed by Palaz et al. [5] which is shown in Figure 6. In this, raw speech signal is segmented into input speech signal stc=stcstst+cin the context of 2c frames having spanning window winmilliseconds. First convolutional layer learns the useful features from the raw speech signal, and remaining convolutional layers further process these features into the useful information. After processing the speech signal, CNN estimates the class conditional probability, i.e., Pi/stc, which is used to calculate emission scaled-likelihood Pstc/i. Several filter stages are present in the network before the classification stage. A filter stage is a combination of convolutional layer, pooling layer, and a nonlinearity. The joint training of feature stage and classifier stage is performed using the back-propagation algorithm.

Figure 6.

CNN-based raw speech phoneme recognition system.

The end-to-end approach employs the following understanding:

  1. Speech signals are non-stationary in nature. Therefore, they are processed in a short-term manner. Traditional feature extraction methods generally use 20–40 ms sliding window size. Although in the end-to-end approach, short-term processing of signal is required. Therefore, the size of the short-term window is taken as hyperparameter which is automatically determined during training.

  2. Feature extraction is a filter operation because its components like Fourier transform, discrete cosine transform, etc. are filtering operations. In traditional systems, filtering is applied on both frequency and time. So, this factor is also considered in building convolutional layer in end-to-end system. Therefore, the number of filter banks and their parameters are taken as hyperparameters that are automatically determined during training.

  3. The short-term processing of speech signal spread the information across time. In traditional systems, this spread information is modeled by calculating temporal derivatives and contextual information. Therefore, intermediate representation is supplied to classifier and calculated by taking long time span of input speech signal. Therefore, win, the size of input window, is taken as hyperparameter, which is estimated during training.

The end-to-end model estimates Pi/stcby processing the speech signal with minimal assumptions or prior knowledge.

6. Experimental results

In this model, a number of hyperparameters are used to specify the structure of the network. The number of hidden units in each hidden layer is very important; hence, it is taken as hyperparameter. winrepresents the time span of input speech signal. kWrepresents the kernel and temporal window width. dWrepresents the shift of temporal window. kWmprepresents max-pooling kernel width and dWmprepresents the shift of max-pooling kernel. The value of all hyperparameters is estimated during training based on frame-level classification accuracy on validation data. The range of hyperparameters after validation is shown in Table 1.

HyperparameterUnitsRange
Input window size (win)ms100–700
Kernel width of the first ConvNet layer (kW1)Samples10–90
Kernel width of the nth ConvNet layer (kWn)Samples1–11
Number of filters per kernel (doutt)Filters20–100
Max-pooling kernel width (kWmp)Frames2–6
Number of hidden units in the classifierUnits200–1500

Table 1.

Range of hyperparameter for TIMIT dataset during validation.

The experiments are conducted for three convolutional layers. The speech window size (winis taken 250 ms with a shift of temporal window dW10 ms. Table 2 shows the comparison of existing end-to-end speech recognition model in the context of PER. The results of the experiments conducted on TIMIT dataset for this model are compared with already existing techniques, and it is shown in Table 3. The main advantages of this model are that it uses only few parameters and offers better performance. It also increases the generalization capability of the classifiers.

End-to-end speech recognition modelPER (%)
CNN-based speech recognition system using raw speech as input [7]33.2
Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks [36]32.4
Convolutional neural network-based continuous speech recognition using raw speech signal [6]32.3
End-to-end phoneme sequence recognition using convolutional neural networks [5]27.2
CNN-based direct raw speech model21.9
End-to-end continuous speech recognition using attention-based recurrent NN: First results [19]18.57
Toward end-to-end speech recognition with deep convolutional neural networks [44]18.2
Attention-based models for speech recognition [20]17.6
Segmental recurrent neural networks for end-to-end speech recognition [45]17.3

Table 2.

Comparison of existing end-to-end speech model in the context of PER (%).

Bold value and text represent the performance of the CNN-based direct raw speech model.

MethodsPER (%)
GMM-/HMM-based ASR system [46]34
CNN-based direct raw speech model21.9
Attention-based models for speech recognition [20]17.6
Segmental recurrent neural networks for end-to-end speech recognition [45]17.3
Combining time and frequency domain convolution in convolutional neural network-Based phone recognition [47]16.7
Phone recognition with hierarchical convolutional deep maxout networks [48]16.5

Table 3.

Comparison of existing techniques with CNN-based direct raw speech model in the context of PER (%).

Bold value and text represent the performance of the CNN-based direct raw speech model.

7. Conclusion

This chapter discusses the CNN-based direct raw speech recognition model. This model directly learns the relevant representation from the speech signal in a data-driven manner and calculates the conditional probability for each phoneme class. In this, CNN as an acoustic model consists of a feature stage and classifier stage. Both the stages are trained jointly. Raw speech is supplied as input to first convolutional layer, and it is further processed by several convolutional layers. Classifiers like ANN, CRF, MLP, or fully connected layers calculate the conditional probabilities for each phoneme class. After that decoding is performed using HMM. This model shows the similar performance as shown by MFCC-based conventional mode.

© 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Vishal Passricha and Rajesh Kumar Aggarwal (December 12th 2018). Convolutional Neural Networks for Raw Speech Recognition, From Natural to Artificial Intelligence - Algorithms and Applications, Ricardo Lopez-Ruiz, IntechOpen, DOI: 10.5772/intechopen.80026. Available from:

chapter statistics

1080total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Evaluation between Virtual Acoustic Model and Real Acoustic Scenarios for Urban Representation

By Josep Llorca, Héctor Zapata, Jesús Alba, Ernest Redondo and David Fonseca

Related Book

First chapter

Mechanical Models of Microtubules

By Slobodan Zdravković

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us