A VLSI Architecture for Output Probability and Likelihood Score Computations of HMM-Based Recognition Systems

the input to OPCs and LSCs, e.g., from 8-bit to 16-bit. We demonstrate a high-speed VLSI architecture that supports FastStoreBPP. We secondly show multiple store-based block parallel processing ( MultipleStoreBPP ) for OPCs and LSCs, and present a Viterbi scorer which supports MultipleStoreBPP. MultipleStoreBPP has high performance scalability by further extending the bit length of the input to OPCs and LSCs, e.g., from 8-bit to 32-bit.


Introduction
Due to their effectiveness and efficiency for user-independent recognition, hidden Markov models (HMMs) are widely used in applications such as speech recognition (word recognition, connected word recognition and continuous speech recognition), lip-reading and gesture recognition. Output probability computations (OPCs) of continuous HMMs and likelihood scorer computations (LSCs) are the most time-consuming part of HMM-based recognition systems.
High-speed VLSI architectures optimized for recognition tasks have been developed for the development of well-optimized HMM-based recognition systems (Mathew et al., 2003a;Nakamura et al., 2010;Yoshizawa et al., 2004;2006;Kim & Jeong, 2007). Yoshizawa et al. investigated a block-wise parallel processing (BPP) for OPCs and LSCs, and proposed a high-speed VLSI architecture for word recognition (Yoshizawa et al., 2002;2004;2006). Nakamura et al. investigated a BPP, store-based block parallel processing (StoreBPP), for OPCs, and proposed a high-speed VLSI architecture for OPCs (Nakamura et al., 2010). As for OPCs and LSCs with StoreBPP, Viterbi scorer for the StoreBPP architecture is required, but not presented yet. An easy application of a Viterbi scorer to the StoreBPP architecture requires many registers and reduces the advantage of using StoreBPP. Different BPPs require different architectures of Viterbi scorer. Viterbi scorer which is suitable for StoreBPP is required for the development of well-optimized future HMM-based recognition systems.
In this chapter, we firstly show fast store-based block parallel processing (FastStoreBPP)forOPCs and LSCs, and present a Viterbi scorer which supports FastStoreBPP. FastStoreBPP exploits full performance of StoreBPP by doubling the bit length of the input to OPCs and LSCs, e.g., from 8-bit to 16-bit. We demonstrate a high-speed VLSI architecture that supports FastStoreBPP. We secondly show multiple store-based block parallel processing (MultipleStoreBPP) for OPCs and LSCs, and present a Viterbi scorer which supports MultipleStoreBPP. MultipleStoreBPP has high performance scalability by further extending the bit length of the input to OPCs and LSCs, e.g., from 8-bit to 32-bit.  Compared with the StreamBPP (Yoshizawa et al., 2002;2004;2006) architecture, our FastStoreBPP and MultipleStoreBPP architectures have fewer registers and requires less processing time. From a VLSI architectural viewpoint, a comparison shows the efficiency of the MultipleStoreBPP architecture through its efficient use of processing elements (PEs).
The remainder of this chapter is organized as follows: the structure of HMM-based recognition systems is described in Section 2, the FastStoreBPP architecture is introduced in Section 3, the MultipleStoreBPP architecture is introduced in Section 4, the architectures are evaluated in Section 5, and conclusions are presented in Section 6.

HMM-based recognition systems
2.1 HMM-based recognition hardware Figure 1 shows the basic structure of the relevant part of HMM-based recognition hardware (Mathew et al., 2003a;Nakamura et al., 2010;Yoshizawa et al., 2002;2004;2006;Kim & Jeong, 2007). The OPC circuit and the Viterbi scorer for LSC work together as a recognition engine. The inputs to the OPC circuit are feature vectors of several dimensions and HMM parameters. These values are stored in RAM and ROM respectively, as shown in Fig. 1 (Nakamura et al., 2010;Yoshizawa et al., 2002;2004;2006 where ω j , σ jp and µ jp are the parameters of the Gaussian probability density function which are precomputed and stored in ROM. The OPC circuit computes log b j (O t ) based on Eq. (1), where 1 ≤ j ≤ N and 1 ≤ t ≤ T. All HMM parameters ω j , σ jp ,a n dµ jp are stored in ROM and the input feature vectors are stored in RAM. The values of T, N, P,a n dt h en u m b e ro f HMMs V differ for each recognition system. For isolated word recognition systems, T, N, P, and V are 86, 32, 38, and 800, respectively (Yoshizawa et al., 2004;2006), and for another word recognition system, T, N, P,andV are 89, 12, 16 and 100 (Yoshizawa et al., 2002).
For output probabilitie log b j (O t ),where1≤ j ≤ N and 1 ≤ t ≤ T, the log-likelihood score log S * is given by All HMM parameters log π j ,l o ga j−1,j ,a n dloga j,j are stored in ROM, and the Viterbi scorer computes log δ t (j) based on Eqs.
(2) and (3). A flowchart of OPCs and LSCs is shown in Fig. 2 (Yoshizawa et al., 2004;2006). All HMM output probabilities are obtained by P · N · T · V times the partial computation of log b j (O t ) calls. Partial computation of log b j (O t ) performs four arithmetic operations, an addition, a subtraction and two multiplications in Eq. (1). All likelihood scores are obtained by N · T · V times the partial computation of log δ t (j) calls. Partial computation of log δ t (j) performs three additions in Eq. (3). The OPC circuit and the Viterbi scorer accelerate these computations. More details can be found in (Yoshizawa et al., 2002;2004;2006).

Block parallel processing for OPCs and LSCs
BPP for OPCs and LSCs was proposed as an efficient high-speed parallel processing method for HMM-based isolated word speech recognition (Yoshizawa et al., 2002). In BPP, the set of input feature vectors is called a block, and HMM parameters are effectively shared between the different input feature vectors for OPC. In recent years, two types of BPP are classified according to input data flow: StreamBPP and StoreBPP (Nakamura et al., 2010).
A block can be considered as a M · P matrix whose elements are o t ′ p ,where1≤ t ′ ≤ M (≤  (Yoshizawa et al., 2006). StreamBPP performs N OPCs in parallel with NPE1s ( Fig. 1) and obtains N output probabilities log b 1 (O t ),logb 2 (O t ), ..., and log b N (O t ) simultaneously. These N output probabilities are obtained every P clock cycles and they are fed to the Viterbi scorer for LSCs. An Viterbi scorer that supports the OPCs and LSCs in StreamBPP was presented in (Yoshizawa et al., 2006), where N LSCs are performed with NP E 2 s (Fig. 1). In the Viterbi scorer, N intermediate scores log δ t (1),l o gδ t (2), ..., and log δ t (N) are computed simultaneously.
A block can be considered as a set of M input feature vectors whose elements are O t ′ where 1 ≤ t ′ ≤ M (≤ T). StoreBPP performs arithmetic operations to locally stored input feature vectors O 1 , O 2 , ..., and O M (Nakamura et al., 2010). StoreBPP performs ⌈M/2 OPCs in parallel with ⌈M/2 PE1s ( Fig. 1) and obtains M HMM output probabilities log b j (O t ′ +1 ), log b j (O t ′ +2 ), ..., and log b j (O t ′ +M ).T h e s eM HMM output probabilities are obtained every 2 · P clock cycles and they are fed to the Viterbi scorer for LSCs. Different BPPs require different Viterbi scorer architectures. In StoreBPP, a Viterbi scorer that supports the OPCs, where M HMM output probabilities are computed simultaneously, is required, but the Viterbi scorer for the StoreBPP was not addressed in (Nakamura et al., 2010). An easy introduction of the Viterbi scorer to StoreBPP requires many registers, which reduces the advantage of the StoreBPP architecture.

Fast store-based block parallel processing
A two-step process was adopted in StoreBPP to compute M HMM output probabilities log b j (O t ′ +1 ),l o g b j (O t ′ +2 ), ..., and log b j (O t ′ +M ), where half of the output probabilities are computed simultaneously with ⌈M/2 PE1s (Nakamura et al., 2010). The ⌈M/2 computations which are performed in parallel and a ROM access are performed simultaneously in StoreBPP, where two HMM parameters −µ j,p+1 and σ j,p+1 are required for next OPC. Because it takes two cycles to read two HMM parameters from ROM, it was appropriate to use a two-step process in StoreBPP.
StoreBPP performs ⌈M/2 OPCs in parallel by using a register array of size M,w h e r eM is the size of block (Nakamura et al., 2010). We improve StoreBPP by reducing the required register size for performing M OPCs in parallel. We modify the parallel computation of ⌈M/2 OPCs by doubling the bit length of the input to OPC. By this bit length extension, two HMM parameters can be read simultaneously. We call the modified parallel processing fast store-based block parallel processing (FastStoreBPP), and we show a pipelined Viterbi scorer that supports FastStoreBPP. It performs M OPCs in parallel by using a register array of size M.

FastStoreBPP architecture for OPCs and LSCs
Our FastStoreBPP architecture that supports FastStoreBPP is shown in Fig. 4, where we assume M ≤ P and hence ⌈M/P = 1. The FastStoreBPP architecture consists of an OPC circuit and a Viterbi scorer. The architecture has two register arrays (RegO and Regω), Load −µ 1,1 and σ 1,1 to Regµ and Regσ, respectively (1 cycle) Load ω j to Regω (and load log π j to RegTmpδ j when t = t ′ + 1 == 1) (1 cycle) Load log a j, j to Rega j, j (and Load log a j−1, j to Rega j−1, j when 2 ≤ j)(1cycle).    two registers (Regµ and Regσ), and MP E 1s for OPCs. Each PE1 consists of two adders and two multipliers, which are used for computing ω j + ∑ P p=1 σ jp (o tp − µ jp ) 2 . PE1 i in the FastStoreBPP architecture, the StreamBPP architecture (Yoshizawa et al., 2006), and the StoreBPP architecture (Nakamura et al., 2010) are identical but differ in number. In addition, the architecture has three register arrays (RegInδ 1 ,R e g L a s t δ,a n dR e g T m p δ j−11 ), three registers (Rega j,j 1 ,R e g a j−1,j 1 ,a n dR e g T m p δ j 1 ), and a PE2f o rL S C s . PE2 consists of three adders, two selectors and two comparators, which are used for LSC based on Eqs. (2) and (3). PE2 in the FastStoreBPP architecture and the StreamBPP architecture (Yoshizawa et al., 2006) are identical but differ in number.

M-parallel OPC
OPC starts by reading M input feature vectors O t ′ +1 , O t ′ +2 , ..., and O t ′ +M from RAM and storing them in RegO in OPC circuit (Fig. 4) based on Loop C1 (Fig. 3). M · P/2 cycles are required for reading M input feature vectors. Then, the HMM parameters of v-th HMM are read from ROM, which are −µ 11 , σ 11 ,a n dω 1 ,a n ds t o r e di nR e g µ,R e g σ,a n dR e g ω, respectively, based on Loop C1 and Loop B ′ (Fig. 3). For the stored input feature vectors, M intermediate results of M OPCs are simultaneously computed with the stored HMM parameters by using MPE1s (Fig. 4) based on Loop A ′ (Fig. 3).
The stored HMM parameters are shared by all PE1s, and the obtained M intermediate results are stored in Regω. At the same time, two HMM parameters −µ jp+1 and σ jp+1 of v-th HMM are read from ROM and stored in Regµ and Regσ, respectively, where the values are overwritten. The HMM parameters are used in next M computations which are performed in parallel with MPE1s based on Loop A ′ (Fig. 3).
M HMM output probabilities are simultaneously obtained every P cycles by MPE 1s, which (Fig. 3).

161
A VLSI Architecture for Output Probability and Likelihood Score Computations of HMM-Based Recognition Systems www.intechopen.com The results are copied from Regω to RegInδ 1 to start LSCs log δ t ′′ +1 (j),l o gδ t ′′ +1 (j), ..., and log δ t ′′ +M (j) and the next M OPCs for the (j (Fig. 3). M · N HMM output probabilities of v-th HMM are obtained by Loop B ′ (Fig. 3). M · N · T HMM output probabilities of v-th HMM are obtained by Loop C1 (Fig. 3). M · N · T · V HMM output probabilities of all HMM are obtained by Loop D ′ (Fig. 3).
Our Viterbi scorer, which support FastStoreBPP, was presented in Fig. 4, where M ≤ P and ⌈M/P = 1. Our ⌈M/P -stage pipelined Viterbi scorer, which supports P < M and 1 < ⌈M/P , is shown in Fig. 5. The Viterbi scorer in Fig. 4 is an instance of the generalized ⌈M/P -stage pipelined Viterbi scorer where ⌈M/P = 1. The ⌈M/P -stage pipelined Viterbi scorer consists of RegLastδ and ⌈M/P sub Viterbi scorers. The i-th stage, i.e., i-th sub Viterbi scorer consists of two register arrays (RegInδ i and RegTmpδ j−1 i ), three registers (RegTmpδ ji , Rega j,ji and Rega j−1,ji )a n daPE2. Each RegInδ i consists of i · P registers, where i = 1, i = 2, ..., ⌈M/P −1. RegInδ ⌈M/P consists of ⌈M/P ·(M mod P) registers. In each RegInδ i , rows are shifted upward every P clock cycles. Each RegTmpδ j−1 i consists of P registers, where i = 1, i = 2, ..., and ⌈M/P −1. RegTmpδ j−1 ⌈M/P consists of M mod P registers. HMM parameters in Rega j,ji and Rega j−1,ji are copied every P clock cycles to Rega j,ji+1 and Rega j−1,ji+1 , respectively, where i = 1, i = 2, ..., and ⌈M/P −1. The last obtained intermediate score by PE2 based on Loop E i (Fig. 3), where i = 1, i = 2, ..., and ⌈M/P −1, is stored in RegTmpδ j−1 i and RegTmpδ ji+1 every P clock cycles based on ⌈M/P -state pipelined LSC (Fig. 3). The last obtained intermediate score by PE2b a s e do n Loop E ⌈M/P (Fig. 3) is stored in RegTmpδ j−1 ⌈M/P and RegLastδ every P clock cycles during Loop B ′ (Fig. 3). The stored intermediate scores are required when starting LSC with new M output probabilities at the first computation by Loop E 1 (Fig. 3). The required intermediate score is read from RegLastδ and stored in RegTmpδ j 1 before computation.

VLSI architecture for OPCs and LSCs with multiple store-based block parallel processing
FastStoreBPP is obtained from StoreBPP by doubling the bit length of the input to OPC for reading two HMM parameters simultaneously.
We further extend the bit length of the input to OPC and we obtain multiple store-based block parallel processing (MultipleStoreBPP).

Multiple store-based block parallel processing
StoreBPP performs M/2 OPCs in parallel by using a register array of size M and M/2 PE1s, where M/2 OPCs are performed by a single HMM (Nakamura et al., 2010). We improve StoreBPP and further reduce the required register size for performing OPCs in parallel. We modify M/2-parallel OPC,w h e r eM/2 OPCs are performed in parallel, to deal with multiple HMMs by further extending the bit length of the input to OPC. By this bit-length extension, L ′ · M ′ /2 OPCs are performed in parallel, where L ′ HMM parameters can be read from ROM simultaneously. We call the modified parallel processing MultipleStoreBPP.O u r MultipleStoreBPP performs M ′ /2-parallel OPC,whereM ′ /2 OPCs are performed in parallel, to L ′ HMMs and L ′ · M ′ /2 OPCs are performed in parallel by using a register array of size M ′ .
A flowchart of our MultipleStoreBPP is shown in Fig. 6. Loops D2 ′ ,C1 ′ ,B ′′ ,andA ′′ in Fig. 6 are based on StoreBPP. In our MultipleStoreBPP, Loop D' (Fig. 3) is partially expanded as shown in Fig. 6 for performing L ′ · M ′ /2 OPCs in parallel in Loop A ′′ . By the expansion, input feature vectors are effectively shared between different M ′ /2-parallel OPCs.
The flowchart consists of L ′ M ′ /2-parallel OPCs and L ′ ⌈M ′ /(2 · P) -stage pipelined LSCs. Each M ′ /2-parallel OPC and ⌈M ′ /(2 · P) -stage pipelined LSC are denoted by dashed and For L ′ HMMs, load −µ 1,1 and σ 1,1 to Regµ and Regσ, respectively (2 cycles) j = 0 j = j + 1 For L ′ HMMs, load ω j to Regω (and load log π j to RegTmpδ j when t = t ′ + 1 == 1) (2 cycles) For L ′ HMMs, load a j, j to Rega j, j (and load a j−1, j to Rega j−1, j when 2 ≤ j)(2cycles) For L ′ HMMs, copy Regω to RegInδ    Fig. 3 with the same PE2s used in FastStoreBPP, but differ in number. Loops A ′ and B ′ (Fig. 3) correspond to Loops A ′′ and B ′′ , respectively. By Loop A ′′ , the output probabilities of L ′ HMMs, (v ′ + 1)-th to (v ′ + L ′ )-th HMMs, are computed with L ′ · M ′ /2 PE1s. These output probabilities are simultaneously obtained every P clock cycles by Loop A ′′ . In Loop A ′′ ,fi r s t l y ,L ′ M ′ /2-parallel OPCs and ROM access −µ j,p+1 of L ′ HMMs are performed simultaneously. Secondly, L ′ M ′ /2-parallel OPCs and ROM access σ j,p+1 of L ′ HMMs are performed simultaneously. 2 · L ′ HMM parameters are read from ROM using two cycles. These HMM parameters are needed for next computation in Loop A ′′ . Then, the obtained L ′ · M ′ /2 output probabilities are fed to L ′ LSCs, where each LSC is the same Viterbi scorer as that introduced for FastStoreBPP, as shown in Fig. 3. These L ′ LSCs support LSC of (v ′ + 1)-th to (v ′ + L ′ )-th HMMs. Loop A ′′ and L ′ LSCs proceed simultaneously.

MultipleStoreBPP architecture for OPCs and LSCs
Our MultipleStoreBPP VLSI architecture is shown in Fig. 7, where we assume M ′ ≤ 2 · P and hence ⌈M ′ /(2 · P) = 1. The MultipleStoreBPP architecture consists of L ′ OPC circuits and L ′ Viterbi scorers. The architecture has 2 · L ′ + 1 register arrays (RegO,R e g σ and Regω), L ′ registers (Regµ), and L ′ · M ′ /2 PE1s for OPCs of L ′ HMMs. Each PE1 consists of two adders and two multipliers, which are used for computing ω j + ∑ P p=1 σ jp (o tp − µ jp ) 2 . PE1s in our MultipleStoreBPP and FastStoreBPP architectures, Yoshizawa et al. (2006), and Nakamura et al. (2010) are identical but differ in number. In addition, the architecture has 3 · L ′ register arrays (RegInδ 1 ,R e g L a s t δ,a n dR e g T m p δ j−11 ), 3 · L ′ registers (Rega j,j 1 ,R e g a j−1,j 1 , and RegTmpδ j 1 ), and L ′ PE2s for LSCs of L ′ HMMs. PE2 consists of three adders, two selectors and two comparators, which are used for LSC on the basis of Eqs. (2) and (3). PE2s in our MultipleStoreBPP and FastStoreBPP architectures and Yoshizawa et al. (2006) are identical but differ in number.
In each OPC circuit, denoted by dashed line in Fig. 7   L ′ · M ′ output probabilities are simultaneously obtained every 2 · P clock cycles by L ′ OPC circuits, which are log b j (O t ′ +1 ), ..., and log b j (O t ′ +M ′ ) of L ′ HMMs based on Loop A ′′ (Fig. 6). The results are copied from Regω to RegInδ 1 for starting LSCs of L ′ HMMs and next OPCs for (j + 1)-th states of L ′ HMMs log b j+1 (O t ′ +1 ), ..., and log b j+1 (O t ′ +M ′ ) based on Loop B ′′ (Fig. 6). L ′ · M ′ · N output probabilities of L ′ HMMs are obtained by Loop B ′′ with the same M ′ input feature vectors O t ′ +1 , ..., and O t ′ +M ′ .
Each Viterbi scorer, denoted by double-dashed lines in Fig. 7, performs ⌈M ′ /(2 · P) -stage pipelined LSC, denoted by double-dashed lines in Fig. 3. LSC starts by reading HMM parameters of L ′ HMMs, log π 1 and log a 1,1 , from ROM and storing them in RegTmpδ j 1 and RegTmpa j,j 1 based on Loop B ′′ (Fig. 6). Then, in each Viterbi scorer, an intermediate score log δ 1 (1) is computed by PE2 with the HMM parameter and the output probability obtained using Eq. (1). The obtained intermediate score is stored in both RegTmpδ j 1 and RegTmpδ j−11 (Fig. 7). RegTmpδ j 1 stores an intermediate score that is needed in the next computation in Loop E 1 (Fig. 3). RegTmpδ j−11 stores M ′ intermediate scores log δ t ′ +1 (j), ..., and log δ t ′ +M ′ (j), which is needed in the next LSC for the (j + 1)-th state of the HMM in Loop B ′′ . After the computation of log δ 1 (1), M ′ − 1 intermediate scores log δ 2 (1), ..., and log δ M ′ (1) are sequentially computed by PE2 on the basis of Eq. (3). In sequential computation, the last obtained intermediate score log δ M ′ (1),i ss t o r e di nR e g T m p δ j−11 and RegLastδ (Fig. 7). RegLastδ stores N intermediate scores that are the last obtained intermediate scores by Loop E 1 during Loop B ′′ . These intermediate scores are log δ t ′ +M ′ (1), ..., and log δ t ′ +M ′ (N) of v-th HMM, which are required when starting LSC with new M ′ output probabilities log b j (O t ′ +M ′ +1 ), ..., and log b j (O t ′ +2·M ′ ) at the first computations in Loop E 1 , given as log δ t ′ +M ′ +1 (1), ..., and log δ t ′ +M ′ +1 (N). A required intermediate score is read from RegLastδ and is stored in RegTmpδ j 1 before computation. Rega j−1,j 1 (Fig.7)storesanHMM parameter log a j−1,j , which is used for computing log δ t ′ (j) on the basis of Eq. (3) when 2 ≤ t ′ and 2 ≤ j.

Evaluation
We compared StreamBPP Yoshizawa et al. (2004;2006), StoreBPP Nakamura et al. (2010), FastStoreBPP (Figs. 3, 4, and 5), and MultipleStoreBPP (Figs. 6, 7, and 8) VLSI architectures. Table 1 shows the register size of MultipleStoreBPP, FastStoreBPP, StoreBPP, and StreamBPP architectures, where x µ , x σ , x ω , x o , x a ,a n dx f represent the bit length of µ jp , σ jp , ω j , o tp , a jj ,a n dt h eo u t p u to fPE1, respectively. N, P,a n dM are the number of HMM states, the dimension of the input feature vector, and the number of input feature vectors in a block, respectively. M ′ and L ′ are the number of input feature vectors in a block with MultipleStoreBPP and the number of HMMs whose output probabilities are simultaneously computed by Loop A ′′ (Fig. 6) with MultipleStoreBPP, respectively. OPC and Viterbi scorer represent the register size of the OPC circuit and Viterbi scorer, respectively. Table 2 shows the processing time for computing output probabilities of V HMMs and likelihood scores with MultipletStoreBPP, FastStoreBPP, StoreBPP, and StreamBPP architectures, where L is the number of HMMs whose output probabilities are computed using the same input feature vectors with StoreBPP Nakamura et al. (2010). OPC and the Viterbi scorer represent the number of clock cycles for OPC and additional cycles for LSC, respectively.

168
Embedded Systems -High Performance Systems, Applications and Projects www.intechopen.com Yoshizawa et al. (2006) Yoshizawa et al. (2006)[cycle] OPC V · (2 · N · P + N + P · T) Viterbi scorer V · (3 · N − 1) StoreBPP Nakamura et al. (2010) [cycle] OPC ⌈V/L {P · M +(1 + 2 · P) · L · N}⌈T/M Viterbi scorer -not available Table 2. Processing times. Table 3 shows the register size, processing time, and the number of PEs for computing output probabilities of 800 HMMs and likelihood scores, where it is assumed that N = 32, P = 38, T = 86, x µ = 8, x σ = 8, x f = 24, x o = 8, x a = 8, and V = 800. These values are the same as those used in a recent circuit design for isolated word recognition Nakamura et al. (2010); Yoshizawa et al. (2004;2006). In addition, we assume that M ′ = 12, L ′ = 4 for the MultipleStoreBPP architecture. Futhermore, ratios compared with StreamBPP are shown in Table 3. Compared with the StreamBPP architectures, the MultipleStoreBPP architecture has fewer registers (48% = 10,432/21,752) and requires less processing time (91% = 4,233,600/4,661,600). The number of PE2s in MultipleStoreBPP and FastStoreBPP are less than that in StreamBPP.   This graph shows that the processing time of MultipleStoreBPP is less than that of FastStoreBPP architecture. It is also less than that of the StreamBPP architecture when M is greater than 22. Figure 10 shows the register size of MultipleStoreBPP, FastStoreBPP, and StreamBPP architectures as well as the value of M (block size). This graph shows that the register size of MultipleStoreBPP is less than those of FastStoreBPP, and StreamBPP architectures when M is greater than 36 and less than 74. Table 4 shows the circuit area, clock period, and power dissipation of the OPC and LSC circuits based on the MultipleStoreBPP and FastStoreBPP architectures, which are derived from the report of the Synopsys Design Compiler (Ver. B-2008.09-SP5), where the target technology is the 90nm technology (STARC 90nm) and the report on power dissipation is obtained with report_power command after logic synthesis. In the table, the delay and area represent the  Table 4. Area, delay and power of OPC and LSC circuits.
minimum clock period and area of the circuit, respectively. Power represents the power dissipation of the circuit whose operating clock frequency is 11 MHz, and for an 800-word real-time isolated word recognition, recognition in 0.2 s is achieved by the MultipleStoreBPP architecture for T = 86, a 1-s speech, V = 800, N = 32, P = 38, M ′ = 22 and L ′ = 4. Compared with the FastStoreBPP architecture, the MultipleStoreBPP architecture has lower power and less area, because the MultipleStoreBPP architecture has fewer registers when the number of PE1s is 44.

Conclusions
We presented MultipleStoreBPP for OPCs and LSCs and presented a new VLSI architecture. MultipleStoreBPP performs parallel-OPCs and pipelined-LSCs for multiple HMMs. Compared with the conventional StoreBPP architecture, the MultipleStoreBPP architecture supports LSC. Furthermore, compared with StreamBPP and FastStoreBPP architectures, the MultipleStoreBPP architecture requires fewer registers and less processing time. In terms of the VLSI architecture the comparison shows the efficiency of the MultipleStoreBPP architecture.

171
A VLSI Architecture for Output Probability and Likelihood Score Computations of HMM-Based Recognition Systems www.intechopen.com