9 Design and Applications of Embedded Systems for Speech Processing

This chapter focuses on speech processing techniques, which involve speech feature extraction, sound localization, speaker identification/verification, and interactive retrieval of spoken documents. Several hardware design issues are discussed in each section. Speech processing applications frequently involve extensive mathematical computation, making resource and power consumption management important. Therefore, this chapter presents not only algorithms but also their corresponding improved solutions to embedded systems, such as fixed-point arithmetic design, field-programmable gate array (FPGA) verification, ARM-based system-on-a-programmable-chip (SoPC) architecture, and other single-chip designs.

Step 3. Calculate the energy in each channel: Step 4. Take the logarithm and perform the cosine transform to obtain the Mel-frequency cepstral coefficients (MFCCs), (3)

Improved algorithm for calculating mel-frequency cepstral coefficients
Generally, the required computational power and ROM in each frame can be determined clearly according to Table 1 (Wang et al., 2000;Wang et al., 2003).As shown in the table, the total required computational power is quite high due to the redundant operations and memory that stores the required constants.Accordingly, some modifications must be made to reduce the computational load.
The weighted energy spectrum in the Mel-window, 1 () k Ej  , can be obtained by subtracting the weighted energy spectrum () k Ej from energy spectrum () Xj.All of the multiplications in (4) can be replaced by subtraction operations.Therefore, the memory required to store the weight constants for (4) becomes redundant and can be eliminated.


Additionally, applying the symmetric property of the cosine function to (3) flattens all of the operations and enables related items to be combined in a new formula, given below.
Therefore, the computational complexity of C [n] operations can be re-estimated, and the result is given in Table 2 (Wang et al., 2000;Wang et al., 2003).

Number of operations and required memory K=256, M=20, L=12
Computational power Actual computational power C [n] S The modified procedure in the MFCC algorithm is based mainly on the improved [] Sk calculation, as discussed below and shown in Fig. 1 (Wang et al., 2000).Every Mel-window is divided into two blocks with equal bandwidth on the Mel-scale.Because the Melwindows overlap each other, every block except for the first and last belongs to two Melwindows.

Fixed-point arithmetic design
The word recognition system is based on the hidden Markov model (HMM).To achieve area-efficiency, MFCC chips are designed using fixed-point arithmetic.The procedure for implementing the fixed-point program is as follows.
Step 1. Partition the algorithm into n modules; this involves calculations of the energy spectrum, channel energy, and MFCCs.
Step 2. Determine the lower bound and upper bound on each module.
The format of fixed-point variables is determined based on the dynamic range of the input variables in the first module.Once this module has been analyzed, the output is fed into the next module and analysis continued until all modules fit the fixed-point data format, as presented in Table 3 (Wang et al., 2003).Step 3. Error measurement for each module This step evaluates the quantization error by comparing their outputs with the output of the corresponding floating-point routines as shown in Step 4. Performance measurement

Maximum Minimum Abs. minimum Fixed-point format
The impact of the recognition rate of the fixed-point MFCC algorithm is evaluated at this stage, as shown in Table 5 (Wang et al., 2003).Table 5.Comparison of recognition rates achieved using floating-point and fixed-point structure.

Circuit design
The use of improved partitioned look-up tables is another commonly used method to perform such elementary functions as logarithm, square root, and trigonometric functions, for example.Figure 2 Wang et al., 2003).
Based on such architecture, only one processing unit is used and all data are processed in pipeline fashion.Figure 2(b) displays the architecture of the improved partitioned table.The shifter logic is used to find the Q value of the minimum left shift and to output the least significant 16 bits, which are the addresses of the two subtables.Only one two-stage pipelined multiplier and adder, which is shared by both the main data path and the look-up table, is used.
At the verification stage, an FPGA board is utilized to implement the MFCC system prototype.First, synthesizable Verilog-HDL descriptions are coded.Synopsys FPGA Express (www.synopsys.com)generates the corresponding netlist files.The Xilinx Flow-Engine completes generating placement, routing, and bit-stream files.The design is implemented successfully in the XC4062XL FPGA chip.

Embedded system design for sound localization
This section introduces a sound localization system, which exploits the average magnitude difference function (AMDF), for finding the directions of environmental signal sources.To verify the accuracy of the algorithm, the entire system is implemented on a single FPGA development board using the Quartus II software tool.Then, the System-on-Chip (SoC) design, based on the FPGA code with the 0.18μm CMOS process, is implemented.The experimental results indicate that the proposed system can achieve higher accuracy with reduced complexity and area of the hardware.

Sound activity detection
When the pair of microphones receives the sound signals, the system begins to determine whether the input signal needs to be handled.The detection of sound activity comprises three steps, which are as follows.
 Threshold Value Detection: Whether the amplitude of the input signal exceeds a threshold is determined by this step.If the amplitude exceeds the threshold, then the system begins to store the input signal data in memory.


Zero-Crossing Rate Estimation: In acoustics, a sound wave has positive and negative values of displacement around the zero amplitude.Zero-crossing rates are calculated by counting the crossings of the baseline over time.The presence of an active ZCR signal can improve threshold value detection. End-Point Detection: An end-point beacon is generated when an ongoing input signal falls below the threshold for a preset period.

Direction-of-arrival estimation
Figure 5 displays a microphone array, where x 1 (t) and x 2 (t) represent the acoustic signals that are received by microphones 1 and 2 respectively; d denotes the distance between these two microphones; is the direction between the array and an unknown source, whose signal is represented as s(t).The source is assumed to be far enough from the microphone array so that the acoustic wave-front that impinges upon the microphone array can be approximated as a plane wave.Let microphone 1 be the reference point; the relationship between the received signals and the source signal in the time domain is given by the equation,


where τ is the propagation delay from the source to the microphone.As shown in the figure, after the wave-front impinges on microphone 2, the wave-front takes time "τ" to reach microphone 1.The distance between the wave-front and microphone 1 is dcos (Chen et al., 2010).Therefore, where c is the sound velocity.In real-life applications, noise and reverberation may distort wave shapes, potentially affecting the propagation delay.A feasible way to estimate the accurate time delay involves using the AMDF.The AMDF firstly fixes the signal at microphone 1, and then shifts the signal at microphone 2 to calculate the time delay.When both signals are the most similar, the difference between the waves will be minimized.In other words, the τ value is obtained when the correlation between the waveforms of the both microphone signals is maximal.Let N be the total number of windows and i represent the sliding window index.The AMDF can be expressed as

FPGA implementation
The entire sound localization system (except for the microphone signal amplifiers) was implemented on a single Altera DE2-70 FPGA board.Software design was developed by using the Quartus II software tool.Firstly, on the FPGA board, the AD converter controller used the I 2 C protocol to control serial input and serial output data.The sound activity detection block was divided into three modules and implemented separately.At the timedelay estimation stage, the AMDF block used conventional basic operational logic elements, such as shift registers, subtraction, absolute value operands, and accumulation, to facilitate the entire design.All blocks implemented the pipeline technique to further accelerate computation.Finally, the output result is displayed on the DE2-70 board using a sevensegment display and LEDs.The system used a total of around 15,600 logic elements (around 188,000 logic gates).

SoC implementation
After the FPGA simulation and validation were complete, the sound localization system was ported to the chip level.In this system, after an input signal passed through the sound activity detection module, it was stored in the left and right SRAMs respectively.Next, the subtraction/absolute/accumulation (SAA) (Wang et al., 2011) module performed the major operations in the AMDF, including subtraction, taking absolute values, and accumulation.Hence, the AMDF block was able to estimate the time delay using the SAA module and convert it into a direction by accessing a predefined table in the ROM.

181
However, while running the AMDF, the system must perform the correlation analysis, 12 1 , N times.The variable N is set herein to 64 for convenience of chip implementation.To reach a favorable trade-off between the chip area and performance, the system used a folding technique to realize the SAA architecture (Fig. 6 and ( 9)).A comparison with the unfolded SAA architecture revealed that the number of adders had been reduced from 127 to eight, and the number of units that performed the absolute value operation had been decreased from 64 to four.The length of the critical paths was effectively minimized, enhancing the clock rate.

Experimental results
The sound localization system was tested with sources in different directions, ranging from 15° to 90° in steps of 15°, at five distances (1-5m).The experimental results indicated that the average accuracy was 80%-90%.The estimation error could be maintained in the range ±5°-±10°.With respect to chip performance, the number of logic gates was reduced to 32,616.Also, the core size and power consumption were minimized (see (Wang et al., 2008a;Wang et al., 2009;Wang et al., 2011) for details).

Embedded system design for speaker identification/verification
The field of speaker recognition has existed for five decades (Furui, 2004).Recently, speaker recognition systems have found many applications in the real world.It is highly flexible and convenient for a wide range of daily-life applications.Various approaches, involving neural networks (Clarkson et al., 2001), Gaussian mixture models (GMMs) (Burget et al., 2007), and support vector machines (SVMs) (Cortes et al., 1995), have been adopted for recognizing speakers.Among them, SVM-based speaker recognition has recently attracted much attention.
Based on the idea of the working set, Platt et al. (1998) proposed the use of the sequential minimal optimization (SMO) algorithm, which is a widely used learning algorithm that involves decomposition, to solve the quadratic programming (QP) problem.Basically, the SMO algorithm performs the following two processes repeatedly: 1) selecting a fixed number of Lagrange multipliers, and 2) solving the QP problem of the multipliers until an optimal solution is found.Although the SMO algorithm makes SVM learning feasible when the number of training samples is very large, the number of required computational iterations still results in a heavy computational burden, which makes it unsuitable for use with stand-alone embedded devices.
The operation of the proposed system based on SMO involves a training phase and an identification phase.Since the SMO training algorithm has huge computational load, it is realized as a dedicated, very large-scale integration (VLSI) module, which is a hardware component.The rest processes of the system, such as speech preprocessing, speech feature extraction, and SVM-based voting, are implemented in software.The proposed system has 90% less training time than the embedded C-based ARM processor, and achieves an 89.9% accuracy with the 2010 speaker recognition database of the National Institute of Standards and Technology (NIST).The proposed system was tested and found to be fully functional on a Socle CDK prototype development board (www.socle-tech.com.tw) with an AMBAbased Xilinx FPGA board and an ARM926EJ processor.

Support vector machine
Support vector classification (Cortes et al., 1995) is a computationally efficient means of finding hyperplanes in a high-dimensional feature space.Training an SVM is the equivalent to finding a hyperplane with the maximum margin.
The canonical representation of a decision hyperplane is (10), where w is the weights of training instances; b is a constant; y i is the label of x i .The optimization problem involves minimizing 2 w .In imperfect separation, the optimal hyperplane is obtained by solving the following constrained optimization problem (11), ,, 1 1 min 2 (( ) ) 1 0 , 0 , 1 where C is a real-valued cost parameter, and ξ i is a penalty parameter (slack variable).If () ii xx   , the SVM finds a linear separating hyperplane with the maximal margin.An SVM www.intechopen.com is called a nonlinear SVM when φ maps x i into a higher-dimensional space.Equation ( 12) is the Lagrange function for imperfect separation.
Basically, ( 12) is a QP problem and can be solved using the SMO algorithm.

Sequential Minimal Optimization
The basic problem of the SMO algorithm is the need to find hyperplane parameters, w and b, by updating Lagrange parameter α.The SMO algorithm searches through the feasible region of the dual problem and maximizes the objective function by choosing two α terms and jointly optimizes them (with the values of the other α terms fixed) in each iteration.Then, the objective function can be written as (13).
The terms Δα 1 and Δα 2 are used to update the hyperplane parameters w and b according to ( 18) and ( 19).

Hardware implementation
The proposed system can perform both speaker training and identification.Based on the complexity analysis in Fig. 7  www.intechopen.comFig. 8. Proposed hardware/software co-design system for speaker identification

Experimental results
The NIST 2010 speaker recognition evaluation (SRE10) speech corpus (by nine speakers) was adopted to evaluate the proposed hardware/software co-design framework.Six datasets, including nine speakers' files in SRE10, were used to evaluate a speaker identification system for an entrance security application.The training utterance of each speaker was 10s long.The duration of the testing utterances was 2-6s.The order of the linear predictive cepstral coefficients (LPCCs) was 18.
Figure 9 presents a time-cost comparison between the proposed hardware/software system and the embedded C code system (ARM-ported system).The proposed design had a 90% lower time-cost than the embedded C code one in the case of interest.Details of the evaluation can be found in (Wang et al., 2008b).www.intechopen.com

Embedded system design for interactive retrieval of spoken documents
Owing to the increasingly widespread use of personal portable devices, an efficient method for retrieving spoken data with limited resources is required.This section proposes an efficient feature-based sentence-matching algorithm for speaker-dependent personal spoken sentence retrieval.Such a system can efficiently retrieve database sentences only partially matched to query sentence inputs.A whole matching plane-based accumulation (WMPB) scheme is then designed to determine the global similarity score.The proposed algorithms are based on the feature-level comparison and do not require acoustical and language models.

Sentence matching for retrieving spoken sentences
Sentence matching is performed to determine the similarity between two sentences.Consider two spoken sentences A and B: Assume that  

... m Aa aa
 is an m-word spoken sentence and  

... n Bb bb
 is an n-word spoken sentence.The similarity between A and B can then be directly determined from the number of matched words (common words) in these two sentences.For example, if spoken sentences A and B are "I have a meeting in London tomorrow" and "Where is my meeting tomorrow?"respectively, then "meeting" and "tomorrow" are the matched words.Since only subsets of words in sentences are matched, sentence matching is a form of partial matching.This partial sentence-matching concept can be applied to spoken sentence retrieval.
Because this similarity is defined semantically, using a speech recognition system with acoustical and language models to transcribe spoken sentences into semantic texts is intuitive.To develop a language-independent retrieval system with a small required memory and favorable performance for a medium-sized sentence database, feature-level partial matching algorithms that do not use acoustic and language models are proposed herein.

Spoken sentence retrieval based on feature-level partial matching
This subsection presents a new partial matching system that is applied to the feature level.Figure 10 shows the proposed feature-level partial matching.First, the features of the spoken sentence are extracted frame by frame.The feature sequence is then segmented into equally sized matching units that are called feature pattern units (FPUs).Given a query sentence Q with l FPUs and a database sentence D with k FPUs, the sentences Q and D are denoted by Q =   12 ...   , however, a distance threshold is required (Itoh, 2001;Itoh & Tanaka 2002).Further, this threshold is difficult to define owing to variation in speech.Without a threshold comparison, an attempt is made herein to find a better similarity score function based on only the feature distances.The IDW method provides a measure of estimating uncertainty of variables.Moreover, this approach is sufficiently flexible to model the variables in a trend curve (Tomczak, 1998).

WMPB algorithm
Based on the above description, the proposed spoken sentence retrieval is summarized as follows.
Step 1. Sentence segmentation and feature extraction Assume that the FPU size is n frames.The length of the overlapping between successive FPUs is n/2 frames (Ng & Victor, 2000).A spoken query sentence and a spoken sentence from the database are segmented based on the FPU size, with n/2 overlapping frames.The FPU overlap of n/2 frames is taken from another work (Itoh, 2001).Moreover, such a setting covers each frame in the query and database sentences; this scheme of redundancy is thought to be advantageous for partial matching.According to Fig. 11, this query sentence has l FPUs and the database sentence has k FPUs.
Step 2. Determination of matching plane For a query sentence with l FPUs and a database sentence with k FPUs, a 2-D matching plane that contains l×k matching blocks is created.T matching planes are created if the database contains T sentences.Figure 11 illustrates the creation of the matching planes.
Step 3. Calculation of the similarity score of each matching block For each matching block, dynamic programming is utilized to calculate the feature distance of the two FPUs.These feature distances are then used to determine local similarity scores using the IDW function.
Step 4. Accumulation of similarity scores Over the whole matching plane, the similarity scores associated with all of the matching blocks are accumulated to yield a global similarity score.
Step 5. Iterative checking sentences from other databases Repeat steps 1 to 4 for the other database sentences until all of their global similarity scores are obtained.
Step 6. Ranking of database sentences Rank the database sentences in accordance with global similarity scores.Because the local similarity scores of all the matching blocks in the matching plane are accumulated to yield a global similarity score, the proposed spoken sentence retrieval method is called the whole matching plane-based (WMPB) algorithm.

Embedded system implementation
The proposed spoken sentence retrieval system was realized in a Pocket PC (HP iPAQ H5550) with a 128 MB RAM and 48 MB flash memory.The Pocket PC uses an Intel PXA255 processor (an XScale micro-architecture based on the ARM V5TE), which is a dedicated portable chip and suitable for handheld devices (www.intel.com).A 16-bit integrated audio codec (AC'97 2.0) was adopted for concurrent real-time speech input/output.The average memory size of one sentence was 142.3 kB with a sampling rate of 8 kHz.The Microsoft embedded complier based on Visual C++ 4.0 was used for the OS of the Pocket PC.Since the PXA255 processor does not support floating-point computation, a fixed-point conversion strategy was conducted to tackle the problem (see (Lin & Wang, 2007)).After the conversion method transformed the partial-matching program into a fixed-point format, the program was burned into the onboard flash memory.The system showed that the program occupied only 140 kB memory, which is appropriate for portable devices.

Experimental results
The experiments are divided into two phases -the parameter setting phase and the evaluation phase.In the parameter setting phase, experiments are conducted to find the best parameters of the IDW function and the FPU size for the proposed algorithm.Table 6 lists the characteristics of the experimental environment.Some experiments were to evaluate the retrieval performance of the proposed partial matching algorithm.Sentences were spoken naturally by one person without controlling the duration of the words or speaking at a deliberately chosen rate.The query sentences partially matched their related database sentences.Here, matching keywords are defined as the terms that are common to queries and their related database sentences.Table 7 lists the overall statistics concerning the experimental database.The database sentences were ranked by their global similarity scores.The retrieval performance was assessed using the most commonly used measurement, which is non-interpolated mean average precision (mAP) (Baeza-Yates & Ribeiro-Neto, 1999;Lo et al., 2002).The mAP is defined as,  11 1 11 1 mAP precision where j N denotes the total number of relevant sentences for query j ; i M represents the total number of queries in batch i ; L is the total number of query batches, and k is the precision of j Q when k sentences are retrieved.Finally,

Conclusion
This chapter presented various speech processing approaches for use in embedded systems, involving speech feature extraction, sound localization, speaker identification/verification, and interactive retrieval of spoken documents.To facilitate implementation, related algorithms and methods of improving them are discussed with reference to FPGA and ARM-based architectures.Experiments were also conducted using testing datasets; the results showed that proper hardware design can improve the performance of the approaches, and the efficacy of the improved algorithms was subsequently demonstrated.

Fig. 2 .
Fig. 2. Circuit design of an embedded system.(a) Architecture of the proposed MFCC chip.(b) Architecture of the look-up table(Wang et al., 2003).

Figure 3
Figure3presents the overall architecture of the sound localization system, including a sound signal amplifier, an analogue-to-digital (AD) converter, a sound activity detector, and an AMDF module.External acoustic signals are received by a pair of microphones and magnified by an amplifier.The AD converter transforms analogue data to digital data.The sound activity detection block consists of threshold value detection, zero-crossing rate (ZCR), and end-point detection modules.Three methods are utilized to distinguish desired segments from silent periods.Finally, the AMDF module(Wang et al., 2008a; Wang et al.,  2009)  estimates the delay based on the desired signal segments, and converts the delay into angles.A brief workflow of the system is shown in Fig.4
, the SMO training, which takes 90.89% of the training time, is the computational bottleneck.Hence, the SMO is realized in hardware and the rest processes, including preprocessing, feature extraction and voting analysis, are implemented in software.As shown in Fig.8, the proposed design comprises four blocks, which are the software-based extraction block (SEB), hardware-based training block (HTB), and softwarebased voting block (SVB).The SEB mainly performs speech preprocessing and speech feature extraction.The HTB executes the SMO algorithm, and the SVB is designed to find the target speaker based on a multiclass SVM.This design can be applied to a fast-trainable system in a stand-alone embedded environment (see(Kuan et al., 2010)  for details).
sized FPUs of the query and database sentences form a matching plane, shown in Fig. 11.Each matching block in the matching plane is associated with an FPU in the query sentence and the database sentence.Let  be the feature-level similarity function.The global similarity score for Q and D in the feature-level is calculated www.intechopen.comDesign and Applications of Embedded Systems for Speech Processing 187

Fig. 10 .
Fig. 10.Proposed feature-level partial matching algorithm.To test which weighting function performs well, experiments on inverse exponential weighting (IEW(X) = 1/e X ) and inverse distance weighting (IDW(X) = 1 p X , where p is an integer weighting power) techniques for summing local similarity scores were conducted (see the previous work(Lin & Wang, 2007)  for details).Based on this experiment, the IDW function outperformed the IEW function; therefore, IDW was used to evaluate the similarity score.The global similarity score function Ψ is defined as,

Table 1 .
Number of operations and the required memory estimated using the original MFCC algorithm.

Table 2
. Improvement of C[n] calculation by rescheduling the original MFCC algorithm and the total improvement provided by the proposed method

Table 3
. Analysis of dynamic range and determined fixed-point data format.

Table 4 .
Average error with the determined fixed-point data format.
Table 8 summarizes the overall statistics for the entire experimental database.

Table 6 .
Characteristics of experimental environments.

Table 7 .
Database statistics.Design and Applications of Embedded Systems for Speech Processing 191 www.intechopen.com