Quantized Neural Networks and Neuromorphic Computing for Embedded Systems

Deep learning techniques have made great success in areas such as computer vision, speech recognition and natural language processing. Those breakthroughs made by deep learning techniques are changing every aspect of our lives. However, deep learning techniques have not realized their full potential in embedded systems such as mobiles, vehicles etc. because the high performance of deep learning techniques comes at the cost of high computation resource and energy consumption. Therefore, it is very challenging to deploy deep learning models in embedded systems because such systems have very limited computation resources and power constraints. Extensive research on deploying deep learning techniques in embedded systems has been conducted and considerable progress has been made. In this book chapter, we are going to introduce two approaches. The first approach is model compression, which is one of the very popular approaches proposed in recent years. Another approach is neuromorphic computing, which is a novel computing system that mimicks the human brain.


Introduction
Deep learning is a branch of machine learning that is inspired by the biological processes of human brain and it is not a new concept. The reason it was not popular earlier is because there were not enough computational power and data available many years ago. With the development of the semiconductor industry and Internet, the stronger computational power and tremendous data generated by the Internet make the use of deep learning techniques possible [1][2][3][4][5][6][7][8][9][10].
Even though deep learning techniques have made great success in many fields, we still have not realized their full potential, especially in embedded systems because such systems do not have enough computation power. In the era of mobile computing, enabling deep learning techniques running on mobile devices is very important and a lot of researchers have been working on this area [4,6]. Researches have been conducted in two directions. The first direction aims to reduce model size and computation of deep learning models. The second direction is to design new hardware architectures that have much stronger computation power. In this chapter, we are going to introduce two approaches. The first technique is quantization, which is used to reduce the computation and model size of deep learning models. The second technique is neuromorphic computing, which is a new hardware architecture to enhance the computation power.

Neural network
Artificial neural network is a computing system that is capable of mimicking the human brain. The purpose of an artificial neural network is to identify patterns in input data and learn an approximate function that maps inputs to outputs. The most basic building units of neural network are neurons, which have inputs, outputs and a processing unit. To better learn complicated patterns in input data, a neural network consists of a huge number of neurons, which are organized into layers of neurons [11,12]. These layers of neurons are stacked on each other so that the output of a layer is the input of the following layer. A neuron in a layer is connected to multiple neurons in previous layer in order to receive data from those neurons. Data received from neurons in previous layer are multiplied by corresponding weights and the product results are accumulated together to generate an output.

Single neuron
The most basic building unit of a neural network is neuron. A neuron receives data from multiple neurons from previous layer and each of these data is multiplied by a weight. Then, these weighted data are accumulated together to generate an output. A non-linear function is applied to the output before the output is sent out to other neurons. More details about why we need a non-linear function are presented in Section 2.3. The working mechanism of a single neuron can be expressed as, where x i is the i th input, w i is the weight corresponding to i th input, b is the bias value and Out is the accumulated output. Figure 1 illustrates a single neuron with three inputs where x 1 , x 2 and x 3 are the three inputs and w 1 , w 2 and w 3 are weights corresponding to inputs x 1 , x 2 and x 3 . The output of this neuron can be computed using Eq. (1).

Figure 1.
Single neuron with three inputs.

Multilayer perceptron
Multilayer perceptron (MLP) is a kind of neural network that has at least one layer of neurons between the input and output layer. Hidden layers are layers between the input and output layer. The reason why multilayer perceptron is introduced is that it makes a neural network much more powerful and is able to learn very complicated patterns in input data. If there is no layer between the input and output layer, the input is transformed to output by a linear transformation function and the neural network can only work on linearly separable data. To enable neural network to handle data that are not linearly separable, we need to have at least one layer between the input and output layer. Meanwhile, non-linear functions are applied to the outputs of hidden layers. Let us take the MLP neural network in Figure 2 as an example. This neural network has three inputs, two outputs and one hidden layer with four neurons. Lines represent connections between neurons. Each connection has an associated weight and we use weight matrices to represent the connections between the input, hidden and output layer.
The behaviour of the neural network in Figure 2 can be expressed mathematically using Eq. (2).
where θ is the non-linear activation function; symbol Á is the dot product between two matrices; W 1 is the weight matrix between the input and hidden layer; W 2 is the weight matrix between the hidden and output layer; h 1 is the output of the hidden layer; x is the neural network input and y is the neural network output. Input x is mapped to four neurons by a weight matrix W 1 first and then each neuron is applied a non-linear activation function. h 1 is the output of the hidden layer and this output is multiplied by another weight matrix W 2 to obtain final result y.

Non-linear activation function
Non-linear activation function [13,14] is very important for MLP. Without a non-linear activation function, neural network does a linear transformation from input to output no matter how many hidden layers exist between the input and output layer. It is because the linear transformation of a linear transformation is still a linear transformation and thus any number of hidden layers can be deducted to a single linear transformation. Let us take the MLP neural network in Figure 2 as an example. Without non-linear activation, the neural network can be expressed mathematically as, Substituting Therefore, the MLP neural network in Figure 2 can be expressed mathematically as a single linear transformation from input to output.

Types of hidden layers
In MLP neural networks [15], there are hidden layers between the input and output layer and these hidden layers play a very important role in performances of MLP neural networks. There are many different types of hidden layers such as convolutional layers, fully-connected layers, pooling layers and so on. In this section, we are going to present more details about convolutional layers and fullyconnected layers.

Fully-connected layers
In fully-connected layers, each neuron is connected to all neurons in previous layers and each connection has an associated weight. Each output of a neuron from previous layers is multiplied by a weight associated with the connection. Then, the product result is accumulated together.
Let us take the hidden layer of MLP neural network in Figure 2 as an example. The hidden layer in Figure 2 is a fully-connected layer. Each neuron in the hidden layer connects all three inputs in the input layer and generates one output. The weight matrix is represented in Eq. (6). In the weight matrix, each row represents the weights of a neuron and thus the matrix size is 4 Â 3 since there are four outputs and three inputs.
The input can be represented as matrix that has the size of 3 Â 1.
Mathematically, fully-connected layer is computed as a matrix multiplication,

Convolutional layer
The convolutional layer is a layer used in many deep learning applications, especially in computer vision [1,2,[16][17][18]. In computer vision, processing and understanding an image is a major task. An image has three dimensions, which are width, height and channel. Meanwhile, an image is highly structured and has strong spatial dependency [11].
The convolution layer has a group of kernels and each of these kernels has three dimensions, which are width, height and channel. The width and height of a kernel are hyper-parameters defined by designers. The size of a channel is equal to the channel size of previous layer. Unlike a fully-connected layer, each neuron in a convolutional layer is only connected to a small spatial region of neurons but all channels in the previous layer. The size of this spatial region depends on the width and height of each kernel. Each kernel slides over the whole image with a specific stride to extract features such as edge feature from the image. Therefore, each kernel extracts a specific feature we want to obtain from each local region.
Let us use Figure 3 above as an example to demonstrate how convolution layer works. In Figure 3, the image only has one channel with size 6 Â 6 Â 1 and there is one kernel with size 3 Â 3 Â 1. Assume the weight of this kernel is Then, the weight matrix W is multiplied by the pixels of a small region in the image element-wise and then these product results are accumulated together. Assume we are applying our kernel on the yellow region of the image in Figure 3. Then, we can get the output Out using Eq. (10), where ⊙ represents element-wise product between two matrixes. Convolution layer has a couple of advantages compared to other layers when dealing with images. These features make the convolution layer very popular in the area of computer vision. First of all, convolutional layers need much less weights compared to fully-connected layers. In a fully-connected layer, a neuron is connected to all neurons in previous layers. If the dimension of previous layer is very large, the number of weights required by the fully-connected layer is very large since the total number of weights is equal to the number of neurons in previous layer times the number of neurons in fully-connected layer. Secondly, the convolution layer focuses on local spatial regions instead of the whole image and many applications benefit from this characteristic. For example, when dealing with object detection in an image, we only need to focus on regions where the object appears and other regions such as background are not needed when we are trying to detect the object in an image. Thirdly, the convolution layer is translation invariant. It means that the responses of a kernel to an object are the same regardless the location of the object in an image.

Model compression
In the era of mobile computing, enabling deep learning techniques running on mobile devices is very important. In general, large and complicated models tend to have high performance, but it increases the computation requirement dramatically. Embedded systems do not have sufficient computation and memory resource to support the model complexity of a high-performance deep learning model. Therefore, deploying deep learning models in embedded systems without sacrificing much performance has been a hot research topic [6,[19][20][21][22].

Model quantization
Deep learning models use floating-point arithmetic in both training and inference phases. Floating-point arithmetic needs many computation resources, which is one of the reasons why deploying deep learning models in embedded systems is difficult. To address this issue, researchers have proposed many approaches [4][5][6][23][24][25][26][27][28][29] to replace the floating-point arithmetic during the inference phase. Low bit-precision arithmetic [30][31][32] is one of those approaches.
In the low-bit arithmetic approach, floating-point numbers and floating-point arithmetic are still used in the training phase. After training is done, model weights and activation layers are quantized using low-bit integer numbers such as 8-bit or 16-bit integers. During inference, integer arithmetic is used instead of floatingpoint arithmetic and thus the computation resource requirement is reduced dramatically.

Quantization scheme
In Ref. [4], the authors proposed a quantization scheme that is successfully adopted in TensorFlow [33]. During the inference phase, the proposed quantization scheme uses integer-only arithmetic while floating-point arithmetic is still used in the training phase. Since this approach uses different data type in training and inference phases, creating a one-to-one mapping between floating-point number and integer number is needed. The authors use Eq. (11) [4] to describe the mapping between a floating-point number and integer number.
In this equation, S and Z are quantization parameters, which are constant for each layer. There is only one set of quantization parameters associated for each activation layer and weights layer.
The constant S is a floating-point number, which is a scale constant to represent the size of each quantization level. For example, assume we are going to quantize a floating point r in a layer to an 8-bit integer number. To calculate the scale constant S, we obtain the maximum and minimum floating-point number of the layer first. We use r max and r min to represent the maximum and minimum floatingpoint number respectively. Since we are using an 8-bit integer number, there are n ¼ 2 8 ¼ 256 quantization levels. Then, the constant scale S can be computed as, In terms of Z, it represents real number 0 using quantized integer. The reason why we need an exact number to represent number 0 is that number 0 is wildly used in deep learning such as zero-padding in convolutional neural network. Representing number 0 exactly improves the performance of deep learning models. The number 0 can be calculated using Eq. (13) [4].
In Eq. (13), q min is the minimum quantization level of our quantized integer. For example, if we use an unsigned 8-bit integer, q min is equal to 0. r min and S are two floating-point numbers representing the minimum values of a layer and the scale constant of this layer, respectively. Because Z is an integer, we need to round r min S to the nearest integer. However, Eq. (13) only works for the case where r min is smaller than 0 and r max is larger than 0. To make it work for all cases, we use the following approaches to handle those cases. If r min is larger than 0, we set r min to 0 and calculate scale constant S using the new r min . If r max is smaller than 0, then we set r max to 0 and calculate scale constant S using the new r max . After we obtain the scale constant S, then we can calculate the zero-point.

Integer-only multiplication
After floating-point numbers are quantized into integers using the quantization scheme described in Eq. (11), the authors in [4] describe their approaches of how to compute multiplication between two floating-point numbers using quantized integers.
Assume we are going to compute the multiplication between two floating-point numbers r 1 and r 2 . The multiplication result is stored in floating-point number r 3 . We first quantize the floating-point numbers r 1 , r 2 and r 3 into quantized integer numbers q 1 , q 2 and q 3 respectively using Eq. (11). We have scale constants S 1 , S 2 and S 3 corresponding to r 1 , r 2 and r 3 . Meanwhile, we have quantized zero-points Z 1 , Z 2 and Z 3 corresponding to r 1 , r 2 and r 3 . Then, we want to compute the product between r 1 and r 2 . We have, In Eq. (15) [4], every arithmetic is between two integers except S 1 S 2 S 3 , which is a floating-point number. To make the whole computation integer-only, the authors in [4] proposed an approach to quantize the floating-point number S 1 S 2 S 3 . Firstly, the authors found that M ¼ S 1 S 2 S 3 is always in the interval (0, 1) and used Eq. (16) [4] to describe the relationship between M and M 0 , In Eq. (16), the authors in [4] set M 0 to a number between 0.5 and 1. n is a positive integer number. Using Eq. (16), the authors in [4] make M to be a fixedpoint multiplier. If a 16-bit integer is used in the multiplication, M 0 can be represented as a 16-bit integer, which is 2 15 M 0 and bit-shift operation is used to compute the multiplication of 2 Àn . Then, the whole expression can be computed using integer-only arithmetic.

Quantization-aware training
There are two common approaches to train quantized neural networks. The first approach is to train neural networks using floating-point numbers and then quantize weights and activation layers after training. However, this approach might not work for some models. In [4], the authors found that this approach does not work for small models because small models tend to have significant accuracy drops. The authors in [4] listed two reasons for accuracy drops. The first one is that weight distribution is large for different output channels. The large weight distribution makes channels with small weights range have large quantization errors. The second reason is that outlier weight values cause the quantization of weights much less accurate.
Because of the reasons mentioned above, the authors in [4] proposed a training approach that includes the quantization effects in the forward pass of training. Backward pass of training works as traditional training method and floating-point numbers are still used for weight and activation layers. During forward pass of training, the authors in [4] use Eq. (17) to quantize each layer and these equations are applied to each floating-point number element-wise.
where r is a floating-point number; a, b are the maximum and minimum values of a layer and n is quantization level. For example, n = 2 8 ¼ 256 quantization levels if a 8-bit integer is used.
The function round is to rounding the number to its nearest integer. 8

Intelligent System and Computing
For weights, the authors' proposed to set a and b to the minimum and maximum floating-point number of a weight layer respectively. In terms of activation layer, the authors used exponential moving averages to track the minimum and maximum floating-point numbers of an activation layer. After we have the range parameter a and b, we can compute other parameters easily. This approach has been implemented in Tensorflow [33,34].

Comparison between different quantization approaches
Binarized neural network: Binarized neural network is an aggressive quantization approach that quantizes each weight to a binary value. In binary neural networks, dot product between two matrices can be completed by bit count operation, which is an operation to count the number of 1 s in a vector. The binary neural network in [5] achieves 32Â reduction in model size and 58Â speed up without losing much accuracy compared to equivalent neural network using single-precision values.
DoReFa-Net: DoReFa-Net is one of the most popular quantization approaches. This quantization approach not only applies quantization on weight and activation layers, but also on gradients. Through applying quantization on gradients, the training speed could be increased significantly. In [25], the proposed DoReFa-Net achieved 46.1% top-1 accuracy on ImageNet using 2-bit activations, 1-bit weights and trained with 6-bit gradients.
Log-based quantization: In [35], the authors proposed a multiplication-free hardware accelerator for deep neural networks. The proposed approach quantizes each weight to the nearest powers of two using logarithmic and rounding functions. In terms of the activation layer, the authors quantize each output to an 8-bit integer. By quantizing each weight to the nearest powers of two, multiplication between two integers could be replaced by bit-shift operations, which could reduce the resource utilization significantly. In [35], the authors demonstrate that the proposed quantization approach achieves almost the same accuracy as floating-point version but reduces energy consumption significantly.

Progress in model compression
Besides quantization approaches, many other model compression approaches are proposed. Pruning approach is one of the most popular approaches for model compression [6]. Pruning approach reduces the size of weights by removing some weights if these weights meet certain criteria. Besides weights, pruning approach could be also applied to activations and biases.
Knowledge distillation is another very popular approach for model compression [36]. There are two models, namely teacher model and student model, in knowledge distillation approach. Teacher model is a trained model. In addition, it has much larger model size than the student model. The main idea of knowledge distillation is to transfer the knowledge of teacher model to student model so that student model could have comparable performance to that of teacher model.

Spiking neurons
The basic building block of spiking neuron networks is spiking neurons. The working mechanism of spiking neuron is different from that of neurons introduced in Section 2.1. Spiking neurons exchange information through electrical pulses, which are also called spikes. Spikes are discrete, time-dependent data and are represented as binary signals. In [8], the authors introduced several properties of spiking neurons. In the first place, spiking neurons receive information from many inputs and generate one output. Secondly, generating a spike depends on the amount of excitatory and inhibitory inputs. Thirdly, a spiking neuron's received spikes from other spiking neurons are integrated over time and will fire spikes if the integrated result is over a certain threshold.

Neuron models
There are several commonly used spiking neuron models such as leaky integrateand-fire, Hodgkin-Huxley [60] and Fitzhugh Nagumo neuron models.
Leaky integrate-and-fire: According to [61], leaky integrate-and-fire neuron model is the simplest model to implement and the operation of leaky integrate-andfire neuron can be completed using few floating-point operations such as additions and multiplications. However, there is no phasic spiking in leaky integrate-and-fire model since the model only has one variable. Meanwhile, spiking latencies do not exist in spikes because the threshold is fixed. The behaviour of leaky integrate-andfire neuron model can be expressed using Eq. (18) [61]. If voltage V reaches a certain threshold level V th , then a spike is fired and voltage V is reset to c.
In Eq. (18), a, b, c and V th are the parameters. FitzHugh-Nagumo: FitzHugh-Nagumo neuron model [61] is more complicated compared to the leaky integrate-and-fire model and needs slightly more floatingpoint operations. The model has multiple variables and thus it has phasic spiking. Meanwhile, spikes of FitzHugh-Nagumo neuron model have spiking latencies because the threshold is not fixed. The behaviour of the FitzHugh-Nagumo neuron model can be expressed using Eq. (19) [61].
Hodgkin-Huxley: Hodgkin-Huxley [60] is a much more complicated neuron model compared to leaky integrate-and-fire and Fitzhugh-Nagumo neuron models. It is described by multiple equations and many parameters. In [61], the authors state that the parameters of the Hodgkin-Huxley neuron model are biophysically meaningful. More importantly, the Hodgkin-Huxley neuron model is very helpful for researchers to investigate single-cell dynamics. However, this model is hard to implement since it requires over 100 floating-point operations. More details about this model can be found in [60].

Leaky integrate-and-fire spiking neuron model
In this section, we are going to present more details about leaky integrate-andfire spiking neuron models. The behaviour of integrate-and-fire spiking neuron model can be described using Eq. (18). If voltage V is above a certain voltage threshold V th , it will fire spikes and voltage V is reset to 0. The behaviour of leaky integrate-and-fire model can be described by the circuit shown in Figure 4.

Neuromorphic computing for embedded systems
As we stated above, embedded systems have very limited computation resources and power constraints. Compared to conventional neural networks, spiking neural networks are more analogous to human brains and consume much less power. Because of these features, neuromorphic computing is suitable for embedded systems. A lot of researchers [62][63][64][65][66] have implemented neuromorphic computing in embedded systems such as FPGA. In [63], the authors implement liquid state machine on FPGA for speech recognition. The overall architecture achieves 88Â speed up compared to CPU implementation. Meanwhile, the proposed approach reduces 30% power consumption.

Hardware implementation of spiking neural networks
A lot of researchers have been working on the hardware implementation of spiking neural networks and many neuromorphic chips have been developed. For example, in Stanford University, Neurogrid [67] and TrueNorth [68] have been developed by IBM.
Neurogrid: Neurogrid [67] is a mixed-signal hardware system for simulating biological brains. This system exploits analog circuits to implement all circuits except axonal arbors to improve energy efficiency and axonal arbors are implemented using digital circuits. The whole system consists of 16 Neurocores and each Neurocore has a 256 Â 256 silicon-neuron array, a receiver, a transmitter and two RAMs. Neurogrid is able to simulate a million neurons by only consuming few watts.
TrueNorth: TrueNorth [68] is a brain-inspired neurosynaptic processor and uses non-von Neumann architecture. The whole system has 4096 cores, 1 million digital neurons and 256 million synapses. TruhNorth achieves 58 giga-synaptic operations per second (GSOPS) and 400 GSOPS per watt. More importantly, the authors have successfully implemented several applications such as object recognition on TrueNorth, and it has much lower power consumption compared to conventional processors.

Recent progress in neuromorphic computing
Spiking neural networks exploit spikes to represent information and thus effective and efficient approaches of representing information using spikes are very important. In spiking neural networks, input is encoded into spikes and each spike is represented as a single binary bit. There are two types of encoding approaches [7,10,45,47,[69][70][71][72]. The first type of encoding approach is rate encoding. In rate encoding, input is encoded as the rate of spikes over an encoding window [7,69]. Temporal encoding is also an encoding approach. Inter-spike interval encoding is a method of doing temporal encoding. In inter-spike encoding, the information is encoded by the time difference between two adjacent spikes [7,69,70].
Researchers have successfully implemented neural encoder using hardware [10,62,69]. In [62], the authors proposed a spike time-dependent encoder on FPGA. In [69], the authors implemented an inter-spike interval-based encoder for neuromorphic processors using analog integrated circuits. The proposed analog implementation of inter-spike interval encoder gets rid of ADCs and Op-amp and thus consumes less power.

Conclusion
In the era of mobile computing and internet of things, embedded systems are everywhere. It can be found in consumer electronics, automobile, industrial and many other applications. Without embedded systems, our daily life would become extremely inconvenient. Deep learning is a technology, which is as important as embedded systems to our daily life. In recent years, deep learning is becoming a fundamental technology that impacts every aspect of our daily life. Therefore, deploying deep learning in embedded systems draws a lot of attention nowadays. Researchers have been conducting researches in many directions. For example, researchers are designing new layers and applying quantization techniques to reduce computation. Meanwhile, new architectures such neuromorphic computing are proposed. Through these techniques, many deep learning models are implemented in embedded systems successfully.