Deep learning techniques have made great success in areas such as computer vision, speech recognition and natural language processing. Those breakthroughs made by deep learning techniques are changing every aspect of our lives. However, deep learning techniques have not realized their full potential in embedded systems such as mobiles, vehicles etc. because the high performance of deep learning techniques comes at the cost of high computation resource and energy consumption. Therefore, it is very challenging to deploy deep learning models in embedded systems because such systems have very limited computation resources and power constraints. Extensive research on deploying deep learning techniques in embedded systems has been conducted and considerable progress has been made. In this book chapter, we are going to introduce two approaches. The first approach is model compression, which is one of the very popular approaches proposed in recent years. Another approach is neuromorphic computing, which is a novel computing system that mimicks the human brain.
- machine learning
- deep learning
- model compression
- pattern recognition
- neuromorphic computing
Deep learning is a branch of machine learning that is inspired by the biological processes of human brain and it is not a new concept. The reason it was not popular earlier is because there were not enough computational power and data available many years ago. With the development of the semiconductor industry and Internet, the stronger computational power and tremendous data generated by the Internet make the use of deep learning techniques possible [1, 2, 3, 4, 5, 6, 7, 8, 9, 10].
Even though deep learning techniques have made great success in many fields, we still have not realized their full potential, especially in embedded systems because such systems do not have enough computation power. In the era of mobile computing, enabling deep learning techniques running on mobile devices is very important and a lot of researchers have been working on this area [4, 6]. Researches have been conducted in two directions. The first direction aims to reduce model size and computation of deep learning models. The second direction is to design new hardware architectures that have much stronger computation power. In this chapter, we are going to introduce two approaches. The first technique is quantization, which is used to reduce the computation and model size of deep learning models. The second technique is neuromorphic computing, which is a new hardware architecture to enhance the computation power.
2. Neural network
Artificial neural network is a computing system that is capable of mimicking the human brain. The purpose of an artificial neural network is to identify patterns in input data and learn an approximate function that maps inputs to outputs. The most basic building units of neural network are neurons, which have inputs, outputs and a processing unit. To better learn complicated patterns in input data, a neural network consists of a huge number of neurons, which are organized into layers of neurons [11, 12]. These layers of neurons are stacked on each other so that the output of a layer is the input of the following layer. A neuron in a layer is connected to multiple neurons in previous layer in order to receive data from those neurons. Data received from neurons in previous layer are multiplied by corresponding weights and the product results are accumulated together to generate an output.
2.1 Single neuron
The most basic building unit of a neural network is neuron. A neuron receives data from multiple neurons from previous layer and each of these data is multiplied by a weight. Then, these weighted data are accumulated together to generate an output. A non-linear function is applied to the output before the output is sent out to other neurons. More details about why we need a non-linear function are presented in Section 2.3. The working mechanism of a single neuron can be expressed as,
where is the input, is the weight corresponding to input, is the bias value and
2.2 Multilayer perceptron
Multilayer perceptron (MLP) is a kind of neural network that has at least one layer of neurons between the input and output layer. Hidden layers are layers between the input and output layer. The reason why multilayer perceptron is introduced is that it makes a neural network much more powerful and is able to learn very complicated patterns in input data. If there is no layer between the input and output layer, the input is transformed to output by a linear transformation function and the neural network can only work on linearly separable data. To enable neural network to handle data that are not linearly separable, we need to have at least one layer between the input and output layer. Meanwhile, non-linear functions are applied to the outputs of hidden layers. Let us take the MLP neural network in Figure 2 as an example. This neural network has three inputs, two outputs and one hidden layer with four neurons. Lines represent connections between neurons. Each connection has an associated weight and we use weight matrices to represent the connections between the input, hidden and output layer.
where is the non-linear activation function; is the dot product between two matrices; is the weight matrix between the input and hidden layer; is the weight matrix between the hidden and output layer; is the output of the hidden layer; is the neural network input and is the neural network output.
Input x is mapped to four neurons by a weight matrix first and then each neuron is applied a non-linear activation function. is the output of the hidden layer and this output is multiplied by another weight matrix to obtain final result .
2.3 Non-linear activation function
Non-linear activation function [13, 14] is very important for MLP. Without a non-linear activation function, neural network does a linear transformation from input to output no matter how many hidden layers exist between the input and output layer. It is because the linear transformation of a linear transformation is still a linear transformation and thus any number of hidden layers can be deducted to a single linear transformation. Let us take the MLP neural network in Figure 2 as an example. Without non-linear activation, the neural network can be expressed mathematically as,
Substituting with , we can have,
Assume, and ,
Therefore, the MLP neural network in Figure 2 can be expressed mathematically as a single linear transformation from input to output.
2.4 Types of hidden layers
In MLP neural networks , there are hidden layers between the input and output layer and these hidden layers play a very important role in performances of MLP neural networks. There are many different types of hidden layers such as convolutional layers, fully-connected layers, pooling layers and so on. In this section, we are going to present more details about convolutional layers and fully-connected layers.
2.4.1 Fully-connected layers
In fully-connected layers, each neuron is connected to all neurons in previous layers and each connection has an associated weight. Each output of a neuron from previous layers is multiplied by a weight associated with the connection. Then, the product result is accumulated together.
Let us take the hidden layer of MLP neural network in Figure 2 as an example. The hidden layer in Figure 2 is a fully-connected layer. Each neuron in the hidden layer connects all three inputs in the input layer and generates one output. The weight matrix is represented in Eq. (6). In the weight matrix, each row represents the weights of a neuron and thus the matrix size is since there are four outputs and three inputs.
The input can be represented as matrix that has the size of 3 × 1.
Mathematically, fully-connected layer is computed as a matrix multiplication,
2.4.2 Convolutional layer
The convolutional layer is a layer used in many deep learning applications, especially in computer vision [1, 2, 16, 17, 18]. In computer vision, processing and understanding an image is a major task. An image has three dimensions, which are width, height and channel. Meanwhile, an image is highly structured and has strong spatial dependency .
The convolution layer has a group of kernels and each of these kernels has three dimensions, which are width, height and channel. The width and height of a kernel are hyper-parameters defined by designers. The size of a channel is equal to the channel size of previous layer. Unlike a fully-connected layer, each neuron in a convolutional layer is only connected to a small spatial region of neurons but all channels in the previous layer. The size of this spatial region depends on the width and height of each kernel. Each kernel slides over the whole image with a specific stride to extract features such as edge feature from the image. Therefore, each kernel extracts a specific feature we want to obtain from each local region.
Let us use Figure 3 above as an example to demonstrate how convolution layer works. In Figure 3 , the image only has one channel with size and there is one kernel with size . Assume the weight of this kernel is
Then, the weight matrix is multiplied by the pixels of a small region in the image element-wise and then these product results are accumulated together. Assume we are applying our kernel on the yellow region of the image in Figure 3 . Then, we can get the output Out using Eq. (10),
where ⊙ represents element-wise product between two matrixes.
Convolution layer has a couple of advantages compared to other layers when dealing with images. These features make the convolution layer very popular in the area of computer vision. First of all, convolutional layers need much less weights compared to fully-connected layers. In a fully-connected layer, a neuron is connected to all neurons in previous layers. If the dimension of previous layer is very large, the number of weights required by the fully-connected layer is very large since the total number of weights is equal to the number of neurons in previous layer times the number of neurons in fully-connected layer. Secondly, the convolution layer focuses on local spatial regions instead of the whole image and many applications benefit from this characteristic. For example, when dealing with object detection in an image, we only need to focus on regions where the object appears and other regions such as background are not needed when we are trying to detect the object in an image. Thirdly, the convolution layer is translation invariant. It means that the responses of a kernel to an object are the same regardless the location of the object in an image.
3. Model compression
In the era of mobile computing, enabling deep learning techniques running on mobile devices is very important. In general, large and complicated models tend to have high performance, but it increases the computation requirement dramatically. Embedded systems do not have sufficient computation and memory resource to support the model complexity of a high-performance deep learning model. Therefore, deploying deep learning models in embedded systems without sacrificing much performance has been a hot research topic [6, 19, 20, 21, 22].
3.1 Model quantization
Deep learning models use floating-point arithmetic in both training and inference phases. Floating-point arithmetic needs many computation resources, which is one of the reasons why deploying deep learning models in embedded systems is difficult. To address this issue, researchers have proposed many approaches [4, 5, 6, 23, 24, 25, 26, 27, 28, 29] to replace the floating-point arithmetic during the inference phase. Low bit-precision arithmetic [30, 31, 32] is one of those approaches.
In the low-bit arithmetic approach, floating-point numbers and floating-point arithmetic are still used in the training phase. After training is done, model weights and activation layers are quantized using low-bit integer numbers such as 8-bit or 16-bit integers. During inference, integer arithmetic is used instead of floating-point arithmetic and thus the computation resource requirement is reduced dramatically.
3.1.1 Quantization scheme
In Ref. , the authors proposed a quantization scheme that is successfully adopted in TensorFlow . During the inference phase, the proposed quantization scheme uses integer-only arithmetic while floating-point arithmetic is still used in the training phase. Since this approach uses different data type in training and inference phases, creating a one-to-one mapping between floating-point number and integer number is needed. The authors use Eq. (11)  to describe the mapping between a floating-point number and integer number.
In this equation, and are quantization parameters, which are constant for each layer. There is only one set of quantization parameters associated for each activation layer and weights layer.
The constant is a floating-point number, which is a scale constant to represent the size of each quantization level. For example, assume we are going to quantize a floating point in a layer to an 8-bit integer number. To calculate the scale constant S, we obtain the maximum and minimum floating-point number of the layer first. We use and to represent the maximum and minimum floating-point number respectively. Since we are using an 8-bit integer number, there are quantization levels. Then, the constant scale can be computed as,
In terms of , it represents real number 0 using quantized integer. The reason why we need an exact number to represent number 0 is that number 0 is wildly used in deep learning such as zero-padding in convolutional neural network. Representing number 0 exactly improves the performance of deep learning models. The number 0 can be calculated using Eq. (13) .
In Eq. (13), is the minimum quantization level of our quantized integer. For example, if we use an unsigned 8-bit integer, is equal to 0. and are two floating-point numbers representing the minimum values of a layer and the scale constant of this layer, respectively. Because is an integer, we need to round to the nearest integer. However, Eq. (13) only works for the case where is smaller than 0 and is larger than 0. To make it work for all cases, we use the following approaches to handle those cases. If is larger than 0, we set to 0 and calculate scale constant using the new . If is smaller than 0, then we set to 0 and calculate scale constant using the new . After we obtain the scale constant , then we can calculate the zero-point.
3.1.2 Integer-only multiplication
After floating-point numbers are quantized into integers using the quantization scheme described in Eq. (11), the authors in  describe their approaches of how to compute multiplication between two floating-point numbers using quantized integers.
Assume we are going to compute the multiplication between two floating-point numbers and . The multiplication result is stored in floating-point number . We first quantize the floating-point numbers , and into quantized integer numbers , and respectively using Eq. (11). We have scale constants , and corresponding to , and . Meanwhile, we have quantized zero-points , and corresponding to , and .
Then, we want to compute the product between and . We have,
In Eq. (15) , every arithmetic is between two integers except , which is a floating-point number. To make the whole computation integer-only, the authors in  proposed an approach to quantize the floating-point number .
In Eq. (16), the authors in  set to a number between 0.5 and 1. is a positive integer number. Using Eq. (16), the authors in  make to be a fixed-point multiplier. If a 16-bit integer is used in the multiplication, can be represented as a 16-bit integer, which is and bit-shift operation is used to compute the multiplication of . Then, the whole expression can be computed using integer-only arithmetic.
3.2 Quantization-aware training
There are two common approaches to train quantized neural networks. The first approach is to train neural networks using floating-point numbers and then quantize weights and activation layers after training. However, this approach might not work for some models. In , the authors found that this approach does not work for small models because small models tend to have significant accuracy drops. The authors in  listed two reasons for accuracy drops. The first one is that weight distribution is large for different output channels. The large weight distribution makes channels with small weights range have large quantization errors. The second reason is that outlier weight values cause the quantization of weights much less accurate.
Because of the reasons mentioned above, the authors in  proposed a training approach that includes the quantization effects in the forward pass of training. Backward pass of training works as traditional training method and floating-point numbers are still used for weight and activation layers. During forward pass of training, the authors in  use Eq. (17) to quantize each layer and these equations are applied to each floating-point number element-wise.
where r is a floating-point number; a, b are the maximum and minimum values of a layer and n is quantization level. For example, n = quantization levels if a 8-bit integer is used.
For weights, the authors’ proposed to set a and b to the minimum and maximum floating-point number of a weight layer respectively. In terms of activation layer, the authors used exponential moving averages to track the minimum and maximum floating-point numbers of an activation layer. After we have the range parameter a and b, we can compute other parameters easily. This approach has been implemented in Tensorflow [33, 34].
3.3 Comparison between different quantization approaches
3.4 Progress in model compression
Besides quantization approaches, many other model compression approaches are proposed. Pruning approach is one of the most popular approaches for model compression . Pruning approach reduces the size of weights by removing some weights if these weights meet certain criteria. Besides weights, pruning approach could be also applied to activations and biases.
Knowledge distillation is another very popular approach for model compression . There are two models, namely teacher model and student model, in knowledge distillation approach. Teacher model is a trained model. In addition, it has much larger model size than the student model. The main idea of knowledge distillation is to transfer the knowledge of teacher model to student model so that student model could have comparable performance to that of teacher model.
4. Neuromorphic computing
Neuromorphic computing [10, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50] is an emerging computing system that mimics the architecture of human brain. Carver Mead proposed the concept of neuromorphic computing in the late 1980s [43, 51, 52, 53]. Neuromorphic computing systems exploit spiking neural network to process information. Compared to conventional neural networks, spiking neural networks are more analogous to human brains and consume much less power. Recently, neuromorphic computing has been successfully applied to many applications [54, 55, 56, 57, 58, 59].
4.1 Spiking neurons
The basic building block of spiking neuron networks is spiking neurons. The working mechanism of spiking neuron is different from that of neurons introduced in Section 2.1. Spiking neurons exchange information through electrical pulses, which are also called spikes. Spikes are discrete, time-dependent data and are represented as binary signals. In , the authors introduced several properties of spiking neurons. In the first place, spiking neurons receive information from many inputs and generate one output. Secondly, generating a spike depends on the amount of excitatory and inhibitory inputs. Thirdly, a spiking neuron’s received spikes from other spiking neurons are integrated over time and will fire spikes if the integrated result is over a certain threshold.
4.1.1 Neuron models
There are several commonly used spiking neuron models such as leaky integrate-and-fire, Hodgkin-Huxley  and Fitzhugh Nagumo neuron models.
In Eq. (18), a, b, c and are the parameters.
4.1.2 Leaky integrate-and-fire spiking neuron model
In this section, we are going to present more details about leaky integrate-and-fire spiking neuron models. The behaviour of integrate-and-fire spiking neuron model can be described using Eq. (18). If voltage V is above a certain voltage threshold , it will fire spikes and voltage V is reset to 0. The behaviour of leaky integrate-and-fire model can be described by the circuit shown in Figure 4 .
4.2 Neuromorphic computing for embedded systems
As we stated above, embedded systems have very limited computation resources and power constraints. Compared to conventional neural networks, spiking neural networks are more analogous to human brains and consume much less power. Because of these features, neuromorphic computing is suitable for embedded systems. A lot of researchers [62, 63, 64, 65, 66] have implemented neuromorphic computing in embedded systems such as FPGA. In , the authors implement liquid state machine on FPGA for speech recognition. The overall architecture achieves 88× speed up compared to CPU implementation. Meanwhile, the proposed approach reduces 30% power consumption.
4.3 Hardware implementation of spiking neural networks
A lot of researchers have been working on the hardware implementation of spiking neural networks and many neuromorphic chips have been developed. For example, in Stanford University, Neurogrid  and TrueNorth  have been developed by IBM.
4.4 Recent progress in neuromorphic computing
Spiking neural networks exploit spikes to represent information and thus effective and efficient approaches of representing information using spikes are very important. In spiking neural networks, input is encoded into spikes and each spike is represented as a single binary bit. There are two types of encoding approaches [7, 10, 45, 47, 69, 70, 71, 72]. The first type of encoding approach is rate encoding. In rate encoding, input is encoded as the rate of spikes over an encoding window [7, 69]. Temporal encoding is also an encoding approach. Inter-spike interval encoding is a method of doing temporal encoding. In inter-spike encoding, the information is encoded by the time difference between two adjacent spikes [7, 69, 70].
Researchers have successfully implemented neural encoder using hardware [10, 62, 69]. In , the authors proposed a spike time-dependent encoder on FPGA. In , the authors implemented an inter-spike interval-based encoder for neuromorphic processors using analog integrated circuits. The proposed analog implementation of inter-spike interval encoder gets rid of ADCs and Op-amp and thus consumes less power.
In recent years, an increasing number of researchers have started to implement neuromorphic computing using analog integrated circuits [46, 47, 49, 50, 71, 73, 74, 75, 76, 77, 78, 79]. Compared to digital implementation, analog implementation of neuromorphic computing is more energy efficient. Meanwhile, analog implementation consumes less chip area.
Three-dimensional integrated circuits (3D IC) technique [80, 81, 82] is an emerging technique to improve the performance of integrated circuits. Compared to conventional fabrication techniques, three-dimensional integrated circuits technique consumes less power and uses small footprint. Recently, 3D IC technique has been applied to neuromorphic computing [79, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93]. Through the 3D IC technique, power consumption and chip area are reduced dramatically .
In the era of mobile computing and internet of things, embedded systems are everywhere. It can be found in consumer electronics, automobile, industrial and many other applications. Without embedded systems, our daily life would become extremely inconvenient. Deep learning is a technology, which is as important as embedded systems to our daily life. In recent years, deep learning is becoming a fundamental technology that impacts every aspect of our daily life. Therefore, deploying deep learning in embedded systems draws a lot of attention nowadays. Researchers have been conducting researches in many directions. For example, researchers are designing new layers and applying quantization techniques to reduce computation. Meanwhile, new architectures such neuromorphic computing are proposed. Through these techniques, many deep learning models are implemented in embedded systems successfully.