Deep learning (DL) is a type of machine learning that mimics the thinking patterns of a human brain to learn the new abstract features automatically by deep and hierarchical layers. DL is implemented by deep neural network (DNN) which has multi-hidden layers. DNN is developed from traditional artificial neural network (ANN). However, in the training process of DL, it has certain inefficiency due to very long training time required. Meta-heuristic aims to find good or near-optimal solutions at a reasonable computational cost. In this article, meta-heuristic algorithms are reviewed, such as genetic algorithm (GA) and particle swarm optimization (PSO), for traditional neural network’s training and parameter optimization. Thereafter the possibilities of applying meta-heuristic algorithms on DL training and parameter optimization are discussed.
- deep learning
- meta-heuristic algorithm
- neural network training
- nature-inspired computing algorithms
- algorithm design
Deep learning (DL) is a branch of machine learning. Based on a set of algorithms, DL attempts to model high-level abstractions in data by using multiple processing layers with complex structures. Developed by Professor Hinton  in 2006, DL now is becoming the most prevalent research area in machine learning. DL is a collective concept of a series of algorithms and models including but not limited to convolutional neural networks (CNNs), deep belief networks (DBNs), restricted Boltzmann machines (RBMs), deep Boltzmann machines (DBM), recursive auto encoders and deep representation respectively, regarding the problem to be solved. The most famous models for machine learning are RBM and DBN and CNN for image classification.
DL is the so-called “Deep” towards those widely used “shallow learning” algorithms such as support vector machine (SVM), boosting and maximum entropy method. Those shallow learning extracts feature mostly by artificial sampling or empirical sampling, thus the model or network will learn a non layer structure feature. On the contrary, DL learns raw data layer by layer by transforming data from raw feature space to transformed feature space. Additionally, deep structure can learn and approach nonlinear function. All these advantages are beneficial to classification and feature visualization [2, 25].
To process big data and large scale dataset, DL has been widely used in many research areas and industrial areas including pattern recognition, image classification and natural language processing (NLP). For instance, in Image Net large-scale visual recognition challenge (ILSVRC) , DL and convolutional neural network are first implemented in image classification and made a great success. As shown in Figure 1, the error rate declines from 25 to 15% when DL and CNN are implemented. From then on, “DL + big data + GPU” is integrated into traditional image classification application, and companies such as Google and Baidu have successfully updated their image searching production by DL’s implementation .
Though DL outperforms those shallow learning methods and it has been widely adopted by the R&D industry, its application and implementation still have some shortcomings. A large deep neural network (DNN) may have millions of parameters and is mostly trained by contrastive divergence (CD) learning algorithm which is iterative and known to be time consuming . The most significant problem is that when facing very large scaled data the DNN will take several days or even months to learn even though the greedy search strategy is proposed. In this case, many companies and researchers are trying to improve hardware capabilities and applying high-processing facilities such as GPU and parallel computing. Others are focusing on using other training algorithms to substitute traditional CD to speed up the training process.
Meta-heuristic algorithms aim to find global or near-optimal solutions at a reasonable computational cost. In terms of meta-heuristic, the words of “meta” and “heuristic” are Greek, where “meta” is “higher level” or “beyond” and heuristics means “to find”, “to know”, “to guide an investigation” or “to discover”. Heuristics are methods to find good (near-) optimal solutions in a reasonable computational cost without guaranteeing feasibility or optimality. In other words, meta-heuristics are a set of intelligent strategies to enhance the efficiency of heuristic procedures .
Meta-heuristic algorithms usually rely on different agents such as chromosome (genetic algorithm [GA]), particles (particle swarm optimization [PSO]), and firefly (firefly algorithm), searching iteratively to find the global optimum or near-global optimum. Many strategies such as evolutionary strategy, social behavior and information transition are implemented to guarantee the whole population will move towards global optimum iteratively and protect from falling in local optimum.
Traditional artificial neural network (ANN) used to have multilayer feed-forward neural network structure and use back-propagation (BP) to modify the weight. The mainly used strategy is gradient descent (GD). Meta-heuristic has been successfully implemented on traditional neural network to speed up the training process by substituting the GD strategy with iterative evolutionary strategy or swarm intelligence strategy [5, 7–10].
Gudise and Venayagamoorthy  compared feed forward with PSO and feed forward with BP, and the results show that feed forward with PSO is better than BP in terms of nonlinear function.
Leung et al.  presents the tuning of the structure and parameters of a neural network using an improved GA. It will also be shown that the improved GA performs better than the standard GA based on some benchmark test functions.
Juang  proposed a new evolutionary learning algorithm based on a hybrid of GA and PSO called HGAPSO. In each epoch, the upper-half of the GA population are defined as elites and the rest will be discarded. Enhanced by PSO and GA, the elites will form the next generation of GA. The hybrid method outperforms PSO and GA in recurrent or fuzzy neural network.
Meissner et al.  used optimized particle swarm optimization (OPSO) to accelerate the training process of neural network. The main idea of OPSO is to optimize the free parameters of the PSO by having swarms within a swarm. Applying the OPSO method to neural network training with the aim to build a quantitative model, OPSO approach yields parameter combinations improving overall optimization performance.
Zhang et al.  proposed a hybrid algorithm of PSO plus BP for neural network training. By utilizing the advantage of PSO’s global searching ability as well as BP’s deep search ability, the hybrid algorithm has a very good performance in both convergent speed and convergent accuracy.
Structures of DL model are similar to the traditional ANN, some modifications are implemented for better learning ability. For instance, the CNN is a traditional ANN modified with pooling procession and the structure of RBM is an undirected graph or a bidirectional neural network. DL model shares similar model with neural network and may take different training algorithm instead of GD strategy.
In this survey, the relevant work of applying meta-heuristic algorithms for ANN’s training is reviewed. In addition, the structure and working process of RBM are analyzed and as the basic structure of DL model, RBM’s training process is introduced. We survey the possibility of implementing meta-heuristic algorithms on RBM’s parameter training process.
The rest of the sections is organized as follows: meta-heuristic algorithms including GA and PSO are introduced in Section 2. ANN is introduced in Section 3. The implementation of meta-heuristic on ANN is introduced in Section 4. DL and RBM are introduced in Section 5. Some conclusion and discussion are drawn in Section 6.
2. Meta-heuristic algorithm
Meta-heuristic is a collective concept of a series of algorithms including evolutionary algorithm, the most famous one is GA , naturally inspired algorithm, the most prevalent one is PSO , trajectory algorithm, such as Tabu search , and so on. The classification of meta-heuristic is shown in Figure 2. In this paper, I mainly focus on surveying GA and PSO because these two are the most prevalent as well as most widely implemented. GA and its working process are introduced in Section 2.1 and PSO is introduced in Section 2.2, respectively.
2.1. Genetic algorithm
Founded in 1975 by Professor Holland, GA sets up an evolution model that simulates Darwinian genetic selection and the natural elimination process . Chromosomes carry information and the operation of crossover and mutation on chromosome guarantees the whole algorithm could find the global optimum or near-global optimum iteratively. Crossover operation keeps the whole population moving towards the global optimum through giving better chromosome higher chance for propagation. Mutation operation keeps the whole population’s diversity and avoids population falling in local optimum easily. The main working process could be summarized as follows:
Step 1. Chromosomes which carry information are randomly generated.
Step 2. All chromosomes are evaluated with fitness functions to calculate fitness values of each as fitnessi.
Step 3. Based on Russian roulette strategy, chromosomes are randomly chosen as parents for next generation.
Step 4. Parents do crossover operations to get next generations.
Step 5. Do mutation operation on each chromosome.
Step 6. Back to Step 2 until meeting the stop criteria or exceeding pre-set iteration number.
2.1.1. Russian roulette strategy
Russian roulette strategy is designed for choosing the parents of the next generation, as shown in Figure 5. Each chromosome has a possibility pi of being chosen as parents.
pi is calculated as below for each chromosome i:
This strategy keeps the idea that the chromosome with higher fitness value will have higher possibility of being chosen thus make sure better chromosome’s information will not be missed or has few possibilities of missing in the next generation. In the meantime, it also allows those chromosomes without better fitness value to transmit their information to the next generation .
2.1.2. Crossover and mutation
Figure 6 shows us a typical crossover operation and mutation operation. The most common way of applying crossover operation is selecting a site, cutting up each chromosome and reconstructing both to get next generation. If their fitness value is better than their parents, submit as next generation otherwise keep their parents as next generation. Furthermore, regarding different real problem to be solved, different crossover operation such as multisite crossover would be used and these will be discussed later.
Figure 6 also shows us the typical mutation operation in GA. The most common way of applying mutation is to randomly select some site of chromosome and randomly change it with any possible value regardless of the fitness value will be better or not. Of course the mutation operation may make one chromosome “worse” while mutation operation can keep the population’s diversity and avoid falling in local optimum.
2.1.3. Fitness function and genetic coding strategy
GA is first used to optimize complex continuous functions which are not able to calculate its mathematical global optimal location. Thus the design of chromosome itself, fitness function and crossover operation are very easy. However, when trying implementing GA on real problems, many types of problems such as discrete problem, time series problem and multi objective problem are faced. In this way, how to design the chromosome itself to represent the real problem better and how to calculate its fitness value when given chromosome need reconsidering. To solve the above troubles, genetic coding strategy is put forward which encodes real problem to chromosome and calculates its fitness value. When whole optimizing is finished, the best chromosome will be decoded to real problem’s expression and then analyzed and discussed why it is global best.
2.2. Particle swarm optimization
PSO is developed by Professor James Kennedy and Russell Eberhart . The main difference of PSO is that it takes “social behaviors” and “memory” to guarantee whole population’s moving towards global optimum iteratively instead of evolutionary strategy.
PSO optimizes a problem by having a population of candidate solutions and moving these particles around in the search-space according to simple mathematical formula over the particle's position xi and velocity vi. Each particle has memory and its movement is influenced by its local best known position pbesti but is also guided toward the best known positions in the search-space, which are updated as better positions found by all particles called gbest.
The iterative formulas of PSO are
Velocity constraint ()
Search range constraint ()
w is called the inertia weight that controls the exploration and exploitation of the search space because it dynamically adjusts velocity and is usually set as 0.8. c1 and c2 are called learning factors and usually set as 2.
In the iterative formula of PSO, vi * w refers to the particle keeps its own search direction. c1 * rand * (pbesti − xi) refers to the particle’s movement is influenced by its own personal best record. c2 * rand * (gbest − xi) refers to the particle’s movement is influenced by whole population’s best record. To keep whole population will not premature or overfit, when implementing iteration process, a limitation of vmax and vmin is predefined to constraint particles’ movement. Additionally, the search space is also predefined thus for each particle’s location xi, there are upper bound xmax and lower bound xmin to make sure each particle will not be out of range. A schematic diagram is shown in Figure 7 and the working flow of PSO is shown in Figure 8.
PSO is prevalent because of its concise evolutionary strategy as well as its concise operations. Unlike GA’s crossover and mutation operation and complex genetic coding strategy, PSO only needs randomly generated location x and velocity v. However, naïve PSO is designed for math function optimization which is a continuous problem. For application, researchers still need to design own coding strategy to make sure PSO’s iterative formulas are meaningful, especially for discrete problems. One example is applying PSO on graph optimization to solve some NP-hard problem such as travelling salesman problem (TSP) problem . TSP asks the following question: Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city exactly once and returns to the origin city? It is an NP-hard problem in combinatorial optimization, important in operations research and theoretical computer science.
With the development of meta-heuristic, researchers are beginning to try implementing meta-heuristic on TSP in case to get a “good enough” result in a reasonable computation time. To solve and adapt TSP to PSO, researchers innovate a concept of “swap operator” and “swap sequence” which gives the “+” operator a new meaning [14, 15].
3. Neural network
In machine learning and cognitive science, ANNs are a family of models inspired by biological neural networks (the central nervous systems of animals, in particular the brain) and are used to estimate or approximate functions that can depend on a large number of inputs and are generally unknown. ANNs are generally presented as systems of interconnected "neurons" which exchange messages between each other. The connections have numeric weights that can be tuned based on experience, making neural nets adaptive to inputs and capable of learning . The ANN was developed in the 1950s, there are almost three generations of neural networks: perceptron , feed-forward BP neural network  and spiking neural network (SNN) .
Founded in the 1950s by Professor Rosenblatt, the basic perceptron is put forward as a probabilistic model to simulate human brain’s working process when receiving information. Figure 9 shows us a typical structure of perceptron that only has one layer, with multi-input neurons as (t1, t2, , tn) in which k is the total number of neurons. Their corresponding inputs are (x1, x2, , xn). Each neuron has its own weight (w1j, w2j, , wnj) in which j refers to the output layer. The net input is processed by the activation φ. If φ(netj) > threshold θj, then the output neuron is activated and output oj = 1, otherwise output neuron is not activated and output oj = 0. The training process of perceptron is that suppose the desired output is od and actual output is oa, then
By using these training strategies, when|(od − oa)| < eps, in which eps is a predefined accuracy, i.e., 10−5, the perceptron could be seen as mature. This training strategy is the ancestor of the widely used BP algorithm.
Geometrically, trained perceptron could be seen as a linear function. It can correctly classify the linear separable problem but not linear inseparable problem such as XOR problem (Figure 10). XOR problem contains two classes A and B while the diagonal belongs to the same class, i.e., data (0, 0) and data (1, 1) belong to class A while data (1, 0) and (0, 1) belong to class B. This will cause no lines or linear functions can separate these two categories correctly unless using curves or reflect the data to high-dimensional space .
3.2. Feed-forward neural network
Figure 11 shows us a typical multilayer unidirectional feed-forward neural network and Figure 12 shows us a typical recurrent neural network, Hopfield network. The most prevalent structure of three-layer neural network with input layer, hidden layer and output layer, every two neurons are connected with weight w.
In a typical feed-forward neural network, the information is transmitted forward from input layer to hidden layer and then from hidden layer to output layer. It works similar to perceptron above, for each neuron j in hidden layer, calculate
Then for each neuron h in output layer, calculate
The output layer could have one or more neurons to represent the final result, different output layer structure may lead to different training time and prediction accuracy.
BP algorithm is the most prevalent supervised learning algorithm for neural network. BP’s main idea is using GD algorithm to find the gradient-descent-most way to modify the weight in neural network. The modification starts from the output layer, then to hidden layer and input layer. Because the GD is based on the difference or error between the desired output and actual output, BP is also called error-BP method. Suppose desired output is od and actual output is oa, the error function is defined as:
Suppose the layers are input layer i, hidden layer j and output layer h, respectively, using partial derivative,
in which η is called learning rate, usually set as 0.01 or 0.1 according to the set of weight of the network. Weight w is usually set as a random number in [0, 1] in case the learning process is too long. Then, update the weight between each neurons,
One epoch of training process is finished. GD will modify weight iteratively using the formula above till the whole network is trained mature.
Unlike perceptron, multilayer network can represent complex nonlinear function with hidden layers, thus can process XOR problems well. To some extent, multilayer network can represent any type of nonlinear function by using hidden layers. However, when implementing BP, if the depth of network is over 5, the error transmitted back to input layer or first hidden layer will be decayed significantly to almost 0 thus making the modification useless, so in real application, the depth of network is smaller than 5 .
3.3. Spiking neural network
SNN is recognized as the third generation of neural network. Maass  proved the spiking neuron and SNN has a powerful learning ability and information processing ability than traditional neural network. Put forward by Gerstner  in 1997, the SNN gives us a concise biological neuron model processed by temporal coding called spiking neuron model (SRM).
Figure 13(A) and (B) shows us the basic structure of SNN, the network structure and information transition of neural network, SNN is similar to using feed-forward multilayer neural network. However, the transition process is a little different with feed-forward neural network. Between neurons i and j, there are multiple delays (d1, d2, , dk) and every link has its own weight corresponding to each delay. Compared to traditional NN, SNN has more powerful learning ability and information processing ability.
For each spiking neuron j, its internal state xj(t) is
Neuron i is its predecessor and weight and delay are the synaptic connection between neuron i and j. ti is the ith input from presynaptic neuron. If xj(t) ≥ v, v is pre-set threshold, the neuron is spiked and could transmit spike to other neurons. ε(t)is called spiking response function,
τ is the membrane potential decay time constant that determines the rise and decay time of the postsynaptic potential (PSP). Figure 14 shows us how one spiking neuron begins to receive input pulse step by step and become spiked. Each neuron could begin to transmit information to its posterity only when this neuron is spiked. SNN implements temporal coding strategy which means the whole network is only based on one variable t which is initially set as 0. Then t increment 0.01 epoch by epoch until the output layer is spiked. The final output is the output layer spiking time t.
Owing to the multiple delay connection, one SNN can achieve times of computation scale and ability than ANN in the same multilayer structure. With temporal coding, the whole network has only one variable t, thus the iterative process is simpler than ANN.
For supervised learning algorithm of SNN, Bohte et al.  first time give us a proved error-propagation supervised learning algorithm called spikeprop algorithm which is also based on GD. Adeli  revised the spikeprop algorithm and put forward another supervised learning algorithm quickprop.
3.4. Neural network encoding strategy
When dealing with different types of data for classification or prediction, encoding strategy similar to genetic coding needs being implemented to transform real problem to neural network format. In addition, because of the longtime of neural network training, some preprocessing method needs to be taken for extracting significant information or features thus accelerate neural network learning.
For instance, dealing with image classification, feature extraction method need being taken as preprocessing for neural network learning and the famous one is CNN. Furthermore, when using neural network for speech recognition, feature extraction is also a need for transforming continuous speech data into discrete neural network input.
4. Applying meta-heuristic algorithm on neural network training
The main idea of applying meta-heuristic on neural network training is using meta-heuristic algorithm instead of GD algorithm modifying weight training. That is from
By using meta-heuristic algorithm’s global optimum searching ability, researchers are aiming to train neural network faster than traditional GD algorithm. In the following part, Section 1 reviewed four researchers’ work on implementing meta-heuristic on neural network training including GA on NN, PSO on NN and hybrid GA and PSO on NN.
4.1. GA on neural network
Leung et al.  first tried implementing GA on neural network training. Their work was published in IEEE Transactions on Neural Network, 2003.
An improved GA is put forward by Leung. Crossover operations, mutation operations and fitness function of GA are all redefined by Professor Leung. Firstly, when chromosomes p1 and p2 do crossover operation, four possible offsprings will be generated and one with the biggest fitness value will be chosen as offspring. The four possible crossover offsprings os1 to os4 are generated as regulations listed below:
are calculated respectively. For instance, Max([1,−1,4], [−3,3,2]) = [1,3,4] and Min([1,−1,4],[−3,3,2]) = [−3,−1,2].
Secondly, mutation operations are redefined. The regulations are given below
os is the chromosome with biggest fitness value in all four possible offsprings. bi random equals to 0 or 1 and Δnosi is a random number making sure . os′ is the final generation after crossover operation and mutation operation.
Thirdly, the fitness value is defined. By adding parameters in the neural network mathematical expression, the actual output of GA-optimized neural network yk equals to:
in which denotes link from ith neuron in input layer to jth neuron in hidden layer, sjk denotes link from jth neuron in hidden layer to kth neuron in output layer, wjk denotes weight between each neuron, denote bias in input layer and hidden layer respectively, denote the number of neurons of input layer, hidden layer and output layer, respectively.
The error of the whole network is defined as mean of all chromosomes:
in which nd denotes the number of chromosomes used in the experiment, denotes the desired output of output neuron k. Given the error of the network, GA is implemented to optimize the network, thus minimizing the error. The fitness function is defined as
the smaller the error and the bigger the fitness value. GA is implemented to find the global optimum of the fitness function, thus the parameter combinations of weight w are the trained weight for the network.
4.2. PSO on neural network
Gudise and Venayagamoorthy  implemented PSO on neural network training in 2003.
The fitness value of each particle (member) of the swarm is the value of the error function evaluated at the current position of the particle and position vector of the article corresponds to the weight matrix of the network.
Zhang et al.  developed a hybrid algorithm of BP and PSO that could balance training speed and accuracy.
The PSO algorithm was showed to converge rapidly during the initial stages of a global search, but around global optimum, the search process will become very slow. On the contrary, the gradient descending method can achieve faster convergent speed around global optimum, and at the same time, the convergent accuracy can be higher.
When the iteration process is approaching end and current best solution is near-global optimum, if the change of the weight in PSO is big, the result will vibrate severely. Under this condition, Zhang supposed with the increase of iteration time, the weight in PSO should decline with the iteration time’s increasing to narrow the search range thus paying more attention to local search for global best. He suggests the weight decline linearly first, then decline nonlinearly as shown in Figure 15.
The concrete working process is summarized below:
For all particle pi, it has a global best location pglobalbest. If the pglobalbest keep unchanged for over 10 generations, that may infer the PSO pays too much time on global search thus BP is implemented for pglobalbest to deep search for a better solution.
Similar to GA’s implementation in neural network, the fitness function defined is also based on whole network’s error and to minimize the error as the optimization of PSO.
The learning rate η of neural network is also controlled in the algorithm, as
where η is the learning rate, k and η0 are constants, epoch is a variable that represents iterative times, through adjusting k and η0, the reducing speed of learning rate is controlled.
By implementing the strategy that BP focusing on deep searching and PSO focusing on global searching, the hybrid algorithm has a very good performance.
4.3. Hybrid GA and PSO on neural network
Juang  hybrids GA and PSO thus optimize recurrent network’s training. The work was published in IEEE Transactions on Systems, Man, and Cybernetics, 2004.
The hybrid algorithm called HGAPSO is put forward because the learning performance of GA may be unsatisfactory for complex problems. In addition, for the learning of recurrent network weights, many possible solutions exist. Two individuals with high fitness values are likely to have dissimilar set of weights, and the recombination may result in offspring with poor performance.
Juang put forward a conception of “elite” of the first half to enhance the next generation’s performance. In each generation, after the fitness values of all the individuals in the same population are calculated, the top-half best-performing ones are marked. These individuals are regarded as elites.
In every epoch, the worse half of the chromosome is discarded. The better half is chosen for reproduction through PSO’s enhancement. All elite chromosomes are regarded as particles in PSO. By performing PSO on the elites, we may avoid the premature convergence in elite GAs and increase the search ability. Half of the population in the next generation is occupied by the enhanced individuals, the others by crossover operation. The working flow of algorithm is shown in Figure 16.
The crossover operation of HGAPSO is similar to normal GA, randomly selecting site on chromosome and exchange the sited piece of chromosome to finish the crossover operation. The crossover schematic diagram is shown in Figure 17. In HGAPSO, uniform mutation is adopted, that is, the mutated gene is drawn randomly, uniformly from the corresponding search interval.
5. DL and RBM
When dealing with image classification or other problems, traditional method is using preprocessing transforming data as input values for neural network learning while using DL method for classification, raw data (pixel values) are used as input values. This will keep to the maximum extent protecting all information regardless of useful or not from being destroyed by extraction methods. The most advantage lies that all the extraction methods are based on expert knowledge and expert choice and thus are not extensible to other problems, while DL algorithm can overcome these limitations by using all data with its powerful processing ability. A CNN is shown in Figure 18.
The Boltzmann machine only has two layers (as Figure 19), the first layer is the input layer and the second layer is the output layer, although the structure is very simple and only contains two layers, its mathematical function in it is not simple. Herein we need to introduce the following probability equation to know the RBM.
From the equation we know there are three items in it, they are the energy of the visible layer neural, the energy of the hidden layer neural, and the energy is consisted of the two layers. From the energy function of the RBM, we can see there are three kinds of parameters lie in the RBM, different from the neural network, each neuron has a parameter too, but for the neural network only the connection of two layers has the parameters. Also, the neural network does not carry any energy function. They use the exponential function to express the potential function. There are also the probabilities existing in the RBM. They work exactly in the same way as the two kind of the probabilities such as p(v|h) and p(h|v). It has a similarity with probability graph model in details, examples are Bayesian network and Markov network. It looks like a Bayesian network because of the conditional probability. On the other hand it does not look like a Bayesian because of the two direction probabilities. These probabilities are lying on the two variables that have only one direction.
Compared with the Markov network, the RBM seems to be having a little bit relation with it, because the RBM has the energy function just as the Markov network. But it is not so much alike because the RBM’s variable has parameter, and the Markov network does not have any parameter. Furthermore, the Markov network does not have conditional probability because it has no direction but just the interaction. From the graph’s perspective, variables in the Markov network use cliques or clusters to represent the relations of close and communicated variables. It uses the production of the potentials of the clique to express the joint probability instead of conditional probability just like the RBM. Its input data are the kind of the Boolean data, within the range between 0 and 1.
The training way of the RBM is to maximize the probability of the visible layer p(v), and to generate the distribution of the input data. RBM is a kind of special Markov random function and a special kind of Boltzmann machine. Its graphical model is corresponding to the factor product analysis.
Different from the probability graphical model, the RBM’s joint distribution directly uses the energy function of both visible layer v and hidden layer h to define instead of potential of it given as
Later we will know that Z is the partition function defined as the sum of e−E(V,h) over all the possible configurations. In other words, it is just a constant normalizing the sum over all the possible hidden layer configurations.
The hidden unit activations are mutually independent given the activations. That is, for m visible and n hidden units, the conditional probability of a configuration of the visible unit v, given a configuration
Conversely, the conditional probability of h given v is . Our goal is to infer the weights that maximize the marginal of the visible, in details we can step through the following equation to infer and learn the RBM.
As for the training algorithm, the main idea is also applied GD idea into RBM. Hinton put forward CD  as a faster learning algorithm. Firstly, the derivative of the log probability of a training vector with respect to a weight is computed as
where the angle brackets are used to denote expectations under the distribution specified by the subscript that follows. This leads to a very simple learning rule for performing stochastic steepest ascent in the log probability of the training data:
where ε is a learning rate.
Because there are no direct connections between hidden units in an RBM, it is very easy to get an unbiased sample of 〈vjhj〉data. Given a randomly selected training image, v, the binary state, hj, of each hidden unit, j, is set to 1 with probability
where bj is the current state of hidden neuron j, σ(x) is the logistic sigmoid function . vjhj is then an unbiased sample. The CD is used to calculate the latter part 〈vjhj〉model. Details can be found in their respective publications.
Considering the complicated computation of implementing CD, the training process of RBM is not easy. Under this condition, implementing meta-heuristic on RBM training to substitute CD is of high possibility.
6. Conclusion and discussion
Meta-heuristic has successfully implemented in neural network training. The algorithm used includes GA, PSO, their hybrid and many other meta-heuristic algorithms. Moreover, feed-forward BP neural network and SNN  are all trained with tests on famous classification problems.
The basic structure of DL is similar to traditional neural network. CNN is a special neural network with different weight computation regulations and RBM is a weighted biodimensional neural network or biodimensional graph. Their training processes are also mainly executed through iterative formula on error which is similar to traditional neural network’s training.
Having seen the above two concluding remarks, it can be assumed that it is of high possibility of applying meta-heuristic in DL to speed up training without declining performance. However, relevant publications along this direction are still rare.
Lastly, there still exists a question that goes under today’s computation ability, especially the GPU whose computation ability is several times stronger than CPU that has been widely used in industrial area. One may ask, although elegant, is meta-heuristic still necessary? This question is not easy to answer. However, one can be certain that when traversing all the possible solutions is certainly highly time-consuming. Search for near-optimal results by meta-heuristic is still useful that can provide us a reasonable searched result (instead of non optimal or far from optimal) near a global optimum value at an acceptable computational cost.