Representative architectures of GANs in recent years.
Deep Learning, also known as deep representation learning, has dramatically improved the performances on a variety of learning tasks and achieved tremendous successes in the past few years. Specifically, artificial neural networks are mainly studied, which mainly include Multilayer Perceptrons (MLPs), Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). Among these networks, CNNs got the most attention due to the kernel methods with the weight sharing mechanism, and achieved state-of-the-art in many domains, especially computer vision. In this research, we conduct a comprehensive survey related to the recent improvements in CNNs, and we demonstrate these advances from the low level to the high level, including the convolution operations, convolutional layers, architecture design, loss functions, and advanced applications.
- deep learning
- kernel methods
- weight sharing
- comprehensive survey
Convolutional Neural Networks (CNNs) are specially designed to handle data that consists of multiple arrays/matrixes such as an image composed of three matrixes in RGB channels . The key idea behind CNNs is the convolution operation, which is to use multiple small kernels/filters to extract local features by sliding over the same input. Each kernel can output a feature map and all the feature maps are concatenated together, this is also known as a convolutional layer and it is the core component in a CNN. Note that these concatenated maps can be further processed by the next layer. To reduce the computational cost, the pooling operation such as maximum pooling is usually applied on these feature maps. A typical CNN is usually structured as a series of layers, including multiple convolutional layers and a few of fully connected layers. For example, the famous LeNet  consists of two convolutional layers and three fully connected layers, and the pooling operation is used after each convolutional layer.
In addition to building a neural network, a loss function is essential to measure the model performance. Therefore, the process of training a CNN model is transformed into an optimization problem, which normally seeks to minimize the value of the loss function over the training data. Specifically, a gradient-descent based algorithm is usually adopted to iteratively optimize the parameters in a CNN.
Figure 1 shows the high-level abstraction of CNNs in this survey. Specifically, we firstly introduce two types of convolution operations in Section 2. Then four methods are summarized for constructing convolutional layers in CNNs in Section 3. In Section 4, we group the current CNN architectures into three types: encoder, encoder-decoder and GANs. Next, we discuss two main types of loss functions in Section 5. In Section 6, we give the advanced applications based on the three types of CNN structures. Finally in Section 7, we conclude this research and give future trends.
2. Convolution operations
The main reason why CNNs are so successful on a variety of problems is that kernels (also known as filters) with fixed numbers of parameters are adopted to handle spacial data such as images. In particular the weight sharing mechanism can help reduce the number of parameters for low computational cost while remaining the spacial invariance properties. In general, there are mainly two types of convolution operations, including basic convolution and transposed convolution.
2.1 Basic convolution and dilated kernels
As shown on the left in Figure 2, convolution operation essentially is a linear model for the local spacial input. Specifically, it only performs the sum of element-wise dot products between the local input and the kernels (usually including a bias), and output a value after an activation function. Each kernel slides overall spacial locations in the input with a fixed step. The result is that we can get an 1-channel feature map. Note that there are generally many kernels in one convolutional layer, and all of the output feature maps are concatenated together, e.g., if the number of kernels used in this convolutional layer is , we can get an feature map.
While the kernel size of is widely used in current CNNs, we may need large receptive fields in the input for observing more information during each convolution operation. However, if we directly increase the size of kernels such as , where is the depth of input, the total number of parameters will increase dramatically and the computational cost will be prohibitive. In practical, as shown on the right in Figure 2, we can insert zeros between each element in the kernels and get dilated kernels. For example, dilated kernels have been applied in many tasks such as image segmentation , translation tasks  and speech recognition .
2.2 Transposed convolution and dilated kernels
Normally the size of output feature maps generated from the basic convolution is smaller than the input space (i.e., the dimension of input is and the dimension of output is in Figure 2), which results in high-level abstraction by using multiple convolutional layers. Transposed convolution can be seen as a reverse idea from basic convolution. Its primary purpose is to obtain an output feature map that is bigger than the input space. As shown on the left in Figure 3, the size of the input is , after transposed convolution, we can have a feature map . Specifically, during transposed convolution, each output filed in is just the kernel multiplied by the scalar value of one element in .
Similarly, we can still use dilated kernels in transposed convolution. The main reason why we need transposed convolution is that it is the fundamental idea to construct a decoder network, which is used to map a latent space into an output image, such as the decoders in U-Net  and GANs. Specifically, the transposed convolution is widely used in tasks such as model visualization , image segmentation , image classification  and image super-resolution .
3. Convolutional layers
The core components in CNNs are convolutional layers. In the last section, we have demonstrated two types of convolution operations and they are the main idea to construct convolutional layers. In this part, we summarize the main methods in deep learning for building convolutional layers, including basic convolutional layers, convolutional layers with shortcut connection, convolutional layers with mixed kernels and convolutional capsule layers.
3.1 Basic convolutional layers
Recall that there are normally kernels in one convolutional layer, where also denotes the depth of the output feature map. In other words, the number of channels in the output map depends on the number of kernels used in the convolutional layer. More formally, we can denote it as
where represents the convolution operation which has been addressed above, denotes the concatenation operation and is the output feature map. After convolution operation, a no-linear activation function is applied on each element in the concatenated feature map, which can be denoted as
While there are many variants related to the activation function, the typical ones which are widely adopted are ReLU , tanh and sigmoid . Note that the non-linear activation functions are essential for building multi-layer networks, as it shows that a two-layer network with enough neurons can uniformly approximate any functions, which is also known as universal approximation theorem .
Note that after convolution operation, the width and height of the output feature map are usually close to the width and height of the input . To further reduce the dimensions of the output feature maps for reducing computational cost, the pooling operation is widely used in the current CNNs. Specifically, for 2D pooling operation, two main hyper-parameters are involved: the filter size and stride . And after pooling operation, the width of the feature map is reduced to and the height of the feature map is . In brief, we can have
where denotes the pooling operation discussed above. Typical pooling operations includes max-pooling and average-pooling. A general choice to conduct pooling operation is to use with filter, which means that each 4 pixels in the 2D feature map will be compressed into 1 pixel. Using a toy example, suppose that there are only four pixels , then or .
3.2 Convolutional layers with shortcut connection
It is true that deep neural networks normally can learn better representation from the data than shallow neural networks. However, stacking more layers in a CNN can lead to the problems of vanishing or exploding gradients, which make the networks hard to optimize. A simple and effective way to address this problem is to use shortcut connections, which can help directly transform the information from the previous layer to the current layer in a network.
Note that can denote two types of operations.
Element-Wise Sum: Each element in is added by the corresponding element in , which means that the dimensions of and must be the same, and the result is that we can get an output of the same size. This type of operation is well known as identity shortcut connection and it is the core idea in ResNet [11, 12]. The main advantage is that it does not add any extra parameters or computational complexity. The disadvantage is due to its inflexible.
Concatenation: We can concatenate the current output and previous input together. Suppose the size of the current output feature map is and the size of the previous input is , after concatenation, we can have a concatenated feature map with a size of . Note that the widths and heights of input and output must be the same. The advantage is that we can remain the information from the previous layers. The disadvantage is that we have to use extra parameters to handle the concatenated feature map . (i.e., the depth of kernels for processing feature map is .) Specifically, this type of convolutional layers is broadly adopted in networks for image segmentation such as U-Net .
3.3 Convolutional layers with mixed kernels
So far we have demonstrated that we normally use many convolutional kernels with the same size in one convolutional layer such as . To enlarge the receptive field, we may adopt the dilated kernels instead. However, it is difficult to know what size of kernels we should use in a CNN. Naturally, we may apply different sizes of kernels in each convolutional layer. E.g., both , and kernels are adopted in one convolutional layer. More formally, we define one convolutional layer with mixed kernels as
where pool(I) denotes the pooling operation such as max-pooling. Therefore, the size of the output feature map is .
However, if we directly add different sizes of kernels in one convolutional layer, the computational cost involved will increase sharply. In the inception module [13, 14], a convolutional layer is applied before and convolutional layers in order to reduce the convolutional cost.
3.4 Convolutional capsule layers
“The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster.”—Geoffrey Hinton.
In general, pooling operation is essential to reduce the size of output feature maps so that we can obtain high-level abstractions from input by stacking multiple convolutional layers in a CNN. However, the cost is that some information in the feature maps has been abandoned such as conducting max-pooling.
In 2017 , Hinton et al. proposed an alluring version of convolutional architectures, which is known as capsule networks, followed by the updated versions in 2018  and 2019 . The convolutional capsule layers in capsule networks are very similar to the traditional convolutional layers. The main difference is that each capsule (i.e., an element in convolutional feature maps) has a weight matrix (i.e., the sizes are in  and in  respectively).
4. Architecture design
Although numerous variants of CNN architectures for solving different tasks are proposed from the deep learning community every year, their essential components and over-all structures are very similar. We group the recent classic network structures into three main types, including encoder, encoder-decoder and GANs.
In 1990, LeCun et al. proposed a seminal network called LeNet , which help establish the modern CNN structure. Since then, many new methods and compositions are proposed to handle the difficulties encountered in training deep networks for challenging tasks such as objective detection and recognition in computer vision. Some representative works in recent years are AlexNet , ZFNet , VGGNet , GoogleNet , ResNet , Inception . As mentioned earlier, new methods for constructing convolutional layers in these networks are proposed, e.g., shortcut connection  and mixed kernels [14, 20].
In general, the above-mentioned networks can all be regarded as encoders, in which each input such as an image is encoded into a high-level feature representation, as shown on the left in Figure 4. And this encoded representation can be further used for, such as image classification, object detection etc. In some literatures, an encoder is also called as a feature extractor. Specifically, the basic convolutional layers are the main components for constructing an encoder network, by stacking multiple layers, each layer in the network can learn high-level abstractions from previous layers . More formally, an encoder network can be written as
where is the input, is the parameters to learn (e.g., kernels and bias) in the network and denotes the encoded representation such as a vector.
In some specific tasks such as image segmentation , our goal is to map an input image to a segmented output image rather than an abstraction. An encoder-decoder structure is specifically designed for solving this type of task. There are many possible ways to implement an encoder-decoder structure, and many variants have also been proposed to improve the drawbacks in the last few years. A naive version of encoder-decoder network which was introduced in  can be denoted as
where denotes an encoder CNN to map an input sample to a representation and represents a decoder CNN to reconstruct the input sample with . Specifically, CNN encoders usually conducts basic convolution operations (i.e., Section 2.1) and CNN decoders perform transposed convolution operations (i.e., Section 2.2).
As shown in the middle in Figure 4, an encoder-decoder network is still one complete network and we can train it with an end-to-end method. Note that there are generally many convolutional layers in each coder network, which results that it can be challenging to train a deep encoder-decoder network directly. Recall that the shortcut connection is often adopted to address the problems in deep CNNs. Naturally, we can add connections between the encoder and the decoder. An influential network based on this idea is U-Net , which is widely applied in many challenging domains such as medical image segmentation. The above two equations can also be rewritten as a composition of two functions.
Specifically, in unsupervised learning, an encoder-decoder network is also well known as autoencoder. And there are many variants of autoencoders proposed in recent years, some famous ones including variational autoencoder , denoising variational autoencoder  and conditional variational autoencoder [23, 24].
Since generative adversarial networks were firstly proposed by Goodfellow et al.  in 2014, this type of architectures for playing two-player minimax game has been most extensively studied. Partly because it is an unsupervised learning method and we can obtain a fancy generator network which can help generate fake examples from a latent space (i.e., a vector with some random noise). On the right in Figure 4 shows the basic structure of GANs, in which a generator network can map some input noise into a fake example and make it look as real as possible and a discriminator network always tries to identify the fake sample from its input. By iteratively training the two players, they can both improve their methods. More formally, we can have
where denotes the generator function and represents the discriminator function. is the latent space input in the generator, and its output is a fake example. is the real samples we have collected. And is the predicted result of the discriminator to show whether the input is real or fake.
As shown in Table 1, numerous variants of GANs architectures can be found in the recently published literatures and we broadly summarize these representative networks according to their published time. Note that the fundamental methods behind these architectures are very similar.
|GANs ||2014||The original version of GANs, where and are implemented with fully connected neural networks.|
|Conditional GANs ||2014||Labels are included in and .|
|Laplacian Pyramid GANs ||2015||CNNs with the laplacian pyramid method.|
|Deep Convolutional GANs ||2015||Transposed convolutional layers are used to construct .|
|Bidirectional GANs ||2016||An extra encoder was adopted based on the traditional GANs.|
|Semi-supervised GANs ||2016||The can also classify the real samples while distinguishing the real and fake.|
|InfoGANs ||2016||An extra classifier was added into the GANs.|
|Energy-based GANs ||2016||The was replaced with an encoder-decoder network.|
|Auxiliary Classifier GANs ||2017||An auxiliary classifier was used in the .|
|Progressive GANs ||2017||Progressive steps are adopted to explain the networks.|
|BigGANs ||2018||A large GANs with self-attention module and hinge loss.|
|Self-attention GANs ||2019||The self-attention mechanism is proposed to build and .|
|Label-noise Robust GANs ||2019||A noise transition model is included in .|
|AutoGANs ||2019||The neural architecture search algorithm is used to obtain and .|
|Your Local GANs ||2020||A new local sparse attention layer was proposed.|
|MSG-GANs ||2020||There are connections from to .|
5. Loss functions
Before introducing the loss functions, we need to understand that the ultimate goal to train a neural network is to find a suitable set of parameters so that our model can achieve good performance on the unseen samples (i.e., test dataset). The typical way to search in machine learning is to use loss functions as a criterion during training. In other words, training neural networks is equivalent to optimizing the loss functions by back-propagation. Accurately, a loss function outputs a scalar value which is regarded as a criterion for measuring the difference between the predicted result and the true label over one sample. And during training, our goal is to minimize the scalar value over training samples (i.e., cost function). Therefore, as shown in Figure 1, loss functions play a significant role in constructing CNNs.
where denotes a loss function for the training sample , and is often known as cost function, which is just the mean of the sum of the losses over training samples (i.e., usually a batch of training samples is fed into a CNN during each iteration of training).
Note that there are numerous variants of loss functions used in the deep learning literature. However, the fundamental theories behind them are very similar. We group them into two categories, namely Divergence Loss Functions and Margin Loss Functions. And we also introduce six typical and classic loss functions that are commonly used for training neural networks.
5.1 Divergence loss functions
Divergence loss functions denote a family of loss functions based on computing the divergences between the predicted results and true labels, mainly including Kullback-Leibler Divergence, Log Loss, Mean Squared Error.
5.1.1 Kullback-Leibler divergence
Before introducing the Kullback–Leibler divergence, we need to understand that the fundamental goal of deep learning is to learn a data distribution over the training dataset so that is close to the true data distribution . Back in 1951, Kullback-Leibler divergence was proposed to measure the difference between two distributions on the same probability space . It is defined as
where denotes the Kullback–Leibler divergence from to . is the entropy of P and is the cross entropy of P and Q. There is also a symmetrized form of the Kullback–Leibler divergence, which is known as the Jensen–Shannon divergence. It is a measure of the similarity between and .
Specifically, means the two distributions are the same. Therefore, if we minimize the Jensen-Shannon divergence, we can make the distribution close to the distribution . More Specifically, if denotes the distribution on data, and represents the distribution which is learned by a CNN model. By minimizing the divergence, we can learn a model which is close to the true data distribution. This is the main idea of GANs. The loss function of GANs is defined as
where denotes the generator and denotes the discriminator. And our goal is to try to make close to . In other words, when the generative distribution of fake examples is close to the distribution of real samples, the discriminator cannot distinguish between the fake and the real.
5.1.2 Log loss
Log loss is widely used in the current deep neural networks due to its simplicity and power. The binary log loss function is defined as
where denotes the binary label for a sample and is the predicted result, (i.e., given a training sample with its corresponding label , we can have an output predicted result with an encoder network .)
When the learning task is multi-class classification, each sample label is normally encoded with the one-hot-encoding format, which can be denoted as , i.e., if the label is 3, then only and the others are all given the value of 0. Therefore, the log loss for one sample can be written as
where is the predicted result for the true label . denotes the indicator function, which means that its output is 1 if , otherwise it outputs .
We may wonder why the log loss is a reasonable choice. Informally, let denotes the data distribution and denotes the distribution leaned by our model, then based on Kullback–Leibler divergence, we can have
And our goal is to minimize the divergence between and so that the distribution obtained by our model is close to the true data distribution. Because the term is the entropy related to data, and we only need to optimize the cross entropy term . Therefore, log loss is also well known as cross-entropy loss.
5.1.3 Mean squared error
Probably the mean squared error is one of the most familiar loss functions as it is really like the least square loss function. It directly calculates the difference between the predicted result and the true label, which is denoted as
One example which can help us deeply understand the mean squared error is that minimize the mean squared loss of a linear regression model is equivalent to maximum likelihood. In other words, this is a method to optimize the parameters of our model so that the distribution learned by our model is most probable under the observed training data. Therefore, the fundamental goal is still the same as above, which is to make the model distribution and the data distribution as close as possible.
5.2 Margin loss functions
Margin loss functions represent a family of margin maximizing loss functions. The typical functions include Hinge Loss, Contrastive Loss and Triplet Loss. Unlike the divergence loss functions, margin loss functions calculate the relative distances between outputs and they are more flexible in terms of training data.
5.2.1 Hinge loss
Hinge loss is well known to train Support Vector Machine classifiers. Specifically, there are two main types of hinge losses. The first type is for each sample with only one correct label, it is denoted as
where denotes each element in the one-hot-encoding label, is the correct class. represents the predicted result of our neural network for each class. is the standard choice for the margin. If , the above loss denotes the standard Hinge loss, and if , it is the Squared Hinge loss.
However, in real tasks such as attribute classification, each samples can have multiple correct labels. e.g., a photo posted on Facebook may include a set of hashtags. Therefore, the second type for multiple labels is
where if , otherwise . is still the common choice for the margin and or .
5.2.2 Contrastive loss
Contrastive loss is specially designed for measuring the similarity of a pair of training samples. Considering two pairs of samples and , where is known as an anchor sample and denotes the positive sample and represents the negative sample, Specifically, if the pair is matching, then the loss for the pair is the distance between their outputs from the network . While if the pair is not matching and the distance of their outputs from the model is small than the pre-defined margin , then we need also to calculate the loss. Formally, we can have
where can be the Euclidean distance, (i.e., ). Alternatively, the above equation can be rewritten as
where if the given pair is matching, otherwise . is the margin which can affect the loss calculating for the unmatched pairs.
5.2.3 Triplet loss
Triplet loss looks similar to the contrastive loss, but it is a measure of the difference between the matched pair and the unmatched pair. Considering three samples , the Triplet loss is denoted as
Note that minimize the loss function is equivalent to minimizing the distances of matched pairs and maximizing the distances of unmatched pairs.
6. Advanced applications
One of the most exciting areas in deep learning is that we can apply neural networks to a numerous number of applications that cannot be solved well or be handled by the traditional machine learning method. In this section, we summarize the typical advances that CNNs has achieved based on the three types of CNN structures.
6.1 Applications with encoders
6.1.1 Image classification
A basic task in machine learning is classification, which is the problem of identifying to which of a list of labels a new sample belongs, such as the well-known CIFAR-10 dataset, in which there are 10 categories of images and the goal is to train a model for correctly classifying an unseen image based on observing the training dataset. In particular, CNNs have made many breakthroughs on large scale image datasets such as the ImageNet challenge . As mentioned in Section 4.1, the classic encoders such as AlexNet , ZFNet , VGGNet , GoogleNet , ResNet , Inception  are regarded as the milestones in the past few years. The successes of these encoders are all based on supervised learning, which means that manual labelling is essential for the dataset such as the ImageNet dataset . Specifically, a labeled dataset is normally divided into training and test dataset (may also include a validation dataset), and our goal is to achieve good performance on the test dataset after training a neural network with the training dataset, and the pre-trained model can be further used for classifying new images that are from the same data distribution space.
Classification can also be treated as a fundamental problem in machine learning, the successes of these encoders on image classification also help establish the foundation for many other applications. Specifically, we can utilize an encoder to extract high-level representation from the low-level input image, and the obtained representation can be further used for many other applications.
6.1.2 Object detection
In addition to image classification, object detection is also very important in computer vision. Image classification gives us the answer to what a given image is, and object detection is about telling us the specific positions of objects in an image. Specifically, the goal is to train an encoder to output a suitable bounding box and associated class probabilities for each object in a given image. Two typical methods are widely used in the current computer vision, including YOLO  and SSD . The core idea of YOLO is that object detection is treated as an regression problem, which means that each image is divided into multiple grids and each grid cell outputs a pre-defined number of bounding boxes, the corresponding confidence for each box and class probabilities . Since the first version of YOLO was proposed, the updated versions have also been proposed. SSD is a more simple method, which utilizes a set of default boxes with different aspect ratios, and each box outputs the shape offsets and the class confidences .
6.1.3 Pose estimation
The multiple levels of representations learned in the multiple layers of CNNs can also be used for solving the task of human-body pose estimation. Specifically, there are mainly two types of approaches, including regression of body joint coordinates and heat-map for each body part. In 2014, a framework called DeepPose  was introduced to learn pose estimation by a deep CNN, in which estimating human-body pose is equivalent to regressing the body joint coordinates. There are also some extension works based on this method, such as a process called iterative error feedback , which encompasses both the input and output spaces of CNN for enhancing the performance. In 2014, Tompson et al.  propose a hybrid architecture which consists of a CNN and a Markov Random Field, in particular the output of the CNN for an input image is a heat-map. Some recent works based on the heat-map method such as , in which a multi-context attention mechanism was proposed to incorporate with CNNs.
6.2 Applications with encoder-decoders
6.2.1 Image restoration
The operation of image restoration is to recover a damaged or corrupt image for the clean image such as image denoising and super-resolution. Therefore, a natural way to implement this idea is to utilize a pre-trained encoder-decoder network, where the encoder can map a noise image into a high-level representation, and the decoder can transform the representation into an original image. For example, Mao et al.  apply a deep convolutional encoder-decoder network for image restoration, in particular the shortcut connection method is adopted between the encoder and decoder, which has been demonstrated in Section 3.2. And the transposed convolution is used for constructing the decoder network, as mentioned in Section 2.2. Similar work in  has also been introduced for image restoration, in which a residual method is used in the network (i.e., in Section 3.2).
6.2.2 Image segmentation
The task of image segmentation is to map an input image into a segmented output image. The encoder-decoder networks have been developed dramatically in recent years and achieve a significant impact on computer vision. Specifically, there are mainly two types of tasks including semantic segmentation and instance segmentation. In 2015, Long et al.  firstly showed that an end-to-end fully CNN can achieve state-of-art in image segmentation tasks. Similar work has also been introduced in  in 2015, in which a U-Net architecture is proposed for medical image segmentation, and the main advance in this architecture is that the shortcut connection method is also used between the encoder and decoder network. Since then, a series of papers based on these two methods have been published. In particular nowadays the U-Net based architectures are widely used for the medical image diagnosis.
6.2.3 Image captioning
One of the exciting applications achieved by CNNs is image captioning, which is to describe the content of an input image with natural language. The basic idea is as follows: Firstly, a pre-trained CNN encoder is used to extract some high-level features from an input image. Secondly, these features are typically fed into an recurrent neural network for generating a sentence. For example, Li et al.  proposed a fully convolutional localization network for extracting representation from images and the decoder for generating captions is LSTM. Recently, attention mechanism has been widely used for sequence processing and achieved significant improvements such as machine translation, Huang et al.  introduce an encoder-decoder framework, where an attention module is used in the encoder and decoder respectively. Specifically, the encoder is a CNN based network.
6.2.4 Speech processing
Note that speech signals exhibit spectral variations and correlations, CNNs are very suitable to reduce them. Therefore, CNNs can also be utilized for the task of speech processing, such as speech recognition. Sainath1 et al.  applied deep CNNs for large vocabulary speech tasks. In [54, 55, 56], the CNNs are used for speech recognition. And the fundamental methods are very similar, both of them use the CNNs to extract features from the raw input, and then these features are fed into an decoder for the specific learning tasks.
6.3 Applications with GANs
6.3.1 Image generation
The most typical application of GANs is to generate fake examples. Recall that there normally are two dependent networks in GANs, including and . Once the training process is finished, we can utilize to generate fake samples from the training dataset.
Generating fake samples can be regarded as data augmentation, which means that these fake data can be further used to train models. Note that deep learning is also well known as a data-driven approach. In particular most of the advances that deep neural networks achieved are based on supervised learning. Specifically, the current successful neural network models usually consist of millions of parameters. And annotated data is essential to optimize these parameters for guaranteeing the model accuracy when conducting supervised learning. However, manually labeling data is time-consuming and expensive, especially in some specific domains such as medicine. Even more severe is that it can be hard to collect enough data due to the privacy concerns. There are numerous works to utilize GANs for enhancing model performance. E.g., in , a semi-supervised framework based on GANs is applied to semantic segmentation in order to address the lack of annotations.  is a work of utilizing synthetic medical images for enhancing the performance of liver lesion classification.
Despite the successes of GANs, generating high-resolution, diverse samples is still a challenging task. In , they introduce the progressive GANs which can generate high-resolution human faces. Another impressive work to generate realistic photographs is BigGANs .
6.3.2 Image translation
Another interesting application derived from GANs is image translation. While there are many specific applications, we summarize them into three categories, including translation of image to image, translation of text to image and translation of image to super-resolution.
Image to Image: The task of image-to-image translation is to learn a mapping . E.g., Isola et al.  apply conditional GANs for an image-to-image task and achieve impressive results such as mapping sketches to photographs, black-white photographs to color etc. Another typical work is the CycleGANs , which can transfer a style of an image into another.
Text to Image: One of the interesting works from GANs is to synthesis a realistic image based on some text descriptions. E.g., “There is a little bird with red feather.” Some representative works include: Reed et al.  introduce a text-conditional convolutional GANs. Zhang et al.  apply a StackGANs to synthesize high-quality images from text.
Super Resolution: The task of super-resolution is to map a low-resolution image to a high-resolution image. In 2017, ledig et al.  propose a framework named as SRGAN, which is regarded as the first work that has the ability to generate photo-realistic images for 4X upscaling factors. Specifically, the loss functions used in their framework consist of an adversarial loss and a content loss. In particular the content loss can help remain the original content from the input images.
6.3.3 Image editing
Image editing is regarded as a fundamental problem in computer vision. The emergence of GANs has also brought new chances for this task. In the past few years, GANs have been developed for image editing, such as image inpainting and image matting.
Image inpainting: The task of image inpainting is to recover an arbitrary damaged region in an image. Specifically, we can utilize the algorithm to learn the content and style of the image and generate the damaged part based on the input image, such as , in which they introduce a context encoder for natural image inpainting. And in [65, 66], their works mainly focus on human face completion.
Image matting: The goal of image matting is to separate the foreground object from the background in an image. This technique can be used for a wide range of applications such as photo editing and video post-production. And there are also some representative works such as [67, 68].
7. Summary and future trends
In this research, we have conducted a hierarchically-structured survey of the main components in CNNs from the low level to the high level, namely, convolution operations, convolutional layers, architecture design, loss functions. In addition to introducing the recent advances of these aspects in CNNs, we have also discussed the advanced applications based on the three types of architectures including encoder, encoder-decoder and GANs, from which we can see that CNNs have made numerous breakthroughs and achieved state-of-the-art in computer vision, natural language processing and speech recognition, especially these fantastic results based on GANs.
From the above analyses, we can summarize that the current development tendencies in CNNs mainly focus on designing new architectures and loss functions. Because these two aspects are the core parts when applying CNNs into various types of tasks. On the other hand, the fundamental ideas behind these various applications are very similar, as summarized above.
However, there are still many disadvantages in the current deep learning. The first problem is the requirement of large-scale datasets, in particular constructing a labeled dataset is very time-consuming and expensive such as in the medical domain. Therefore, we need to pay much more attention to semi-supervised learning and unsupervised learning. The second disadvantage is the high computational cost related to training deep CNNs, as the current standard CNN structures become deeper and deeper and they usually consists of millions of parameters. The third issue is that applying CNNs into tasks is not an easy job and it usually requires professional skills and experiences, because training a network involves a lot of hyper-parameters to tune, such as the number of kernels in each layer, the size of kernels, the total number of layers, learning rate etc.
Future work should focus on deep learning theory as the solid theory for supporting the current neural models is lacking. Unlike other machine learning algorithms such as support vector machines that have obvious mathematical logic, it is usually very hard to totally understand why a deep network can achieve such an excellent performance on a task. Therefore, based on the current developments of deep learning, we give three trends on which we need to work in the future: Neural Topologies such as the graph neural networks, Uncertainty Estimation such as Bayesian neural networks and Privacy Preservation.
This work is supported by China Scholarship Council and Data61 from CSIRO, Australia.
Conflict of interest
The authors declare no conflict of interest.