Detailed information of video pairs used in our experiments.
Abstract
Deep learning models are data driven. For example, the most popular convolutional neural network (CNN) model used for image classification or object detection requires large labeled databases for training to achieve competitive performances. This requirement is not difficult to be satisfied in the visible domain since there are lots of labeled video and image databases available nowadays. However, given the less popularity of infrared (IR) camera, the availability of labeled infrared videos or image databases is limited. Therefore, training deep learning models in infrared domain is still challenging. In this chapter, we applied the pix2pix generative adversarial network (Pix2Pix GAN) and cycle-consistent GAN (Cycle GAN) models to convert visible videos to infrared videos. The Pix2Pix GAN model requires visible-infrared image pairs for training while the Cycle GAN relaxes this constraint and requires only unpaired images from both domains. We applied the two models to an open-source database where visible and infrared videos provided by the signal multimedia and telecommunications laboratory at the Federal University of Rio de Janeiro. We evaluated conversion results by performance metrics including Inception Score (IS), Frechet Inception Distance (FID) and Kernel Inception Distance (KID). Our experiments suggest that cycle-consistent GAN is more effective than pix2pix GAN for generating IR images from optical images.
Keywords
- image conversion
- generative adversarial network
- cycle-consistent loss
- IR image
- Pix2Pix
- cycle GAN
1. Introduction
Image-to-image conversion, such as data augmentation [1] or style transfer [2], has been applied to recent computer vision applications. Traditional image conversion models had been investigated for specific applications [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]. Since the creation of the GAN model [15], it opened a new door to train generative models for image conversion. For example, computer vision researchers have successfully developed GAN models for day-to-night and sketch-to-photograph image conversions [16]. Two recent popular models that can perform image-to-image translations are Pix2Pix GAN [2] and Cycle GAN [16]. Pix2Pix GAN needs paired images for training whereas Cycle GAN relaxes this constraint and can be trained with unpaired images. In practice, paired images from different domains are often difficult to obtain. Therefore, Cycle GAN is a better choice for image to image translation where paired images are not available.
IR image datasets are not largely available as compared to optical images. As a result, we face the shortage of data when we train models for object detection in IR domain. This problem can be mitigated by using the Cycle GAN model to covert labeled optical images to IR images. In this chapter, we evaluate two models, Pix2Pix GAN and Cycle GAN, for image conversion from optical domain to IR domain. We used four different datasets to perform the conversion and three metrics including Inception Score (IS), Frechet Inception Distance (FID) and Kernel Inception Distance (KID) to assess quality of the converted IR images.
2. Image to image conversion models
2.1 Generative adversarial network
GAN consists of one generative model and one discriminative model to generate images from noise as shown in Figure 1. The generator “
where
2.2 Conditional GAN
GAN can be converted into a conditional model with auxiliary information that is used to impose condition on generator and discriminator [17]. In the conditional GAN model, additional data are fed into the generator and discriminator so that data generation can be controlled. The loss function in conditional GAN becomes [17].
where
2.3 Pix2Pix GAN
The Pix2Pix GAN model is built upon the concept of conditional GAN and it has been a common platform for various image conversion tasks. The diagram of Pix2Pix GAN model is given in Figure 3. Pix2Pix GAN consists of a “U-Net” [18] based generator and a “PatchGAN” discriminator [2]. The “U-Net” generator passes low level information of input image to output image, and the “PatchGAN” discriminator helps capture statistics of local styles. The loss function of pix2pix GAN is:
Pix2Pix GAN learns to map input image
2.4 Cycle GAN
In many cases, it is difficult to get paired images from different domains. Cycle GAN [16] addressed this challenge by introducing the cycle-consistent loss function as shown in Figure 4. There are two generator G and F in Cycle GAN along with two adversarial discriminator Dx and Dy. X and Y are input domain and target domain, respectively. While Dx helps G to generate images from X domain to Y domain, F is trained to generate images from Y domain to X domain. G: X → Y and F: Y → X are two mappings that are trained in Cycle GAN and these are kept consistent by two cycle-consistency losses. The total loss function of Cycle GAN is given by:
where
There are two terms in the loss function of Cycle GAN: adversarial losses and cycle-consistency losses. LGAN(G, DY, X, Y) and LGAN(F, DX, Y, X) are the adversarial losses for G: X→Y and F: Y→X mapping, respectively, which ensure that target images’ distribution and generated images’ distribution are close. The cycle-consistency loss,
Cycle GAN has been used in different applications including season transfer, style transfer, etc. [16]. In addition, Cycle GAN has resolved the mode collapse problem in training if only the adversarial loss is used [19]. Mode collapse happens when the generator outputs the same image for different inputs. Though other methods [2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 21, 22, 23, 24] can also offer image-to-image translation with unpaired images, Cycle GAN has become a common platform for many image translation related tasks.
3. Experimental setups
3.1 Datasets
For training Pix2Pix GAN and Cycle GAN, we have used images pairs from the open-source visible and infrared video database from the signal multimedia and telecommunications laboratory at the Federal University of Rio de Janeiro [25]. IR and visible-light video pairs in the database are synchronized and registered. We utilized 80% of frames in the “Guanabara Bay_take_1” video pair for training and the remaining 20% frames for testing. In addition, we evaluated the trained model on other three image pairs named “Guanabara Bay_take_2”, “Camouflage_take_1” and “Camouflage_take_2”. Detailed information of the four video pairs are listed in Table 1 and some example pairs are shown in Figure 5.
Dataset Name | Description [25] |
---|---|
Guanabara Bay_take_1 |
|
Guanabara Bay_take_2 |
|
Camouflage_take_1 |
|
Camouflage_take_2 |
|
3.2 Performance metrics
3.2.1 Inception score
Inception score (IS) is widely used for evaluating GANs [26]. IS considers quality and diversity of generated images by evaluating the entropy of probability distribution outputted created by the pre-trained “Inception v3” model on the generated data [27]. A large inception score represents high quality of the generated images. One drawback of the inception score is that it does not consider information in the real images used for training the GAN model. Therefore, it is not clear how the generated images compare to the real training images.
3.2.2 Frechet inception distance
Frechet Inception Distance (FID) indicates the similarity between two sets of datasets and is often used for evaluating GANs [28, 29]. FID is the Wasserstein-2 distance between feature representations of real and fake images computed by the Inception v3 model [27]. We used the coding layer of the Inception model to obtain feature representation of each image. FID is consistent with the human-judgment of image quality and it can also detect intra-class mode collapse. A lower FID score indicates that the two groups of images are similar so that the generated fake images are of high quality.
3.2.3 Kernel inception distance
Kernel Inception Distance (KID) is another metric often used to assess quality of GAN generated images relative to real images [30]. KID first uses the Inception v3 model to obtain representations of generated images. It then calculates the squared maximum mean discrepancy (MMD) between the representations of real training images and generated images. KID score is also consistent with human judgment of image quality. A small KID value indicates high quality of the generated images.
4. Results
4.1 Testing results on “Guanabara Bay_take_1” and “Guanabara Bay_take_2”
We trained the Pix2Pix GAN and Cycle GAN on 80% of the frames in “Guanabara Bay_take_1” video pair and tested the trained models on the remaining 20% frames. Some visible and IR images that we have used for training are shown in Figure 5. After training, we also applied both models to the “Guanabara Bay_take_2” dataset. Figures 6 and 7 show some generated IR images. By visual inspection, Cycle GAN can generate better results than Pix2Pix GAN does. In addition, we observe that IR images generated by Cycle GAN are similar to the real IR images. Table 2 lists the quantitative performance metrics of the generated images by the two models. Cycle GAN outperforms Pix2Pix GAN in terms of all the metrics including IS, FID and KID on this dataset.
Metrics | Datasets | |||||||
---|---|---|---|---|---|---|---|---|
Guanabara Bay_take_1 | Guanabara Bay_take_2 | Camouflage take_1 | Camouflage take_2 | |||||
IS Score | PixPix GAN | Cycle GAN | PixPix GAN | Cycle GAN | PixPix GAN | Cycle GAN | PixPix GAN | Cycle GAN |
2.70 | 2.88 | 1.85 | 3.61 | 1.02 | 2.72 | 1.02 | 2.66 | |
FID | 0.90 | 0.84 | 2.33 | 1.12 | 3.64 | 1.51 | 3.35 | 1.52 |
KID | 4.24 | 2.42 | 24.00 | 7.10 | 48.61 | 9.13 | 43.55 | 9.15 |
4.2 Testing results on “Camouflage_take_1” and “Camouflage_take_2”
We have applied the trained models to “Camouflage_take_1” and “Camouflage_take_2” datasets and results are shown in Figures 8 and 9. Both models did not generate good quality IR images though the quantitative metrics as shown in Table 2. Cycle GAN is slightly better than Pix2Pix GAN. One possible reason is that the data in the two sets have different distributions as those in the training data, making both models failed.
5. Conclusion
In this chapter, we have investigated visible-to-IR image conversion using Pix2Pix GAN and Cycle GAN. Cycle GAN is a better model than Pix2Pix GAN and both can generate good visual quality IR images based on visible images, if training data and test data are similar. Overall, IR images generated by Cycle GAN have sharper appearances and better quantitative performance metrics than those by Pix2Pix GAN. However, if testing data have significant distribution shift as compared to training data, both models cannot generate quality IR images. Therefore, our recommendations are 1). Cycle GAN appears to be a better tool to convert optical images to IR images if training and testing datasets have similar distributions and 2) Both models are sensitive to distribution shift and additional techniques are needed to address the challenge.
References
- 1.
Fahimi F, Dosen S, Ang KK, Mrachacz-Kersting N, Guan C. Generative adversarial networks-based data augmentation for brain–computer Interface. IEEE transactions on neural networks and learning systems. 2020 - 2.
Isola P, Zhu JY, Zhou T, Efros AA. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2017:1125-1134 - 3.
Zhao H, Yang H, Su H, Zheng S. Natural image Deblurring based on ringing artifacts removal via knowledge-driven gradient distribution priors. IEEE Access. 2020 Jul 8; 8 :129975-129991 - 4.
Su JW, Chu HK, Huang JB. Instance-aware image colorization. In: InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020. pp. 7968-7977 - 5.
Park B, Yu S, Jeong J. Densely connected hierarchical network for image denoising. In: InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops 2019. (pp. 0-0) - 6.
Chen T, Cheng MM, Tan P, Shamir A, Hu SM. Sketch2photo: Internet image montage. ACM transactions on graphics (TOG). 2009 Dec 1; 28 (5):1-0 - 7.
Shi W, Qiao Y. Fast texture synthesis via pseudo optimizer. In: InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020. pp. 5498-5507 - 8.
Anwar S, Barnes N. Real image denoising with feature attention. InProceedings of the IEEE International Conference on Computer Vision. 2019:3155-3164 - 9.
Pan L, Dai Y, Liu M. Single image deblurring and camera motion estimation with depth map. In: In2019 IEEE Winter Conference on Applications of Computer Vision (WACV) 2019 Jan 7. IEEE. pp. 2116-2125 - 10.
Shih Y, Paris S, Durand F, Freeman WT. Data-driven hallucination of different times of day from a single outdoor photo. ACM Transactions on Graphics (TOG). 2013 Nov 1; 32 (6):1-1 - 11.
Laffont PY, Ren Z, Tao X, Qian C, Hays J. Transient attributes for high-level understanding and editing of outdoor scenes. ACM Transactions on graphics (TOG). 2014 Jul 27; 33 (4):1-1 - 12.
Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2015:3431-3440 - 13.
Eigen D, Fergus R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE international conference on computer vision. 2015:2650-2658 - 14.
Fergus R, Singh B, Hertzmann A, Roweis ST, Freeman WT. Removing camera shake from a single photograph. In: ACM SIGGRAPH 2006 Papers 2006 Jul 1. pp. 787-794 - 15.
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial nets. In: Advances in Neural Information Processing Systems 2014. pp. 2672-2680 - 16.
Zhu JY, Park T, Isola P, Efros AA. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision 2017. pp. 2223-2232 - 17.
Mirza M, Osindero S. Conditional Generative Adversarial Nets. arXiv Preprint arXiv:1411.1784. 2014 Nov 6 - 18.
Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention 2015 Oct 5. Cham: Springer. pp. 234-241 - 19.
Goodfellow I. NIPS 2016 Tutorial: Generative Adversarial Networks. arXiv Preprint arXiv:1701.00160. 2016 Dec 31 - 20.
Eigen D, Fergus R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE international conference on computer vision. 2015:2650-2658 - 21.
Johnson J, Alahi A, Fei-Fei L. Perceptual losses for real-time style transfer and super-resolution. In: European Conference on Computer Vision 2016 Oct. Vol. 8. Cham: Springer. pp. 694-711 - 22.
Wang X, Gupta A. Generative image modeling using style and structure adversarial networks. In: European Conference on Computer Vision 2016 Oct. Vol. 8. Cham: Springer. pp. 318-335 - 23.
Xie S, Tu Z. Holistically-nested edge detection. In Proceedings of the IEEE international conference on computer vision. 2015:1395-1403 - 24.
Zhang R, Isola P, Efros AA. Colorful image colorization. In: European Conference on Computer Vision 2016 Oct. Vol. 8. Cham: Springer. pp. 649-666 - 25.
Ellmauthaler A, Pagliari CL, da Silva EA, Gois JN, Neves SR. A visible-light and infrared video database for performance evaluation of video/image fusion methods. Multidimensional Systems and Signal Processing. 2019 Jan 15; 30 (1):119-143 - 26.
Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X. Improved techniques for training gans. In: Advances in Neural Information Processing Systems 2016. pp. 2234-2242 - 27.
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2016:2818-2826 - 28.
Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S. Gans trained by a two time-scale update rule converge to a local Nash equilibrium. InAdvances in neural information processing systems. 2017:6626-6637 - 29.
Fréchet M. Sur la distance de deux lois de probabilité. COMPTES RENDUS HEBDOMADAIRES DES SEANCES DE L ACADEMIE DES. Sciences. 1957 Jan 1; 244 (6):689-692 - 30.
Bińkowski M, Sutherland DJ, Arbel M, Gretton A. Demystifying mmd gans. arXiv preprint arXiv:1801.01401. 2018 Jan 4