Open access peer-reviewed chapter

Generative Adversarial Networks for Visible to Infrared Video Conversion

Written By

Mohammad Shahab Uddin and Jiang Li

Submitted: 08 June 2020 Reviewed: 03 September 2020 Published: 04 November 2020

DOI: 10.5772/intechopen.93866

From the Edited Volume

Recent Advances in Image Restoration with Applications to Real World Problems

Edited by Chiman Kwan

Chapter metrics overview

742 Chapter Downloads

View Full Metrics

Abstract

Deep learning models are data driven. For example, the most popular convolutional neural network (CNN) model used for image classification or object detection requires large labeled databases for training to achieve competitive performances. This requirement is not difficult to be satisfied in the visible domain since there are lots of labeled video and image databases available nowadays. However, given the less popularity of infrared (IR) camera, the availability of labeled infrared videos or image databases is limited. Therefore, training deep learning models in infrared domain is still challenging. In this chapter, we applied the pix2pix generative adversarial network (Pix2Pix GAN) and cycle-consistent GAN (Cycle GAN) models to convert visible videos to infrared videos. The Pix2Pix GAN model requires visible-infrared image pairs for training while the Cycle GAN relaxes this constraint and requires only unpaired images from both domains. We applied the two models to an open-source database where visible and infrared videos provided by the signal multimedia and telecommunications laboratory at the Federal University of Rio de Janeiro. We evaluated conversion results by performance metrics including Inception Score (IS), Frechet Inception Distance (FID) and Kernel Inception Distance (KID). Our experiments suggest that cycle-consistent GAN is more effective than pix2pix GAN for generating IR images from optical images.

Keywords

  • image conversion
  • generative adversarial network
  • cycle-consistent loss
  • IR image
  • Pix2Pix
  • cycle GAN

1. Introduction

Image-to-image conversion, such as data augmentation [1] or style transfer [2], has been applied to recent computer vision applications. Traditional image conversion models had been investigated for specific applications [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]. Since the creation of the GAN model [15], it opened a new door to train generative models for image conversion. For example, computer vision researchers have successfully developed GAN models for day-to-night and sketch-to-photograph image conversions [16]. Two recent popular models that can perform image-to-image translations are Pix2Pix GAN [2] and Cycle GAN [16]. Pix2Pix GAN needs paired images for training whereas Cycle GAN relaxes this constraint and can be trained with unpaired images. In practice, paired images from different domains are often difficult to obtain. Therefore, Cycle GAN is a better choice for image to image translation where paired images are not available.

IR image datasets are not largely available as compared to optical images. As a result, we face the shortage of data when we train models for object detection in IR domain. This problem can be mitigated by using the Cycle GAN model to covert labeled optical images to IR images. In this chapter, we evaluate two models, Pix2Pix GAN and Cycle GAN, for image conversion from optical domain to IR domain. We used four different datasets to perform the conversion and three metrics including Inception Score (IS), Frechet Inception Distance (FID) and Kernel Inception Distance (KID) to assess quality of the converted IR images.

Advertisement

2. Image to image conversion models

2.1 Generative adversarial network

GAN consists of one generative model and one discriminative model to generate images from noise as shown in Figure 1. The generator “G” tries to generate images from the input noise “z” as realistic as possible to misguide the discriminator “D” whereas “D” is trained to discriminate the fake image “G(z)” from the real one “x.” During training, errors at output “D” are backpropagated to update parameters in “G” and “D,” and the following loss function is optimized [15]:

Figure 1.

Structure of generative adversarial network.

VDG=ExpdataxlogDx+Ezpzzlog1DGzGDminmaxE1

where x and z represent training data and input noise, respectively. pdata(x) and pz(z) are distributions of training data and input noise. The discriminator “D” is trained to minimize the probability of the generated fake image to be real so that it can correctly assign labels to “G(z)” and “x” in Figure 1. The generator “G” is trained to maximize D(G(z)) or equivalently to minimize log1DGz]inequ1, generating realistic images. Essentially, the generator learns to generate real data’s distribution given by the training dataset. Once the goal is achieved, the generator can be used to generate realistic images by sampling from the learned probability distribution.

2.2 Conditional GAN

GAN can be converted into a conditional model with auxiliary information that is used to impose condition on generator and discriminator [17]. In the conditional GAN model, additional data are fed into the generator and discriminator so that data generation can be controlled. The loss function in conditional GAN becomes [17].

VDG=Eypdata(y)logDyx+Ezpzzlog1DGzxGDminmaxE2

where y and z are training data and input noise, respectively. The input noise z combined with extra information x generate the output G(z|x). Figure 2 shows the diagram of conditional GAN.

Figure 2.

Architecture of conditional GAN. Extra information x is given to both G and D. the discriminator trains itself to distinguish between real and fake image. The generator trains itself to fool discriminator by generating images similar to real images. Here both G and D get x as input.

2.3 Pix2Pix GAN

The Pix2Pix GAN model is built upon the concept of conditional GAN and it has been a common platform for various image conversion tasks. The diagram of Pix2Pix GAN model is given in Figure 3. Pix2Pix GAN consists of a “U-Net” [18] based generator and a “PatchGAN” discriminator [2]. The “U-Net” generator passes low level information of input image to output image, and the “PatchGAN” discriminator helps capture statistics of local styles. The loss function of pix2pix GAN is:

Figure 3.

Block diagram of Pix2Pix GAN.

VDG=Ex,ylogDxy+Ex,zlog1DxGxz+Ex,y,z[yGxz1]GDminmaxE3

Pix2Pix GAN learns to map input image x and random noise z to output image y. The generator tries to minimize the loss function while the discriminator tries to maximize the loss function. The L1 loss between real image and fake one is included to achieve pixel level matching. Pix2Pix GAN had been applied to many applications including edges-to-photo conversion, sketch-to-photo conversion, map-to-aerial photo conversion etc. The main drawback of Pix2Pix GAN is that it needs paired images in both domains for training, which is not always possible in practice.

2.4 Cycle GAN

In many cases, it is difficult to get paired images from different domains. Cycle GAN [16] addressed this challenge by introducing the cycle-consistent loss function as shown in Figure 4. There are two generator G and F in Cycle GAN along with two adversarial discriminator Dx and Dy. X and Y are input domain and target domain, respectively. While Dx helps G to generate images from X domain to Y domain, F is trained to generate images from Y domain to X domain. G: X → Y and F: Y → X are two mappings that are trained in Cycle GAN and these are kept consistent by two cycle-consistency losses. The total loss function of Cycle GAN is given by:

Figure 4.

Overall architecture of cycle GAN.

LGFDxDy=LGANGDYXY+LGANFDxYX+λLcycGFG,FDx,DyminmaxE4

where

LGANGDYXY=EypdataylogDYy+Expdataxlog1DYGxE5
LGANFDxYX=ExpdataxlogDXx+Eypdataylog1DXGyE6
LcycGF=Expdatax[GFxx1+EypdatayGFyy1E7

There are two terms in the loss function of Cycle GAN: adversarial losses and cycle-consistency losses. LGAN(G, DY, X, Y) and LGAN(F, DX, Y, X) are the adversarial losses for G: X→Y and F: Y→X mapping, respectively, which ensure that target images’ distribution and generated images’ distribution are close. The cycle-consistency loss, LcycGF, ensures that the two mappings have no contradictions. λ is a weight controlling balance between the two categories of losses.

Cycle GAN has been used in different applications including season transfer, style transfer, etc. [16]. In addition, Cycle GAN has resolved the mode collapse problem in training if only the adversarial loss is used [19]. Mode collapse happens when the generator outputs the same image for different inputs. Though other methods [2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 21, 22, 23, 24] can also offer image-to-image translation with unpaired images, Cycle GAN has become a common platform for many image translation related tasks.

Advertisement

3. Experimental setups

3.1 Datasets

For training Pix2Pix GAN and Cycle GAN, we have used images pairs from the open-source visible and infrared video database from the signal multimedia and telecommunications laboratory at the Federal University of Rio de Janeiro [25]. IR and visible-light video pairs in the database are synchronized and registered. We utilized 80% of frames in the “Guanabara Bay_take_1” video pair for training and the remaining 20% frames for testing. In addition, we evaluated the trained model on other three image pairs named “Guanabara Bay_take_2”, “Camouflage_take_1” and “Camouflage_take_2”. Detailed information of the four video pairs are listed in Table 1 and some example pairs are shown in Figure 5.

Dataset NameDescription [25]
Guanabara Bay_take_1
  • Contains scenes of “the Guanabara Bay and the Rio de Janeiro-Niteroi bridge”.

  • Taken during Nighttime.

  • Contains 1 scene plane at approximately 500 m distance.

Guanabara Bay_take_2
  • Contains scenes of “the Guanabara Bay and the Rio de Janeiro-Niteroi bridge”.

  • Taken during nighttime.

  • Contains 1 scene plane at approximately 500 m distance.

Camouflage_take_1
  • Contains outdoor scenes.

  • Taken during bright sunlight.

  • Contains 2 scene planes at approximately 10 m and 300 m distances.

  • Contains people who are hiding behind vegetation.

Camouflage_take_2
  • Contains outdoor scenes.

  • Taken during bright sunlight.

  • Contains 2 scene planes at approximately 10 m and 300 m distances.

  • Contains people who are hiding behind vegetation.

Table 1.

Detailed information of video pairs used in our experiments.

Figure 5.

Visible-IR images from Guanabara Bay_take_1 video pair used for training Pix2Pix GAN and cycle GAN models. (a) Visible images. (b) IR images.

3.2 Performance metrics

3.2.1 Inception score

Inception score (IS) is widely used for evaluating GANs [26]. IS considers quality and diversity of generated images by evaluating the entropy of probability distribution outputted created by the pre-trained “Inception v3” model on the generated data [27]. A large inception score represents high quality of the generated images. One drawback of the inception score is that it does not consider information in the real images used for training the GAN model. Therefore, it is not clear how the generated images compare to the real training images.

3.2.2 Frechet inception distance

Frechet Inception Distance (FID) indicates the similarity between two sets of datasets and is often used for evaluating GANs [28, 29]. FID is the Wasserstein-2 distance between feature representations of real and fake images computed by the Inception v3 model [27]. We used the coding layer of the Inception model to obtain feature representation of each image. FID is consistent with the human-judgment of image quality and it can also detect intra-class mode collapse. A lower FID score indicates that the two groups of images are similar so that the generated fake images are of high quality.

3.2.3 Kernel inception distance

Kernel Inception Distance (KID) is another metric often used to assess quality of GAN generated images relative to real images [30]. KID first uses the Inception v3 model to obtain representations of generated images. It then calculates the squared maximum mean discrepancy (MMD) between the representations of real training images and generated images. KID score is also consistent with human judgment of image quality. A small KID value indicates high quality of the generated images.

Advertisement

4. Results

4.1 Testing results on “Guanabara Bay_take_1” and “Guanabara Bay_take_2”

We trained the Pix2Pix GAN and Cycle GAN on 80% of the frames in “Guanabara Bay_take_1” video pair and tested the trained models on the remaining 20% frames. Some visible and IR images that we have used for training are shown in Figure 5. After training, we also applied both models to the “Guanabara Bay_take_2” dataset. Figures 6 and 7 show some generated IR images. By visual inspection, Cycle GAN can generate better results than Pix2Pix GAN does. In addition, we observe that IR images generated by Cycle GAN are similar to the real IR images. Table 2 lists the quantitative performance metrics of the generated images by the two models. Cycle GAN outperforms Pix2Pix GAN in terms of all the metrics including IS, FID and KID on this dataset.

Figure 6.

Fake IR images generated by Pix2Pix GAN and cycle GAN from the visible images in the Guanabara Bay_take_1 dataset. (a) Generated IR images by Pix2Pix GAN. (b) Generated IR images by cycle GAN.

MetricsDatasets
Guanabara
Bay_take_1
Guanabara
Bay_take_2
Camouflage
take_1
Camouflage
take_2
IS ScorePixPix GANCycle GANPixPix GANCycle GANPixPix GANCycle GANPixPix GANCycle GAN
2.702.881.853.611.022.721.022.66
FID0.900.842.331.123.641.513.351.52
KID4.242.4224.007.1048.619.1343.559.15

Table 2.

Evaluation metrics on generated IR images of different datasets using Pix2Pix GAN and cycle GAN.

Figure 7.

Fake IR images generated by Pix2Pix GAN and cycle GAN from the visible images of Guanabara Bay_take_2 dataset. (a) Generated IR images by Pix2Pix GAN cycle GAN. (b) Generated IR images by cycle GAN.

4.2 Testing results on “Camouflage_take_1” and “Camouflage_take_2”

We have applied the trained models to “Camouflage_take_1” and “Camouflage_take_2” datasets and results are shown in Figures 8 and 9. Both models did not generate good quality IR images though the quantitative metrics as shown in Table 2. Cycle GAN is slightly better than Pix2Pix GAN. One possible reason is that the data in the two sets have different distributions as those in the training data, making both models failed.

Figure 8.

Fake IR images generated by Pix2Pix GAN and cycle GAN from the visible images of Camouflage_take_1 dataset. (a) Generated IR images by Pix2Pix GAN. (b) Generated IR images by cycle GAN.

Figure 9.

Fake IR images generated by Pix2Pix GAN and cycle GAN from the visible images of Camouflage_take_2 dataset. (a) Generated IR images by Pix2Pix GAN. (b) Generated IR images by cycle GAN.

Advertisement

5. Conclusion

In this chapter, we have investigated visible-to-IR image conversion using Pix2Pix GAN and Cycle GAN. Cycle GAN is a better model than Pix2Pix GAN and both can generate good visual quality IR images based on visible images, if training data and test data are similar. Overall, IR images generated by Cycle GAN have sharper appearances and better quantitative performance metrics than those by Pix2Pix GAN. However, if testing data have significant distribution shift as compared to training data, both models cannot generate quality IR images. Therefore, our recommendations are 1). Cycle GAN appears to be a better tool to convert optical images to IR images if training and testing datasets have similar distributions and 2) Both models are sensitive to distribution shift and additional techniques are needed to address the challenge.

References

  1. 1. Fahimi F, Dosen S, Ang KK, Mrachacz-Kersting N, Guan C. Generative adversarial networks-based data augmentation for brain–computer Interface. IEEE transactions on neural networks and learning systems. 2020
  2. 2. Isola P, Zhu JY, Zhou T, Efros AA. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2017:1125-1134
  3. 3. Zhao H, Yang H, Su H, Zheng S. Natural image Deblurring based on ringing artifacts removal via knowledge-driven gradient distribution priors. IEEE Access. 2020 Jul 8;8:129975-129991
  4. 4. Su JW, Chu HK, Huang JB. Instance-aware image colorization. In: InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020. pp. 7968-7977
  5. 5. Park B, Yu S, Jeong J. Densely connected hierarchical network for image denoising. In: InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops 2019. (pp. 0-0)
  6. 6. Chen T, Cheng MM, Tan P, Shamir A, Hu SM. Sketch2photo: Internet image montage. ACM transactions on graphics (TOG). 2009 Dec 1;28(5):1-0
  7. 7. Shi W, Qiao Y. Fast texture synthesis via pseudo optimizer. In: InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020. pp. 5498-5507
  8. 8. Anwar S, Barnes N. Real image denoising with feature attention. InProceedings of the IEEE International Conference on Computer Vision. 2019:3155-3164
  9. 9. Pan L, Dai Y, Liu M. Single image deblurring and camera motion estimation with depth map. In: In2019 IEEE Winter Conference on Applications of Computer Vision (WACV) 2019 Jan 7. IEEE. pp. 2116-2125
  10. 10. Shih Y, Paris S, Durand F, Freeman WT. Data-driven hallucination of different times of day from a single outdoor photo. ACM Transactions on Graphics (TOG). 2013 Nov 1;32(6):1-1
  11. 11. Laffont PY, Ren Z, Tao X, Qian C, Hays J. Transient attributes for high-level understanding and editing of outdoor scenes. ACM Transactions on graphics (TOG). 2014 Jul 27;33(4):1-1
  12. 12. Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2015:3431-3440
  13. 13. Eigen D, Fergus R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE international conference on computer vision. 2015:2650-2658
  14. 14. Fergus R, Singh B, Hertzmann A, Roweis ST, Freeman WT. Removing camera shake from a single photograph. In: ACM SIGGRAPH 2006 Papers 2006 Jul 1. pp. 787-794
  15. 15. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial nets. In: Advances in Neural Information Processing Systems 2014. pp. 2672-2680
  16. 16. Zhu JY, Park T, Isola P, Efros AA. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision 2017. pp. 2223-2232
  17. 17. Mirza M, Osindero S. Conditional Generative Adversarial Nets. arXiv Preprint arXiv:1411.1784. 2014 Nov 6
  18. 18. Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention 2015 Oct 5. Cham: Springer. pp. 234-241
  19. 19. Goodfellow I. NIPS 2016 Tutorial: Generative Adversarial Networks. arXiv Preprint arXiv:1701.00160. 2016 Dec 31
  20. 20. Eigen D, Fergus R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE international conference on computer vision. 2015:2650-2658
  21. 21. Johnson J, Alahi A, Fei-Fei L. Perceptual losses for real-time style transfer and super-resolution. In: European Conference on Computer Vision 2016 Oct. Vol. 8. Cham: Springer. pp. 694-711
  22. 22. Wang X, Gupta A. Generative image modeling using style and structure adversarial networks. In: European Conference on Computer Vision 2016 Oct. Vol. 8. Cham: Springer. pp. 318-335
  23. 23. Xie S, Tu Z. Holistically-nested edge detection. In Proceedings of the IEEE international conference on computer vision. 2015:1395-1403
  24. 24. Zhang R, Isola P, Efros AA. Colorful image colorization. In: European Conference on Computer Vision 2016 Oct. Vol. 8. Cham: Springer. pp. 649-666
  25. 25. Ellmauthaler A, Pagliari CL, da Silva EA, Gois JN, Neves SR. A visible-light and infrared video database for performance evaluation of video/image fusion methods. Multidimensional Systems and Signal Processing. 2019 Jan 15;30(1):119-143
  26. 26. Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X. Improved techniques for training gans. In: Advances in Neural Information Processing Systems 2016. pp. 2234-2242
  27. 27. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2016:2818-2826
  28. 28. Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S. Gans trained by a two time-scale update rule converge to a local Nash equilibrium. InAdvances in neural information processing systems. 2017:6626-6637
  29. 29. Fréchet M. Sur la distance de deux lois de probabilité. COMPTES RENDUS HEBDOMADAIRES DES SEANCES DE L ACADEMIE DES. Sciences. 1957 Jan 1;244(6):689-692
  30. 30. Bińkowski M, Sutherland DJ, Arbel M, Gretton A. Demystifying mmd gans. arXiv preprint arXiv:1801.01401. 2018 Jan 4

Written By

Mohammad Shahab Uddin and Jiang Li

Submitted: 08 June 2020 Reviewed: 03 September 2020 Published: 04 November 2020