Autoencoder model and classification model.
We propose in this chapter a deep learning-based recommendation system for aesthetic surgery, composing of a mobile application and a deep learning model. The deep learning model built based on the dataset of before- and after-surgery facial images can estimate the probability of the perfection of some parts of a face. In this study, we focus on the most two popular treatments: rejuvenation treatment and eye double-fold surgery. It is assumed that the outcomes of our history surgeries are perfect. Firstly a convolutional autoencoder is trained by eye images before and after surgery captured from various angles. The trained encoder is utilized to extract learned generic eye features. Secondly, the encoder is further trained by pairs of image samples, captured before and after surgery, to predict the probability of perfection, so-called perfection score. Based on this score, the system would suggest whether some sorts of specific aesthetic surgeries should be performed. We preliminarily achieve 88.9 and 93.1% accuracy on rejuvenation treatment and eye double-fold surgery, respectively.
- aesthetic surgery
- rejuvenation treatment
- eye double-fold surgery
- recommendation system
- convolutional neural network
Plastic surgery is a surgical specialty relating to restoration, reconstruction, or alteration of the human body. There are two major categories: (1) reconstructive surgery and (2) aesthetic or cosmetic surgery. The former is intended to correct dysfunctional areas of the body and is reconstructive in nature. Examples of this kind include breast reconstruction, burn repair surgery, congenital defect repair, lower extremity reconstruction, hand surgery, scar revision surgery, etc. The latter focuses on enhancing the appearance of the patient. Improving aesthetic appeal, symmetry, and proportion are among the key goals. The scope of aesthetic surgery procedures includes breast enhancement, facial contouring, facial rejuvenation, body contouring, skin rejuvenation, etc. The scope of this chapter restricts to the latter, aesthetic surgery.
In aesthetic surgery, the treated areas function properly; it is optional based on the willingness of the patient who cares about their beauty. The sense of beauty also varies from geographical areas and sometimes follows either local or global fashion trends. Therefore, the consultation in aesthetic surgery of experienced doctors is extremely important. The severe problem of population aging in developed countries leads to the shortage of high-skill labors in almost all industrial sectors. Thus this chapter proposes a deep learning-based aesthetic surgery recommendation system, aiming at keeping the valuable know-how of experienced doctors to consult the patient in aesthetic surgery. Moreover, the continuous learning capability of the AI model also facilitates the self-update of the newly fashionable know-how in this field, given a set of rich training data.
Although aesthetic surgery can be performed on all areas of the head, neck, and body, our focused areas in this chapter are the facial area. We take the most popular treatments for facial areas, rejuvenation, and eye double-fold surgery into consideration. In order to build a deep learning system which is capable of predicting the perfection of aesthetic surgery, we collect an in-house training dataset composing of pairs of images capturing the eye area of the same person before and after aesthetic surgery. It is assumed that the beauty of facial areas after surgery is perfect, that is, the know-how of an aesthetic surgeon is embedded into these pairs of images.
In order to keep the know-how of experienced aesthetic surgeon, we propose to train a deep neural network by these pairs of images in our in-house dataset. Among various kinds of neural network architectures, proposed in the literature, convolutional neural networks (CNN) have been demonstrating outstanding performance in image recognition . This was the first time a large and deep CNN—AlexNet model—achieved record-breaking results on highly challenging image recognition dataset with a margin of more than 10% with respect to the second best which makes use of handcrafted features. Even though the performance of AlexNet is still far from the inferotemporal pathway of the human visual system, it created the way of success for successor models such as Inception , VGG , and Resnet . The convolutional layers learn from data to extract a rich set of features for a variety of purposes such as image classification and recognition , visual tracking , face recognition , object detection , person reidentification , etc. The power of CNN is enabled by the learning mechanism in which weights of convolutional filters are adjusted toward the adaption to the labels. The generalization of CNN is guaranteed by the availability of a huge dataset to produce an outstanding performance on unseen data.
However, our in-house dataset is not huge enough to guarantee the generalization of CNN for this task. Therefore, we propose to use convolutional autoencoder neural networks to overcome the limitation of our small dataset. The network is firstly trained in a layer-wise mechanism to reconstruct input images in the output layer. This training mechanism is completely unsupervised. After the convolutional autoencoder neural network is trained, the decoder part is truncated. Only the encoder part is kept and is concatenated with fully connected layers. The whole network will be trained by images and their labels, before and after surgery. The weights of the encoder parts are kept intact because the encoder parts have already learned the key features of the training image set. As a result, our proposed model is able to achieve 88.9 and 93.1% accuracy on rejuvenation treatment and eye double-fold surgery, respectively.
The rest of this chapter is organized as follows. Section 2 describes our contribution to the backdrop of related work. The proposed method is presented in detail in Section 3. Finally, we conclude the chapter and delineate future work in Section 4.
2. Related work
The number of aesthetic surgery, particularly for facial areas, has been recently drastically surged. This trend even could spread further in the next few years due to lowering the average cost of such treatments and the desire of beautification. Numerous research works have been proposed in the literature, especially in the computer vision community to address the challenges posed by aesthetic surgery. These research works are generally broken into three categories, namely, skin quality inspection, face recognition after surgery, and surgery planning and recommendation.
Aesthetic surgery in facial areas to correct facial feature anomaly and to improve the beauty, in general, alters the original facial information. It poses a significant challenge for face recognition algorithms. Majority of proposed methods in the literature focused on the advances of handcraft features. Richa et al.  have investigated the effects of aesthetic surgery to face recognition. Amal et al.  proposed a face recognition system based on LBP and GIST descriptor to address this problem. Maria et al.  combine two methods, face recognition against occlusions and expression variations (FARO)  and face analysis for commercial entities (FACE)  with split face architecture to deal with the effects of plastic surgery to process each face region as separate biometrics. For a comprehensive survey of face recognition algorithms against variations due to aesthetic surgery, please refer to .
Skin quality inspection and assessment is also a potential application of computer vision and deep learning methods. By assessing skin quality, it is able to help the aesthetic surgeon to recommend certain kinds of operation to enhance the beauty of the face. Surface roughness, wrinkle depth, volume, and epidermal thickness of the skin are quantitatively computed by applying deep learning method to images captured by optical coherence tomography . The facial skin is classified into facial skin patches, namely, normal, spots, and wrinkles, by using convolutional neural networks . Batool and Chellappa  proposed a method to model wrinkles as texture features or curvilinear objects, so-called aging skin texture, for facial aging analysis. They reviewed commonly used image features to capture the intensity gradients caused by facial wrinkles, such as Laplacian of Gaussian, Hessian filter, steerable filter bank, Gabor filter bank, active appearance model, and local binary patterns.
In the last category of aesthetic surgery planning and recommendation, facial beauty prediction is the first step to assess whether or not a face should perform an aesthetic surgery. Yikui et al.  described BeautyNet in which multiscale CNN is employed to obtain deep features, characterizing the facial beauty, and is combined with transfer learning strategy to alleviate overfitting and to achieve robust performance on unconstrained faces. Lu et al.  transferred rich deep features from a pretrained VGG16 model on face verification task to Bayesian ridge regression algorithms for predicting facial beauty.
Going beyond the facial beauty prediction, Arakawa and Nomoto  removed wrinkles and spots while preserving natural skin roughness by using a bank of nonlinear filters for facial beautification. Eighty-four facial landmark points are represented in a vector of 234 normalized lengths to compare with vectors of beautiful faces to suggest how to warp the triangulation of the original face to the beautiful ones . Bottino et al.  presented a quantitative approach to automatically recommend effective patient-specific improvements of facial attractiveness by comparing the face of the patient with a large database of attractive faces. Simulations are performed by applying features of similar attractive faces into the patient faces with a suitable morphing of facial shape.
Our research differs from the above-related works in two senses. Firstly, a convolutional autoencoder is employed to learn rich features and characterize both unattractive and beautiful faces in an unsupervised manner, rather than under supervised learning [18, 19]. The learned features are more discriminative than handcrafted features [9–13, 15–17]. Secondly, the proposed deep learning framework facilitates a holistic approach to identify what kinds of facial treatment should be performed to enhance the attractiveness, rather than predicting beauty score [20–22].
The method is divided into two main steps, namely, train feature extractor and train classification model as shown in Figure 1. The dataset is first utilized to train the feature extraction model based on convolutional autoencoder. In this step, the convolutional autoencoder is trained in order to learn to encode the input in a set of images and then tries to reconstruct the input from them. Thus, it can learn the feature of the input data by minimizing the reconstruction error. Then we extract the encoder and use it as the feature extractor.
Figure 2 illustrates the method of training feature extractor. From the first step, the model is trained only to learn filters able to extract feature that can be used to reconstruct the input. These filters in the encoder then are extracted and then utilized as the feature extraction for the classification model which is a fully connected layer.
3.1 Feature extractor
Autoencoders are well-known unsupervised learning algorithm whose original purpose is to find latent lower-dimensional state spaces of datasets, but they are also capable of solving other problems, such as image denoising, enhancement, or colorization. The main idea behind autoencoders is to reduce the input into a latent state space with lower dimensions and then try to reconstruct the input from this representation. Thus the autoencoder uses its input as the reference of the output in the learning phase. The two parts are called encoder and decoder, respectively. By reducing the number of variables which represent the data, we force the model to learn how to keep only meaningful information, from which the input is reconstructable. It can also be viewed as a compression technique as shown in Figure 2.
A conventional autoencoder is composed of two layers, corresponding to the encoder and decode. It aims to find a code for each input that minimizes the difference between input, and output, over all samples :
In the fully connected autoencoder,
where x and h are vectors, W is the learn weights, and is the activation function. After learning, the embedded vector h is a unique representation for input. In our application, the convolutional autoencoders (CAE) are defined as
where x and h are the matrix or tensor and “*” is the convolutional operator.
We propose the CAE-based feature extraction method that learns the generic feature of the face. The encoder serves as the feature extractor that encodes the image into vector h that represent the image of the facial part, for example, the eyes.
In Table 1, the several model structures that we use in our experiment were shown. We tried four models. The differences between the four models are the number of layers and dimension of the embedded vector. The number of layers in autoencoders vary between 3 and 4 layers, while the dimensions of the embedded vector are 32 and 64.
3.2 Perfection prediction system
After training the autoencoder, it will serve as the feature extractor as shown in Figure 2. In this step, we apply the transfer learning method to transfer the well-learned filters for facial part feature extraction. We try two different lengths of the embedding vector, say, 32 and 64, as shown in Table 1. We also try both to freeze and to retrain the extractor.
This model predicts the probability of perfection; the output range of the model is a one-dimensional vector with the interval of [0–1] that reflects the probability of perfection. The probability of perfection is defined as follows:
If the face is perfect (no surgery is needed), the probability is 1.
If the face is not perfect (surgery is needed to be performed), the probability is 0.
However, in a real situation, we cannot obtain a big dataset of the perfect/non-perfect face. Thus, we assume that the outcome of surgery is perfect (value is 1) and the original face is not perfect (value is 0).
4.1 Dataset preparation
In the dataset, the total number of images of rejuvenation treatment and eye double-fold surgery are 5585 and 36,598 pairs of the images, respectively. The data is filtered to select a good image for training. Some errors in the images are not well aligned, only one eye, etc. The reject rates are 82.15 and 59.92%, respectively.
The image was crop around the eye area. Then the data is divided into 70, 15, and 15% for training, validation, and testing, respectively.
As mentioned earlier, the output of the model is the probability of perfection. Hence, our training data include the image of before and after surgery as shown in Figure 3.
4.2 Prediction accuracy
We chose a thresholding value for the prediction model. If the predicted value is higher than the thresholding value, the face is predicted to be perfect. From that, we obtain accuracy of the different models as shown in Tables 2 and 3. We have two schemes of training in which the feature extractors are freezed (not training together with the classification model) and trainable, resulting in eight models in these two tables. When the feature extractor is freezed, we believe that it has already captured universal features such as edge and curves which is relevant to our tasks. Therefore, we want to keep the weights of the feature extract intact. However, in the second training scheme, we apply different learning rates for the feature extractor and the fully connected layers. The learning rate of the feature extractor is much smaller than that of the fully connected layer because we believe that the weights of the feature extractor is good enough for our tasks and we do not want to distort them too quickly and too much during the training of the classification model.
These above training schemes are the best common practices when fine-tuning deep neural networks. We tried both of them, resulting in the following. The best model for double-fold surgery and rejuvenation treatment are Models 1 and 8 (see Tables 2 and 3 for more details), with accuracy on the test dataset of 93.1 and 88.9%, respectively. Model 1 is the model when the encoder was freezed. However, in Model 8, the encoder was retrained. However, the accuracy differences between the best model and the second best model are less than 1%.
We have presented in this chapter an interesting application of deep learning in aesthetic surgery recommendation along with its encouraging results. By using our system, the presented deep learning engine will provide a reference decision of taking either rejuvenation treatment or eye double-fold surgery or not to both the surgeon and the patient, just based on the eye photo of the patient. To this end, we trained a deep autoencoder by our in-house dataset, composing of pairs of images captured before and after the surgery. The trained encoder part learned in an unsupervised manner, a rich set of features, characterized both unattractive and beautiful facial features. We concatenate the trained encoder part to a fully connected layer to predict perfection score of an eye photo of a patient, based on which a decision of taking treatment or not will be made.
Even though our preliminary results are promising with 88.9 and 93.1% accuracy on rejuvenation treatment and eye double-fold surgery, respectively, it still has much rooms for improvement. Firstly, we should improve the dimension of our in-house dataset by encouraging more patients to participate in our program, so that we are able to build a deeper network. More and more layers are added; richer and richer learned features are obtained to improve the accuracy of our system. Secondly, we are going to expand the capability of our system to deal with more kinds of treatments to diversify and provide the best services to our clients, rather than focusing on the two above treatment and surgery.