I. INTRODUCTION
Recently, the deep learning is triggering the development of image analysis of various data types. Deep learning keeps the state of the art (SOTA) model in almost every field of image analysis by finding and learning the features, shapes and patterns of images. In the classification of ImageNet [1] and CIFAR10 [2] datasets, models such as InceptionResnet [3], and Big transfer (BiT) [4] utilizing ResNet [5] have been selected as SOTAs and many researchers are actively studying to achieve higher accuracy [18], [19], [20], [21], [22], [23], [24]. Based on these models, face expression recognition (FER), Face Recognition and Face Generation are being investigated. Most of face datasets are often taken in a laboratory or controlled in front. However, in real life, there are more pictures of different directions than the exact frontal one. There is a research [6] that present problems and analyzed only profile face data. If we can make a generation as good quality about the front one from the side view image, we can utilize it in various applications.
The main models in a field of image generation are Ganerative Adversarial Network (GAN) [8] and Variational AutoEncoder (VAE) [9]. In particular, many different GAN models focus on producing high image quality which the style is transferred. Concretely, there are some networks that generate images by extracting the charac-teristics of the image and synthesizing the disentangled attributes such as StarGAN [10], InterfaceGAN [7], CycleGAN [11], DiscoGAN [12], and StyleGAN [13].
Shen et al. extracted the latent space through the trained StyleGAN and proposed a methodology for various style edits [7]. As shown in Fig. 1, the more we change to the profile from frontal face, the more certain people lose their identity.
Fig. 1. Pose generation example of interfaceGAN [7].
In this paper, we suggest a new method only using deep learning without image editing techniques for generating the frontal face from images of other directions. Conditional generative adversarial network (cGAN) generated an image corresponding to X according to stationary postconditions Y as posterior X|Y [14]. Using this method, we generate a frontal face while Y is style condition, but we get style vector by our models instead of fixed value Y.
The four models that each model extracts diverse style of face are proposed. First, style-encoder model has a structure similar to discriminator, obtaining style vectors by image, and it’s called StyleEncoder.
The second model is advanced version of first model which has an attention mechanism by 1×1 convolution, defining A-StyleEncoder. Also, without training model which finds the style, use the classification model trained by another dataset to generate a frontal view. One of them is InceptionResNet trained by VGGFace2 and we compare the results each of models in experiments. A-StyleEncoder model gets the properties which are coarse feature of face like hair, beardand outline. The pre-trained Inception Resnet model extracts details of face such as eye, nose and mouth type. To combine these features together, we merge the model’s output and it could be considered not only specific area of face but also overall style. After, a style vector as input on generator, generates frontal face.
In next section, we will introduce other models and methodologies that define styles from image related to the suggested models. We propose a method for training each network’s parameters and the four types of style extraction model in approach in Section 3. In Section 4, we compare the output image of each model and PSNR result with the original image. Also, we will visualize the vector from the StyleEncoder model to check whether the style was extracted properly and specify the parametric structure of generator, discriminator and styleEncoder model. Finally, we explain limitation of the research, future work and important effect in Section 5.
II. RELATED WORK
2.1. cGAN
The conditional-GAN [14] trains the model by conditioning a class label on the generator and the discriminator in order to generate an image of a given type that meets the desired conditions. The learning method is the addition of the condition y from the existing GAN structure as the following:
\(\begin{gathered} \min _{G} \max _{D} V(D, G)=\mathbb{E}_{x \sim P_{d a t a(x)}}[\log D(x \mid y)]+ \\ \mathbb{E}_{z \sim P_{z(z)}}[\log (1-D(G(z \mid y)))] \end{gathered}\) (1)
where D and G represent the discriminator and generator, respectively. x is a real image, z is a noise input for generating fake image. Conditional Generative Adversarial Network overcomes the limitations of the generated images with random Gaussian sample. In this paper, image dependent vector by style network replaces class label y. The generator model is responsible for generating identically corresponding frontal face by the designed style encoder model.
2.2. StyleGAN, InterfaceGAN
A Style-Based Generator Architecture (StyleGAN) for GANs by NVIDIA, presents an advanced model which generates high-quality image [10]. The StyleGAN generates the image gradually, starting from a very low resolution to a high resolution. It modifies the central feature corresponding to each resolution for each level separately. Resolution of up to 82 affects pose, general hair style, face shape, and other levels affect more micro features. Also, the StyleGAN employed AdaIN (adaptive instance nor- malization) module that editing each channel by information vector w. This mechanism produces state of the art results with high resolution images, allowing for a better understanding of GAN outputs.
The InterfaceGAN proposes a novel approach, interpreting the latent space of GANs for semantic face editing [7]. Specifically, this research made latent space of image for finding semantic subspaces and using trained face synthesis model. InterfaceGAN is capable of changing several semantic elements (pose, gender, glasses, etc.) of application controllable trained StyleGAN model, but pose generation loses the meaning of determining the same identity in Figure 1. In this paper, we introduce how to produce an image without losing identity with a simple GAN structure.
2.3. Classification Model
As a backbone model, the residual network is widely taken into account for SOTAs. InceptionResnet [3] is Inception style networks that utilize residual connections instead of filter concatenation. One of models, Inception Resnet-v1, is a hybrid Inception version with significantly improved recognition performance. We select the InceptionResnet-v1 pre-trained by VGGFace2.
III. PROPOSED APPROACH
3.1. Formulation
We point to the fact that the cGAN has fatal drawback property that class label y should be a fixed vector, passing on generator. If we want to produce an image explained by more complex style which unable to express in simple class, we cannot generate the desired image. On the other hand, a style-encoder alternates particular label, representing a different style of every image. Therefore, the formulation for training network as the following:
\(\theta_{G}^{*}, \theta_{\mathrm{D}}^{*}, \theta_{S}^{*}=\min _{\theta_{G}} \max _{\theta_{D}} f\left(\theta_{G}, \theta_{D}, \theta_{S}\right)\) (2)
Each of 𝜃𝐺, 𝜃𝐷, 𝜃𝑆 defines parameters of the generator, discriminator, style-encoder network and 𝜃∗ is the optimal parameters. Model process is termed function 𝑓, then the optimal parameters of G and D are computed by:
\(\theta_{G}^{*}=\underset{\theta_{G}}{\operatorname{argmin}} f\left(\theta_{G}\left(\theta_{S}\right), \theta_{D}^{*}\left(\theta_{G}\left(\theta_{S}\right)\right)\right),\) (3)
\(\theta_{D}^{*}=\underset{\theta_{D}}{\operatorname{argmax}} f\left(\theta_{G}^{*}\left(\theta_{S}\right), \theta_{D}\right)\) (4)
Finally, we get 𝑓on 𝑥∈𝑅𝐶×𝑊×𝐻 (𝐶: channel, 𝑊, 𝐻: image of width and height.) in that 𝑥𝑓𝑟𝑜𝑛𝑡𝑎𝑙is the front face of 𝑥 and 𝑥𝑝𝑟𝑜𝑓𝑖𝑙𝑒 is side one. Therefore, we can express as:
𝑓(𝜃𝐺, 𝜃𝐷, 𝜃𝑆)
=𝔼𝑥~𝑝𝑑𝑎𝑡𝑎[log (𝐷(𝑥𝑓𝑟𝑜𝑛𝑡𝑎𝑙|𝑆(𝑥𝑝𝑟𝑜𝑓𝑖𝑙𝑒;𝜃𝑆);𝜃𝐷))] +𝔼𝑥~𝑝𝑑𝑎𝑡𝑎[log (1−𝐷(𝐺(𝑆(𝑥𝑝𝑟𝑜𝑓𝑖𝑙𝑒;𝜃𝑆);𝜃𝐺) |𝑆(𝑥𝑝𝑟𝑜𝑓𝑖𝑙𝑒;𝜃𝑆);𝜃𝐷))]. (5)
By this formulation, we can get the frontal face image by multi-view images.
3.2. Overall Network
The overall structure of the designed model is shown in Figure 2. If we insert the side face as the input, we extract the style vector through the style network. Extracted style vector is used in both generator and discriminator. The generator generates the front real image with the entered style vector. Discriminator receives the actual front or predicted front and side profile style vectors as inputs and finally learns whether the frontal face of the person is real or fake.
Fig. 2. Overall structure of the model: \(\widehat{\boldsymbol{x}}_{\text {front }}\) is the predicted front and 𝒙𝒇𝒓𝒐𝒏𝒕 means the real front face.
3.3. Style Network Architecture
To give the generator using various style vectors, we design four types of models that can extract styles.
3.3.1. StyleEncoder (SE)
The StyleEncoder model consists of architecture similar to discriminator in that encodes an image. For an image, the image features are represented through a simple Convolutional Neural Network (CNN) structure of five layers. The activation function uses LeakyRelu and proceeds Batch Normalization (BN). The final size of the Style Vector is 1×1×512. If there is no attention module in Figure 3 of (a), it is a simple style-encoder model.
Fig. 3. Primary style network architectures. Merged version of A- StyleEncoder and InceptionResnet (A-SE+IR).
3.3.2. A-StyleEncoder (A-SE)
A-StyleEncoder is an advanced model of style-encoder as adding attention module to enhance features. The structure is Fig. 3 (a) and the feature selected in the middle stage of the style-encoder passes through a 1x1 convolution, sigmoid function and again multiplies itself. The output 𝑥𝑙 of middle layer can be given as:
\(\text { LeakyRelu }(\text { BatchNorm }(\operatorname{Conv}(x)))=\mathcal{F}(x),\) (6)
𝑥𝑙 = ℱ(𝑥𝑙−1) ⊗ 𝜎(𝐶(ℱ(𝑥𝑙−1))), (7)
where ⊗ denote element-wise multiplication, C(·) is pointwise convolution operation by 1x1 filter, and 𝑙 is the output layer of attention module.
3.3.3. InceptionResnet (IR)
InceptionResnet [3] outperforms in area of classi-fication and is widely used as a backbone network. We consider a pre-trained InceptionResnet-v1 by vggface2 dataset. This method recognizes our dataset image as 100 percent accuracy. As with other style-encoder models, the face style is drawn in size 1×1×512 with the input image.
3.3.4. A-StyleEncoder + InceptionResnet (A-SE + IR)
Finally, we concatenate the outputs of A-StyleEncoder and InceptionResnet to make use of each property which determines type of style. A-StyleEncoder defines a style of around face (hair, beard, face shape) and InceptionResnet decides a feature to recognize people as eye, nose and mouth. Thus, to generate image of all types, we concatenate features together as Eq (8):
𝑠𝑡𝑦𝑙𝑒 = 𝑆𝐴−𝑆𝑡𝑦𝑙𝑒𝐸𝑛𝑐𝑜𝑑𝑒𝑟(𝑥) ⊕ 𝑆𝐼𝑛𝑐𝑒𝑝𝑡𝑖𝑜𝑛𝑅𝑒𝑠𝑛𝑒𝑡(𝑥). (8)
By concatenating these outputs as shown in Fig. 3(b), it is feasible to generate the frontal face by profile face.
𝐈V. EXPERIMENTS
4.1. Dataset and Training Detail.
We used the FEI Face dataset [15] with 11 directions (1 front and 10 sides) for each of 200 individuals. We preprocess an image with Multi-task Cascaded Convolutional Networks (MTCNN) [16] to detect face and crop. The size of final cropped image is 3×128×128 (C×W×H). The input is ten profiles turned 18 degrees from -90 to 90 degrees. Both generator and discriminator were trained alternately, and the style model was trained with the generator. We set 100 epochs and took 20 hours with Geforce GTX1080Ti. We used the Adam optimizer and learning rate 0.0002.
4.2. Features Visualization.
Before entering the generator and discriminator, we visualize the output features of A-StyleEncoder and InceptionResnet by concatenating them. The size of the total output feature was 1×1×1024. We used the principal component analysis (PCA) technique to project it in two dimension. The output of each model for visualization was aggregated as follows:
𝑃𝐶𝐴(𝑆𝐴−𝑆𝑡𝑦𝑙𝑒𝐸𝑛𝑐𝑜𝑑𝑒𝑟(𝑥) ⊕ 𝛼 × 𝑆𝐼𝑛𝑐𝑒𝑝𝑡𝑖𝑜𝑛𝑅𝑒𝑠𝑛𝑒𝑡 (𝑥)), (9)
where α is set to 2, and the result of visualizing in two dimension through the principal component analysis (PCA) function as shown in Fig. 4. It can be seen that people of similar styles are grouped together. This means that the ‘ASE + IR’ is effective.
Fig. 4. Visualization of style output of ‘A-SE + IR’ with all people of dataset.
4.3. Detailed Network Architecture.
Specific architectures of the generator and discriminator are shown in Table 1 and Table 2. Each model was referred to the architecture of DCGAN [17]. The generator produces images with a style vector, and the discriminator concatenates the input image and style to find out the real or false image.
Table 1. Generator Architecture.
Table 2. Discriminator Architecture.
4.4. Performance Analysis.
The result of generating a frontal view from each person’s profile compares with the models of Style Network Architecture and the InterfaceGAN using the pretrained StyleGAN on Flickr-Faces-HQ Dataset (FF-HQ). As we can see in Fig. 5, the merged version (A-SE + IR) of A-StyleEncoder and InceptionResnet outperforms than other style networks. As mentioned, A-StyleEncoder (A-SE) can extract type of hair, face shape (around of face) and select features such as eyes, nose and mouth better than other models on InceptionResnet (IR). The results of the ‘A-SE + IR’ model for others are shown in Fig. 6. It can be seen that hair styles and facial features are applied independently compared to other generator models. Hence, we consider the characteristics of A-Style Encoder and InceptionResnet at the same time and get the style of the whole face. Finally, we compared a Peak Signal-to-noise ratio (PSNR) of output results in Table 3. The proposed ‘ASE +IR’ model obtained the highest PSNR value.
Fig. 5. Results of all models mentioned in Style Network Architecture.
Fig. 6. Results of generating frontal face from multi-viewer image by our best models.
V. CONCLUSION
We have investigated some style extraction models and proposed a style extraction model called ‘A-SE + IR’ by concatenating the results of the attention style encoder model and the InceptionResnet for giving the condition of the generator and discriminator to generate the front face from the side. Also, we developed a frontal face generation module that would extract complex features by applying a conditional generator. This model not only extracts styles around the face such ascertain people’s hair styles, but also confirms that the facial features are well drawn. We verified the possibility of generating a frontal face with reliable quality, from side view images.
References
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, "Imagenet: A large-scale hierarchical image database," in Proceeding of 2009 IEEE conference on computer vision and pattern recognition, pp. 248-255, 2009.
- CIFAR10 dataset of Laboratory of Toronto, "https://www.cs.toronto.edu/kriz/cifar.html".
- Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi, "Inception-v4, inception-resnet and the impact of residual connections on learning," arXiv preprint arXiv:1602.07261, 2016.
- Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby, "Big transfer (bit): General visual representation learning," arXiv preprint arXiv:1912. 11370, vol. 6, 2019.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778, 2016.
- Maja Pantic and Ioannis Patras, "Dynamics of facial expression: recognition of facial actions and their temporal segments from face profile image sequences," IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 36, no. 2, pp. 433-449, 2006. https://doi.org/10.1109/TSMCB.2005.859075
- Yujun Shen, Ceyuan Yang, Xiaoou Tang, and Bolei Zhou, "Interfacegan: Interpreting the disentangled face representation learned by gans," arXiv preprint arXiv:2005.09635, 2020.
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, "Generative adversarial nets," in Proceeding of Advances in neural information processing systems, pp. 2672-2680, 2014.
- Diederik P Kingma and Max Welling, "Auto-encoding variational bayes," arXiv preprint arXiv:1312.6114, 2013.
- Yunjey Choi, Minje Choi, Munyoung Kim, JungWooHa, Sunghun Kim, and Jaegul Choo, "Stargan: Unified generative adversarial networks for multi-domainimage-to-image translation," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8789-8797, 2018.
- Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei AEfros, "Unpaired image-to-image translation using cycle-consistent adversarial networks," in Proceedings of the IEEE international conference on computer vision, pp. 2223-2232, 2017.
- Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim, "Learning to discover cross-domain relations with generative adversarial networks," arXiv preprint arXiv:1703.05192, 2017.
- Tero Karras, Samuli Laine, and Timo Aila, "A style-based generator architecture for generative adversarial networks," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp.4401-4410, 2019.
- Mehdi Mirza and Simon Osindero, "Conditional generative adversarial nets," arXiv preprint arXiv: 1411.1784, 2014.
- The FEI face database of Laboratory of FEI, "http://fei.edu.br/cet/facedatabase.html".
- Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao, "Joint face detection and alignment using multitask cascaded convolutional networks," IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499-1503, 2016. https://doi.org/10.1109/LSP.2016.2603342
- Alec Radford, Luke Metz, and Soumith Chintala, "Unsupervised representation learning with deep convolutional generative adversarial networks," arXiv preprint arXiv: 1511.06434, 2015.
- Rohit Srivastava, Ravi Tomar, Ashutosh Sharma, Gaurav Dhiman, Naveen Chilamkurti, and Byung-Gyu Kim, "Real-Time Multimodal Biometric Authentication of Human Using Face Feature Analysis," Computers, Materials & Continua, vol. 49, no.1, pp. 1-19 (DOI:10.32604/cmc.2021.015466), 2021.
- Dami Jeong, Byung-Gyu Kim, and Suh-Yeon Dong, "Deep Joint Spatiotemporal Network (DJSTN) for Efficient Facial Expression Recognition," Sensors, vol. 2020, no. 20, p. 1963 (https://doi.org/10.3390/s20071936), 2020.
- Ji-Hae Kim, Gwang-Soo Hong, Byung-Gyu Kim, and Debi P. Dogra, "deepGesture: Deep learning-based gesture recognition scheme using motion sensors," Displays, vol. 55, pp. 34-45 (https://doi.org/10.1016/j.displa.2018.08.001), 2018.
- Ji-Hae Kim, Byung-Gyu Kim, Partha Pratim Roy, and Da-Mi Jeong, "Efficient Facial Expression Recognition Algorithm Based on Hierarchical Deep Neural Network Structure," IEEE Access, vol. 7, pp. 41273-41285, 2019. https://doi.org/10.1109/access.2019.2907327
- Dong-hyeon Kim, Dong-seok Lee, and Soon-kak Kwon, "Fall Situation Recognition by Body Centerline Detection using Deep Learning," Journal of Multimedia Information System, vol. 7, no. 4, pp. 257-262, 2020. https://doi.org/10.33851/JMIS.2020.7.4.257
- Woon-Ha Yeo, Young-Jin Heo, Young-Ju Choi, and Byung-Gyu Kim, "Place Classification Algorithm Based on Semantic Segmented Objects," Applied Sciences, vol. 2020, no. 10, p. 9069 (https://doi.org/10.3390/app10249069), Dec. 2020.
- S. Mukherjee, S. Ghosh, S. Ghosh, P. Kumar, and P. P. Roy, "Predicting Video-frames Using Encoder-convlstm Combination," in Proceeding of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2027-2031 (doi: 10.1109/ICASSP.2019.8682158), 2019.