Monocular visual scene analysis
: saliency detection and 3D face reconstruction using GAN

  • Xiaoxu Cai

Student thesis: Doctoral Thesis


Visual scene analysis imitates the way humans perceive the outside world, which is essential for achieving computer intelligence. This thesis narrows down the scope of visual scene analysis to two fundamental tasks, namely detecting and reconstructing the object of interest in a scene. For a general scene consisting of multiple objects, it’s a natural routine to screen out the most salient object first. For a human-centred scene, reconstructing the 3D geometry of the human face that occupies the central position in social communication and is highly deformable becomes one of the first priorities. Based on these two insights, the thesis studies the problems of saliency detection in a general scene and depth-to-3D face reconstruction in a human-centred scene. It deeply explores adapting the generative adversarial network – GAN that was initially proposed for image generation to solve the aforementioned problems.
For saliency detection, the thesis proposes a novel perceptual loss-guided GAN called as PerGAN. PerGAN applies a multi-scale discriminator and is trained with a perceptual loss that measures misdetection errors on the semantic feature level rather than the common pixel level of the generated saliency map. This enables an improved utilization of features across different image resolutions and those are semantically meaningful. The proposed method has been validated on benchmark datasets and outputs competitive saliency detection accuracy against the state-of-the-art.
For 3D face reconstruction from a depth image, the thesis first proposes to use the GAN to bridge the facial voxel grid and the depth data. The attention mechanism is incorporated into the GAN to regulate the learning process to weight higher on the intermediate features that are more relevant to predicting facial voxels. The resulting attention-guided GAN, or AGGAN in short, is trained and evaluated on synthesized depth images. Comparing with the previous methods that rely on a costly optimization-based 3D reconstruction process, the learning-based AGGAN is more efficient and robust to depth images with noises and large facial poses. What’s more, the use of synthetic data for training shows big potential on overcoming the shortage of depth images with 3D facial labels. Based on these results, the thesis continues to use the synthetic data for training the 3D face reconstruction network, meanwhile, incorporating unlabelled real depth images into the training procedure for obtaining a domain-adaptive reconstruction model. It employs a GAN to fill the domain gap between the synthetic and real depth images and learns a common feature embedding that is informative to both domains. The resulting reconstruction network shows a promising generalization ability to real-world depth images. Extensive experiments on mainstream real datasets demonstrate that the proposed domain-adaptive 3D face reconstruction method is competitive against the state-of-the-art.
Through developing the aforementioned algorithmic solutions to visual saliency detection and depth-to-3D face reconstruction, the thesis also gains first-hand experience on adapting GAN to different visual scene analysis tasks that are quite different from its familiar image generation task. The adaptation of GAN in this thesis ranges from binary saliency map generation, facial voxels prediction to domain alignment. This is supposed to be beneficial to propagate the GAN to a broader range of application scenarios that are not limited to visual scene analysis.
Date of AwardJan 2021
Original languageEnglish
SupervisorHui Yu (Supervisor)

Cite this