Learning Periorbital Soft Tissue MotionbyAnurag RanjanB. Tech., National Institute of Technology Karnataka, 2013A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University Of British Columbia(Vancouver)September 2015c© Anurag Ranjan, 2015AbstractHuman observers tend to pay a lot of attention to the eyes and the surrounding softtissues. These periorbital soft tissues are associated with subtle and fast motionsthat convey emotions during facial expressions. Modeling the complex movementsof these soft tissues is essential for capturing and reproducing realism in facialanimations.In this work, we present a data driven model that can efficiently learn andreproduce the complex motion of the periorbital soft tissues. We develop a systemto capture the motion of the eye region using a high frame rate monocular camera.We estimate the high resolution texture of the surrounding eye regions using aBayesian framework. Our learned model performs well in reproducing variousanimations of the eyes. We further improve realism by introducing methods tomodel facial wrinkles.iiPrefaceThe methods described in Chapter 3, Chapter 5 and Chapter 6 formed part of thefollowing poster, and will be part of a following full paper:Debanga Raj Neog, Anurag Ranjan, Joa˜o L. Cardoso, and Dinesh K. Pai.“Gaze driven animation of eyes.” In Proceedings of the 14th ACM SIG-GRAPH/Eurographics Symposium on Computer Animation, pp. 198-198. ACM, 2015.The work in Chapter 3 and Chapter 6 was done jointly with Debanga Raj Neogand will eventually form a part of his doctoral thesis. The rendering frameworkwas developed by Joa˜o L. Cardoso.Except as noted above, all algorithms, experiments and analysis in this thesiswere developed by me, with guidance from my supervisor, Dinesh Pai.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background and Related Work . . . . . . . . . . . . . . . . . . . . . 42.1 Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.1 Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 Scene Flow . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Realistic Performance Capture . . . . . . . . . . . . . . . . . . . 72.2.1 Facial Performance Capture . . . . . . . . . . . . . . . . 82.2.2 Dynamic Capture of the Eyes . . . . . . . . . . . . . . . 82.3 Texture Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 9iv2.3.1 Multi-view Stereo Methods . . . . . . . . . . . . . . . . 92.3.2 Bayesian Methods . . . . . . . . . . . . . . . . . . . . . 92.4 Learning Skin Motion . . . . . . . . . . . . . . . . . . . . . . . . 102.5 Wrinkles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.1 Scene Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.2 Scene Flow Model . . . . . . . . . . . . . . . . . . . . . 143.1.3 Numerical Solution . . . . . . . . . . . . . . . . . . . . . 173.1.4 Experiments and Results . . . . . . . . . . . . . . . . . . 193.2 Skin Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2.1 Skin Tracking Model . . . . . . . . . . . . . . . . . . . . 223.2.2 Experiments and Results . . . . . . . . . . . . . . . . . . 224 Texture Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.2 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 274.2.2 Generative Model . . . . . . . . . . . . . . . . . . . . . . 294.2.3 Bayesian Estimation . . . . . . . . . . . . . . . . . . . . 304.2.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . 324.3 Numerical Solution . . . . . . . . . . . . . . . . . . . . . . . . . 364.3.1 Update for Noise Variance, Σ . . . . . . . . . . . . . . . . 364.3.2 Update for Texture Image, J0 . . . . . . . . . . . . . . . . 374.3.3 Update for Optical Flow Fields, wi . . . . . . . . . . . . . 384.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 Learning Skin Motion . . . . . . . . . . . . . . . . . . . . . . . . . . 485.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.2 Eyelid Shape Model . . . . . . . . . . . . . . . . . . . . . . . . . 495.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 505.2.2 Training Model . . . . . . . . . . . . . . . . . . . . . . . 515.3 Skin Motion Model . . . . . . . . . . . . . . . . . . . . . . . . . 52v5.3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 535.3.2 Training Model . . . . . . . . . . . . . . . . . . . . . . . 535.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.4.1 Eyelid Shape Model . . . . . . . . . . . . . . . . . . . . 555.4.2 Skin Motion Model . . . . . . . . . . . . . . . . . . . . . 555.4.3 Comparison with Other Methods . . . . . . . . . . . . . . 555.4.4 Model Transfer . . . . . . . . . . . . . . . . . . . . . . . 575.4.5 Interactive Real Time Implementation . . . . . . . . . . . 586 Wrinkles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.1 Strain based Appearance Model . . . . . . . . . . . . . . . . . . 596.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 596.1.2 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 616.1.3 Learning Strain to Appearance Map . . . . . . . . . . . . 626.1.4 Experiments and Results . . . . . . . . . . . . . . . . . . 636.2 Shape from Shading Model . . . . . . . . . . . . . . . . . . . . . 656.2.1 Geometric Model . . . . . . . . . . . . . . . . . . . . . . 656.2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . 666.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 677 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 69Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71viList of TablesTable 3.1 Comparison of RMS errors for Scene Flow. . . . . . . . . . . . 20Table 5.1 Performance comparison of different learning algorithms on skinmotion model. . . . . . . . . . . . . . . . . . . . . . . . . . . 57viiList of FiguresFigure 3.1 Frames of a stereo setup . . . . . . . . . . . . . . . . . . . . 14Figure 3.2 Frames of General Sphere dataset . . . . . . . . . . . . . . . 19Figure 3.3 Results of Scene Flow estimation with Ground truths . . . . . 20Figure 3.4 Overview of the spaces related to skin tracking. . . . . . . . . 23Figure 3.5 The capture setup used in our experiments. . . . . . . . . . . 24Figure 3.6 Checkerboard snapshots for Calibration and Reconstruction . 24Figure 3.7 Tracking results for monocular videos. Images from videosand corresponding 3D reconstructed meshes. . . . . . . . . . 25Figure 4.1 Frames of Eyelid in motion showing displacement of the Eye-lashes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Figure 4.2 Transformation mapping Texture space and Image frames. . . 29Figure 4.3 Color of texels in texture space tracked across frames. . . . . 34Figure 4.4 Texel Distribution across of a tracked point across frames. . . 35Figure 4.5 Eyelashes from the Video Sequence . . . . . . . . . . . . . . 41Figure 4.6 Reconstructed Eyelashes . . . . . . . . . . . . . . . . . . . . 42Figure 4.7 Skin texture from the Video Sequence . . . . . . . . . . . . . 43Figure 4.8 Reconstructed Skin texture . . . . . . . . . . . . . . . . . . . 44Figure 4.9 Reconstructed Skin texture of the Eyelid . . . . . . . . . . . . 45Figure 4.10 Reconstructed Skin texture of the Full Face . . . . . . . . . . 46Figure 4.11 Reconstructed Skin under the eyelash . . . . . . . . . . . . . 47Figure 5.1 orbicularis oculi and levator palpebrae control the eyelids . . 49Figure 5.2 Gaze parameterized skin motion . . . . . . . . . . . . . . . . 50viiiFigure 5.3 Model of the neural network for training Eyelid shape. . . . . 52Figure 5.4 Model of the neural network for training Skin Motion. . . . . 53Figure 5.5 Reconstruction of the Eyelid . . . . . . . . . . . . . . . . . . 54Figure 5.6 Reconstruction of the Skin . . . . . . . . . . . . . . . . . . . 56Figure 5.7 Model transfer: Model trained on one subject is used to deformthe mesh of another subject. . . . . . . . . . . . . . . . . . . 57Figure 5.8 Interactive WebGL implementation in real-time. . . . . . . . . 58Figure 6.1 Facial expression and skin deformations: Forehead wrinkles,frown, blink and Crow’s feet. . . . . . . . . . . . . . . . . . . 60Figure 6.2 Body Mesh projected as Skin Atlas, and Tracking Mesh . . . 60Figure 6.3 Tracking of Eyelid across frames of a video sequence . . . . . 61Figure 6.4 Forehead wrinkles projected to Skin Atlas as an appearancetexture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62Figure 6.5 Normal map of steady state and activated state for foreheadwrinkles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64Figure 6.6 Rendered Upper Head animation with Steady state pose andActivated pose . . . . . . . . . . . . . . . . . . . . . . . . . 65Figure 6.7 Wrinkle marking, Skin groups forming wrinkles, and wrinklegeneration. . . . . . . . . . . . . . . . . . . . . . . . . . . . 67Figure 6.8 Wrinkle modeling in a forceful eye closure example. . . . . . 68ixGlossaryAAM Active Appearance ModelsANN Artificial Neural NetworkAWGN Additive White Gaussian NoiseCG Conjugate GradientDCT Discrete Cosine TransformDFT Discrete Fourier TransformFEM Finite Element MethodsLDOF Large Displacement Optical FlowMAP Maximum a PosteriorMCML Maximally Collapsing Metric LearningML Maximum LikelihoodMLR Multivariate Linear RegressionMLS Moving Least SquaresNCA Neighborhood Component AnalysisNN Neural NetworkOFC Optical Flow ConstraintxPCA Principal Component AnalysisPPCA Probabilistic PCARBF Radial Basis FunctionsSAG Stochastic Average GradientxiAcknowledgmentsI am grateful to my research supervisor Prof. Dinesh K. Pai for supporting methroughout my graduate program. I am thankful to Debanga Raj Neog for his usefulcollaboration during the projects that became a part of this thesis. I extend mythanks to Joa˜o L. Cardoso for helping with the rendering. I acknowledge MITACSand NSERC for funding my graduate program.I will miss the espressos at Bean which always kept me awake, and their won-derful, welcoming people.Finally, I express my deepest gratitude to Radhe Uncle for his love and guid-ance throughout my life. I am very thankful to my parents and family for theirkindest support.xiiDedicationto MomxiiiChapter 1IntroductionEyes have always been the important subjects of human perception and facial ex-pressions. In 1960, Yarbus showed that human observers spend a surprisinglylarge fraction of time fixated on the eyes in a picture [82]. The region aroundthe eye forms the most important feature of Viola and Jones’s face detector [74].In Google’s Inceptionism 1, we see that the neural network for classification of hu-mans or animals generally looks for the eye region. This results in dreamy imagescorresponding to features learned by the classifiers to have a lot of eyes.But what is it about eyes that conveys this important information to observers?To discuss this, we need to introduce some terms. The colloquial term “eye” isnot sufficiently precise. It includes the globe, the approximately spherical opticalapparatus of the eye that includes the colorful iris. The globe sits in a bony socket inthe skull called the orbit. The term “eye” also usually includes the upper and lowereyelids, eyelashes and periorbital soft tissues that surround the orbit, including themargins of the eyebrows. When we refer to the eye we mean all of these tissues,and we will use the more specific term where appropriate.The eyes convey subtle information about a person’s mental state (e.g., at-tention, intention, emotion) and physical state (e.g., age, health, fatigue). Thus,creating realistic computer generated animations of eyes is a very important prob-lem in computer graphics. Realistic facial performance capture has been widelyresearched [10, 29] but attention to details around the eye region has been paid1http://googleresearch.blogspot.ca/2015/06/inceptionism-going-deeper-into-neural.html1only in handful of methods. Recently, there has been an emergence of methods[4, 5, 14, 25, 46] that focus on reconstruction of details of the eyes in facial ani-mations. However, these methods require a complex setup for detailed capture andreconstruction of eye region. Our work is motivated by learning important aspectsof the motion around the eyes using a simple video capture setup.In order to learn about the motion of the eyes, we need a robust frameworkto track the detailed movements of the soft tissues around the eyes. The motionof the eye region is very subtle and fast. The blink of an eye takes around 100to 400 milliseconds [56]. The eye saccades in humans can reach up to 900◦ persecond [21]. This results in subtle motion of soft tissues surrounding the eyes,which generally goes unnoticed. Tracking of such detailed and fast motion of theeye thus requires a robust tracking method.In addition, the motion can also help in estimating the detailed texture of theeye regions using superresolution methods. The various regions of the eye suchas the globe, eyelashes and skin texture may occlude each other as a result of themotion of the eye region. Therefore, a framework is required to deal with theseregions of occlusion to robustly recover textures of the eye region. A large numberof tracking methods use the Optical Flow framework [32] for computing the densemotion field across the images and recent developments have made it more robustto occlusions and discontinuities.The subtle motions of the eye are very complex, and it would be very usefulto simplify them for understanding. A simplified model is essential for generatingthese complex motions using a small number of parameters. Most of the facialperformance capture methods use Blend Shapes which can not capture all the de-tailed and non-linear motion that is usually produced around eye region. Hence, anon-linear model is required to capture the detailed motion of the eye region whichcan be controlled using small number of parameters such as Gaze and other affectparameters.The realism in facial animation is highly influenced by the quality of differentwrinkles that can be produced. The simple wrinkles on the forehead as a resultof expressions like eyebrow raising are easier to capture. However, the wrinklesaround the eye region are not very sharp and distinguishable. There have been anumber of methods for modeling the wrinkles using shape from shading approach.2In this thesis, we design a robust method for dense tracking of the region aroundthe eyes. We recover the texture of the eye region using an optical flow based su-perresolution method. We design a neural network framework to learn and controlthe motion of the eyes using a small number of parameters. Finally, we use a shapefrom shading approach to simulate wrinkles and add realism.Our main contributions are as follows:1. We implement a robust scene flow method [70] and improve its baselineperformance using a stochastic optimization technique [57]. We developskin flow method in a monocular setup to track the movements around theeye using “reduced coordinates” and estimate the dense 3D motion of theeye region.2. We develop a robust method to estimate the texture of skin around the eyeregion and eyelashes in the presence of occlusion. Our framework relieson optical flow based superresolution methods and deals with occlusions byintroducing material mask functions. We recover high resolution texturesusing Bayesian estimation.3. We parametrize the complex motion of the eye in terms of the Gaze of theeye. We design two neural network models based on Radial Basis Functions(RBF) to learn the motion of the eye region. The first network learns theshape of the eyelid from the Gaze parameters. The second network parametrizesthe motion of the eye in terms of eyelid shape.4. We develop two approaches to deal with wrinkles that are produced as a re-sult facial expressions. The first model generates Appearance Textures basedon the strain of the skin that can simulate wrinkles. The second method,based on a shape from shading approach, parametrizes wrinkles in terms ofgeometric shape based on the observed intensities of their pixels.3Chapter 2Background and Related WorkThe goal of this thesis is to design a framework for learning the motion of theeye region. This objective is addressed by developing methods for motion estima-tion and texture recovery. Using the data from motion estimation, we train neuralnetwork models to produce realistic motion around the eye.2.1 Motion EstimationThe work on motion estimation has followed from methods for estimation of opti-cal flow [8, 11, 12, 32, 41]. The optical flow methods are used for estimating dense2D motion between frames of an image sequence. Subsequently, the methods forcomputing 3D motion using multiple cameras emerged [34, 70, 73, 77]. The es-timation of dense 3D motion in a scene using multiple cameras is referred to asScene Flow.2.1.1 Optical FlowTracking motion in video sequences, known as Optical Flow is a well studied prob-lem in computer vision. It is defined as the estimation of motion, given by u(x,y)and v(x,y) in x and y directions respectively, of all the pixels between two imageframes given by the sequence I(x,y, t) ∈ R . There have been various develop-ments to obtain state of the art performance in determining optical flow. Horn andSchunck’s seminal work [32] on optical flow described it as a constrained opti-4mization problem introducing Optical Flow Constraint (OFC)Ixu+ Iyv+ It = 0, (2.1)where, Ix, Iy and It are the partial derivatives of image. u,v are the flow fieldsbetween images I(x,y, t1) and I(x,y, t2). However a single OFC is insufficient toestimate u,v, hence Horn and Schunck introduced spatial regularization constraintsthat enforced flow smoothness across pixels. The spatial regularization constraintsare given by∇2u = 0 and ∇2v = 0, (2.2)where, ∇ is the gradient operator. This was followed by several approaches basedon variational methods [60, 84, 85] including Brox et al.’s work [12] that incor-porates the theory of warping. The variational methods provide a way of trackingprecise displacement of scene objects to a sub-pixel level. However, these methodsuse spatial regularization over image pixels that leads to errors over edge disconti-nuities. Although the regularizer is very effective in providing a unique solution tothe optical flow problem, it causes over-smoothness of resulting flow fields acrossedge discontinuities. A framework to deal with the discontinuities was developedby Black and Anandan [8] using graduated non-convexity in a variational approach.The emergence of new variational methods [60, 85] have provided better ways ofdealing with the regularization, improving the estimation of optical flow at edgediscontinuities.We have also seen the emergence of fast methods which are based on trackingsparse points as opposed to dense optical flow methods. Such methods [37, 49, 76]track features across multiple image frames that follow the approach of the clas-sic KLT tracker [41, 58, 66]. The KLT tracker uses a coarse-to-fine approach forestimation of optical flow using image pyramids, making it more robust. Severaloptical flow methods have since followed a coarse-to-fine approach to get morerobust estimation.Deformations in the upper head region surrounding the eye as a result of skinmovements are very complex. The tracking of these deformations is affected by5occlusion due to buckling and folds. Some methods were specifically targetedtowards robust tracking of facial movements. The model based facial trackingintroduced by Pighin et al. [48] was successful only in tracking coarse points onface. This followed other approaches in active-mesh based methods [18, 55, 63,72, 76] modeling the optical flow problem as a physical mass-spring system. Somework on estimation of deformation in various surfaces has been done by [55] and[49]. These works perform well only on linear deformation models and are basedon sparse optical flow tracking.The Active Appearance Models (AAM) introduced by Xiao et al. [80] were amore robust version of face tracking that learned a model by manually annotatingpositions of facial features in the images. Other AAM based methods [38] also be-came popular for facial tracking. As manually annotating all the pixels of the facefor generating training data becomes extremely tedious, they could never be usedfor computing dense optical flow in facial deformations. Most of the work in ro-bust facial tracking is based on sparse trackers like KLT or use learning approacheslike AAM. As such, they do not perform very well in capturing the detailed andcomplex motion of skin especially around the eyes.A major baseline improvement resulted in combining feature based trackingwith a dense variation framework to compute large and fast motions. Brox andMalik’s Large Displacement Optical Flow (LDOF) [11] performed significantlybetter than previous methods, especially on large and fast motions. A fast ver-sion of this method has been implemented by [62] which uses parallelizations onGPU. Improvements to the KLT-tracker by Kalal et al. [37] also performed betterby detecting tracking failures using a Median Flow tracker. Another baseline im-provement was made by Sun et al. [60] which provided various implementationstrategies including the use of a weighted median filter. The work by Zimmer et al.[84, 85] provided a theoretical interpretation of improvements in variational meth-ods in terms of diffusion images. This provided a much needed understanding ofoptical flow problem and the approaches that guarantee a better solution. A de-tailed survey of current practices in estimation of optical flow is provided by Sunet al. [61] and Baker et al. [1].62.1.2 Scene FlowScene flow is defined as the motion of 3D points in a scene. As such, it requiresdepth information and can not be computed by monocular set ups. Thus, scene flowsetups include binocular or multi-view stereo capture systems. Unlike structurefrom motion methods that estimate the 3D representation of static scenes, sceneflow estimates the 3D motion of moving scenes. Scene flow estimation was firstintroduced by Vedula et al. in [73]. The proposed solution was based on solving anoverdetermined linear system of equations.Most of the algorithms for computing scene flow treat structure and motionof the scene separately. The approach often follows sequential computation ofshape and motion parameters. This can be seen in Vedula et al.’s seminal workfollowed by Wedel et al.’s [77]. However, the joint estimation of shape and motionparameters in a scene flow framework has established much better improvements[34, 70]. These methods follow the variational optical flow setup with additionalepipolar constraint to estimate the 3D scene flow.Multi-camera scene flow has also seen improvements from the use of tempo-ral information [35]. Some approaches have also used a second order variationalframework in a joint estimation of scene flow to compute Lagrangian stress tensorof a deformable body [30]. The new methods have shown that joint estimationof shape and motion components results in significant increase of performance incomputing scene flow.Our approach follows from Valgaerts et al.’s work [70] that describes the jointestimation of scene flow along with camera intrinsics. The work shows a good im-provement over previous methods without using any external measurements forcamera calibration. We also look into various optimization methods for solv-ing the variational setup. Our method uses Schmidt et al.’s Stochastic AverageGradient (SAG) [57] for improving the numerical solution obtained from the varia-tional methods.2.2 Realistic Performance CaptureIn the direction of achieving realism in animations, most of the research is based onrealistic facial animation [10, 26, 29]. We have seen a handful of methods which7focus on the details of the eye [4, 5, 46].2.2.1 Facial Performance CaptureAn estimation of detailed facial textures has been done using the light stage [47], amultiple array of polarized lights and cameras. Some systems also include an arrayof multiple cameras in controlled lighting [10] for a high resolution facial scan.There has also been significant progress in achieving a high quality perfor-mance capture of human faces using cheap measurement setups. The work in [36]uses a hand-held monocular video input to create dynamic 3D avatars and theirtextures. Using structure from motion techniques, a texture is estimated by tak-ing multiple images of the face from different viewpoints. These images are thenstitched together using Poisson’s blending to get the final texture. While the reso-lution of these textures is quite low, they can be used in the animation pipeline forcreating dynamic avatars.The use of a monocular camera in reconstruction of dynamic facial geometryhas been done in [26] without paying any attention to textures. The work of [71]uses a binocular facial performance capture system to model the geometry whilethe textures are captured using projective texturing. While both of above methodshave been impressive in capturing facial geometry, they usually reconstruct lowresolution textures using primitive algorithms. There has not been enough contri-bution in reconstructing high resolution textures from a simple and inexpensive setup.2.2.2 Dynamic Capture of the EyesRecently, the computer graphics community has started looking at detailed captureand reconstruction of subtle details of the eye [4, 5, 46]. The work on capturing thegeometric structure of the globe of the eye and its texture has been discussed in [4].While the authors have a detailed reconstruction system for the shape of the eyelidsusing multiple cameras, their texture estimation algorithm uses a simple Poisson’sblending. The detailed spatio-temporal reconstruction of eyelids has been done byBermano et al. [5]. However, their method is limited to reconstructing the blinksequences. A similar Poisson’s blending approach has been used in our work [46]8where the authors develop a model of periorbital soft tissue deformation based onthe gaze of the eyes. These works pay detailed attention to the eye and the region ofskin around it which is important in capturing various subtle movements. A lot ofthese movements are a result of simple expressions like ‘eye blink’, ‘look around’and ‘Crow’s Feet’ sequences.A detailed survey of eye and gaze animation has been discussed in [54]. It dis-cusses various biologically motivated models to simulate eyeball movements suchas saccades, smooth pursuits and vestibulo-ocular reflex. The animation pipelinesare mainly concerned with low level characters. There has not been any contribu-tion to capture highly detailed texture and motion of the eyes.2.3 Texture EstimationIn computer graphics, textures are an important aspect of rendering appealing an-imations in movies and games. In the recent years, computer graphics researchhas made significant contributions in introducing better realism in computer gen-erated imagery [10, 25]. The developments in generating realistic performance incharacter animation have also led to improved capture methods.2.3.1 Multi-view Stereo MethodsMost of the works in realistic performance capture [4, 10, 25, 46] follow multi-viewstereo techniques which capture the image of the object using multiple camerasand stitch them together to form the complete texture. These methods are verysophisticated and require a lot of precision setup, including setting up a light stage[47], camera arrays and measuring the physics of skin [29]. Although they areeffective in capturing detailed texture, these set ups are very expensive and requirea lot of time to construct.2.3.2 Bayesian MethodsIn recent years, people have started looking into reconstructing textures using su-perresolution methods [28, 68]. The work of [28] uses the approach of superres-olution methods in a variational framework. Their work was mainly focused oncapturing high quality textures of static 3D objects using multi-view stereo, and9did not account for dynamic elastic objects such as real human faces.A Bayesian approach using superresolution to recover textures is discussed in[68] using multiple videos. The method combines the best ideas of superresolutionand multi-view stereo in a unified framework. The approach is generalized forreconstructing 3D texture of static objects under Lambertian conditions. The setup requires calibration of multiple cameras to achieve good performance.SuperresolutionVarious methods of estimating a superresolution image from multiple low resolu-tion images have existed in computer vision literature. A detailed survey of su-perresolution methods is presented in [65]. The early superresolution methodswere based on combining frequency domain coefficients [51, 67]. The work of[67] uses a Discrete Fourier Transform (DFT) whereas [51] use a Discrete CosineTransform (DCT) to approach superresolution problem. The superresolution prob-lem has also been approached using interpolation methods [9, 69]. A Moving LeastSquares (MLS) interpolation has been used in [9].In recent years, we have seen more use of Bayesian methods in estimation ofsuperresolution [24, 53]. They are mostly based on Maximum Likelihood (ML)estimates [24] or Maximum a Posterior (MAP) estimates [64]. There has also beensome research in exploring the models of image priors that can be used in Bayesianapproach. The work of [53] uses a Gaussian Markov Random Field as the imageprior. The use of optical flow from several video frames to reconstruct superres-olution image using ML estimation has been discussed in [24]. There have alsobeen developments in machine learning methods [75, 81] that learn to predict highresolution patches based on several low resolution patches. The work of [33] useshigh resolution features from a single image to reconstruct low resolution parts ofthe same image.2.4 Learning Skin MotionThe animation of character meshes is a well studied problem in computer graph-ics. In the context of realistic facial performances, the two most widely used ap-proaches are physical models and blend shapes. The physical models [39, 59] rely10on accurately modeling the physical characteristics such as elasticity, strains etc. ina deformable body. In contrast, blend shapes [10, 25] rely on data driven methods.They use a set of pre-processed meshes and obtain animations by blending themlinearly or non-linearly.The physical models can reproduce all possible set of poses in an animation.However, constructing a physical model requires a lot of detail in estimating thephysical properties of the system. Blend shapes are however easier to constructfrom data driven methods. The performance of a blend shape model depends onthe number of blend shapes that are used for animation. As such, it leads to storageconstraints on large number of blend shapes to cover all possible expressions.In order to minimize the storage restrictions of blend shapes, and recover all theexpressions as in physical models, we use machine learning approaches to modelour data driven approach. We use Artificial Neural Network (ANN) models to learnthe parameters from our dataset. As a result, we learn a non-linear model to pro-duce large number of expressions without any storage restrictions.2.5 WrinklesDetecting and modeling wrinkles is a widely researched topic in facial rejuvena-tion, computer vision, and computer graphics. The characteristics of wrinkles andbuckles formed in various materials depend on their material properties and on theunderlying substrate. Physical models of wrinkles have been described in [15, 27].This work characterizes amplitude and frequency of wrinkles by the elastic proper-ties of the material using an experimentally motivated mathematical model. Therehas been significant research in wrinkle modeling of clothes [50, 52], including theuse of compressive strain model [16].Tracking facial wrinkles in videos to generate useful mathematical represen-tation has been a focus of research in graphics and vision. Automatic detectionand quantification of facial wrinkles has been discussed in [17] and uses Honget al.’s algorithm [31] for wrinkle extraction and enhancement. This work howeverdoes not assess the complex wrinkles formed around the eye. Bando et al. usesBezier curves [2] for modeling wrinkles on human skin. This method employs dy-namic modulation of fine-scale and large-scale wrinkles according to deformation11of skin for dynamic animation. Bickel et al. proposed a pose-space deformationtechnique [7] that learns the dependence of wrinkle formation on skin strain basedon a small set of prior poses. The work uses scattered data interpolation of learnedRBF [40, 79] for the transfer of facial details such as wrinkles.Various Finite Element Methods (FEM) [22, 44] attempt to model wrinkles asvolumetric layers, where the interaction among facial tissues determine the shapeand frequency of the wrinkles. A Markov Point Process has been used to modelwrinkles in [3]. The wrinkles are modeled as stochastic arrangements of line seg-ments which localize the wrinkles in a Bayesian framework.Medium-scale facial wrinkles and buckles are modeled using Bezier curves in[6], which improves the appearance of forehead wrinkles. However, the methodrequires painting the forehead to make it Lambertian for wrinkle modeling. Somerecent research also includes blending normal maps to produce wrinkles duringfacial animation [19]. This technique requires keeping only a few reference posesand corresponding normal maps, making it viable for real-time rendering.12Chapter 3Motion EstimationIn this chapter, we discuss the methods to estimate the complex motion of the skinin the upper head region. In Section 3.1, we describe the scene flow algorithm tocompute the dense 3D motion in images obtained from a binocular camera set up.We show improvements to baseline methods using Stochastic Optimization tech-niques. In Section 3.2, we briefly describe the skin flow technique in a monocularset up to track the skin of the upper head region using a high frame rate camera.We develop reduced coordinate spaces to interpret the 3D sliding skin motion usingmonocular videos.3.1 Scene FlowScene flow is the estimation of the 3D flow field using a binocular camera set up.We follow the work of [70] to model the scene flow as an optimization problem.We summarize their work introducing similar notations in this section. The modeluses consecutive images from two cameras and formulates the problem in a four-frame setup as shown in Figure 3.1. We estimate the flow fields between images ofconsecutive frames as well as the flow fields between each of the stereo pairs. Weinfer the 3D motion of points in the scene using the above estimation of differentflow fields.13Figure 3.1: Frames of a stereo setup, reproduced from [70]3.1.1 DefinitionsOur setup includes two cameras labeled as left and right cameras. The images fromtwo consecutive frames are used at time instants t and t +1. The two consecutiveframes of the left camera are denoted by g1L and g2L. Similarly, the consecutiveframes of right camera are denoted by g1R and g2R. This makes a four-frame setup for scene flow estimation across frames (g1L,g2L,g1R,g2R). We define positionvector, x= (x,y)T that describes the coordinates of the image plane. We denote theFundamental matrix of the stereo set up by F . The optical flow of the left camerais given by wf = (u f ,v f )T . We represent the stereo flow between g1L and g1R bywst = (ust ,vst)T . There is also a difference flow wd = (ud ,vd)T , which measuresthe change in stereo flow between two cameras from t to t + 1. It can also beseen as a change in optical flow in the right camera images. Here, u and v are theflow components along x and y directions in the image. By estimating these sixparameters, {wf,wst,wd}; we can determine the three dimensional flow of a scene.3.1.2 Scene Flow ModelScene flow estimation is modeled as minimizing the energy functional given byE =∫Ω(ED+EE +ES)dx, (3.1)14where, ED is the data term that enforces the constraint that features present in ascene remain same across the four frames. EE refers to the epipolar constraintwhich enforces that image pixels of stereo pairs should lie along the epipolar lines.ES is the smoothness constraint that provides a good regularization for scene flowcomputation. The integral is over all the image pixels x in the domain of the imageplane, Ω. In our work, we use an efficient optimizer, SAG [57] to estimate thenumerical solution.Data TermThe data term ED models four constrains between the input frames given byED1 =Ψ(|g2L(x+wf)−g1L(x)|2), (3.2)ED2 =Ψ(|g2R(x+wf+wst+wd)−g1R(x+wst)|2), (3.3)ED3 =Ψ(|g1R(x+wst)−g1L(x)|2), (3.4)ED4 =Ψ(|g2R(x+wf+wst+wd)−g2L(x+wf)|2), (3.5)where, ED1 and ED2 define the optical flow constraint between images from theleft and right images respectively. ED3 and ED4 model the stereo flow constraintbetween the stereo pairs of each of the cameras. We use a robust convex penaltyfunctionΨ(s2)=√s2+ ε2 and set ε = 0.001. The sub-quadratic penalization func-tion, Ψ(s2), was introduced by Black and Anandan in [8] and can be seen as an L1approximation. It provides robustness across outliers in the data terms caused bynoise and occlusion. We also assume gradient constancy and include RGB chan-nels of the images in our data term. Hence our data term in Equation 3.2 becomesED1 =Ψ(∑i∈{r,g,b}(|gi2L(x+wf)−gi1L(x)|2+ γ|∇gi2L(x+wf)−∇gi1L(x)|2)),(3.6)where, γ ≥ 0 is a constant that denotes the influence of gradient constancy assump-tion. i ∈ {r,g,b} denotes RGB color channels. ∇ is the spatial gradient operator.We modify the other data terms accordingly to include RGB channels and gradient15constancy. We finally combine the data terms and getED = ED1+ED2+ED3+ED4. (3.7)Epipolar TermThe epipolar energy term models the geometric relationship between the stereo im-age pairs and applies the constraint along the epipolar lines of the two cameras. Thefundamental matrix F models the geometric transform between the stereo camerasand defines the epipolar constraint. The epipolar terms are given byEE1 =Ψ(((x+wst)Th F(x)h)2), (3.8)EE2 =Ψ(12((x+wf+wst+wd)Th F(x+wa)h)2(3.9)+12((x+wa+wst+wd)Th F(x+wf)h)2)+µ(|wf−wa|2),where, (x)h =(x,y,1)T and denotes the homogeneous transformation. The epipolarconstraints enforces the pixels to lie along the same epipolar line. EE1 is linear inwst, whereas EE2 is linear in wst and wd, but it is quadratic in wf. Hence as in [70], we introduce auxiliary variable wa to make EE2 linear in wf,wst and wd. µ is thecoupling term between wf and wa. Finally, we combine the epipolar terms to getEE = β1EE1+β2EE2 , (3.10)where β1 and β2 are constants.Smoothness TermIn order to avoid flow discontinuities, regularization terms are added in the energyfunctional causing spatial smoothness. The smoothness regularization terms are16given byES1 =Ψ(|∇wf|2), (3.11)ES2 =Ψ(|∇wst|2), (3.12)ES3 =Ψ(|∇wd|2), (3.13)ES = α1ES1+α2ES2+α3ES3, (3.14)where, α1,α2,α3 are constants.3.1.3 Numerical SolutionIn order to obtain the solution to the energy function, we discretize the energyfunctional (Equation 3.1). We use a coarse-to-fine approach using image pyramidsto solve for incremental flows. Let d = (du f ,dv f ,dust ,dvst ,dud ,dvd ,1)T . Then,the data term from Equation 3.2 can be factorized in terms of d as ED1 =Ψ(dT Jˆ1d),and a data tensor Jˆ1 is obtained as shown in [70]. Similarly, data tensors Jˆ2, Jˆ3 andJˆ4 can be obtained by factorizing other data terms.Similarly, we can obtain epipolar tensors by definingd1 = (dust ,dvst ,1)T , (3.15)d2 = (du f +dust +dud ,dv f +dvst +dvd ,1)T , (3.16)d3 = (dua,dva,1)T , (3.17)d4 = (dua+dust +dud ,dva+dvst +dvd ,1)T , (3.18)d5 = (du f ,dv f ,1). (3.19)The epipolar tensors {Eˆi|i ∈ {1,2,3,4,5}} can be factorized from epipolarterms of Section 3.1.2 asEE1 =Ψ(dT1 Eˆ1d1), (3.20)EE2 =Ψ(14dT2 Eˆ2d2+14dT3 Eˆ3d3+14dT4 Eˆ4d4+14dT5 Eˆ5d5)(3.21)+µ(|w f +dw f −wa−dwa|)2.17Thus, the energy function can be represented in terms of incremental flows asE(dwf,dwst,dwd,dwa) = (3.22)∫Ω(Ψ(dT Jˆ1d)+Ψ(dT Jˆ2d)+Ψ(dT Jˆ3d)+Ψ(dT Jˆ4d))+β1Ψ(dT1 Eˆ1d1)+β2Ψ(14dT2 Eˆ2d2+14dT3 Eˆ3d3+14dT4 Eˆ4d4+14dT5 Eˆ5d5)+α1Ψ(|∇(wf+dwf)|2)+α2Ψ(|∇(wst+dwst)|2)+α3Ψ(|∇(wd+dwd)|2)+β2µ(wf+dwf−wa−dwa))dx.In order to solve for the stepwise flow increments, we write the Euler-Lagrangeforms [20] of Equation 3.22 and factor out incremental flow terms as∂uE = Auu−bu, (3.23)∂vE = Avv−bv, (3.24)where u = (du f ,dust ,dud ,dua) and v = (dv f ,dvst ,dvd ,dva). We now combine theEuler-Lagrange forms as[∂uE∂vE]=[Au 00 Av][uv]−[bubv](3.25)in a matrix representation. The right hand side can be represented as Adw−b, where dw is a 8n-dimensional vector. dw represents the incremental flows(uT ,vT )T of all the n pixels in the image by a single column vector following alexical order. The linear system is solved by logistic regression iteratively usingSAG [57] and is represented asf (dw) =8n∑i=1log(1+ exp(−bidwT ai))+ λ2 ||dw||2 (3.26)and minimizing f (dw), where ai are the rows of the matrix A and bi are the rowsof the matrix b. We apply an L2-regularization on the flow increments as shown inEquation 3.26.18Figure 3.2: Frames of General Sphere dataset from [70]: Sequence from theleft camera (left) and corresponding sequence from the right camera(right).3.1.4 Experiments and ResultsWe obtain the stereo image sequence g1L,g2L,g1R,g2R from the dataset providedby Valgaerts et al. [70]. It consists of a synthetic sequence of a rotating texturedsphere as shown in Figure 3.2. The Fundamental Matrix, F is also provided in thedataset. In a real scenario, the sequence can be obtained from a stereo camera setupand the Fundamental Matrix can be obtained from camera calibration techniques.We initialize the flow fields (u f ,v f ,ust ,vst ,ud ,vd)T using Brox and Malik’sLDOF [11]. We then follow a coarse-to-fine approach using image pyramids with19Figure 3.3: Results of Scene Flow estimation (top) with Ground truths (bot-tom). From left to right: forward flow (wf), stereo flow (wst) and differ-ence flow (wd). The color coding is shown on top-left. (Regions underocclusion are manually removed)Method RMS Errors(u f ,v f ,ud ,vd) (u f ,v f ) (ust ,vst)Our method initialized with [11] 0.49 0.54 1.60Valgaerts et al. [70] (N + SR) 0.61 0.59 1.61Valgaerts et al. [70] (N + JR) 0.63 0.59 1.86Valgaerts et al. [70] (JR) 0.67 0.64 2.08Wedel et al. [77] with Ground Truth 2.40 0.65 -Wedel et al. [77] (87%) 2.45 0.66 2.9Huguet and Devernay [34] 2.51 0.69 3.8Wedel et al. [77] (100%) 2.55 0.77 10.9Table 3.1: Comparison of RMS errors for Scene Flow. N: Normalization, SR:Separate Regularization, JR: Joint Regularization.20Algorithm 1: Scene Flow EstimationData: Sequence : g1L,g2L,g1R,g2R, Fundamental Matrix, FResult: Scene Flow: (u f ,v f ,ust ,vst ,ud ,vd)TInitialize (u f ,v f ) from g1L,g2L, using [11] ;Initialize (ust ,vst) from g1L,g1R, using [11];Initialize (ua,va)← (u f ,v f ) ;Compute (u˜ f , v˜ f ) from g1R,g2R, using [11] ;Compute (uˆ f , vˆ f ) by warping (u˜ f , v˜ f ) using (ust ,vst) ;Initialize (ud ,vd)← (uˆ f , vˆ f ) ;Compute pyramid images with η = 0.9 ;for Each level of Pyramid doFetch the downsampled pyramid frames gi1L,gi2L,gi1R,gi2R ;Resample flows w := (u f ,v f ,ust ,vst ,ud ,vd ,ua,va)T ;Compute Data Tensors Jˆ1, Jˆ2, Jˆ3, Jˆ4 ;Compute Epipolar Tensors Eˆ1, Eˆ2, Eˆ3, Eˆ4, Eˆ5 ;Initialize flow increments dw← 0;Arrange all tensors in the matrix form Adw−b ;while ||Adw−b||> tol doCompute dw by solving Equation 3.26 using [57];Update A(w)← A(w+dw) ;endUpdate Flows, w← w+dw ;enda downsampling factor of η = 0.9. At each level of the pyramid, we compute thedata and epipolar tensors by factoring out the incremental flows from the corre-sponding terms (Section 3.1.3). We then rearrange the terms in a linear systemAdw−b and solve it using SAG [57] with λ = 0.1. The framework is described inalgorithm 1.We compare our scene flow algorithm with the pre-existing algorithms andshow the results in Table 3.1. We see that our scene flow performs better thanpre-existing methods with the introduction of the stochastic optimizer and the L2regularization term for flow increments. The qualitative results of scene flow esti-mation are shown in Figure 3.3.213.2 Skin FlowWe now develop a system to compute 3D flow of the skin using a high frame ratemonocular camera and a pre-computed 3D mesh. We accomplish it by representingthe skin using reduced spaces. To represent the motion of skin on the face, and itsmeasurement by video cameras, we use the reduced coordinate representation ofskin introduced by [39]. In Section 3.2.1, we summarize our method and describethe details in [45].3.2.1 Skin Tracking ModelWe represent skin by a 2D map denoted as the skin atlas in Figure 3.4. The skinslides on a fixed 3D “body” corresponding to the shape of the head around the eyes.It is mapped to 3D skin in skin space using parameterization pi . This can undergorigid transformation φ0 to reach an initial state x at t = 0 in physical space. Theskin undergoes deformations φt with time to reach xt . The projection of these statescan be observed by tracking the corresponding points ut using a single camera. Pdenotes the projection transform.The parameterization pi is obtained using Faceshift [78] and φ0 is estimatedusing the Orthogonal Procrustes algorithm. The projected points ut are trackedusing LDOF [11]. The states xt are then inferred using projection P. The flowcorrections are applied by observing the skin atlas using the estimated values of piand φ0. Flows are corrected such that the tracked point across video remains sameas the corresponding skin point in the skin atlas.3.2.2 Experiments and ResultsTo measure motion around the eye region, we used a single Grasshopper31 camerathat can capture up to 120 fps with image resolution of 1960×1200 pixels. Thesetup is shown in Figure 3.5. The actor sits on a chair and faces the camera withthe head rested on a chin rest. The scene is lit by a DC powered LED source2to overcome the flickering due to aliasing effects of an AC light source on a highframe rate capture. We use polarizing filters with the cameras to reduce specularity.1Point Grey Research, Vancouver, Canada2https://www.superbrightleds.com22Figure 3.4: Overview of the spaces related to skin tracking.Camera CalibrationWe perform camera calibration using Matlab’s Computer Vision Toolbox. Thecamera used for capturing video data is calibrated using a standard checkerboardgrid. We use a 7x9 grid of black and white squares as shown in the Figure 3.6 andtake its snapshots at different viewing angles. The width of each square is 7.25 mm.The checkerboard is placed in the same position as the face of the subject duringvideo capture. After obtaining the checkerboard patterns at different orientations,we reconstruct the position of checkerboard planes in 3D (Figure 3.6) and computethe projection matrix, P using [83].Mesh ReconstructionsOur skin tracking algorithm requires a subject specific 3D mesh and the parametriza-tion pi . To acquire this, we use FaceShift [78] technology with a Kinect RGB/Dcamera. This process takes less than 15 minutes per subject.23Figure 3.5: The capture setup used in our experiments.Figure 3.6: Checkerboard snapshots for Calibration (Left) and Reconstruc-tion (right)24Figure 3.7: Tracking results for monocular videos. From top to bottom: Im-ages from videos for three examples, corresponding 3D reconstructedmesh, and 3D meshes with reconstructed texture.We show the reconstructions using our skin tracking approach in Figure 3.7.We can see robust reconstructions of our 3D mesh using monocular skin tracking.The tracking of skin around the eyes and eyelid in the reconstruction is comparedwith the original video. Thus, we obtain realistic reconstruction of skin movementin the 3D meshes.25Chapter 4Texture EstimationIn this chapter, we will discuss the framework for recovering the skin texturearound the eye and the texture of the eyelashes. We will start by formulatingour problem as a superresolution estimation problem based on optical flow. Wefollow an approach of texture estimation in a Bayesian framework. We then intro-duce material mask functions to recover eyelashes or skin selectively under mutualocclusion. We show the recovered textures and the improvements.4.1 OverviewDevelopments in dynamic performance capture of faces have led to their high qual-ity and realistic rendering [25]. However, there has not been significant develop-ment to capture the details of the eyes. Eyes have extremely fast and subtle mo-tions, like blink, saccades etc., which are difficult to capture. This also influencesthe tracking of fast motions, structure and the texture of the eyes. Capturing textureof the details of the eye, such as the upper and lower eyelids and the eyelashes isaffected not only by the eye’s fast motion but also by occlusion in natural expres-sions.This chapter is motivated by detailed capture of the texture around the eye innatural expressions. We use the motion of eyes from multiple frames of a highframe rate camera to capture the detailed texture. Using optical flow methods, wetrack the details of the eyes across various frames and reconstruct the texture using26a Bayesian approach. We introduce material mask functions in order to addressvarious occlusions that arise as a result of eyelash and skin motion. We show that,using an optical flow based superresolution approach in the video data of naturalexpressions, we can reconstruct high resolution texture of the eyes.4.2 TheoryWe formulate the estimation of texture of the eyes using an optical-flow basedsuperresolution approach motivated by [24]. Our formulation uses Bayesian esti-mation in a generative imaging model. We will consider the estimation of textureof the eyelashes and the skin under mutual occlusion of each other. Under the smalland fast motions of the eye, parts of the skin are often occluded by the eyelashesand are visible only for a small duration. As such, both eyelashes and the eyelidskin are difficult to track as their corresponding pixel-features change dynamically.Moreover, videos of the eyes at a high frame rate have a lot of noise which wemodel as a Gaussian random noise. In our framework, we jointly estimate thetexture using optical flow and the noise present in the input data.In Figure 4.1, we see the motion of the eyelashes which are 13 frames apart ina 200 fps video sequence. A closer look at such a video sequence highlights thehorizontal motion of the skin underneath as a result of forces from the orbicularismuscle of the eye.4.2.1 DefinitionsWe consider a set of 2n+1 images, [I−n(x)...I0(x)...In(x)] obtained from our videosequence. Each image obtained using a high frame rate camera has a resolution ofl× l. Each of the values of x represent a pixel in the image. In computer graphics,an object is associated with its structure, the mesh geometry and the texture asso-ciated with it. We would like to construct a high resolution texture image J0(x)with a resolution h× h which corresponds to the reference image I0(x). Each ofthe values of x in J0 represent a texel in the texture image. Our approach utilizesthe neighboring images in the video sequence Ii(x) to obtain the high resolutiontexture image.We also define a transformation Ti that maps a point x in texture space of J0(x)27Figure 4.1: The first and last frames of the eyelashes in motion. As the eyelidmoves up, the lower lid skin has a subtle sideways motion revealingoccluded skin texture.to a corresponding point Ti(x) in Ii(x). The transformation, Ti can be decomposedinto an optical flow field, wi(x) and a barycentric mapping D(x) from texture im-age to video image. Therefore, a texel in the texture space J0(x) is warped usingthe flow wi(x) and then mapped to the image Ii(x) using the barycentric transformD(x). It is noted that the flow fields wi(x) are defined on the texture space corre-sponding to high resolution texture image J0(x). The mapping can hence be writtenasTi(x) = D(x+wi(x)).In a similar manner, we can define the inverse transformation T−1i that mapsa point x in Ii(x) to a corresponding point T−1i (x) in the texture image J0(x). It isgiven by,T−1i (x) = D−1(x)−wi(T−1i (x)).28Figure 4.2: The transform Ti maps texture space(top) and the image frames(bottom). The transform Ti can be represented as a combination ofbarycentric transform, D and flow fields wi.4.2.2 Generative ModelWe are now in a position to describe our imaging model. In order to generate thevideo image Ii(x), we warp the texture image J0(x) by the flow field wi(x) and thenproject it on to image space using the barycentric transform, D. This process canbe represented by a single transformation Ti(x) as described in Section 4.2.1. Weapply an optical blur using a Gaussian filter g after the transformation Ti(x). Wealso model random noise present in the system using an Additive White GaussianNoise (AWGN) term, ε . The generative model is formulated asIi(x) = g∗ J0(T−1i (x))+ ε, (4.1)where ε ∼ N(0,Σ), and g simulates a Gaussian optical blur using the convolutionoperator, ∗.294.2.3 Bayesian EstimationThe generative model produces images Ii(x) from the texture J0(x) depending onthe flow field, wi, and the noise parameter, Σ that models its variance. Thus, weformulate our problem as the estimation of θ = {J0,wi,Σ} that maximizes the pos-terior,p(θ |I) ∝ p(I|θ)p(θ). (4.2)In order to obtain the conditional data likelihood term, we make i.i.d assump-tions on image noise for all pixels in the image. The conditional data likelihoodterm, p(I|θ) then becomes the product of probabilities of individual observed pix-els over all the frames,p(I|θ) =n∏i=−n∏xp(Ii(Ti(x))|θ), (4.3)where, x ∈ [1,h]× [1,h] are the texels in the high resolution texture image. Theconditional probability of an observed texel value can be modeled by a normaldistribution given the previous estimates of texture, J0, and noise variance, Σ, asp(I|θ) =n∏i=−n∏x[1(2pi)d/2|Σ|1/2 exp(−12mTi Σ−1mi)], (4.4)where, mi = Ii(Ti(x))− g ∗ J0(x), d is the dimensionality of mi and is equal tonumber of channels in the image frame. mi models the error between the observedpixel values in the image Ii and the previous estimate of high resolution texture J0using our generative model.Now, we turn to the formulation of our prior term. Several different prior termsfor Bayesian estimation have been proposed before. In the event of unavailabledata, these prior terms generally model smoothness over texels to regularize theoptimization term. We factorize the prior term into texture dependent and flowdependent parts motivated by [24]. The prior term can be written asp(θ) ∝ p(J0)n∏i=−np(wi), (4.5)30where, p(J0) is the texture dependent term, and p(wi) is the flow dependent termover individual frames. The flow dependent prior is modeled as an exponentialdistribution parametrized by width λ f . This prior applies a smoothness constraintover the optical flow field wi(x) and is given byp(wi) ∝ exp(− 1λ f ∑x||∇wi(x)||2). (4.6)This prior is generally used with computing the flow fields on a coarse-to-finegrid involving optimization over multiple layers [1]. We introduce a second priorfor optical flow field that can be modeled by precomputing flow fields using [62].The model uses an exponential distribution that expresses the deviation of new flowfields wi from the precomputed flow fields w∗i . Here, w∗i is the optical flow betweenJ0(T0(x)) and J0(Ti(x)). This can be written asp(wi) ∝ exp(− 1λ f ∑x||wi(x)−w∗i (x)||2). (4.7)Similarly, we can introduce a prior on the high resolution texture J0 whichintroduces a smoothness constraint. The prior can be modeled as an exponentialdistribution with a width λ j,p(J0) ∝ exp(− 1λ j ∑x||∇J0(x)||2). (4.8)However, in the presence of observations from the image frames, we would liketo model our prior based on available data. To obtain this prior, we use the averageof all the image frames, transformed on to texture image coordinates which is givenbyJ∗0 (x) =12n+1∑iIi(T (x)). (4.9)We express the prior on the texture as an exponential distribution of the errorover J∗0 (x). The prior over J0 becomesp(J0) ∝ exp(− 1λ j ∑x||J0(x)− J∗0 (x)||2). (4.10)31However, the above prior results in artifacts over the computed texture image.The artifacts are generally due to discontinuities over the texture. Hence, we com-bine the best features of (Equation 4.8) and (Equation 4.10) to obtain a prior whichremoves the discontinuities by introducing a smoothness term. Hence, we obtainthe prior asp(J0) ∝ exp(− 1λ j ∑x∇||J0(x)− J∗0 (x)||2). (4.11)After obtaining the likelihood and prior terms of the Bayesian estimation, weproceed to compute the posterior and optimize it to obtain the texture, flow fieldsand the noise, θ = {J0,wi,Σ} in our system.4.2.4 OptimizationAfter modeling the posterior by setting up the likelihood (Equation 4.4) and prior(Equation 4.5) terms, we aim to maximize the posterior to estimate θ = {J0,wi,Σ}.We simplify our formulation by taking the negative logarithm of the posterior(Equation 4.2) that can be minimized in an optimization framework. From Equa-tion 4.2, we haveθˆ = argminθ(− log{p(I|θ)p(θ)}) , (4.12)= argminθ(− log p(I|θ)− log p(θ)) . (4.13)We now evaluate the negative logarithm of the likelihood and the prior termsseparately. From (Equation 4.4), we evaluate the likelihood term as− log p(I|θ) =− logn∏i=−n∏x[1(2pi)d/2|Σ|1/2 exp(−12mTi Σ−1mi)], (4.14)=−n∑i=−n∑x[log1(2pi)d/2|Σ|1/2 −(12mTi Σ−1mi)], (4.15)=n∑i=−n∑xlog{(2pi)d/2|Σ|1/2}+n∑i=−n∑x12mTi Σ−1mi, (4.16)where, mi = Ii(Ti(x))−g∗ J0(x).32Similarly, we evaluate the negative logarithm of the prior terms. From (Equa-tion 4.5), using priors (Equation 4.6) and (Equation 4.8), we have− log p(θ) = 1λ j ∑x||∇J0(x)||2+ 1λ fn∑i=−n∑x||∇wi(x)||2. (4.17)The above prior term introduces only a smoothness constraint on the flow fieldswi and the high resolution texture J0. We modify the prior terms using the priors(Equation 4.7) and (Equation 4.11) to take into account the observations made frominput data as well. Hence we recompute the prior terms as− log p(θ) = 1λ j ∑x∇||J0(x)− J∗0 (x)||2+1λ fn∑i=−n∑x||wi(x)−w∗i (x)||2. (4.18)Finally, we express the negative logarithm of the posterior term as an energyfunction. We seek to minimize this energy function to obtain the estimates ofθ = {J0,wi,Σ}. From (Equation 4.14) and (Equation 4.18), we haveE(θ) =n∑i=−n∑xlog{(2pi)d/2|Σ|1/2}+n∑i=−n∑x12mTi Σ−1mi+1λ j ∑x∇||J0(x)− J∗0 (x)||2+1λ fn∑i=−n∑x||wi(x)−w∗i (x)||2. (4.19)Material Mask FunctionsThe optical flow fields wi(x) suffer from errors in our framework due to variousreasons. The eye movements are extremely fast, and are difficult to track. There isconstant occlusion from the eyelashes and the eyelid skin. Moreover, the eyelashesare difficult to resolve due to their size and resolution constraints. As such, wepropose a material mask function ν(x) over each of the texels in the high resolutiontexture.In order to formulate our material mask function appropriately, we look at thedistribution of color for a particular texel tracked across several frames. This isgiven by {Ii(Ti(x))|i ∈ [−n,n]} for a constant x. The distribution usually formsa cluster containing one or two modes. A single mode results from distributions33Figure 4.3: Color of texels in texture space tracked across frames: Some tex-els under occlusion are likely to have two color values correspondingto the eyelash and the skin texture in a horizontal line. Histograms ofsome horizontal scan lines are shown in Figure 4.4.where the texel does not undergo any occlusion. A double mode results wherea texel undergoes mutual occlusion from eyelash and the eyelid skin. One modecorresponds to color values of the eyelash, and the other mode corresponds to theeyelid skin under the eyelash. Since both are under mutual occlusion, the trackerpicks up both while tracking across multiple frames.The tracked pixels across frames are shown in Figure 4.3. The corresponding34Figure 4.4: Texel Distribution across of a tracked point across frames. Thetexels under occlusion (left) show two modes corresponding to the eye-lash color and the skin color. The texels not under occlusion (right)show a single mode.texels under occlusion are likely to have two color values corresponding to the eye-lash and the skin texture. The distribution for a few tracked pixels correspondingto texels, x can be seen in Figure 4.4. The modes of the distribution correspondto the intensity values of the eyelash and the skin given by µe and µs respectively.These modes are different for different texels in the high resolution texture and canbe represented as µe(x) and µs(x). Hence, in order to observe the eyelashes inour texture image, we formulate the material mask function for each of the framesi ∈ [−n,n] as35νi(x) = exp(− 1λν(Ii(Ti(x))−µe(x))2), (4.20)where, λν is a parameter that controls the amount of penalty as a result of deviationfrom µe.Similarly, for observing the skin in the high resolution texture which under-goes occlusion from the eyelash, we construct the material mask function centeredaround µs(x) given byνi(x) = exp(− 1λν(Ii(Ti(x))−µs(x))2). (4.21)The material mask terms are normalized such that, ∑i vi(x) = 1, ∀ x. Finally,we incorporate the material mask term in the energy function (Equation 4.19) thatwe seek to minimize asE(θ) =n∑i=−n∑xlog{(2pi)d/2|Σ|1/2}+n∑i=−n∑x12νi(x)mTi Σ−1mi+1λ j ∑x∇||J0(x)− J∗0 (x)||2+1λ fn∑i=−n∑x||wi(x)−w∗i (x)||2. (4.22)We can explore several methods of optimization for obtaining this solution. Inorder to do so, we proceed to evaluate the gradient of our function.4.3 Numerical SolutionWe proceed to compute a numerical solution of the energy function (Equation 4.22)to get estimates of θ = {J0,wi,Σ}. We derive the updates for each of the {J0,wi,Σ}which will be computed in an iterative manner. We also account for the materialmask function νi(x) while deriving the updates for each of the parameters.4.3.1 Update for Noise Variance, ΣTo estimate the update on the noise variance Σ, in the energy function (Equa-tion 4.22), we set the derivative of E(θ) w.r.t Σ−1 to zero similar to [24]. To36simplify the notation, we represent mi(x),νi(x) as mi,νi respectively. We have∂E(θ)∂Σ−1=12∂∂Σ−1∑i ∑xνimTi Σ−1mi+ log((2pi)d |Σ|), (4.23)0 =∂∂Σ−1∑i ∑xtr(Σ−1νimTi mi)− log(|Σ−1)+ log((2pi)d), (4.24)0 =∑i∑x[2(νimimTi −Σ)−diag(νimimTi −Σ)], (4.25)0 =∑i∑x(νimimTi −Σ), (4.26)Σ= ∑i∑xνimimTi∑i∑x νi. (4.27)4.3.2 Update for Texture Image, J0In order to derive an update for the high resolution texture, J0, we introduce somevector notations. Let J0 be the set of all J0(x) in the texture represented by acolumn vector. Let Ii be the set of all Ii(Ti(x)) represented by a column vector.Let ∑x g ∗ J0 = GJ0, where G is a transform representing convolution. Let S be ablock diagonal matrix of Σ−1. Let Vi be all the material mask function terms fora particular frame i represented by a diagonal matrix. We note that ∑i Vi = 1, anidentity matrix, since νi is normalized. Let Q be a matrix that computes discreteapproximation of spatial derivatives, ∇. Then, taking the derivative of the energyfunction (Equation 4.22), we have∂E(θ)∂J0=∂∂J0n∑i=−n∑x12νimTi Σ−1mi+1λ j ∑x∇||J0(x)− J∗0 (x)||2. (4.28)We rewrite the above equation in a vector form using the notation introduced above.Thus, taking the derivative w.r.t J0, we have37∂E(θ)∂J0=∂∂J012∑iVi(Ii−GJ0)T S(Ii−GJ0)+ 1λ j (J0−J∗)T Q(J0−J∗) (4.29)0 =−12∑iViGSIi+12(∑iViGSG)J0+2λ jQ(J0−J∗). (4.30)Hence, setting the value of ∑i Vi = 1 on the R.H.S terms, we estimate the value ofJ0 by solving the following linear system:12∑iViGSIi+2λ jQJ∗ =(12GSG+2λ jQ)J0. (4.31)The above equation can be solved for J0 using direct methods. However, the sizeof the matrices become as large as 106× 106 even at smaller texture images of 1megapixels. Hence, we use the an iterative approach to solve it using the ConjugateGradient (CG) method.4.3.3 Update for Optical Flow Fields, wiWe now derive the updates for optical flow fields. Let Wi be set of all wi(x)represented by a column vector. Then, from Equation 4.19, we haveE(Wi) =12Vi(Ii−GJ0)T S(Ii−GJ0)+ 1λ f (Wi−W∗i )T (Wi−W∗i ). (4.32)The above energy function computes the optical flow field between Ii and GJ0with a regularization over Wi. It can be solved using Euler-Lagrange equations asdescribed in [62].We iterate over the updates as described above until the convergence is reached.The overall framework is described in algorithm 2.38Algorithm 2: Texture EstimationInput: Ii, i ∈ [−n,n] , nitersOutput: Σ,J0,wi, i ∈ [−n,n]Initialize D using 3D mesh obtained from Faceshift[78] ;Compute w∗i (x) using [62];Compute J∗0 (x) using Eq. (4.9) ;wi(x)← w∗i (x) ;J0(x)← J∗0 (x) ;for k← 1 to niters doCompute νi(x) using §4.2.4 ;Compute Σ using §4.3.1 ;Update J0 using §4.3.2 ;Update wi using §4.3.3 ;end4.4 ResultsWe used the images of a closing blink sequence of the eye recorded from a highframe rate camera at the rate of 200fps. We used image sequences of the eyes at aresolution of 500×500 pixels and obtained high resolution texture maps of 1000×1000 pixels. The first and last frames of the sequence are shown in Figure 4.1.The parameters were set to λ j = 1000,niters = 10,λν = 10. For the experiments,we selected the modes of the material mask function with the higher frequency,according to the following condition,Mode(x) =µs(x), if f (µs(x))> f (µe(x))µe(x), otherwise. (4.33)where, f (µs(x)) and f (µe(x)) are frequency of pixels with color values µs(x) andµe(x) respectively.The eyelash from the video sequence is shown in Figure 4.5. The reconstruc-tion of the eyelash texture is shown in Figure 4.6. It took about 60 seconds foreach iteration of algorithm 2 to run using MATLAB on a 2.67 GHz Intel Core i5processor. We ran 10 iterations for about 600 seconds. Our algorithm success-fully recovered the detailed eyelash texture from low resolution videos of blinking39eyes. We used 13 frames of a blink sequence for this reconstruction. We alsoreconstructed the skin texture from the videos. The reconstruction is shown inFigure 4.8, and the corresponding image from the video is shown in Figure 4.7.We recovered the texture of the full face shown in Figure 4.10. We used 13frames of the video sequence containing the open eyes captured at a resolution of750×750 with a frame rate of 200fps. We selected the material mask functionsaccording to Equation 4.33. The recovered full face texture in Figure 4.10 has aresolution of 1500 × 1500.We also recovered the skin under the eyelashes by using 13 frames of a se-quence where the skin is occluded by the eyelashes in all the frames. The first andlast frames of the sequence are shown in Figure 4.1. All the frames that are usedare of closed eye where the skin under the eyelash is densely occluded. The re-construction was done by selecting the modes of the material mask function, µs(x)corresponding to skin, irrespective of conditions in Equation 4.33. The reconstruc-tion is shown in Figure 4.11.40Figure 4.5: Eyelashes from the Video Sequence.41Figure 4.6: Reconstructed Eyelashes with 13 frames of the closing blink se-quence by masking using Equation 4.33.42Figure 4.7: Skin texture from the Video Sequence.43Figure 4.8: Reconstructed Skin texture with 13 frames of the closing blinksequence by masking using Equation 4.33.44Figure 4.9: Reconstructed Skin texture of the Eyelid: The blood vessels andwrinkles are more prominent in the reconstruction (bottom) as com-pared to original image (top).45Figure 4.10: Reconstructed Skin texture of the Full Face using 13 frame se-quence with eyes open. Masking using Equation 4.33.46Figure 4.11: Reconstructed Skin under the eyelash (left) by using framesfrom a sequence under occlusion (Figure 4.1) by masking the modecorresponding to µe(x). One frame from the sequence of similar look-ing frames (right).47Chapter 5Learning Skin MotionIn this chapter, we will discuss the generation of skin motion of the upper head re-gion from a small number of parameters. In Chapter 3, we developed a frameworkfor effective estimation of the skin motion. We will show that these measurementscan be used to create a model for generating the motion of the skin around the eyes.We use artificial neural networks for our learning model.5.1 OverviewThe soft tissues around the eye move primarily due to the activation of orbicularisoculi and levator palpebrae muscles that control the eyelids. Therefore, the motionof the eyelids and the skin around the eyes can be easily parameterized in terms ofactivation of orbicularis and levator muscles. This parameterization requires theactivations of different muscles during natural expressions. The muscle activationscan be measured by recording their stress or strain characteristics during naturalexpressions. However, there does not exist a non-invasive way of measuring thesemuscle activations when the subject performs natural expressions.Gaze of the eye is defined as the direction in which the eye is looking. Welearn the skin motion of the upper head region parameterized by gaze of the eyewhich can be measured easily using tracking methods [45]. We learn two modelsto parametrize the skin motion. The first model learns the shape of the eyelidmargin (which we will refer to as “eyelid” for short) from gaze parameters since48Figure 5.1: orbicularis oculi and levator palpebrae control the eyelids (left).Extraocular muscles control gaze (right). Reproduced from [42]eyelid shape is highly dependent on gaze parameters. The second model learnsskin motion depending on eyelid shape. This makes sense since the soft tissuesaround the eye move primarily due to the activation of orbicularis oculi and levatorpalpebrae muscles that control the eyelids. We learn two models also to make theimplementation more robust. Learning a skin motion model directly from gazeparameters results in overfitting and does not perform well for wide ranges of gazeparameters. Our model performs well for wide range of gaze parameters with avery small reconstruction error.As described above, we factor the generative model into two models which arelearned separately. The schematic of our implementation is shown in Fig. 5.2. Theeyelid model predicts the shape of the eyelid from the gaze parameters. Finally, theskin motion is learned as a function of eyelid shape.5.2 Eyelid Shape ModelWe construct the eyelid shape model using neural networks. In the rest of thesection, we will formally define the eyelid shape and its control parameters. Wewill proceed with construction of the model and validate the model after trainingit.49Figure 5.2: Gaze parameterized skin motion: skin motion is estimated fromeyelid shape which is recovered using our lid model from gaze parame-ters. Other affect parameters can also be included in the model.5.2.1 DefinitionsIn order to construct a generative eyelid shape model based on the gaze parameters,we start by defining gaze. Gaze is the direction in which the eyes are looking,modeled as the optical axis or as the visual axis. Estimating such a model of gazeis very well studied and widely used, and available in all off-the-shelf eye trackers.Since the gaze represents a ray, it has two degrees of freedom; in the eye trackingliterature this is referred to as 2D Gaze. However, it has been known since the timeof Helmholtz that the globe also rotates within the orbit about the visual axis, aphenomenon known as ocular torsion. In other words, it is necessary to estimatethe full 3D rotation of the globe; this is referred to as 3D Gaze. Torsion can be high,particularly during head tilt and in extreme gaze, and therefore we also estimatetorsion to accurately register the globe between frames. Thus, we can represent the3D rotation of the eye as {θ ,φ ,ψ} corresponding to x,y and z axis of the globe.However, the shape of the eyelids are also influenced by the aperture of the eye.The aperture of the eye can be defined as the measure of the eye opening. It can beestimated by tracking the eyelid positions. We define the aperture of each eye by a2D vector given by {aU ,aD}. aU and aD for an eye are defined as the position ofcenters of upper and lower eyelid respectively.Thus, we define the gaze vector for an eye as a 5-dimensional vector which isa combination of 3D globe rotations and 2D apertures and is given byg = (θ ,φ ,ψ,aU ,aD). (5.1)50We define the eyelid shape for the left eye as a combination of the shapeof its upper and lower lids, eL = [eLUT ,eLDT ]T . Similarly, the eyelid shape forthe right eye can be defined as eR = [eRUT ,eRDT ]T . Each of the eyelid shapes,{eLU,eLD,eRU,eRD} are modeled as spline curves. Each of the spline curves givenby {eLU,eLD,eRU,eRD} is a d×2 matrix representing the vertices in the mesh cor-responding to each eyelid. In our model d varies from 17 to 22 depending on theeyelid being modeled.The five eyelid model parameters are the most basic ones that were able toreproduce the motions captured during measurement. Our system can be easilyextended to include other parameters that characterize affect, mouth state, etc., aslong as one captures data under those conditions.5.2.2 Training ModelIn order to train the model, we estimate the rotations of the globe (θ ,φ ,ψ) during avideo sequence using [46]. We also estimate the eyelid shapes, {eL,eR} using [46].From the eyelid shape, we estimate the aperture (aU ,aD) for each eye by trackingthe center of each of the eyelids. For each eyelid shape, {eLU,eLD,eRU,eRD} andthe corresponding gaze parameters of the globe g, we train a neural network usingn = 670 data points. Our neural network framework implements 2 hidden layersusing Matlab’s Neural Network Toolbox. The first hidden layer of neurons aremodeled using radial basis functions,yi = bri exp(−||wri .x||2), (5.2)where, wri is the weight vector for each of the inputs represented by a vector x.bri is the bias vector. The second layer of neurons are modeled using pure linearfunctions given byyi = wli.x+bli, (5.3)with weight vectors wli.x and bias bli . The linear layer is used as a weighting func-tion for the outputs from the RBF layer of the neural network. Our learning modelis shown in Figure 5.3. The model uses 25 neurons in each of the hidden layersfor training. The weights wri and wli of the model are learned by minimizing least51Figure 5.3: Model of the neural network for training Eyelid shape.square loss between input and output using back-propagation. The loss function isminimized using stochastic gradient descent.Our framework uses four of the neural network models described using Equa-tion 5.2 and Equation 5.3. Each neural network learns an eyelid shape which isgiven by {eLU,eLD,eRU,eRD} from the gaze parameter, g of the correspondingeye. Our neural network is capable of capturing the high non-linearity of the skinmotion as shown in the results.5.3 Skin Motion ModelThe skin motion model can be learned as a function of the eyelid shape. In the restof the section, we will describe the representation of skin motion and the learningmodel. We use a similar neural network model as discussed in Section 5.2 toparametrize the skin motion.52Figure 5.4: Model of the neural network for training Skin Motion.5.3.1 DefinitionsIn order to construct a generative model of the skin motion based on the eye-lid shape, we combine the eyelid shape vectors for both left and right eyes as,[eLT ,eRT ]T = [eLUT ,eRUT ,eLDT ,eRDT ]T . Each control point that belongs to theeyelid shape, [eLT ,eRT ]T forms a row in a l×2 matrix, where l is the total numberof control points corresponding to all the eyelid shape vectors in [eLT ,eRT ]T . Eachcontrol point of the eyelid is 2-dimensional representing its position (x,y).We define skin mesh S = (V,E) as a collection of vertices and edges. Thevertices, V in S, form a m×2 matrix with each row denoting the (x,y) positions ofa particular vertex.5.3.2 Training ModelIn order to train the skin motion model, we estimate skin motion as a flow of thesubject mesh S over our video sequence as described in Section 3.2. We obtain the53Figure 5.5: Reconstruction of the Eyelid. Ground truth (red) and overlayedreconstruction (blue). Left eyelid (a), right eyelid (b) and the corre-sponding magnified portion of left eyelid (c).lid shapes by using tracking methods in [45] represented by splines. We train twoneural network models as described in Section 5.2 to independently model x and ycomponents of eyelid shapes. To predict x components of the skin vertices, Vx, wedrive our neural network using the x components of the eyelid shape [eLx T ,eRx T ]T .A similar neural network is learned for predicting the y component of skin motions,Vy. The network is shown in Figure 5.4.Both neural network models use the same structure as described in Section 5.2,that uses two hidden layers. The first hidden layer contains radial basis functionsfollowed by a second layer of pure linear function. The two models are trainedusing n = 670 data points. The number of vertices in a subject mesh is large,spanning m = 1662 dimensions. In order to deal with high dimensionality, we per-form dimensional reduction using Principal Component Analysis (PCA) on high-dimensional {Vx,Vy} and reduce them to 15 dimensions to achieve better speed upswhile training.545.4 ResultsOnce the generative model is learned, movements of the eyes are efficiently syn-thesized from a specification of gaze, and a small number of additional parameterssuch as the aperture of the eyelid opening. We show that the error rate for the re-construction of eyelid shape and skin mesh on a validation dataset is very small.Therefore, our model can easily generate realistic motion of eyelids and the skinaround the eyes. We also compare our approach with other learning methods andachieve better error rates.5.4.1 Eyelid Shape ModelDuring the testing and implementation of the learned eyelid shape model, we pro-vide gaze parameters to our model to obtain the corresponding shape of the eyelids.On a cross validation dataset of 158 points picked uniformly at random from thetraining dataset, we obtain the error in the ∞-norm, ||e||∞ = 0.04372 mm. Thisrepresents the reconstruction error of the eyelid control point which suffers maxi-mum error on a real-sized face mesh. The reconstruction of the eyelid is shown inFigure 5.5.5.4.2 Skin Motion ModelThe implementation of the skin motion model consists of providing eyelid shapesobtained in Section 5.2 to estimate skin motions using this learned model. On avalidation dataset of 158 points picked uniformly at random from training dataset,we obtain the error in the ∞-norm, ||e||∞ = 0.265 mm. This represents the recon-struction error of the vertex V in S which suffers maximum error on a real-size facemesh. The reconstruction of the mesh is shown in Figure 5.6.5.4.3 Comparison with Other MethodsWe tested a variety of more sophisticated dimensionality reduction methods, in-cluding Probabilistic PCA (PPCA), Neighborhood Component Analysis (NCA),Maximally Collapsing Metric Learning (MCML) and others. These were evaluatedwith our Neural Network (NN) model and a Multivariate Linear Regression (MLR)model for their performance as shown in Table 5.1. Due to the nature of our high55Figure 5.6: Reconstruction of the Skin(top). Ground truth (red) and recon-struction overlayed (blue). Magnified version of the left eye at the (bot-tom).56Models ||e||∞ (mm) Training Time (s)PPCA + MLR 6.187 240.14NCA + MLR 4.279 61.62MCML + MLR 5.288 778.54PCA + MLR 1.156 19.55PCA + NN 0.265 8.72Table 5.1: Performance comparison of different learning algorithms on skinmotion model.Figure 5.7: Model transfer: Model trained on one subject (left) is used todeform the mesh of another subject from one pose to another (center).Deformed mesh overlayed on second subject (right).dimensional output data (face mesh) and low-dimensional input data, we foundthat our Neural Network model performed best with PCA on our skin model. Weused 670 data points to model a 1662 dimensional output mesh; more sophisticatedmethods could perform better on much larger input data, but capturing the data thenbecomes expensive.5.4.4 Model TransferOnce trained on a single user, the generative model can be used to produce skinmovements in other characters based on different gaze or affect parameters. Theoutput of the skin motion model is the location of the projected skin mesh, S, onthe image plane. From this output and given reference mesh, we can compute57Figure 5.8: Interactive WebGL implementation in real-time.a mapping Ψ from the reference mesh to the deformed mesh. This mapping re-mains constant across subjects in our framework, and can be stored. Now, allthat is needed is a new mesh registered to the model mesh using software such asFaceshift. Given the new 3D reference mesh, we can use the same mapping Ψ, togenerate the 3D deformed mesh configuration for any gaze input parameters. InFigure 5.7, we have shown an example of facial expression transfer.5.4.5 Interactive Real Time ImplementationFor interactive applications we use the PCA+MLR model of Section 5.3. Althoughthe error is worse than the neural network model, it is faster to evaluate and im-plement on WebGL framework. Since all operations are linear, they can be con-verted to matrix multiplications and premultiplied, resulting in p×q matrices, onefor each skin atlas coordinate axis, which correlate q gaze inputs with p learnedweights. A WebGL application in [46] running using our model is shown in Fig-ure 5.8.58Chapter 6WrinklesIn this chapter, we discuss different methods to simulate wrinkles that can be incor-porated while rendering to achieve realism. We discuss two methods for creatingnormal maps that can be blended while rendering to simulate wrinkles. In Sec-tion 6.1, we describe an appearance model based on the strain of the skin whichis rendered using normal maps. In Section 6.2, we follow the shape from shadingapproach and parametrize wrinkles as geometric models in terms of their depth andheight.6.1 Strain based Appearance ModelIn computer graphics, facial features like wrinkles and buckles are used to enhancerealism of a rendered human face in facial animations. Some of the facial wrinklesare shown in Figure 6.1. It is therefore crucial to establish an appearance modelfor these features, which can be used to generate them. We develop an appearancemodel based on the changing strains in the skin that can be used for generatingwrinkles.6.1.1 DefinitionsFollowing the notations from Section 3.2.1, we define skin atlas in 2D as a param-eterization, using the map pi , of 3D skin mesh in skin space. The skin mesh andthe parameterization are obtained using Faceshift [78]. The skin mesh can undergo59Figure 6.1: Facial expression and skin deformations: (Left to Right) Fore-head wrinkles, frown, blink and Crow’s feet.Figure 6.2: Skin Mesh (left), Skin Atlas (middle), and Tracking Mesh (right)deformations given by φt as shown in Figure 3.4. The deformation φ0 at t0 is com-puted using the Orthogonal Procrustes algorithm [45]. The projections of thesedeformed meshes are observed by our monocular setup at time t. The projectionmatrix P is obtained by camera calibration as described in Section 3.2.2.Given a mesh deformation, a strain measure may be computed in various ways.We use the strain measure from [39] to compute directional strains (u,v) in theorthogonal directions (u,v) of the skin atlas as shown in Figure 6.2.In our monocular setup, we track the point ut across time in the video framesas defined in Section 3.2.1. Each point ut can be mapped back to the skin atlaswhile tracking given the estimates of pi,P and φ0. We represent the pixel intensityof a mapped point from image I(ut ) in skin atlas as J(ut). We define AppearanceTexture, J, as an image that discretizes the 2D skin atlas.The skin mesh projected on the image plane using the estimates of pi,P and φ0is cropped to contain the wrinkles. This cropped mesh is referred as the TrackingMesh. The skin mesh, skin atlas and the tracking mesh are shown in Figure 6.2.60Figure 6.3: Tracking of Eyelid across frames of a video sequence6.1.2 TrackingWe record a color video of a subject performing different facial expressions usinga Basler Pilot series camera1 with a frame size of 646x486 pixels. The frames arecaptured at 200Hz with a shutter duration of 5ms. We project skin mesh on imageplane and crop it to obtain the Tracking Mesh.We track the vertices of the tracking mesh across the video frames using theKanade-Lucas-Tomasi (KLT) algorithm [41, 58, 66] as shown in Figure 6.3. Wealso detect tracking failures using [37]. The poorly tracked points are correctedusing the neighborhood points. We use bi-cubic scattered data interpolation of theneighborhood points to estimate the motion of poorly tracked points.Facial expressions result in skin deformations causing wrinkles. Skin deforma-tion in a facial expression is a result of strains produced in the skin. We estimatethe strain produced in the skin using [39] by tracking the mesh across the videosequence. We then map the points in the tracked image to the appearance texturegiven by J. The points in the appearance texture correspond to the same points onthe skin in the video sequence irrespective of facial deformation. The appearancetexture of the forehead corresponding to the eyebrow raising expression is shownin Figure 6.4 for two different time instants.The motion of the skin and its features viz. wrinkles, texture changes etc., inthe original video are difficult to localize. However, the wrinkles occur at the sameposition across all the frames in the appearance texture as shown in Figure 6.4. Theintensity of points in this appearance texture is given by J.1Basler Camera, http://www.baslerweb.com/Cameras-42173.html61Figure 6.4: Forehead wrinkles projected to Skin Atlas as an appearance tex-ture.6.1.3 Learning Strain to Appearance MapWe now describe a model to generate the appearance textures based on the strainsproduced in the skin. Our model learns the weights of an RBF to generate appear-ance textures J(ut) based on skin strains u,v.TrainingWe train our model using the intensity values in the appearance texture, J(ut) forgiven estimates of skin strains, u,v, obtained in Section 6.1.2. We use radialbasis functions for our model as described in [40, 79].For each face triangle of the Tracking Mesh Fi (i : 1,2, ..N; N is number of facetriangles), we obtain strain along orthogonal directions corresponding to the axesof the skin atlas, u and v, as shown in Figure 6.2. We first compute the strain ofthe ith triangle w.r.t. its rest configuration, and then find its strain components, uand v, along the orthogonal directions at time t. These strains are computed foreach triangle in all frames using [39]. The strain vectors are used to define a 2Ndimensional feature space for the N faces. The 2N dimensional feature space isformed by the strain vectors u and v for each of the N face triangles.For a particular facial expression, strains are produced in Fi. We learn a func-tion, ω :R2N→RM that maps these strains ε t = [(1)u ,(2)v , ...,(N)u ,(N)v ] to a targetappearance texture J(ut), containing M skin points. We learn this mapping us-62ing scattered data interpolation that employs Linear RBF kernels, Φ. We modeleach point in the appearance texture J(ut) as a set of linear radial basis functions,Φ(r) = r [40, 79]. Our model is given byJ(ut) = Σnk=1wukΦ(ε t − ε k), (6.1)where, n is the total number of frames in the sequence of appearance textures. Wedetermine the weights wuk ,k = 1 : n using our appearance textures and correspond-ing strains for each ut ∈ [1,M]. These weights represent our model and can be usedto compute the appearance texture depending upon the skin strain.Blending of Normal MapsThe computed appearance textures based on strains can not be used directly forrendering. Therefore, we use radiometry to estimate the normal maps of the ex-pression wrinkles and buckles. The normal maps are obtained by illuminating thesubject using a light source from different directions following [43]. The normalmaps corresponding to neutral state and activated state are obtained (Figure 6.5).The normal maps obtained using radiometry are blended together using weightsthat correspond to appearance texture obtained from strain appearance learningmodel in Section 6.1.3. The appearance texture provides the blending weights fornormal maps during a facial expression. Let JT be the appearance texture at timeT for an activated state, and J0 be the appearance texture for a neutral state. LetNT and N0 be the normal maps at activated and neural states respectively. Then,a normal map, Nt , at any time t for a facial expression can be computed fromappearance texture, Jt , asNt =JT − JtJT − J0N0+Jt − J0JT − J0NT . (6.2)6.1.4 Experiments and ResultsWe used a 2.67 GHz Intel Core i5 processor for video processing. The trackingand generation of appearance textures was done using MATLAB which gave us aprocessing speed of 3-4 frames per second. The radial basis weights were com-63Figure 6.5: Normal map of neutral state(left) and activated state(right) forforehead wrinklesputed using MATLAB which took about 15 seconds to learn the model using thecaptured data.For obtaining the normal maps, we captured the facial expression of the subjectin two states: neutral state and activated state. In the neutral state, the subject hasneural expression, and no wrinkles or buckles are formed. In the activated state,the subject performs a specific expression resulting in wrinkles or buckles on theskin. Each of the two states are captured using four different scene illuminations.The subject is illuminated from top, bottom, left and right for capturing. These fourimages captured with different scene illuminations are used to compute the normalmaps using xNormal2 software.The rendering was done using Maya which used normal maps and strain ap-pearance model. We estimated skin strains based on the face deformations duringanimation. This strain estimation was subsequently used for computing the ap-pearance texture from the learning model described in Section 6.1.3. The normalmaps are then obtained by blending which are controlled by appearance textures.The normal maps in neutral and activated states are blended using Equation 6.2 togenerate wrinkles in a facial expression. We show the result in Figure 6.6 for aneyebrow raising expression.2http://www.xnormal.net/64Figure 6.6: Rendered Upper Head animation. Steady state pose(Left) andActivated pose (Right). (Snapshots from the video)6.2 Shape from Shading ModelIn this approach, we describe the use of shape from shading to model wrinklesaround the eye following Bickel et al. [6]. Bickel et al. use the intensity changesin a video capture during wrinkle formation to estimate the geometric propertiesof the wrinkles. They use a Lambertian set up by painting specific wrinkles onsubject using a non-reflective paint. In our approach, we use polarized filters toremove specularity and achieve Lambertian conditions.6.2.1 Geometric ModelWe parametrize each of the observed wrinkle lines using a spline curve, p. Wethen model the cross-section of the wrinkles as an exponential function along thesplines, similar to [6]:W (p,d,w) = d.( pw−1)exp(−p/w). (6.3)Each wrinkle line is modeled separately as a spline curve parametrized by p.The depth d and width w of the wrinkles are estimated using the illumination ofthe point p with respect to ambient illumination.656.2.2 ImplementationWe manually mark the position of wrinkles in the video frames corresponding toextreme poses, such as crow’s feet. We need to do this only once per subject.This process can also be made automatic by detecting wrinkle lines using methodssuch as edge detection [13] or vessel/ridge like structure detection techniques [23].However, these techniques are not as robust for detecting wrinkles in the eye region.Hence, we manually mark the splines of the wrinkles corresponding to parameterp as shown in Figure 6.7.After setting the spline parameter, p, for the wrinkle line, we estimate the ge-ometric parameters d and w using shape from shading approach. The parameters(d,w) are obtained by estimating the illumination during wrinkle formation withrespect to ambient illumination as shown in [6]. These parameters result in a depthmap corresponding to wrinkles for each of the poses.We then obtain a normal map from the depth map for a particular pose fromthe estimated values of d and w. We transfer the normal maps to the skin atlas(Section 6.1.1) and store them as normal maps. The normal maps are then used torender wrinkles and folds around the eye.We separate the eye region into six vertex groups (see Figure 6.7) based onanatomical location and motion of facial muscles around the eye. For example,upper and lower eyelid regions are separated into two separate groups, as openingof upper and lower eyelids are independently controlled by different muscles. Nowwe define several key poses which are the skin poses corresponding to a few ex-treme expressions, such as forceful eye closure. We compute the normal maps Ni,i∈ 1, ...,n corresponding to n gaze vectors gi, i∈ 1, ...,n during key poses and storethem to train a wrinkle model.As in Chapter 5, for each skin group, G, we train a linear wrinkle model, WG :g→ Γ using gaze vectors, gi and corresponding normal map blending weights, Γi.The blending weights of the normal maps are set as the average grayscale intensityin the image of the key poses along the wrinkle lines in that group, followed by anormalization across all the key poses. We can now use the wrinkle model WG togenerate blending weights for a novel gaze.66Figure 6.7: Top: Wrinkle markings (with shape shown as intensity), Bottom:Three skin groups in the left eye region (left), and wrinkle generation inthe eyes (right).6.2.3 ResultsThe strain appearance model in Section 6.1 can not produce realistic effect in mod-eling complex wrinkles around the eye. This is because computing a normal mapfor complex expressions around the eye using our basic system is prone to errors.Hence, we use shape from shading approach in this section to parametrize thegeometric characteristics of these wrinkles. The reconstructions using blendingweights learned from this wrinkle model are shown in Figure 6.8.67Figure 6.8: Wrinkle modeling: Without wrinkle (left) and with wrinkle(right) in a forceful eye closure example.68Chapter 7Conclusion and Future WorkEyes have always been one of the most difficult elements to capture and reconstructin computer graphics. The complex motion of lid folding, the fast skin motiondue to saccades and blink sequences require better algorithms for capturing andmodeling them. The traditional motion trackers fail due to complex motions of theperiorbital soft tissues. We improve the methods for capturing the motion of theperiorbital soft tissues in Chapter 3 and also improve upon the existing scene flowtechniques for motion estimation.The textures of the soft tissues around the eye generally have a low resolutionin natural videos, as they are a small part of the whole face. Moreover, the texturesof the periorbital tissues are often under occlusion due to eyelashes. The eyelashocclusions are corrected mainly using inpainting. Moreover, the eyelashes are finestructures and therefore it is very difficult extract their shape and texture. The highresolution textures form an important part of computer generated imagery. Wesuccessfully recover the textures of the eyelashes using an optical flow based su-perresolution technique in Chapter 4. We also generate a complete high resolutionskin texture for periorbital tissues that can be directly used for animation.The complex motions of the periorbital tissues have always been difficult toreproduce. Most of the facial animations lack in good control of the periorbitalsoft tissues. We model the complex motions of the periorbital tissues using simplegaze parameters in Chapter 5. Hence, the realistic motions of the soft tissues caneasily be produced and tuned using gaze and affect parameters.69A large part of realism around the eye comes from wrinkles that are formedduring various expressions. We develop techniques inspired by shape from shadingto produce realistic expression wrinkles around the eye in Chapter 6.In this work, we improved the methods for capturing the motion and textureof the periorbital soft tissues. We developed a controller that can produce realisticeye movements using a small number of parameters. To add realism, we modeledwrinkles in the soft tissues.Future WorkTracking The estimation of scene flow can be improved by the introduction of ad-ditional constraints with respect to the mesh structure and elasticity. This couldaccomplish a detailed capture of the facial expressions with a finer mesh. Further-more, the scene flow method is very expensive with respect to computation. Agood area of exploration would be making the computations faster by samplingsparse points, or by learning priors.Texture Recovery The texture estimation algorithm is highly dependent on flowestimation. Thus, increasing the speed of flow estimation would boost the perfor-mance of our method. The penalty functions introduced in Chapter 4 are based onthe color values of eyelashes and skin. They can be made more efficient by learninga prior on the features. Furthermore, the high resolution textures can also be usefulin extracting physical model of eyelashes in terms of a small number of parameters.Learning Skin Motion The ANN models can be trained on larger datasets to captureall the poses that can be reproduced. Other affect parameters can be added to createexpressions that involve the entire face. The model can be easily implemented forreal-time interfaces, and can be ported to mobile devices and web.Wrinkles One of the major elements of realism in facial performance capture sys-tems are wrinkles. There have been only a handful of methods for reproducingand modeling wrinkles in a monocular set up. Our model could be extended toreproduce the finer complex wrinkles which are difficult to model.70Bibliography[1] S. Baker, D. Scharstein, J. Lewis, S. Roth, M. J. Black, and R. Szeliski. Adatabase and evaluation methodology for optical flow. International Journalof Computer Vision, 92(1):1–31, 2011. → pages 6, 31[2] Y. Bando, T. Kuratate, and T. Nishita. A simple method for modelingwrinkles on human skin. Computer Graphics and Applications, 2002.Proceedings. 10th Pacific Conference on, pages 166–175, 2002. → pages 11[3] N. Batool and R. Chellappa. A Markov point process model for wrinkles inhuman faces. In Image Processing (ICIP), 2012 19th IEEE InternationalConference on, pages 1809–1812. IEEE, 2012. → pages 12[4] P. Be´rard, D. Bradley, M. Nitti, T. Beeler, and M. Gross. High-Qualitycapture of eyes. ACM Transactions on Graphics, 33(6):223, 2014. → pages2, 8, 9[5] A. Bermano, T. Beeler, Y. Kozlov, D. Bradley, B. Bickel, and M. Gross.Detailed spatio-temporal reconstruction of eyelids. ACM Transactions onGraphics (TOG), 34(4):44, 2015. → pages 2, 8[6] B. Bickel, M. Botsch, R. Angst, W. Matusik, M. Otaduy, H. Pfister, andM. Gross. Multi-scale capture of facial geometry and motion. In ACMTransactions on Graphics (TOG), volume 26, page 33. ACM, 2007. →pages 12, 65, 66[7] B. Bickel, M. Lang, M. Botsch, M. A. Otaduy, and M. Gross. Pose-spaceanimation and transfer of facial details. In Proceedings of the 2008 ACMSIGGRAPH/Eurographics Symposium on Computer Animation, pages57–66. Eurographics Association, 2008. → pages 12[8] M. J. Black and P. Anandan. A framework for the robust estimation ofoptical flow. In Computer Vision, 1993. Proceedings., Fourth InternationalConference on, pages 231–236. IEEE, 1993. → pages 4, 5, 1571[9] N. K. Bose, N. Ahuja, et al. Superresolution and noise filtering usingmoving least squares. Image Processing, IEEE Transactions on, 15(8):2239–2248, 2006. → pages 10[10] D. Bradley, W. Heidrich, T. Popa, and A. Sheffer. High resolution passivefacial performance capture. ACM Transactions on Graphics (TOG), 29(4):41, 2010. → pages 1, 7, 8, 9, 11[11] T. Brox and J. Malik. Large displacement optical flow: Descriptor matchingin variational motion estimation. Pattern Analysis and Machine Intelligence,IEEE Transactions on, 33(3):500–513, 2011. → pages 4, 6, 19, 20, 21, 22[12] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High accuracy opticalflow estimation based on a theory for warping. Computer Vision-ECCV2004, pages 25–36, 2004. → pages 4, 5[13] J. Canny. A computational approach to edge detection. Pattern Analysis andMachine Intelligence, IEEE Transactions on, (6):679–698, 1986. → pages66[14] C. Cao, D. Bradley, K. Zhou, and T. Beeler. Real-time high-fidelity facialperformance capture. ACM Transactions on Graphics (TOG), 34(4):46,2015. → pages 2[15] E. Cerda and L. Mahadevan. Geometry and physics of wrinkling. PhysicalReview Letters, 90(7):074302, 2003. → pages 11[16] K.-J. Choi and H.-S. Ko. Stable but responsive cloth. ACM Transactions onGraphics (TOG), 21(3):604–611, 2002. → pages 11[17] G. O. Cula, P. R. Bargo, A. Nkengne, and N. Kollias. Assessing facialwrinkles: Automatic detection and quantification. Skin Research andTechnology, 19(1):e243–e251, 2013. → pages 11[18] D. Decarlo and D. Metaxas. Optical flow constraints on deformable modelswith applications to face tracking. International Journal of Computer Vision,38(2):99–127, 2000. → pages 6[19] L. Dutreve, A. Meyer, and S. Bouakaz. Easy acquisition and real-timeanimation of facial wrinkles. Computer Animation and Virtual Worlds, 22(2-3):169–176, 2011. → pages 12[20] L. Elsgolc. Calculus of variations. 1962. → pages 1872[21] B. Fischer and E. Ramsperger. Human express saccades: Extremely shortreaction times of goal directed eye movements. Experimental BrainResearch, 57(1):191–195, 1984. → pages 2[22] C. Flynn and B. A. McCormack. Finite element modeling of forearm skinwrinkling. Skin research and technology, 14(3):261–269, 2008. → pages 12[23] A. F. Frangi, W. J. Niessen, K. L. Vincken, and M. A. Viergever. Multiscalevessel enhancement filtering. Medical Image Computing andComputer-Assisted Interventation MICCAI98, pages 130–137, 1998. →pages 66[24] R. Fransens, C. Strecha, and L. Van Gool. Optical flow basedsuper-resolution: A probabilistic approach. Computer Vision and ImageUnderstanding, 106(1):106–115, 2007. → pages 10, 27, 30, 36[25] G. Fyffe, A. Jones, O. Alexander, R. Ichikari, and P. Debevec. DrivingHigh-Resolution Facial Scans with Video Performance Capture. ACMTransactions on Graphics (TOG), 34(1):8, 2014. → pages 2, 9, 11, 26[26] P. Garrido, L. Valgaerts, C. Wu, and C. Theobalt. Reconstructing detaileddynamic face geometry from monocular video. ACM Transactions onGraphics, 32(6):158, 2013. → pages 7, 8[27] J. Genzer and J. Groenewold. Soft matter with hard skin: From skinwrinkles to templating and material characterization. Soft Matter, 2(4):310–323, 2006. → pages 11[28] B. Goldluecke and D. Cremers. Superresolution texture maps for multiviewreconstruction. In Computer Vision, 2009 IEEE 12th InternationalConference on, pages 1677–1684. IEEE, 2009. → pages 9[29] P. Graham, B. Tunwattanapong, J. Busch, X. Yu, A. Jones, P. Debevec, andA. Ghosh. Measurement-Based Synthesis of Facial Microgeometry. InComputer Graphics Forum, volume 32, pages 335–344. Wiley OnlineLibrary, 2013. → pages 1, 7, 9[30] A. Hewer, J. Weickert, H. Seibert, T. Scheffer, and S. Diebels. Lagrangianstrain tensor computation with higher order variational models. In Proc.British Machine Vision Conference. BMVA Press, Bristol (September 2013),2013. → pages 773[31] L. Hong, Y. Wan, and A. Jain. Fingerprint image enhancement: Algorithmand performance evaluation. Pattern Analysis and Machine Intelligence,IEEE Transactions on, 20(8):777–789, 1998. → pages 11[32] B. K. Horn and B. G. Schunck. Determining optical flow. ArtificialIntelligence, 17:185–203, 1981. → pages 2, 4, 5[33] J.-B. Huang, A. Singh, and N. Ahuja. Single Image Super-resolution fromTransformed Self-Exemplars. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pages 5197–5206, 2015. →pages 10[34] F. Huguet and F. Devernay. A variational method for scene flow estimationfrom stereo sequences. In Computer Vision, 2007. ICCV 2007. IEEE 11thInternational Conference on, pages 1–7. IEEE, 2007. → pages 4, 7, 20[35] C. H. Hung, L. Xu, and J. Jia. Consistent binocular depth and scene flowwith chained temporal profiles. International journal of computer vision,102(1-3):271–292, 2013. → pages 7[36] A. E. Ichim, S. Bouaziz, and M. Pauly. Dynamic 3D avatar creation fromhand-held video input. ACM Transactions on Graphics (TOG), 34(4):45,2015. → pages 8[37] Z. Kalal, K. Mikolajczyk, and J. Matas. Forward-backward error: Automaticdetection of tracking failures. Pattern Recognition (ICPR), 2010 20thInternational Conference on, pages 2756–2759, 2010. → pages 5, 6, 61[38] S. Koterba, S. Baker, I. Matthews, C. Hu, J. Xiao, J. Cohn, and T. Kanade.Multi-view aam fitting and camera calibration. In Computer Vision, 2005.ICCV 2005. Tenth IEEE International Conference on, volume 1, pages511–518. IEEE, 2005. → pages 6[39] D. Li, S. Sueda, D. R. Neog, and D. K. Pai. Thin skin elastodynamics. ACMTransactions on Graphics (TOG), 32(4):49, 2013. → pages 10, 22, 60, 61,62[40] G.-R. Liu. Meshfree methods: Moving beyond the finite element method.CRC press, 2010. → pages 12, 62, 63[41] B. D. Lucas and T. Kanade. An iterative image registration technique withan application to stereo vision. In Proceedings of the 7th international jointconference on Artificial intelligence-Volume 2, pages 674–679. MorganKaufmann Publishers Inc., 1981. → pages 4, 5, 6174[42] P. J. Lynch and C. C. Jaffe. Head lateral anatomy. Creative CommonsAttribution 2.5 License 2006. URLhttps://commons.wikimedia.org/wiki/File:Lateral head anatomy.jpg. →pages 49[43] W.-C. Ma, T. Hawkins, P. Peers, C.-F. Chabert, M. Weiss, and P. Debevec.Rapid acquisition of specular and diffuse normal maps from polarizedspherical gradient illumination. In Proceedings of the 18th Eurographicsconference on Rendering Techniques, pages 183–194. EurographicsAssociation, 2007. → pages 63[44] N. Magnenat-Thalmann, P. Kalra, J. L. Le´veˆque, R. Bazin, D. Batisse, andB. Querleux. A computational skin model: Fold and wrinkle formation.IEEE transactions on information technology in biomedicine, 6(4):317–323,2002. → pages 12[45] D. R. Neog, A. Ranjan, J. L. Cardoso, and D. K. Pai. EyeMove: Gaze drivenanimation of eyes. In Preparation. → pages 22, 48, 54, 60[46] D. R. Neog, A. Ranjan, J. L. Cardoso, and D. K. Pai. Gaze driven animationof eyes. In Proceedings of the 14th ACM SIGGRAPH/EurographicsSymposium on Computer Animation, pages 198–198. ACM, 2015. → pages2, 8, 9, 51, 58[47] P. Peers, T. Hawkins, and P. Debevec. A reflective light stage. ICT TechnicalReport ICT-TR-04.2006, 2006. → pages 8, 9[48] F. Pighin, R. Szeliski, and D. H. Salesin. Resynthesizing facial animationthrough 3d model-based tracking. In Computer Vision, 1999. TheProceedings of the Seventh IEEE International Conference on, volume 1,pages 143–150. IEEE, 1999. → pages 6[49] D. Pizarro and A. Bartoli. Feature-based deformable surface detection withself-occlusion reasoning. International Journal of Computer Vision, 97(1):54–70, 2012. → pages 5, 6[50] T. Popa, Q. Zhou, D. Bradley, V. Kraevoy, H. Fu, A. Sheffer, andW. Heidrich. Wrinkling captured garments using space-time data-drivendeformation. In Computer Graphics Forum, volume 28, pages 427–435.Wiley Online Library, 2009. → pages 11[51] S. Rhee and M. G. Kang. Discrete cosine transform based regularizedhigh-resolution image reconstruction algorithm. Optical Engineering, 38(8):1348–1356, 1999. → pages 1075[52] D. Rohmer, T. Popa, M.-P. Cani, S. Hahmann, and A. Sheffer. Animationwrinkling: Augmenting coarse cloth simulations with realistic-lookingwrinkles. ACM Trans. Graph., 29(6):157:1–157:8, Dec. 2010. ISSN0730-0301. doi:10.1145/1882261.1866183. URLhttp://doi.acm.org/10.1145/1882261.1866183. → pages 11[53] H. Rue and L. Held. Gaussian Markov random fields: Theory andApplications. CRC Press, 2005. → pages 10[54] K. Ruhland, S. Andrist, J. Badler, C. Peters, N. Badler, M. Gleicher,B. Mutlu, and R. Mcdonnell. Look me in the eyes: A survey of eye and gazeanimation for virtual agents and artificial systems. In EurographicsState-of-the-Art Report, pages 69–91, 2014. → pages 9[55] M. Salzmann and P. Fua. Linear local models for monocular reconstructionof deformable surfaces. Pattern Analysis and Machine Intelligence, IEEETransactions on, 33(5):931–944, 2011. → pages 6[56] H. R. Schiffman. Sensation and perception: An integrated approach . JohnWiley & Sons, 1990. → pages 2[57] M. Schmidt, N. L. Roux, and F. Bach. Minimizing finite sums with thestochastic average gradient. arXiv preprint arXiv:1309.2388, 2013. → pages3, 7, 15, 18, 21[58] J. Shi and C. Tomasi. Good features to track. In Computer Vision andPattern Recognition, 1994. Proceedings, IEEE Computer SocietyConference on, pages 593–600. IEEE, 1994. → pages 5, 61[59] E. Sifakis, I. Neverov, and R. Fedkiw. Automatic determination of facialmuscle activations from sparse motion capture marker data. In ACMTransactions on Graphics (TOG), volume 24, pages 417–425. ACM, 2005.→ pages 10[60] D. Sun, S. Roth, and M. J. Black. Secrets of optical flow estimation and theirprinciples. Computer Vision and Pattern Recognition (CVPR), 2010 IEEEConference on, pages 2432–2439, 2010. → pages 5, 6[61] D. Sun, S. Roth, and M. J. Black. A quantitative analysis of current practicesin optical flow estimation and the principles behind them. InternationalJournal of Computer Vision, 106(2):115–137, 2014. → pages 676[62] N. Sundaram, T. Brox, and K. Keutzer. Dense Point Trajectories byGPU-accelerated Large Displacement Optical Flow. Technical ReportUCB/EECS-2010-104, EECS Department, University of California,Berkeley, Jul 2010. URLhttp://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-104.html.→ pages 6, 31, 38, 39[63] D. Terzopoulos and K. Waters. Physically-based facial modelling, analysis,and animation. The Journal of Visualization and Computer Animation, 1(2):73–80, Dec. 1990. ISSN 1049-8907. → pages 6[64] J. Tian and K.-K. Ma. Stochastic super-resolution image reconstruction.Journal of Visual Communication and Image Representation, 21(3):232–244, 2010. → pages 10[65] J. Tian and K.-K. Ma. A survey on super-resolution imaging. Signal, Imageand Video Processing, 5(3):329–342, 2011. → pages 10[66] C. Tomasi and T. Kanade. Detection and tracking of point features.Pittsburgh: School of Computer Science, Carnegie Mellon Univ., 1991. →pages 5, 61[67] R. Tsai and T. S. Huang. Multiframe image restoration and registration.Advances in computer vision and Image Processing, 1(2):317–339, 1984. →pages 10[68] V. Tsiminaki, J.-S. Franco, and E. Boyer. High Resolution 3D Shape Texturefrom Multiple Videos. In Computer Vision and Pattern Recognition (CVPR),2014 IEEE Conference on, pages 1502–1509. IEEE, 2014. → pages 9, 10[69] H. Ur and D. Gross. Improved resolution from subpixel shifted pictures.CVGIP: Graphical Models and Image Processing, 54(2):181–186, 1992. →pages 10[70] L. Valgaerts, A. Bruhn, H. Zimmer, J. Weickert, C. Stoll, and C. Theobalt.Joint estimation of motion, structure and geometry from stereo sequences.Computer Vision–ECCV 2010, pages 568–581, 2010. → pages 3, 4, 7, 13,14, 16, 17, 19, 20[71] L. Valgaerts, C. Wu, A. Bruhn, H.-P. Seidel, and C. Theobalt. Lightweightbinocular facial performance capture under uncontrolled lighting. ACMTrans. Graph., 31(6):187, 2012. → pages 877[72] A. Varol, M. Salzmann, E. Tola, and P. Fua. Template-free monocularreconstruction of deformable surfaces. In Computer Vision, 2009 IEEE 12thInternational Conference on, pages 1811–1818. IEEE, 2009. → pages 6[73] S. Vedula, P. Rander, R. Collins, and T. Kanade. Three-dimensional sceneflow. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(3):475–480, 2005. → pages 4, 7[74] P. Viola and M. Jones. Rapid object detection using a boosted cascade ofsimple features. In Computer Vision and Pattern Recognition, 2001. CVPR2001. Proceedings of the 2001 IEEE Computer Society Conference on,volume 1, pages I–511. IEEE, 2001. → pages 1[75] J. Wang, S. Zhu, and Y. Gong. Resolution enhancement based on learningthe sparse association of image patches. Pattern Recognition Letters, 31(1):1–10, 2010. → pages 10[76] Y. Wang and O. Lee. Active mesh: A feature seeking and tracking imagesequence representation scheme. Image Processing, IEEE Transactions on,3(5):610–624, 1994. → pages 5, 6[77] A. Wedel, C. Rabe, T. Vaudrey, T. Brox, U. Franke, and D. Cremers.Efficient dense scene flow from sparse or dense stereo data. Springer, 2008.→ pages 4, 7, 20[78] T. Weise, S. Bouaziz, H. Li, and M. Pauly. Realtime Performance-BasedFacial Animation. ACM Transactions on Graphics (ProceedingsSIGGRAPH 2011), 30(4), July 2011. → pages 22, 23, 39, 59[79] H. Wendland. Scattered data approximation, volume 17. CambridgeUniversity Press, 2005. → pages 12, 62, 63[80] J. Xiao, S. Baker, I. Matthews, and T. Anade. Real-time combined 2D + 3Dactive appearance models. In Computer Vision and Pattern Recognition,2004. Proceedings of the 2004 IEEE Computer Society Conference on,volume 2, pages II–535. IEEE. → pages 6[81] J. Yang, J. Wright, T. S. Huang, and Y. Ma. Image super-resolution viasparse representation. Image Processing, IEEE Transactions on, 19(11):2861–2873, 2010. → pages 10[82] A. L. Yarbus. Eye movements during the examination of complicatedobjects. Biofizika, 6:52–56, 1960. → pages 178[83] Z. Zhang. A flexible new technique for camera calibration. Pattern Analysisand Machine Intelligence, IEEE Transactions on, 22(11):1330–1334, 2000.→ pages 23[84] H. Zimmer, A. Bruhn, J. Weickert, L. Valgaerts, A. Salgado, B. Rosenhahn,and H.-P. Seidel. Complementary optic flow. Energy minimization methodsin computer vision and pattern recognition, pages 207–220, 2009. → pages5, 6[85] H. Zimmer, A. Bruhn, and J. Weickert. Optic flow in harmony. InternationalJournal of Computer Vision, 93(3):368–388, 2011. → pages 5, 679
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Learning periorbital soft tissue motion
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Learning periorbital soft tissue motion Ranjan, Anurag 2015
pdf
Page Metadata
Item Metadata
Title | Learning periorbital soft tissue motion |
Creator |
Ranjan, Anurag |
Publisher | University of British Columbia |
Date Issued | 2015 |
Description | Human observers tend to pay a lot of attention to the eyes and the surrounding soft tissues. These periorbital soft tissues are associated with subtle and fast motions that convey emotions during facial expressions. Modeling the complex movements of these soft tissues is essential for capturing and reproducing realism in facial animations. In this work, we present a data driven model that can efficiently learn and reproduce the complex motion of the periorbital soft tissues. We develop a system to capture the motion of the eye region using a high frame rate monocular camera. We estimate the high resolution texture of the surrounding eye regions using a Bayesian framework. Our learned model performs well in reproducing various animations of the eyes. We further improve realism by introducing methods to model facial wrinkles. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2016-03-31 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial 2.5 Canada |
DOI | 10.14288/1.0166703 |
URI | http://hdl.handle.net/2429/54784 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2015-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc/2.5/ca/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2015_november_ranjan_anurag.pdf [ 22.53MB ]
- Metadata
- JSON: 24-1.0166703.json
- JSON-LD: 24-1.0166703-ld.json
- RDF/XML (Pretty): 24-1.0166703-rdf.xml
- RDF/JSON: 24-1.0166703-rdf.json
- Turtle: 24-1.0166703-turtle.txt
- N-Triples: 24-1.0166703-rdf-ntriples.txt
- Original Record: 24-1.0166703-source.json
- Full Text
- 24-1.0166703-fulltext.txt
- Citation
- 24-1.0166703.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0166703/manifest