UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Measurement and animation of the eye region of the human face in reduced coordinates Neog, Debanga Raj 2018

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2018_september_neog_debanga.pdf [ 45.26MB ]
JSON: 24-1.0368664.json
JSON-LD: 24-1.0368664-ld.json
RDF/XML (Pretty): 24-1.0368664-rdf.xml
RDF/JSON: 24-1.0368664-rdf.json
Turtle: 24-1.0368664-turtle.txt
N-Triples: 24-1.0368664-rdf-ntriples.txt
Original Record: 24-1.0368664-source.json
Full Text

Full Text

Measurement and Animation of the Eye Region of theHuman Face in Reduced CoordinatesbyDebanga Raj NeogB. Tech., Indian Institute of Technology Guwahati, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)June 2018c© Debanga Raj Neog, 2018CommitteeThe following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the dissertation entitled:“Measurement and Animation of the Eye Region of the Human Face in ReducedCoordinates” submitted by Debanga Raj Neog in partial fulfillment of the require-ments for the degree of Doctor of Philosophy in Computer Science.Examining Committee:Co-supervisor: Dinesh K. Pai, Computer ScienceCo-supervisor: Robert J. Woodham, Computer ScienceSupervisory Committee Member: Jim Little, Computer ScienceUniversity Examiner: Alan Mackworth, Computer ScienceUniversity Examiner: Sidney Fels, Electrical and Computer Engineering.iiAbstractThe goal of this dissertation is to develop methods to measure, model, and animatefacial tissues of the region around the eyes, referred to as the eye region. First, wemeasure the subtle movements of the soft tissues of the eye region using a monoc-ular RGB-D camera setup, and second, we model and animate these movementsusing parameterized motion models. The muscles and skin of the eye region arevery thin and sheetlike. By representing these tissues as thin elastic sheets in re-duced coordinates, we have shown how we can measure and animate these tissuesefficiently.To measure tissue movements, we optically track both eye and skin motions us-ing monocular video sequences. The key idea here is to use a reduced coordinatesframework to model thin sheet-like facial skin of the eye region. This frameworkimplicitly constrains skin to conform to the shape of the underlying object when itslides. The skin configuration can then be efficiently reconstructed in 3D by track-ing two dimensional skin features in video. This reduced coordinates model allowsinteractive real-time animation of the eye region in WebGL enabled devices usinga small number of animation parameters, including gaze. Additionally, we haveshown that the same reduced coordinates framework can also be used for physics-iiibased simulation of the facial tissue movements and to produce tissue deformationsthat occur in facial expressions. We validated our skin measurement and animationalgorithms using skin movement sequences with known skin motions, and we canrecover skin sliding motions with low reconstruction errors.We also propose an image-based algorithm that corrects accumulated inaccu-racy of standard 3D anatomy registration systems that occurs during motion cap-ture, anatomy transfer, image generation, and animation. After correction, we canoverlay the anatomy on input video with low misalignment errors for augmentedreality applications, such as anatomy mirroring. Our results show that the proposedimage-based corrective registration can effectively reduce these inaccuracies.ivLay SummaryThe subtle movements of the facial tissues around the eyes convey a lot of infor-mation, especially in facial expressions. We refer to this region of the face as theeye region. Facial muscles and skin of the eye region are particularly interestingas they are thin and sheetlike. This allows us to model these tissues using a re-duced coordinates framework which can be used to efficiently measure, simulate,and model the tissue movements. We measure these movements using a monocu-lar RGB-D camera setup, simulate their motions using physics-based simulation,and generate interactive facial animation using machine learning models trainedon measured data. Furthermore, we propose an image-based algorithm that cor-rects inaccuracies of 3D anatomy registration systems. Our results show that ourproposed methods can be used to generate accurate, realistic, and interactive facialand anatomy animation.vPrefaceThe works comprising this thesis are taken from papers that have been published,submitted, or are planned to be submitted in the near future in collaboration withother people.Chapter 3 is from (Neog, D. R., Ranjan, A., & Pai, D. K. (2017, May). SeeingSkin in Reduced Coordinates. In Automatic Face Gesture Recognition (FG 2017),2017 12th IEEE International Conference on (pp. 484-489). IEEE.). The text isfrom the paper, which was written by the present author and edited by Dr. Pai. Thepresent author was the primary author who worked on the majority of the data pro-cessing, analysis, and implementation of the model. Anurag Ranjan helped in datacollection and analysis. Dr. Pai provided overall guidance of the project and partic-ipated in writing. This work was supported in part by grants from NSERC, CanadaFoundation for Innovation, MITACS, and the Canada Research Chairs Program.We thank Vital Mechanics Research for providing the hand model and software.The work described in Chapter 4 is from a manuscript currently under prepara-tion. The present author worked as the first author and the main writer of the paper.Anurag Ranjan helped in generating the figures of the results. Dr. Pai participatedin the early development of the idea and provided discussions on the data collectionviand analysis.Real-time facial animation work, described in Chapter 5, is from (Neog, D.R., Cardoso, J. L., Ranjan, A., & Pai, D. K. (2016, July). Interactive gaze drivenanimation of the eye region. In Proceedings of the 21st International Conferenceon Web3D Technology (pp. 51-59). ACM.). The present author was the primaryauthor of the work and performed data collection, modeling, and analysis. JoaoCardoso helped in web application development. Anurag Ranjan helped in experi-ments with computation models and in data analysis. Dr. Pai led the overall projectand provided guidance in writing the text. The authors would like to thank PaulDebevec and the WikiHuman Project for providing the Digital Emily model usedin our examples. This work was funded in part by grants from NSERC, Peter WallInstitute for Advanced Studies, Canada Foundation for Innovation, and the CanadaResearch Chairs Program.The Chapter 6 is from the report submitted as a part of the present author’sResearch Proficiency Evaluation Examination held to evaluate his proficiency re-quired for a doctoral degree. The present author worked as the main researcherand writer of the report. Dr. Pai provided guidance in writing the text. The workis later used in our papers (Neog, D. R., Cardoso, J. L., Ranjan, A., & Pai, D. K.(2016, July). Interactive gaze driven animation of the eye region. In Proceedingsof the 21st International Conference on Web3D Technology (pp. 51-59). ACM.)and (Neog, D. R., Ranjan, A., & Pai, D. K. (2017, May). Seeing Skin in ReducedCoordinates. In Automatic Face Gesture Recognition (FG 2017), 2017 12th IEEEInternational Conference on (pp. 484-489). IEEE.).Chapter 7 is based on the collaboration work between the University of BritishColumbia and INRIA (Rhoˆne-Alpes, Grenoble, France) under ‘Mitacs GlobalinkviiResearch Award’ program and the work was published in (Bauer, A., Neog, D. R.,Dicko, A. H., Pai, D. K., Faure, F., Palombi, O., & Troccaz, J. (2017). AnatomicalAugmented Reality with 3D Commodity Tracking and Image-space Alignment.Computers & Graphics.). The present author was the second author of the jour-nal paper, and improved the work of (Bauer, A., Dicko, A. H., Faure, F., Palombi,O., & Troccaz, J. (2016, October). Anatomical mirroring: real-time user-specificanatomy in motion using a commodity depth camera. In Proceedings of the 9th In-ternational Conference on Motion in Games (pp. 113-122). ACM.) by proposinga new method for anatomy registration correction. The discussion included in thechapter is contributed mostly by the present author, and is the primary author of thatpart. Armelle Bauer helped in data collection and provided suggestions in systemintegration. Dr. Faure, Dr. Troccaz, and Dr. Pai provided overall guidance of theproject and participated in writing. This work was supported in part by grants fromthe LabEx PERSYVAL-LAB (ANR-11-LABX-0025), NSERC, Canada Founda-tion for Innovation, MITACS, and the Canada Research Chairs Program.viiiTable of ContentsCommittee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Measurement of Eye and Facial Skin Movements . . . . . . . . . 41.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . 61.2 Simulation of Facial Tissues . . . . . . . . . . . . . . . . . . . . 7ix1.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 71.2.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . 81.3 Motion Modeling and Interactive Animation . . . . . . . . . . . . 81.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 81.3.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . 101.4 Anatomical Augmented Reality with Image-space Alignment . . . 111.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 111.4.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . 121.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Anatomy of the Eye Region . . . . . . . . . . . . . . . . . . . . . . . 152.1 Facial Tissue Structure . . . . . . . . . . . . . . . . . . . . . . . 152.1.1 Facial Skin . . . . . . . . . . . . . . . . . . . . . . . . . 162.1.2 Facial Muscles . . . . . . . . . . . . . . . . . . . . . . . 172.1.3 Retaining Ligaments . . . . . . . . . . . . . . . . . . . . 202.1.4 Orbital Bones . . . . . . . . . . . . . . . . . . . . . . . . 223 Measurement of Skin Movements . . . . . . . . . . . . . . . . . . . 243.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Representing Skin Motion in Reduced Coordinates . . . . . . . . 273.3 Reduced Coordinate Tracking . . . . . . . . . . . . . . . . . . . 303.3.1 Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . 313.3.3 Flow Computation and Correction . . . . . . . . . . . . . 313.3.4 Generating Body Map . . . . . . . . . . . . . . . . . . . 323.3.5 3D Reconstruction. . . . . . . . . . . . . . . . . . . . . . 33x3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.4.1 Face Tracking . . . . . . . . . . . . . . . . . . . . . . . . 343.4.2 Hand Tracking . . . . . . . . . . . . . . . . . . . . . . . 373.4.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 423.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434 Simulation of Facial Tissues of the Eye Region . . . . . . . . . . . . 444.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.2 Constitutive Models of Facial Tissues . . . . . . . . . . . . . . . 464.2.1 Muscle Model . . . . . . . . . . . . . . . . . . . . . . . 464.2.2 Skin Model . . . . . . . . . . . . . . . . . . . . . . . . . 494.2.3 Subcutaneous Tissue . . . . . . . . . . . . . . . . . . . . 514.3 Overview of Simulation Framework . . . . . . . . . . . . . . . . 514.3.1 Kinematics . . . . . . . . . . . . . . . . . . . . . . . . . 534.4 Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.4.1 Deformation Gradients . . . . . . . . . . . . . . . . . . . 544.4.2 Compute Elastic Force . . . . . . . . . . . . . . . . . . . 554.4.3 Integration . . . . . . . . . . . . . . . . . . . . . . . . . 564.4.4 Tissue Attachments . . . . . . . . . . . . . . . . . . . . . 564.5 Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.5.1 Quasistatic Formulation . . . . . . . . . . . . . . . . . . 574.5.2 Pulse Step Controller . . . . . . . . . . . . . . . . . . . . 594.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.7 Muscle Activation to Pose Space Mapping . . . . . . . . . . . . . 644.7.1 Optimization Results . . . . . . . . . . . . . . . . . . . . 67xi4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695 Interactive Animation of the Eye Region . . . . . . . . . . . . . . . . 715.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.2 Skin Movement in Reduced Coordinates . . . . . . . . . . . . . . 725.3 Gaze Parameterized Model of Skin Movement . . . . . . . . . . . 755.3.1 Factors Affecting Skin Movement in the Eye Region . . . 765.3.2 Generative Model of Skin Movement . . . . . . . . . . . 785.4 Transferring Animations . . . . . . . . . . . . . . . . . . . . . . 805.5 Client Applications . . . . . . . . . . . . . . . . . . . . . . . . . 815.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 906 Measurement of Eye Movements . . . . . . . . . . . . . . . . . . . . 916.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916.2 Equipment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966.3 Gaze Estimation in Head-fixed Coordinates . . . . . . . . . . . . 976.3.1 Eye Feature Detection in Image . . . . . . . . . . . . . . 976.3.2 Pupil Feature Tracking . . . . . . . . . . . . . . . . . . . 1036.3.3 Imaging Geometry . . . . . . . . . . . . . . . . . . . . . 1046.3.4 Ocular Torsion Estimation . . . . . . . . . . . . . . . . . 1096.3.5 Eye Blink Detection . . . . . . . . . . . . . . . . . . . . 1116.4 Gaze Estimation in World Coordinates . . . . . . . . . . . . . . . 1126.4.1 Helmet slippage compensation . . . . . . . . . . . . . . . 1156.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1166.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119xii7 Anatomical Augmented Reality with Image-space Alignment . . . . 1217.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1217.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1237.2.1 Input Data From 3D Registration . . . . . . . . . . . . . 1257.2.2 Image-based Correction . . . . . . . . . . . . . . . . . . 1267.2.3 Occlusion Estimation and Layering . . . . . . . . . . . . 1347.2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 1367.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1388 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 1408.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146A Supporting Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 157A.1 Calibration parameters estimation . . . . . . . . . . . . . . . . . 157xiiiList of TablesTable 3.1 Skin Reconstruction Error in Hand Tracking Experiment . . . . 42Table 5.1 FACS AU used in our experiments (left) and expressions thatcan be produced using these AUs (right). . . . . . . . . . . . . 78Table 5.2 Overview of the model size, memory usage, and performanceof the animation. The experiments are run in a Chrome webbrowser on a desktop with an Intel Core i7 processor and anNVIDIA GeForce GTX 780 graphics card at 60 fps. . . . . . . 87Table 7.1 Image-based corrective registration results: anatomy intersec-tion coefficient before and after corrections . . . . . . . . . . . 137xivList of FiguresFigure 1.1 Thesis overview. . . . . . . . . . . . . . . . . . . . . . . . . 4Figure 1.2 Overview of our skin measurement and reconstruction pipeline.Our skin tracking and reconstruction method uses a monocularcamera and a depth sensor to recover skin sliding motion onthe surface of a deforming body. . . . . . . . . . . . . . . . . 5Figure 1.3 Overview of our skin simulation pipeline. . . . . . . . . . . . 7Figure 1.4 We describe a complete pipeline to model skin deformationaround the eyes as a function of gaze and expression param-eters. The face is rendered with our WebGL application inreal-time and runs in most modern browsers (a) and mobiledevices (b). An example of automatic facial animation gen-eration while watching a hockey video is shown (c-d). Thecharacter starts watching the game with a neutral expression(c) and gets concerned when a goal is scored (d). Eye move-ments were generated automatically using salient points com-puted from the hockey video. . . . . . . . . . . . . . . . . . . 9Figure 1.5 Overview of our image-space anatomy misalignment correction. 11xvFigure 2.1 Soft tissue layers in face (Figure from [65]). . . . . . . . . . . 16Figure 2.2 Anatomy of asian (left) and caucasian (right) eyelids (Figurefrom [33]). . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Figure 2.3 Soft tissues and spaces of the orbital region (Figure from [110]):(A) orbicularis oculi, (B) levator palpebral superioris muscle. . 21Figure 2.4 Orbital bone (Figure from [110]). . . . . . . . . . . . . . . . 23Figure 3.1 Overview of spaces related to skin tracking. . . . . . . . . . . 27Figure 3.2 We use a monocular capture setup to capture subjects. . . . . 34Figure 3.3 Using a body (a) and skin tracked in an image sequence (b:top).With these information along with the input image, we can re-construct 3D skin (b:bottom). The whole sequence assumes afixed body. See the video for the complete sequence. . . . . . 35Figure 3.4 Skin motion reconstruction in a blink sequence. Top row showstracking points (red mesh) on input images. Bottom row shows3D skin reconstructions. . . . . . . . . . . . . . . . . . . . . 37Figure 3.5 Hand Tracking: Reconstruction of 3D skin for three frames ina hand movement sequence. The sliding motion of skin is pro-duced by the flexion of the small finger. In the top row, theinput body and image sequence is shown. In the bottom row,we show the 3D reconstruction of skin along with a zoomedin version to illustrate skin sliding. The red arrow in the lastframe shows an approximate direction of skin sliding. See sup-plementary video to visualize the motion. . . . . . . . . . . . 37Figure 3.6 Skin Reconstruction in a hand grasping gesture. . . . . . . . . 38xviFigure 3.7 Skin tracking error in image coordinates for hand tracking ex-periment. In (a) the errors are measured as ∞-norm betweenthe tracked points and corresponding baseline skin motion data,in pixels. The body mesh is expected to produce high error(red) as it does not include skin sliding. In (b) we show thatour algorithm produces low RMSE error with drift correction(blue: with drift correction, black: without drift correction). . 39Figure 3.8 Reconstruction error (in cm) of body sequence (top) and recon-structed skin (middle) from baseline data is shown in heatmapson a rest pose. In bottom row, the errors (in ∞-norm) are plot-ted for all 25 frames in the hand tracking experiment. . . . . . 40Figure 3.9 Comparison of reconstructed 3D skin (blue) and baseline data(red) in hand tracking experiment. . . . . . . . . . . . . . . . 41Figure 4.1 Stress-strain behavior in our muscle model. . . . . . . . . . . 48Figure 4.2 Stress-strain behavior in our skin model. . . . . . . . . . . . . 48Figure 4.3 Illustration of tissue layers in face. Skull, muscle, and skinlayers are shown separated for illustration purpose only andshould not be confused with any volumetric simulation. . . . . 50Figure 4.4 Fat compartments in face [79]. . . . . . . . . . . . . . . . . . 52Figure 4.5 Layered spaces (Jacobians are shown in green). . . . . . . . . 53Figure 4.6 Overall scheme. . . . . . . . . . . . . . . . . . . . . . . . . . 54Figure 4.7 Pulse-step activation profile. . . . . . . . . . . . . . . . . . . 60Figure 4.8 Simulation of facial muscle units for different muscle activa-tion levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . 61xviiFigure 4.9 Stress strain relation of simulated muscle. . . . . . . . . . . . 62Figure 4.10 (a) Learning forward model of a muscle fiber and (b) output ofstep and pulse-step controllers. . . . . . . . . . . . . . . . . . 62Figure 4.11 Synthetic example of skin simulation: (a) undeformed and (b)deformed under muscle action. . . . . . . . . . . . . . . . . . 63Figure 4.12 Frowning: (a) neutral and (b) frontalis muscle activated. . . . 64Figure 4.13 Eyebrow raising: (a) neutral and (b) corrugator muscle activated. 65Figure 4.14 Activation estimation for frown pose. (a) Target (red) and sim-ulated pose after optimization (blue) in 2D skin atlas, (b) 3Dreconstruction of target (red) and simulated skin (blue), (c) ac-tivation profiles of individual muscles, and (d) value of objec-tive function with iterations. . . . . . . . . . . . . . . . . . . 68Figure 4.15 Activation estimation for eyebrow raise pose. (a) Target (red)and simulated pose after optimization (blue) in 2D skin atlas,(b) 3D reconstruction of target (red) and simulated skin (blue),(c) activation profiles of individual muscles, and (d) value ofobjective function with iterations. . . . . . . . . . . . . . . . 69Figure 5.1 Overview of the spaces used for modeling skin movement. . . 73Figure 5.2 Body is the union of “skull” and globes. . . . . . . . . . . . . 75Figure 5.3 Gaze parametrized skin movement: We predict eyelid aperturefrom gaze using an aperture model, and we use gaze and aper-ture together to predict skin deformation in the eye region. Ifdesired, expression parameters can also be included to producefacial expressions. . . . . . . . . . . . . . . . . . . . . . . . 78xviiiFigure 5.4 An overview of target transfer. (a) The target character mesh(red) is registered non-rigidly on the capture subject mesh (blue)shown in the top row. Image coordinates of the target mesh arecomputed from the image coordinates of the model output us-ing barycentric mapping computed during registration. (b) Themodel trained on one subject can be used to generate animationof a character mesh of any topology uploaded by an user. . . . 82Figure 5.5 Flow chart of how inputs are handled in our applications by theanimation model. . . . . . . . . . . . . . . . . . . . . . . . . 84Figure 5.6 We can generate facial expressions interactively. From top leftclockwise: normal, frown, eye closure, and surprise expres-sions are shown. . . . . . . . . . . . . . . . . . . . . . . . . 85Figure 5.7 Facial animation automatically generated from a hockey video.The character starts watching the game with a neutral expres-sion (a), our stochastic model generates natural blinks (b), getsanxious anticipating a goal (c), and excited when a goal isscored (d). The salient points are computed from the videoand used as input to our system. . . . . . . . . . . . . . . . . 88Figure 5.8 Skin movement driven by gaze during static scene observation.The red circle in the left represents the image point the subjectis looking at. . . . . . . . . . . . . . . . . . . . . . . . . . . 89Figure 5.9 Skin deformation in eye closing during a blink. The character-istic medial motion of the lower eyelid during blink is gener-ated by the model (shown using red arrow). . . . . . . . . . . 89xixFigure 5.10 Comparison with no skin deformations (top), and with defor-mation (bottom) using our model with different gaze positions. 90Figure 6.1 Illustrative diagram of our experimental setup. . . . . . . . . 93Figure 6.2 Rotation of globe in the head-fixed coordinates. . . . . . . . . 94Figure 6.3 Block diagram of proposed gaze estimation framework. . . . . 96Figure 6.4 Subject wearing Chronos eye tracking device in a Vicon mo-tion capture system experimental setup (left), and a sample im-age from the video captured using Chronos eye tracking device(right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97Figure 6.5 Illustration: Hough transform for circles [47]. . . . . . . . . . 98Figure 6.6 Pupil center finding algorithm. . . . . . . . . . . . . . . . . . 101Figure 6.7 Pupil occlusion detection algorithm. . . . . . . . . . . . . . . 102Figure 6.8 Pupil horizontal width profile, showing center of mass basedpupil center (in red) and updated pupil center based on ouralgorithm (in green). . . . . . . . . . . . . . . . . . . . . . . 103Figure 6.9 The central projection of the eye onto the image plane [66]. . . 105Figure 6.10 Iris detection and annulus selection for ocular torsion estimation.109Figure 6.11 Ocular torsion and eye velocity in eye movements showing op-tokinetic nystagmus. . . . . . . . . . . . . . . . . . . . . . . 110Figure 6.12 Blink profile and eyelid velocity profile at an instant of eye blink.112Figure 6.13 The coordinate systems used in our project. . . . . . . . . . . 113Figure 6.14 Head-fixed coordinate system with origin at the globe center.Ta,Ta,Ta are markers on the helmet of the eye tracker. M1,M2,and M3 are markers used to compensate for the helmet slippage. 114xxFigure 6.15 Helmet slippage in an eye tracking experiment with extremehead movement. . . . . . . . . . . . . . . . . . . . . . . . . 114Figure 6.16 Distribution of point of regard (in green) while the subject fix-ates at marker shown in red. . . . . . . . . . . . . . . . . . . 117Figure 6.17 Standard deviation of error in point of regard in binocular gazeestimation (in mm) along X,Y and Z directions of the worldcoordinates. . . . . . . . . . . . . . . . . . . . . . . . . . . . 118Figure 7.1 We propose an image-based correction (step 1), and an occlu-sion estimation and layering technique (step 2). In the firststep, we correct anatomy regions separately. In the secondstep, we combine them in correct order to generate the finaloverlay image. . . . . . . . . . . . . . . . . . . . . . . . . . 124Figure 7.2 Overview of our corrective registration system. . . . . . . . . 125Figure 7.3 Feature estimation for left arm. In left: anatomy features (S )(blue) are estimated using anatomy landmarks (red) and sub-landmarks (green). In right: depth features (D) are estimatedusing Kinect depth landmarks (red) and sub-landmarks(green).Here we are showing one sub-division of landmarks. . . . . . 127Figure 7.4 Feature estimation: Preliminary 3D registration when super-imposed on input image (a) we can observe misalignment ofthe anatomy (b). We estimate anatomy features (d) from theintermediate anatomy images (c). In the bottom row, we showestimation (f) and segmentation (g) of depth contours from thedepth map (e), and estimated depth features (h). . . . . . . . . 129xxiFigure 7.5 Landmark correction: Our skeleton correction algorithm cor-rects initial Kinect 2D skeleton in image space (b) to producemore consistent configurations (c). . . . . . . . . . . . . . . . 131Figure 7.6 Cage generation and warping of the right upper limb. Yellowpoints are anatomy landmarks, White ones are updated land-marks. The red zones represent anatomy cages. . . . . . . . . 133Figure 7.7 Image-based corrective registration: Misalignments in the anatomyregions observed in (a) are corrected by our image-based cor-rective algorithm to produce (b). . . . . . . . . . . . . . . . . 134Figure 7.8 Occlusion handling: Our image-based corrective algorithm cor-rects misalignments in the rendering (a) of initial 3D registra-tion by warping anatomy regions in image space and in sepa-rate layers. Rendering them without knowing their relative dis-tances from camera create occlusions (b). Our occlusion han-dling algorithm can recover these relative distances and renderthese regions in correct order (c). . . . . . . . . . . . . . . . . 136Figure 7.9 The anatomy alignment coefficients for anatomy regions areshown before and after image-based correction for the squatsequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138xxiiAcknowledgmentsThanks to my Mom and Dad: for always believing in my dreams. Although youare thousands of miles away while I am writing this acknowledgement, your warmlove always remains with me as a constant source of inspiration.I would like to express my sincere gratitude to my supervisor Prof. DineshPai for his continuous support and guidance throughout my studies and researchat UBC. I would also like to thank to the members of my supervisory committee:Prof. Robert J. Woodham and Prof. Jim Little. Prof. Woodham has provided mewith valuable advice regarding my research and career path. Prof. Little’s invalu-able suggestions during my committee meetings have helped me in deciding myresearch direction. Also, special thanks to Prof. Franc¸ois Faure, for his encour-agement and advice during my research internship at Inria Grenoble Rhoˆne-Alpes,France.Thanks to all current and past members of Sensorimotor Systems Laboratory(SSL), without you my experience during doctoral studies would not have been soamazing! Thanks to the countless discussions and brainstorming sessions, whichhave certainly made me a better researcher. Special thanks to Anurag Ranjan, forour collaborations for two years. I fondly remember those discussions on facexxiiitracking and scene flow that used to last several hours. I remember those sleeplessnights before SIGGRAPH submissions– “a must have experience”– that I sharedwith most of you. In general, I am grateful to the following members of SSL:Sang-Hoon Yeo, Martin Lesmana, Shinjiro Sueda, David Levin, Ye Fan, Duo Li,Prashant Sachdeva, Joao Cardoso, Jolande Fooken, and Cole Shin for their unfail-ing support and assistance.A very special gratitude goes out to Prof. Pai and the funding sources forfunding my research and conferences attendances.And finally, last but by no means least, thanks to my dear friends Shailendraand Geetansh for weekend squash sessions, pub nights, and entrepreneurial en-deavours; thanks for being there in both exciting and challenging times. Thanksto the members of the meetup group “Grenoble Conversational English” that I cre-ated during my stay in Grenoble, France, for making my stay in France memorable.Thanks to all the members of Vancouver Assamese community - you helped me tomake Vancouver home away from home.Thanks for all your encouragement! This thesis is dedicated to you all!!xxivChapter 1IntroductionWhether or not the Bard actually said “The eyes are the window to your soul,”the importance of eyes is well-recognized, even by the general public. AlfredYarbus’s influential work in the 1950s and 1960s quantified this intuition; usingan eye tracker, he noted that observers spend a surprisingly large fraction of timefixated on the eyes in a picture. The eyes of others are important to humans be-cause they convey subtle information about a person’s mental state (e.g., attention,intention, emotion) and physical state (e.g., age, health, fatigue). Consequently,eyes are carefully scrutinized by humans and other social animals. Creating realis-tic computer generated animations of eyes is, therefore, a very important problemin computer graphics.But what is it about eyes that conveys this important information to observers?To discuss this, we need to introduce some terms. The colloquial term “eye” isnot sufficiently precise. It includes the globe, the approximately spherical opticalapparatus of the eye that includes the colorful iris. The globe sits in a bony socket inthe skull called the orbit. The term “eye” also usually includes the upper and lower1eyelids and periorbital soft tissues that surround the orbit, including the margins ofthe eyebrows. When we refer to the eye we mean all of these tissues, and we willuse the more specific term where appropriate.The most obvious property of eyes is gaze, that is, what the eyes are looking at.This is entirely determined by the position and orientation of each globe relative tothe scene. Almost all previous work in animation of eyes has been on animatinggaze, with some recent attention paid to the appearance of the globe and iris, andthe kinematics of blinks. For instance, an excellent recent survey of eye modelingin animation [85] does not even mention the soft tissues or wrinkles surroundingthe eyes.Gaze is clearly important, but the soft tissues of the eye also convey a lot of in-formation. We refer to this region as the eye region in this thesis. For instance, thewell-known “Reading the Mind in the Eyes,” test developed by Autism researcherBaron-Cohen [7] used still images of people’s eyes, without any context about whatthey were looking at, and hence little gaze information1. Yet, normal observers areable to read emotion and other attributes from these images alone. Facial expres-sions are one of the key components of effective communication. Emotions suchas anger, happiness, or sadness are accompanied by characteristic facial tissue de-formations. For example, we widen our eyes when we are surprised or droop theeyelids when we are fatigued. Similarly, when we observe a scene or a painting oureyelids produce a wide range of deformations which are highly correlated to gaze.Animating the soft tissues around the eyes is extremely important for computergraphics, but has largely been ignored so far.1You can try it yourself at http://well.blogs.nytimes.com/2013/10/03/well-quiz-the-mind-behind-the-eyes/2What do we set out to achieve?We are interested in measurement and modeling of subtle movements of the softtissues of the eye region, also known as periorbital soft tissues. With this goal,we propose several methods for measurement, simulation, and real-time motionsynthesis of these tissues.The main goals of this thesis are summarized below:• Measurement. Our first goal is to measure 3D eye movements and deforma-tions of skin around the eyes using a simple monocular camera setup. Thethin sheet-like structure of the facial muscles and tissues around the eyesallows us to represent them as thin elastic sheets in a reduced coordinatesystem. Furthermore, we want to use our measurement method to measuredeformations of other thin surfaces sliding over rigid or semi-rigid objects.• Simulation. Second, we aim to simulate the motions of the periorbital mus-cle and skin using a layered physics-based simulation method. This systemshould efficiently and robustly simulate sliding deformations of facial tissuesaround the eyes. We also want to investigate how feasible it is to estimatemuscle activations for different facial expressions, using the data collectedfrom video-based skin deformation measurement.• Animation. We want to train a machine learning motion model of skin thatpredicts skin configuration around the eyes when gaze and other expressionparameters are given as inputs. Our model should predict skin configurationin real-time and allow interactive facial animations. We will utilize our ef-ficient reduced coordinate representation of skin and use a simple but fastmotion model to achieve this goal.3• Anatomical Augmented Reality. We also want to propose an image-basedmethod to correct anatomy misalignment in a 3D anatomy registration sys-tem that accumulates during system steps, such as motion tracking andanatomy transfer. Our method should estimate these corrections based on2D features estimated using images captured by a RGB-D camera and im-ages generated by the 3D registration system.Figure 1.1: Thesis overview.1.1 Measurement of Eye and Facial Skin Movements1.1.1 MotivationApart from being used as a sensory organ, the eyes also have an important rolein conveying our emotions during any conversation, for which they are consideredone of the most integral parts of facial expression detection algorithms in computervision. Robust non-intrusive eye detection and tracking is therefore crucial for theanalysis of eye movement, and this motivated us to work toward the developmentof a robust eye detection and tracking system. In the scientific community, the4Figure 1.2: Overview of our skin measurement and reconstruction pipeline.Our skin tracking and reconstruction method uses a monocular cameraand a depth sensor to recover skin sliding motion on the surface of adeforming body.importance of eye movements is implicitly acknowledged as the method throughwhich we gather the information necessary to identify the properties of the visualworld. A review on eye tracking methodologies spanning three disciplines: neu-roscience, psychology, and computer science, with brief remarks about industrialengineering and marketing is reported in [29].Technology for reconstructing and tracking 3D shapes is now widely available,especially due to the availability of inexpensive sensors such as Microsoft’s Kinectand Intel’s RealSense cameras. In conjunction with template-based motion track-ing [15, 17, 98, 100], one can generate a sequence of 3D meshes that representthe shape of the body. However, these mesh animations do not accurately capturethe motion of skin, since skin can slide over the body without changing the body’sshape. For example the skin on the face and hands can stretch, wrinkle, and slide5during natural movements without being detected by a depth sensor. Similar to 3Dgaze, the skin movements around the eyes also convey a lot of important informa-tion about mental state, and therefore, capturing detail motion of skin is crucialto understand facial expressions. This motivated us to propose an algorithm thatcan track and reconstruct 3D sliding motions of skin on the surface of a deformingbody, such as human face and hands.1.1.2 Contributions• Robust 3D Gaze Tracking. Although our goal is to measure and modelsubtle motions of periorbital facial tissues, we also want to explore to whatextent gaze and the motions of these tissues are correlated. Therefore, wepropose a system to track 3D eye movement, also known as 3D gaze, usinga monocular RGB camera setup. Our gaze tracker can also automaticallydetect eye blinks.• Efficient Facial Skin Tracking. We propose a new method for capturingthe sliding motion of skin over a body using the color video information thatis usually available in addition to depth. The key observation is that the skinand muscles of the face, especially around the eyes, and on the back of thehand, are very thin and sheet-like [59]. In such regions, skin can be wellapproximated as a thin sheet sliding on the surface of an underlying rigid orarticulated rigid body structure, which we call body. Our proposed reducedcoordinate representation of skin allows recovery of skin sliding motion inthe eye region using a monocular image sequence and implicitly constrainsit to always slide on the surface of the body.6• Tracking Other Skin-like Materials. Our framework can also be used toreconstruct skin sliding motions on any other deforming body, as long asskin shares the shape of the body when it slides. For example, we can modelskin deformation on hands when we perform hand gestures.1.2 Simulation of Facial TissuesFigure 1.3: Overview of our skin simulation pipeline.1.2.1 MotivationWe can successfully measure skin motions using a reduced coordinate framework.This motivated us to find out whether the same framework could also be used tosimulate the motions of the thin tissues, such as facial muscles and skin in theforehead and around the eyes. One of the primary challenges of volumetric tis-sue simulation is to deal with collisions between the tissue and underlying rigidor semi-rigid body. Furthermore, having multiple layers of tissues further com-plicates these problems. We want to exploit the sheetlike structure of the facialmuscles and to simulate them in a physics-based simulation framework relativelyinexpensively in a 2D reduced coordinate system. We also need a controller tocontrol the system of facial muscles. Pulse-step controllers are commonly used tocontrol eye movements [82]. We also decided to use a pulse-step controller for oursimulation framework based on the evidence from the experiments of [34, 82].71.2.2 Contributions• Efficient Reduced Coordinate Simulation. We propose a quasistaticphysics-based simulation system for facial tissues around the eyes. By ex-ploiting the thin sheetlike structure of the facial muscles and skin in thatregion, we used a reduced coordinate representation to model and simulatethese tissues. By using this framework we inherit a couple of benefits thatwe previously obtained in skin movement measurement, such as faster de-formation computation and avoidance of collision issues between tissues.• Layered Simulation of Tissues. We modeled facial muscles and skin as lay-ers, and simulated them as Eulerian and Lagrangian simulations respectively[35, 59]. This layered simulation also allows modeling of tissue attachmentsand coupling.• A Simple Controller for Facial Tissues. Many experiments on eye andfacial muscles have shown that they move under burst-tonic innervations [34,82]. We therefore decided to use a pulse-step controller designed specificallyto control our reduced coordinate layered facial tissue simulation system.Using this controller we can produce the dynamic behaviour of facial tissuemovements.1.3 Motion Modeling and Interactive Animation1.3.1 MotivationWith the widespread use of WebGL and the hardware commonly found in smalldevices, it is now possible to render high quality human characters over the web.8Figure 1.4: We describe a complete pipeline to model skin deformationaround the eyes as a function of gaze and expression parameters. Theface is rendered with our WebGL application in real-time and runs inmost modern browsers (a) and mobile devices (b). An example of au-tomatic facial animation generation while watching a hockey video isshown (c-d). The character starts watching the game with a neutral ex-pression (c) and gets concerned when a goal is scored (d). Eye move-ments were generated automatically using salient points computed fromthe hockey video.Both eye diffraction effects [105] and skin subsurface scattering properties [2] canbe rendered at interactive rates. We make use of these advances to generate facialanimations using our model. With our measured 3D gaze and skin movement data,we are interested in building a system for real-time animation of all the soft tissuesof the eye, driven by gaze and a small number of additional parameters. These an-imation parameters can be obtained using traditional keyframed animation curves,measured from an actor’s performance using off-the-shelf eye tracking methods,or estimated from the scene observed by the character, using behavioral modelsof human vision. To our knowledge, our work is the first to specifically addressmodeling the dynamic motion of eyes along with details of the surrounding skindeformation. Our work is complementary to the important previous work in gazeanimation and on modeling the geometry and appearance of the globe and eyelid.91.3.2 Contributions• Parametrized Motion Model. We propose a novel parametrized facial mo-tion model that require minimal space and computation to simulate complextissue movements in the eye region. This motion model describes the motionof skin in the reduced coordinate framework already used in our skin mea-surement and simulation frameworks. The parameters of these generativemodels are learned using the data obtained during skin motion measurement.• Model Transfer. The skin motion model trained on a particular actor’s fa-cial expressions can be transferred to other characters. This lets us transferexpressions unique to the actor to other animation characters.• Realtime Animation. Once the generative model is learned, movements ofthe eyes are efficiently synthesized from a specification of gaze and optionalparameters such as brow raising. These parameters can be obtained fromany source, such as keyframe animation or an actor’s performance. Motionsynthesis is extremely fast, with speed only constrained by rendering time.A client application can download our motion models from a server and ren-der eye animation interactively using WebGL, a cross platform specificationsupported by modern web browsers.101.4 Anatomical Augmented Reality with Image-spaceAlignment1.4.1 MotivationRecently, there has been an increasing interest in building a mirror-like augmentedreality (AR) system to display the internal anatomy of a user superimposed onvideo [9, 14, 63]. These kind of applications are getting widely known as “anatom-ical mirror”. The work most relevant to our research is by [9], where their anatomymirroring application runs interactively in real-time. But, the system often pro-duces erroneous results for two main reasons: (1) the system solves an overlyconstrained optimization problem during 3D anatomy registration, and (2) the sys-tem has many stages, such as motion tracking and anatomy transfer, where errorsaccumulate. These errors show up as anatomy misalignments when anatomy issuperimposed on video. Our method efficiently corrects these misalignments inimage-space by avoiding full three dimensional corrections, and it can be easilycoupled with an existing real-time 3D anatomy registration system, such as [9]without significant reduction in frame rate.Figure 1.5: Overview of our image-space anatomy misalignment correction.111.4.2 Contributions• Image-space Anatomy Correction. We propose an image-based methodthat corrects the anatomy registration errors of a 3D registration system, suchas [9]. We estimate a transformation using the output data of the system andwarp the anatomy regions in image-space to correct misalignments.• Occlusion Handling. We also propose an occlusion handling and layeringalgorithm that determines how to combine different warped anatomy regionsin layers to generate a final augmented reality video.1.5 Thesis OutlineThis thesis investigates measurement, modeling, and animation of subtle facialtissue movements for different gaze and facial expressions. To our knowledge, ourwork is the first attempt to automatically generate subtle skin deformations of theeye region based on gaze. Figure 1.1 provides an overview of proposed methodspresented in this thesis.In Chapter 2, we provide the reader with a background of the structure of thefacial tissues of the eye region. We also discuss our motivation behind representingfacial muscles and skin of the eye region as thin sheets using a reduced coordinateframework.In Chapter 3, we describe a novel method to track facial skin or thin skin-likesurfaces using a reduced coordinate representation of skin. This allows us to mea-sure very subtle motions of facial skin with very few computations. This methodis very general and can be used with any state-of-the-art optical flow technique.In Chapter 4, to generate the motions of facial muscles, we first model facial12tissues of the eye region using physics-based models, and then simulate them usingquasistatic simulations. Again, we utilize a reduced coordinate representation ofskin to efficiently simulate muscle and skin of the eye region as thin elastic layers.We made an attempt to estimate tissue parameters using an optimization frame-work that uses measurements from Chapter 3 and a physics-based model of facialtissues. This approach was not successful due to several reasons which we discussin Chapter 4. Therefore, we decided to move into a new direction, where we modelfacial skin motions using a machine learning approach.To understand and model facial expressions of the eye region, along with mea-surement of facial tissue movements, we also need to track eye movements in threedimensions, also known as 3D gaze. In Chapter 6, we propose a video-based 3Deye tracking method that is robust to occlusions caused by eye blinks. This methodis an extension of the seminal work by [66, 83]. We have used our 3D gaze trackerextensively in our studies presented in this thesis.In Chapter 5, we propose a method to model and interactively animate facialskin motions of the eye region. Our model can automatically generate subtle facialanimations based on gaze and additional parameters. These parametrizations al-low us to generate different facial expressions, including generation of expressivewrinkles. Our models are lightweight and models trained on a user can be used totransfer user specific expressions to other subjects.Finally, in Chapter 7, we propose an anatomical augmented reality system thatcan efficiently correct anatomy misalignments in image-space in a template-based3D anatomy registration system. The topic of investigation in this chapter is notdirectly related to the theme of the other chapters, but this has potential applica-tions in augmented reality, such as anatomical mirroring. Our method can handle13occlusions among anatomy regions which was not taken care of in previous work[14, 63].The key aspect of our work is to represent skin as thin elastic sheet using areduced coordinate framework. This representation of skin has been extensivelyused in all of our measurement, modeling and animation modules. This frameworkallows us to both measure and model skin motions using a simple monocular video-based setup. This simple yet efficient representation of skin has helped us to createa web application, we call EyeMove, for facial animation that run interactively inlaptops and mobile devices.14Chapter 2Anatomy of the Eye RegionThe soft tissues around the eyes, which we call the eye region, have a very im-portant role in generating facial expressions. In this chapter, we present a briefintroduction to the anatomy of the facial tissues that will help readers to under-stand the complexity of the eye region. Our proposed measurement and animationsystem is an attempt to capture subtle and complex motions of these tissues of theeye region.2.1 Facial Tissue StructureAs described in [65], the human face can be divided into five concentric layers (SeeFigure 2.1): skin, subcutaneous layer, musculo-aponeurotic layer, retaining liga-ments and spaces, and fixed periosteum and deep fascia. Although these five layerscan be found throughout the face, their complexity of arrangement and anatomyvaries across the face. For example, the anatomy of the lower face is usually morecomplex in comparison to that of the scalp. Here, we will limit our discussion tothe skin, muscles, ligaments, and bones of the eye region.15Figure 2.1: Soft tissue layers in face (Figure from [65]).2.1.1 Facial SkinSkin, as an outer covering of vertebrates, plays several important roles, includ-ing protecting internal body structures from pathogens and external environment,sensing the temperature, pressure or vibration as a sensory organ, and maintainingthermoregulation. Skin is a stratified structure consisting of three layers: epider-mis, dermis, and a subcutaneous fat layer. Epidermis, with a thickness between0.008 and 0.013 mm, is the outermost layer of skin which can further be dividedinto four strata. The outermost layer is called stratum corneum. While stratumcorneum consists of dead keratinized cells, other strata contain living cells. Thelayer below epidermis is known as dermis which supports epidermis and usuallywith a thickness between 0.3 and 4 mm. It consists of a dense layer of fibrousproteins: collagen, elastin, and reticular. Dermis also contains ground substancesthat includes polysaccharids, water, enzymes, etc. The innermost layer of skin16is subcutaneous fatty layer also know as hypodermis, which protects skin againstimpact forces by absorbing energy. From the experiment performed in [39] it isfound that at low strain skin is very weak which is attributed to the elastin fibersin dermis, while at larger strains the effect of collagen fibers dominates and showshighly elastic behavior. The dermis and living cells in epidermis also make the skinviscoelastic. The facial tissue simulations implemented in Chapter 4 model skin asan orthotropic hyperelastic material and do not include the viscoelastic property ofskin.2.1.2 Facial MusclesThe eyelids play important roles in collection and distribution of tear over our eyes,protecting eyes from injury and glare, and removing airborne particles using eye-lashes from the front of the eye. The aesthetic appeal of a well shaped eyelid alsoincreases its importance in the field of reconstructive surgery. The separation be-tween upper and lower eyelids is known as palpebral fissure. The vertical andhorizontal distances of the fissure are usually 8-10 mm and 30-31 mm respectivelyin an adult. Another important measurement is the angle between upper and lowereyelids, which is approximately 60 degrees medially and laterally. The two majormuscles responsible for voluntary and involuntary eyelid movements are: orbic-ularis oculi (OOM) and levator palpebral superiors muscle (LPSM) (See Figure2.3).Orbicularis Oculi muscle. Orbicularis oculi muscle is a striated muscle lyingjust below the skin and mostly forms the shape and structure of the eyelid. Thismuscle can be divided into two parts: orbital and palpebral. The orbital portion17originates from the frontal process of the maxillary bone, the orbital process of thefront bone, and the common medial canthal tendons forming a continuous ellipsearound the orbital rim. The palpebral region is further divided into preseptal andpretarsal regions based on the regions underneath them. The preseptal part liesover the orbital septum, a protective membrane anterior to globe that extends fromorbital rim to around 3-5 mm above the tarsal plate. The pretarsal part overlies thetarsal plate, a thick and elongated connective tissue that gives the eyelid its shapeand structural rigidity. Unlike the orbital region, the palpebral region sweeps onlyin two half ellipses along the circumference of the orbital rim, both fixed mediallyand laterally in the canthal tendons. The deep heads of the pretarsal muscle medi-ally fuse to form Horners muscle, which contribute to lacrimal pump mechanism,and tightens the eyelids against the globe.Levator palpebral superioris. The levator palpebral superioris muscle andMu¨ller muscle are the two major eyelid retractors of the upper eyelid. The ori-gin of LPSM is the lesser sphenoid wing and it passes above the superior rectusmuscle to finally form levator aponeurosis near the Whitnall’s ligament. Whit-nall’s ligament is a condensation seen near the superior orbital rim, and attachesto the lateral and medial wall and soft tissues of the orbit. Whitnall’s ligamentprovides mechanical support to some important anatomical structures in the upperface region. As the levator aponeurosis passes into the eyelid from the Whitnall’sligament, it forms a “horn” like structure medially and laterally. Levator aponeu-rosis continues downward to insert into the tarsal plate and usually 3-4 mm abovethe upper eyelid margin [110]. The role of capsulopalpebral fascia in the lowereyelid is analogous to that of levator aponeurosis in the upper eyelid. The fascia18originates as the capsulopalpebral head from attachments to the terminal musclefibers of the inferior rectus muscle, and then divides to fuse into inferior obliquemuscle. Anteriorly the capsulopalpebral head divisions join to form the Lockwoodsuspensory ligament. The inferior tarsal muscle in the lower eyelid is analogous tothe Mu¨ller muscle [74].Occidental vs. Oriental Eyelids. The most characteristic difference betweenoccidental and oriental eyelids is the position of the eyelid crease, which is pre-dominantly determined by the insertion of projections of the levator aponeurosisinto the dermis. In Caucasian or occidental upper lid, orbital septum inserts intoFigure 2.2: Anatomy of asian (left) and caucasian (right) eyelids (Figure from[33]).the levator aponeurosis 2 to 5 mm above the superior tarsal border, while in theAsian or oriental upper lid it inserts onto the levator aponeurosis more inferiorly.The levator palpebral superioris muscle and dermis fuse superior to the tarsus in oc-19cidental eyelids, while in oriental eyelids this fusion happens inferior to the uppertarsal border. Other unique features of the Asian eyelids include vertically narrowpalpebral fissures, lash ptosis, a full appearance to the upper lids and sulci, andcommonly epicanthal folds [99].2.1.3 Retaining LigamentsAs we discussed before, the retaining ligaments hold the facial muscles in properpositions by attaching it to the skull. They also connect dermis to the underlyingperiosteum. In facial rejuvenation research, studying the anatomy and role of theretaining ligaments are very important because they are the major structures thathold the superficial facial structure. But with age, these ligaments weaken and thefacial structures start to sag. The two major retaining ligaments in the face are:orbicularis retaining ligament and zygomatic retaining ligament. The orbicularisretaining ligament, that helps to hold OOM to the skull, is a bilaminar membranethat connects the fascia on the underside of the orbicularis oculi to the periosteumof the orbital rim, and which merges laterally with the lateral orbital thickening[94]. The zygomatic ligament of the face is an osteocutaneous ligament that orig-inates from the periosteum of the zygoma and/or the anterior and lateral borderof the zygomatic arch and inserts into the superficial muscular aponeurotic systemwhich is connected to the dermis of the cheek. The zygomatic ligament restrainsthe facial skin against gravitational changes and delineates the anterior border ofthe cheeks [84].20(a)(b)Figure 2.3: Soft tissues and spaces of the orbital region (Figure from [110]):(A) orbicularis oculi, (B) levator palpebral superioris muscle.212.1.4 Orbital BonesThe orbit is approximately pear-shaped cavity (with an approximate volume of30 cc) in the upper head, within which numerous important structures are closelypacked. As shown in Figure 2.4, there are seven bones in the orbit: ethmoid bone,palatine bone, frontal bone, sphenoid bone, lacrimal bone, zygomatic bone, andmaxillary bone. The orbital walls can be divided into roof, floor, medial wall, andlateral wall. The muscles, nerves, and vessels in the orbit are padded with orbitalfat. Most of the tissues in this region are mostly associated with visual functions.However, the orbital bone also provides attachment sites to the orbicularis oculimuscle. Apart from the broad opening in the anterior, there are a series of foramina,canels, and fissures to communicate with the extraorbital compartments [110].22Figure 2.4: Orbital bone (Figure from [110]).23Chapter 3Measurement of Skin Movements3.1 BackgroundTechnology for reconstructing and tracking 3D shapes is now widely available, es-pecially due to the availability of inexpensive sensors such as Microsoft’s Kinectand Intel’s RealSense cameras. In conjunction with template-based motion track-ing [15, 17, 98, 100], one can generate a sequence of 3D meshes that representthe shape of the body. However, these mesh animations do not accurately capturethe motion of skin, since skin can slide over the body without changing the body’sshape. For example the skin on the face and hands can stretch, wrinkle, and slideduring natural movements without being detected by a depth sensor. Therefore,in this chapter, we propose a new method for capturing the sliding motion of skinover a body using the color video information that is usually available in additionto depth.Recovering 3D shape from monocular capture is an ill-posed problem and sev-eral constraints are imposed to limit the range of solutions. Two widely known24techniques are non-rigid structure from motion [19, 27] and shape from template[8, 42]. The Non-rigid Structure from Motion techniques (NRSfM) [19, 28, 45]are used to recover non-rigid structures from a sequence of images that capturesmotion of the object. The NRSfM offers a model-free formulation but it usually re-quires correspondences in a long image sequence. On the other hand, shape fromtemplate techniques use the constraints imposed by isometric or conformal defor-mations to reconstruct 3D shape [8, 42, 64, 87]. Our work is related to shape fromtemplate techniques such as the work of Garrido et al. [42].Facial expressions are one of the key components of effective communication.Emotions such as anger, happiness, or sadness are accompanied by characteris-tic facial tissue deformations. Eye movements are particularly important for non-verbal communication; in addition to changes in gaze, the configuration of theskin around the eyes conveys significant information about our mental and phys-ical states that can be recognized even from single images [7]. For example, wewiden our eyes when we are surprised or droop the eyelids when we are fatigued.Similarly, when we observe a scene or a painting our eyelids produce a wide rangeof deformations which are highly correlated to gaze.Surprisingly, there is very little research that focuses on tracking and recon-structing skin in the eye region. Recent work by Bermano et al. [12] trackedeyelid motions and skin deformation around the eyes using a complex multiplecamera setup, but reconstructed only simple motions such as blinking. On theother hand, in some other work eyelid margins are tracked but detailed skin de-formations around the eyes are ignored [26, 67, 75]. Garrido et al. [42] estimated3D structure of the face and refined the shape using photometric and optical flowconstraints using a monocular setup, but ignored the fine reconstruction of eyelid25motions. Currently, to our knowledge, no method exists to track and reconstructskin sliding deformation around the eyes using a simple monocular camera setup.The key observation is that the skin and muscles of the face, especially aroundthe eyes, and on the back of the hand, are very thin and sheet-like [59]. In suchregions, skin can be well approximated as a thin sheet sliding on the surface of anunderlying rigid or articulated rigid body structure, which we call body. This ap-proximation allows us to represent skin in a low dimensional space and implicitlyconstrain it to always slide on the surface of the body. This is the core idea of ourproposed skin movement measurement method, which allows efficient reconstruc-tion of subtle skin sliding motions. Such motions are small but highly noticeable,especially in the face. Capturing such skin movements from human subjects canenable the construction of data driven models of the face [69].Our main contribution is the use of the reduced coordinate representation ofskin to track and reconstruct skin sliding during facial expressions and hand ges-tures. This representation makes our system efficient and robust. Our method com-plements existing face and gesture tracking techniques by recovering characteristicskin sliding motions from a sequence of images. The reduced coordinate represen-tation automatically constrains skin to slide on the tracked surface. Furthermore, itis easy to use, with minimal setup. It can utilize widely available RGB-D camerasand can use any optical flow technique. Our algorithm can correct two types oferrors: first, tracking drift generated by the optical flow technique, and second, 3Dreconstruction error due to error in the mappings of reduced coordinate represen-tation to 3D.263.2 Representing Skin Motion in Reduced CoordinatesTo represent the sliding motion of skin over a deforming surface and its measure-ment by a video camera, we use the reduced coordinate representation of skinintroduced by Li et al. [59]; however, we discretize the skin instead of the body,i.e., we use a Lagrangian discretization. See Figure 3.1. Our reduced coordinatesystem has several benefits that makes it efficient and robust. First, by representingthree-dimensional skin in a two-dimensional space we can efficiently compute skinconfiguration. Second, by constraining the synthesized skin movement to alwaysslide tangentially on the underlying body, our skin reconstruction is robust againstbulging and shrinking and other interpolation artifacts.Figure 3.1: Overview of spaces related to skin tracking.Since skin is a thin structure that slides on an underlying body, we need torepresent the skin and body separately. We will assume that the character mesh to27be animated is discretized into a triangular mesh in a reference pose in 3D. Theskin and body meshes are aligned in the reference pose (top row, 3.1). The skin isparameterized by a map pi , using an atlas of rectangular coordinate charts.Following the notation of Li et al. [59], skin points are denoted X in 3D andu in 2D chart coordinates. By a small but common abuse of notation, we will usethe same symbols to denote the set of points corresponding to vertices of the skinmesh, represented as stacked vectors.They are related as:X= pi(u). (3.1)Each chart may be associated with a texture image I. Such meshes, parameteri-zation, and textures can all be obtained from standard RGBD scanning and meshregistration techniques (in our examples we used Faceshift [17]).The skin slides on the body, which has the same shape and topology as X, andproduces X in an intermediate physical space. The sliding motion of the skin isrepresented by the transformation φ (t)s at time t. We can obtain a skin point in theintermediate physical space at time t, Xt as follows:Xt = φ(t)s (X). (3.2)As the characters move in 3D, we assume they are imaged with some RGB-Dsystem which produces a body mesh xt at frame t. It is usually produced by somevariant of the ICP algorithm with point cloud data. However, the key point is thatthese meshes capture the body shape but not the sliding of skin over the body. The28true skin mesh, xt , is no longer aligned with the body mesh xt . Our goal is toreconstruct xt .To this end, the skin is imaged by a camera, with associated projection matrixP, to produce a color image It . The image coordinate of the skin vertex xt is denotedut . Therefore, we can write:ut = P(xt). (3.3)Note that since we have depth information from the body mesh xt , P is essen-tially a 3D projective transformation (and not a projection) and therefore invertible.This inverse is called the body map Mt , and it maps a skin point in the camera co-ordinates on to the body surface.Finally, note that once we know the locations u of mesh vertices in an image,we can define a function f in the visible regions of skin, from the atlas to image,by interpolating the values between vertices. This function can be used to warp ortransfer pixels from the texture atlas images I to the camera image.The underlying body on which skin slides can also change its shape with time.We represent the body in physical space as x. At time t, the body point is denotedby xt . The skin in physical space is denoted by x. At time t = 0 body and skinshares the same shape. The body in the intermediate physical space is fixed andsame as body at t = 0, i.e. x0. The change in shape of the body is represented bythe transformation φ (t)b at time t. Therefore a body point at t can be represented asfollows:xt = φ(t)b (x0). (3.4)29Please note that we assume body meshes x are obtained from standard video-based mesh registration techniques such as Faceshift [17]. From X to x, we com-pute the map φb. Thus, we obtain x from a skin point X in the skin space as:xt = φ(t)b ◦φ (t)s (X). (3.5)With all the above definitions we can define a composite mapping f ,ut = P(φ(t)s ◦φ (t)b (pi(u))) = f (t)(u). (3.6)We can generate texture at t by texture mapping the image at t using f−1.3.3 Reduced Coordinate TrackingOur reduced coordinate skin tracking and reconstruction algorithm is summarizedin Algorithm 1. Here we describe different components of this algorithm in greaterdetail.3.3.1 InputsOur system requires a sequence of 3D meshes registered to the motion of an object.We call this sequence a ‘body sequence’ (or xt , t = 1 : T ). The body sequence canbe obtained by any mesh tracking technique (e.g. [98, 100]). For our face trackingexample (Section 3.4.1), we used Faceshift [17] to generate the body sequence.Any calibrated monocular camera can used to produce the image sequence.30Algorithm 1: Skin Tracking in Reduced CoordinatesInput : Reference mesh X, with a set of reference texture images I;projection matrix P; for t = 1 : T , a sequence of body meshes xtand camera images ItOutput: Sequence of skin configuration, xt1 Initialization: Set u0 and I0, using X and I2 for t = 1 to T do3 Generate: body map Mt4 Compute: dense optical flow wp from It−1 to It5 Using correspondences between u0 and u∗t , where u∗t = ut−1+wp, warpI with f to generate I∗t6 Compute: corrective dense optical flow wc from It to I∗t to remove driftin flow7 Update flow: w = wp + wc // Eq. 3.78 Advect: ut = ut−1 + w // Eq. 3.89 Optimize: to improve 3D reconstruction // Eq. 3.9xt = argminx(‖Px−ut‖2+λ ‖x−Mt(ut)‖2)10 end11 return xt3.3.2 InitializationTo bootstrap the flow computations, we need u0 and I0. We first estimate the initialbody mesh x0. In many cases, it is the same as the reference mesh X; if not, aregistration step is performed to align the two. Then we project x0 using P to obtainu0 in image coordinates. Since the vertex coordinates for each mesh triangle areknown in both image coordinates and in the texture images I, we can synthesize I0by warping each triangle’s pixels from I to I0.3.3.3 Flow Computation and CorrectionWe use a dense tracker proposed by Brox and Malik [20] to estimate the flow wpbetween It−1 and It . We advect the tracked point locations ut−1 as: u∗t = ut−1+wp.31The error in optical flow produces drift in the location of the tracked points withtime. For longer sequences, the error quickly grows over frames. Therefore, tocorrect this drift we warp the texture images I to the frame It using barycentricinterpolation based on the locations of the tracked points (i.e., the function f inFigure 3.1) to produce I∗t . We compute a new dense flow wc from It to I∗t . Then wecorrect the flow to obtain the final flow w as follows:w = wp+wc. (3.7)Using w we get the final locations of the tracked points by advecting ut−1:ut = ut−1+w. (3.8)Our algorithm can accommodate both feature-based or dense optical flow tech-niques. We used Large Displacement Optical Flow in our examples. It is a com-putationally expensive dense optical flow technique, but produces very low recon-struction errors. On the other hand, we also experimented with fast but less accu-rate Kanade-Lucas-Tomasi (KLT) feature tracker. The results are documented inSection Generating Body MapAs we discussed earlier, we use a body map Mt to reconstruct 3D skin in the phys-ical space. To generate this map at t, we project the body mesh xt on the imageIt using P (cf., Eq. 3.3) to obtain ut . Mt maps ut to 3D mesh points xt , and it isan inverse of the projective transformation P. We use Matlab’s implementation ofnatural neighbor interpolation [91] to generate Mt using these points. Using this32mapping, we can estimate 3D location corresponding to other query points u. In-stead of using natural neighbor interpolation we can also use linear interpolationtechniques, which are faster but result in higher interpolation errors.3.3.5 3D Reconstruction.Now, using Mt we could, in theory, reconstruct the 3D skin position as xt =Mt(ut).However this did not give the best results. Recall that the real skin mesh is notaligned with the body mesh. So the reconstructed skin point may not lie exactlyon the body surface. To correct this, we also reproject reconstructed skin pointsonto the image and try to keep their locations in image coordinates close to that ofthe corresponding tracked skin points. This is implemented as an optimization thatweights the two terms:xt = argminx(‖Px−ut‖2+λ ‖x−Mt(ut)‖2). (3.9)The first term corresponds to minimizing the reprojection error, while the secondterm keeps the 3D reconstructed point close to the approximated body surface. Wesolve the optimization problem using a nonlinear Quasi-Newton solver in Matlab.For a given tracking system, we estimated λ by cross validating across a few setsof examples. This value of λ is subsequently used for other data obtained from thesame tracking system, and we obtained similar high quality results. For both facetracking and hand tracking examples we estimated λ = 0.1.33Figure 3.2: We use a monocular capture setup to capture subjects.3.4 ResultsWe tested the results using two examples: first, we track and reconstruct charac-teristic skin sliding around the eyes, and second, we reconstruct sliding motions ofskin on the hand in synthetic hand gesture sequences. In the second example wevalidate our method using baseline data of skin sliding. The code is written in Mat-lab 2015b (MathWorks, Inc.) and C++ on a desktop with an Intel Core i7 processorand 64GB of RAM. See our supplementary video in our project webpage1.3.4.1 Face TrackingThe face tracking example shows how using a simple monocular camera setup wecan recover detailed motions of skin around the eyes. Here we briefly describethe experimental setup. The setup is shown in Figure 3.3(a). We used a singleGrasshopper32 camera, that can capture up to 120 fps with image resolution of1960×1200 pixels. The actor sits on a chair and faces the camera with the head1http://www.cs.ubc.ca/research/seeingskininreducedcoordinates2FLIR, Richmond, BC, Canada34Figure 3.3: Using a body (a) and skin tracked in an image sequence (b:top).With these information along with the input image, we can reconstruct3D skin (b:bottom). The whole sequence assumes a fixed body. See thevideo for the complete sequence.rested on a chin rest. The scene was lit by a DC powered LED light source3 to over-come the flickering due to aliasing effects of an AC light source on a high framerate capture. We used polarizing filters with the cameras to reduce specularity. Wecalibrated the camera using Matlab’s Computer Vision System Toolbox. We usedFaceshift [17] technology with a Kinect RGB-D camera to obtain the body mesh.This process takes 15 minutes or less per actor.For faces, a single chart and texture image is sufficient, and matches the com-mon practice. In our algorithm we use dense flow to track skin features in the imagebut tracking the eyelid margins is challenging because of occlusions that occur dueto eyelashes and eyelid folds. Therefore, we tracked eyelid margins separately3https://www.superbrightleds.com35using an artificial neural network (ANN). We use a feed forward network (usingMatlab fitnet) with 5 neurons in the hidden layer. To generate the features, wecrop the eyelid region from the input RGB images and reduce it to 110 dimensionsusing PCA. The output of the model is the locations of 20 control points (manuallyannotated) that represent the shapes of the eyelid margins. For training we used98 frames from a video of 2335 frames. Eyelid margins are manually annotated.We cross-validated the model output with the manually annotated data set, andobtained an error (RMSE) of 1.2 pixels per eyelid marker on average per imageframe.The results of skin reconstruction around the eyes for a sequence where thesubject looks around are shown in Figure 3.3(c). In this example, the eyelid marginproduces complex shapes and eyelid skin slides over the skull and globe surface.This makes it a perfect example to demonstrate the applicability of our reducedcoordinate representation of skin. Our result shows recovering the characteristicdeformation of eyelids can greatly enhance the expressiveness; as humans are verysensitive to even subtle skin motions in the eye region. We approximate skull andglobe as one rigid structure on which eyelid skin slides. Therefore, we combinedthe face reconstructed by Faceshift with a globe model to generate a body meshthat approximates the body mesh on which skin slides. See Figure 3.3(b). Wealso show skin reconstruction of a blink sequence in Figure 3.4. In this exampleour reconstruction can also recover medial motion of lower eyelid skin, which isnormally observed in human blinks.36Figure 3.4: Skin motion reconstruction in a blink sequence. Top row showstracking points (red mesh) on input images. Bottom row shows 3D skinreconstructions.Figure 3.5: Hand Tracking: Reconstruction of 3D skin for three frames in ahand movement sequence. The sliding motion of skin is produced bythe flexion of the small finger. In the top row, the input body and imagesequence is shown. In the bottom row, we show the 3D reconstructionof skin along with a zoomed in version to illustrate skin sliding. The redarrow in the last frame shows an approximate direction of skin sliding.See supplementary video to visualize the motion.3.4.2 Hand TrackingTo validate our algorithm with baseline data for which the true skin movement wasknown, we used a hand tracking example with two artist generated animations of37Figure 3.6: Skin Reconstruction in a hand grasping gesture.body motion. In the first animation the little finger is flexed, and in the secondanimation all the fingers are flexed to produce a hand grasping gesture. To gener-ate baseline data we simulated skin sliding on the body using the skin simulationsoftware Vital Skin4. The skin was then rendered using Autodesk’s Maya softwareto generate an image sequence (1k × 1k resolution). The original animation andrendered image sequences are used as input to our algorithm.We tracked and reconstructed 3D skin using our algorithm and comparedagainst the baseline skin movement to evaluate reconstruction error. The results oftracking are shown in Figure 3.8. As expected, the projection of the body meshesshows large error from the baseline skin motion data as skin sliding motions aremissing, whereas the error of our motion tracking remains low. In Figure 3.8(b),4http://www.vitalmechanics.com/38Figure 3.7: Skin tracking error in image coordinates for hand tracking ex-periment. In (a) the errors are measured as ∞-norm between the trackedpoints and corresponding baseline skin motion data, in pixels. The bodymesh is expected to produce high error (red) as it does not include skinsliding. In (b) we show that our algorithm produces low RMSE errorwith drift correction (blue: with drift correction, black: without driftcorrection).39Figure 3.8: Reconstruction error (in cm) of body sequence (top) and recon-structed skin (middle) from baseline data is shown in heatmaps on a restpose. In bottom row, the errors (in ∞-norm) are plotted for all 25 framesin the hand tracking experiment.the root mean squared error (RMSE) is plotted against the frame number, and theresult of our tracking without motion refinement (shown in black) is also included.Without motion refinement the error gradually increases due to drift in optical flow(shown in black), but the refinement step reduces the drift (in blue) which shows52.57% reduction in error in the last frame of the hand tracking sequence. Note thatthe accuracy of tracking would vary depending on the tracking algorithm used. Forexample, in the last frame of the finger flexing example, dense LDOF algorithm40Figure 3.9: Comparison of reconstructed 3D skin (blue) and baseline data(red) in hand tracking experiment.performs better with RMSE tracking error of 0.44± 0.12 pixels than feature-basedKLT algorithm with RMSE error of 1.01 ± 0.29 pixels.As discussed in Section 3.3, we reconstruct 3D skin meshes sliding on the handin physical space. The reconstructed skin meshes are very similar to the baselineskin motion data as we can see quantitatively in Figure 3.8. The heatmaps depict41Table 3.1: Skin Reconstruction Error in Hand Tracking ExperimentType Error(mm) Frame 5 Frame 15 Frame 25Finger ∞-norm 1.02 1.56 1.41Flexing RMSE 0.3 ± 0.16 0.33 ± 0.19 0.41 ± 0.2Hand ∞-norm 1.29 1.54 2.63Grasping RMSE 0.51 ± 0.25 0.79 ± 0.33 0.81 ± 0.42the errors of each mesh vertex measured as a distance (in cm) from the baselinemesh for three frames (with interpolation), while the plot shows the ∞-norm ofthe overall mesh vertices for all the frames. A qualitative comparison is shownin Figure 3.9. In our unoptimized Matlab implementation, on average, it takes185s to compute the flow (with motion refinement) between two images (of size 1ksquare), 0.0078s to generate body map, and 7.54s to correct 3D skin reconstruction.We listed the reconstruction errors in mm in Table. 3.1. The table includes resultof both finger flexing and hand grasping sequences. See Figure 3.6. In this casethe hand grasping example has slightly high error as the motion is more extreme.Finally in Figure 3.5, we show our reconstructed skin with texturing. The motion ofthe tattoo in the bottom row of Figure 3.5 emphasizes the sliding of skin. Here skincan be thought of as the texture sliding on the body surface. To see the completereconstruction sequence, please refer to our supplementary video.3.4.3 LimitationsOur system has a few limitations. It requires the tracked skin features to be visiblein the entire image sequence to make sure that the skin points are not lost duringtracking due to occlusions. Another situation where reconstruction error is highis when tracked points reach the border of the visible skin region. There are two42reasons for this: first, at the border, body maps could be generated by extrapolationwhich is often unreliable due to noise, and second, when the surface normal is notwell aligned with the camera axis, 3D reconstruction is very sensitive to small errorin tracking. Another limitation is that our reduced representation of skin cannotmodel out-of-plane deformations of skin, such as wrinkles. But fortunately, in ourpipeline, these effects can be easily added by using normal or displacement maps.3.5 ConclusionWe proposed an efficient skin tracking and reconstruction algorithm that uses areduced coordinate representation of skin. Using this representation, our methodefficiently recovers 3D skin sliding motion by tracking 2D skin features in an imagesequence. Most of the current face and gesture tracking and reconstruction methodsignore skin sliding, but our examples show that by recovering skin sliding we canadd realism to an existing animation. Although we only showed applications of ourmethod in facial expression and hand gesture examples, our method is very generaland can be easily applied to other deforming objects with sliding surfaces.43Chapter 4Simulation of Facial Tissues ofthe Eye Region4.1 BackgroundIn the previous chapter, we discussed a video-based skin tracking algorithm thatcan be used to measure subtle skin movements in the eye region. In this chapter, wewill investigate how physics-based simulation of facial tissues of the eye region canutilize these measurements. First, we will describe the constitutive material modelsthat we use for facial tissues, and second, we will discuss a layered physics-basedsimulation framework that uses these models to generate facial tissue movements.It is desired that the simulated facial tissue configurations are consistent with themeasurements obtained using our facial skin tracking method.The work by Lee et al. [58] is one of the early attempts to create a physics-based model of layered facial tissues, and to use a muscle actuation model inskin dynamics to generate facial animation. Their algorithm can automatically44insert facial muscles in head models at anatomically correct positions, without go-ing through the painful process of manual registration. Ng-Thow-Hing et al. [71]proposed a data fitting pipeline to model muscles as B-Spline solids by extract-ing contours from medical images. In [70, 89], a physics-based character anima-tion and simulation system called Dynamic Animation and Control Environment(DANCE) is proposed that can be used to test biomechanical models of muscles.This system focuses on developing physics-based controllers for articulated fig-ures. A general architecture for the skeletal control of interdependent articulationsperforming multiple concurrent reaching tasks is proposed in [32]. They devel-oped a procedural tool for animators that captures complexity of these tasks. In[101], a more sophisticated muscle model is used, and facial movements are sim-ulated using quasi-incompressible, transversely isotropic, and hyperelastic skeletalmuscle model using finite volume method. The accuracy of the simulated facialexpressions depends on how well we can estimate the muscle activations neededto generate target muscle movements. Using a quasistatic formulation, Sifakis etal. [93] automatically computed muscle activations by tracking sparse motion cap-ture markers. They simulated anatomically accurate facial musculature using finiteelement method. A fast and automatic morphing technique to generate subject-specific simulatable flesh and muscle models is presented in [25] which can beused to generate a wide range of expressions. In [48], a dynamic model of jaw andhyoid is proposed to model the dynamics of chewing using Hill-type actuator-basedmuscle models. In the recent work by Sa´nchez et al. [88], masseter muscle modelsare acquired by a dissection and digitization procedure on cadaveric specimens tounderstand their structural details and their contribution to force-transmission nec-essary for mastication. Using deformable registration these templates can also be45used to generate subject specific masseter models.Muscles attach to the bones by tendons and ligaments, and modeling of theseconnective tissues is not well explored. A notable work is [96], where tendons inthe forearm and hand are modeled using an elastic primitive known as “strands”and are simulated using a biomechanical simulator which produces plausible handmotions. Sachdeva et al. [86] introduced an Eulerian-on-Lagrangian discretizationof tendon strands with a new selective quasistatic formulation, eliminating unnec-essary degrees of freedom in hand and biomechanical system simulations. Theyalso introduced control methods for these systems using a general learning-basedapproach and using data extracted from a simulator.In this work, we attempted to estimate material parameters of the facial tissuesby minimizing the error between our simulation output and position of the trackedfacial features but we achieved only limited success. Later in this chapter we dis-cuss the reasons behind it. This steered our research toward a data-driven approachfor modelling facial skin movements and we will discuss it in detail in the nextchapter.4.2 Constitutive Models of Facial Tissues4.2.1 Muscle ModelWe model muscle as a hyperelastic and anisotropic material. The stress in a mate-rial point of muscle depends on both the direction of the muscle fibers and deforma-tion gradient computed from material strain. We assume that the muscle behavioris insensitive of strain rate and the material has reached a “preconditioned” stateafter repeated loadings, resulting in minimal amount of hysteresis. For more details46please refer to [39].We represent the constitutive model of muscle using the strain-energy functionW , as follows:W = Whyp(E,e,ν ,a,α) i f ε > ε0,0 otherwise , (4.1)where, Whyp is a scalar energy function representing a hyperelastic material model(e.g. Stvk material), E is Green strain in muscle, e is Young’s Modulus, ν is Pois-son’s coefficient, α is the activation level in the muscle, and a is the muscle fiberdirection. In our model, as shown in Figure 4.1, we shift the passive muscle stress-strain profile along the muscle fiber direction with an amount proportional to theactivation level α . Other sophisticated muscle models, such as Hill-type musclemodel could also be used but we should note that the traditional Hill-type musclemodel entirely lacks history dependence and fails to account for force enhance-ment [73].There are two major types of muscles found in face: linear and sphincter. Thisclassification is based on the direction of muscle fibers. Linear muscles, as thename suggests, have muscle fibers along a particular linear direction, a, while insphincter muscles the fibers are arranged in a circular fashion. When activated, lin-ear muscles produce force along the fiber directions, while sphincter muscle fiberscontract tangentially producing net force along the radial direction, closing the ori-fice in the muscle. Since both types of muscles are sheet-like, our 2D Eulerianframework, which we will describe later, is suitable for their simulation. The sim-ulations of both kinds of muscles are shown in the results section of this chapter.The palpebral part of the orbicularis oculi muscle that controls the closing ofthe palpebral fissure is modeled such that when activated it produces a net down-47Figure 4.1: Stress-strain behavior in our muscle model.Figure 4.2: Stress-strain behavior in our skin model.48ward force to close the eyelids, and also can be controlled by an external forceapplied by levator aponeurosis. We model the eyelid with a stiffer material to rep-resent the tarsal plate. We consider preseptal and orbital parts of orbicularis as apart of a single sphincter and model it in a piecewise linear fashion as in [58].4.2.2 Skin ModelWe propose a constitutive model of skin that shows “biphasic” skin behavior, acharacteristic property in animal skin, and produces buckles during compression.We use the Fung elastic model for isotropic preconditioned skin [39]. In this modelskin behaves linearly for small strains but becomes very stiff for large strains. Tomodel the compressive elastic force in skin, we followed a model similar to the postbuckling response to materials during compression as in [23], that also considersbending forces.Starting with this empirical model, we can represent strain-energy of skin,Wis in the following form:W = f (α,E)+ c exp(α,E), (4.2)where,W is summation of a piecewise linear function f and an exponential func-tion, α has different values for different strain regions as shown in Figure 4.2, andc = 0 during compressive strain.From Figure 4.2, we can observe that at different levels of stretch, skin materialcan unfold characteristic fine wrinkles or buckles. When we compress skin, itfirst shows fine scale wrinkles and as stress crosses a critical force, skin startsbuckling. Cerda and Mahadevan [22] modeled the elastic response of a thin film49resting on the top of an elastic foundation of any geometry. In this model, thewrinkle periodicity (λ ) depends on thin film and its foundation, while the wrinkleamplitude (A) depends on compressive strain. Our model considers the bucklelines as the properties of skin material, and also models the buckle amplitude basedon the compressive strain in the elastic foundation, which is muscle in our case.Anisotropy of muscle and skin. The directional behavior of muscle and skinmaterial can be specified by an animator during modeling. When tissues are mod-eled as a triangulated mesh, we can compute parametric weights for a materialpoint inside a triangle as a translation-independent weighted sum of three verticesof the triangle as in [106], where the user specifies fiber directions at the verticesof the triangle.Figure 4.3: Illustration of tissue layers in face. Skull, muscle, and skin lay-ers are shown separated for illustration purpose only and should not beconfused with any volumetric simulation.504.2.3 Subcutaneous TissueWe have not included incompressibility of muscle in the model, unlike in the con-stitutive volumetric model used in [101]. It is not very straightforward to includevolume conservation in a 2D simplex. To deal with that, we introduce a layer ofmaterial between muscle and skin and its volume is dependent on the deforma-tion of muscle. Additionally, this layer can qualitatively capture the behavior ofa fat layer. In Figure 4.4, the fat compartments in the face are shown. We fol-low an approach similar to [10], where height of the skin was predicted based ondeformation of the underlying mesh assuming local volume conservation in thefat compartments in the upper head region. This assumption is valid if there isno muscle bulging. But even this can be handled as in [10], by applying varyingweights to the skin height update equations. The updated skin thickness in ith fatcompartment region, hican be computed as:hi= wi · AˆiAi · hˆi, (4.3)where, hˆiis the rest state skin thickness, Aˆi and Ai are the area of the mesh corre-sponding to the ith fat compartment in rest and deformed configurations respec-tively, wi is the scaling factor for ith muscle compartment to deal with musclebulging.4.3 Overview of Simulation FrameworkThe pipeline of our simulation scheme in layered spaces is shown in Figure 4.6. Weuse both Eulerian and Lagrangian discretization and we will only describe thembriefly here; for more details please refer to [76]. First, we start with a high resolu-51Figure 4.4: Fat compartments in face [79].tion mesh of the skull, on which we will simulate muscle, skin, and subcutaneousfacial tissues. In the pre-processing stage we perform mesh coarsening to obtain adecimated mesh. We also model the muscles and their material properties in thisstage. The skin and muscle properties can be estimated from experiments, such asin [39], or could be painted by an animation artist, or from captured video data.After this stage, we perform a Lagrangian simulation of muscles, followed by anEulerian simulation of skin. In our first simulation, which we call ξ1, we simulate52muscle movements given an activation level, and then in the second simulation, wecompute elastic force on skin computed from muscle deformation using our secondsimulation ξ2. Finally, we compute Lagrangian high resolution skin mesh from thecoarse mesh using a mapping computed in ξ2.Figure 4.5: Layered spaces (Jacobians are shown in green).4.3.1 KinematicsOur simulation framework consists of a Lagrangian simulation of muscle followedby an Eulerian simulation of skin. The piecewise linear function Φ representsmuscle motion, i.e., deformation of muscle from its material space X . We use Ψto represent the skin motion relative to muscle. Instead of using skin as a texture[59], we compute Lagrangian skin positions from Eulerian simulation output in ξ2.As shown in Figure 4.6, this allows us to use any high resolution texture map of53Figure 4.6: Overall scheme.a dense mesh, and mapping η allows us to compute high resolution Lagrangianskin atlas. Finally, the piecewise linear function pi0 maps high resolution atlas to aLagrangian dense skin mesh as our final output.4.4 Dynamics4.4.1 Deformation GradientsWe use piecewise linear mappings and construct constant Jacobian matrices foreach triangle in the simulation grid, as in [59, 101]. We compute the reduced de-formation gradient of muscle F and reduced deformation gradient of skin F usingfollowing equations:F = DxR−1X , F= DxR−1X, (4.4)where, Dx for each face is constructed by assembling edge vectors, di = xi−x0, i=1,2 as columns with vertex positions xi, i = 0,1,2 in x. R−1X and R−1Xare obtained54from thin QR decomposition of DX and DXrespectively. To understand the neces-sity of computing the reduced deformation gradient, please refer to [59].Now, using the reduced deformation gradients we can compute reduced Greenstrains E and E of muscle and skin respectively as follows:E =12(aT FT Fa− I), E= 12(FTF− I), (4.5)where, a is the direction of muscle fibers.4.4.2 Compute Elastic ForceWe first formulate the constitutive equations W andW, as in Section 4.2, and adda coupling energy term, Wms to get a total energy term, WT as:WT =W +W+Wms. (4.6)We then compute the contribution of triangle j to the total force at vertex i asfollows:fi j = ∂x/∂uT (−∂WT j/∂x), fi j = ∂X/∂UT (−∂WT j/∂X), (4.7)where, fi j is the force on a muscle vertex due to muscle activation or passive forcestransformed to skull atlas, and fi j is the force on a skin vertex. The final force on avertex is computed by summing up the contributions from incident faces.554.4.3 IntegrationAt each time step we solve for the constrained velocities of each layer using awidely used linear implicit method [6], and forming a linear constrained dynamicalKarush-Kuhn-Tucker (KKT) system [18] given as:M∗ GTG 0v˜(k+1)λ= f ∗g , (4.8)where, λ is the vector of Lagrange multipliers, M∗ = M + h2∂ f/∂u is the iner-tia matrix, f ∗ = Mv(k)+ h f (k) is the elastic force, and G is the constraint matrixthat represents the constraints between tissue layers. Additionally, in the Euleriansimulation, ξ2, we advected the velocity field of skin using the semi-Lagrangianmethod used in [36, 95].In ξ1 and ξ2, the kinetic energies of muscles and skin in the physical space canbe computed as:T =12∫AuρvTΓ1TΓ1vdAu, T=12∫AUρvTΓ2TΓ2vdAU . (4.9)This gives us the inertia contribution of a triangle j, which can be computed asM j = m jΓ1Tj Γ1 j for muscle, and similarlyM j =m jΓ2Tj Γ j for skin.4.4.4 Tissue AttachmentsFacial tissues have many attachments with other biological tissues in that regionthat facilitate complex facial expressions. Many of those attachments can be mod-eled as constraints among different tissue layers. While muscles originate or insertinto skull (e.g., orbicularis oculi inserts to frontal bone of skull), skin to muscle56attachments (e.g., attachment of frontalis muscle to skin in the eyebrow region) arealso common in several regions of the upper head [68]. To incorporate the fixed at-tachments among facial tissue layers, we implement velocity level hard constraintsin our simulation settings and solve KKT systems of constraint dynamical systemsas in [59], in ξ1, and ξ2 respectively.4.5 ControlNow we will discuss our motivation behind using a quasistatic formulation of oursimulation framework to estimate appropriate muscle activations to generate facialexpressions.4.5.1 Quasistatic FormulationGiven a set of muscle activations, a quasistatic framework can generate simulatedpose without the need of a full dynamic simulation. We use two approaches tocompute muscle activations to simulate facial expressions. First, we use the FacialAction Coding System (FACS), that codes anatomically possible facial expressionsusing Action Units (AU), that defines contraction or relaxation of one or more fa-cial muscles [31]. In the upper head, the major facial expressions include eyebrowraising and lowering, winking, blinking, and frowning. From the FACS database,we collect the muscular basis of those actions and then generate facial expressionsby controlling these muscles. Unfortunately, this database does not include thecomplex facial motions that may involve a variety of combinations of facial mus-cles. Second, we can use a video-based approach where by matching a target pose(we will define in the next section) from video to the output of the quasistatic sim-ulation, the muscle activations can be estimated. In this chapter we will mainly57focus on the second approach and we will start with a description of the quasistaticformulation.To construct a mapping from the activation space to the tissue configurationspace, we can proceed in two ways: first, for a given muscle activation, we can al-low our dynamic simulation to run until tissue configuration attains a steady state,or second, we can adopt a quasistatic formulation to estimate the configuration ofmuscles and skin. The quasistatic assumption is valid because at any point of timethe external forces on tissues (or forces due to muscle activation) are balanced bythe internal material resistance. This is faster in comparison to the first approachgiven we have a stable numerical scheme. Here we note that the quasistatic formu-lation will only be used for the purpose of constructing the above mapping, but notfor any dynamic simulation.The constitutive model of muscle described in Section 4.2 suggests that whena muscle is activated with a constant activation, it will finally reach a steady stateconfiguration when the stress due to muscle activation is balanced by external elas-tic forces of the tendons or other muscles attached to it [92]. Therefore, the steadystate configuration xeq at muscle activation α can be implicitly defined as the equi-librium force f (xeq,α) = 0. In practice, we solve the equilibrium configurationueq in the physical atlas u, and then perform a barycentric mapping to compute thephysical mesh configuration xeq. For a given muscle activation, we compute theforces fu and its Jacobian ∂ fu/∂u symbolically using Maple1. Using a similar ap-proach, we can compute steady state skin configuration in the muscle atlas and thentransform to ueq in the physical atlas. We solve the system with a Newton-Raphsoniterative solver. At each step we compute the updated muscle atlas configuration1https://www.maplesoft.com/58using the Taylor approximation, fu(uk +δu)≈ f (uk)+ ∂ fu/∂u|uk δu,muscle: u(i+1) = u(i)− (∂ fu/∂u)−1 fu (4.10)skin: u(i+1) = u(i)− (∂ fu/∂u)−1 fu. (4.11)To stabilize the scheme, we added a scaled (by a small scalar λ ) identity matrixto the Jacobian matrix, as in the Levenberg–Marquardt algorithm, to prevent theJacobian from becoming near-singular.4.5.2 Pulse Step ControllerThe burst-tonic innervation has been found in many of the eye and facial mus-cles [34, 82]. Evinger et al. [34] recorded eyelid movement data in normal humansubjects by monitoring lid positions using magnetic search coils. Their data showthat amplitude-maximum velocity can uniquely and reliably characterize basic eye-lid movements such as lid saccades and blinks. This kind of pulse step controlleris commonly used to characterize eye movements [82]. We use an equilibriumcontroller such as pulse-step controller to control the movement of facial tissues.Given the steady state activation α∗ of a target key pose P∗, we can reach poseP∗ by using an activation profile as follows:α = α∗+(αmax−α∗)u(t), (4.12)where, αmax is the maximum feasible control signal, and the function u(t) can be59used to decide between step and pulse-step controller.step: u(t) = 0, t ≥ 0 (4.13)pulse-step: u(t) =1, 0≤ t < t∗0, t ≥ t∗. (4.14)In the case of pulse-step controller, we first apply a pulse control until the targetskin configuration is reached at t∗, and then we apply a step control. The activationprofiles are shown in Figure 4.7.Figure 4.7: Pulse-step activation profile.4.6 ResultsWe implemented our system in Matlab and C++ and the simulations were per-formed in a 2.66 GHz Intel Core i5 computer with 8 GB of RAM with timestepsof 1ms. Rendering is done using Maya Mental Ray, using Sub Surface Scattering60(SSS) material.We modeled frontalis muscle and orbicularis oculi muscle as linear and sphinc-ter types respectively. The frontalis muscle raises the eyebrows while the orbicu-laris oculi muscle helps to close the eyes. We can produce muscle movements byactivating these muscles. The results are shown in Figure 4.8. The stress-strain re-lationship of the muscles, activated with different activations, are shown in Figure4.9.Figure 4.8: Simulation of facial muscle units for different muscle activationlevels.The forward model of a muscle fiber, the model that predicts steady state mus-cle configuration for a given muscle activation, is learned by calculating muscleconfigurations at different activation levels at steady state conditions (Figure 4.10(a)). The response of a muscle fiber to step and pulse-step control inputs is shownin Figure 4.10 (b). From the results we see that the step controllers take a longtime to reach target configuration, while pulse-step controllers can produce rapidmovements.61Figure 4.9: Stress strain relation of simulated muscle.Figure 4.10: (a) Learning forward model of a muscle fiber and (b) output ofstep and pulse-step controllers.For simulation of face, the low resolution simulation mesh has 524 vertices(965 faces) and the high resolution mesh has 7517 vertices (14905 faces).62In the eyebrow raising example, the frontalis muscle is activated (Figure 4.12).To produce a frowning expression, the corrugator muscle is activated (Figure 4.13).The movements of skin in a synthetic example are shown in Figure 4.11.(a)(b)Figure 4.11: Synthetic example of skin simulation: (a) undeformed and (b)deformed under muscle action.63(a)(b)Figure 4.12: Frowning: (a) neutral and (b) frontalis muscle activated.4.7 Muscle Activation to Pose Space MappingOur goal now is to compute muscle activations that can simulate a target facial ex-pression, such as frowning or brow raising. Let’s consider that we are simulatinga frowning expression. We propose to learn the muscle activations to produce thisexpression from a video sequence where a subject starts with a neutral facial ex-64(a)(b)Figure 4.13: Eyebrow raising: (a) neutral and (b) corrugator muscle acti-vated.pression and then frowns. We obtain a 3D face mesh corresponding to the neutralfacial expression of the subject (obtained using face registration techniques, suchas Faceshift) and project it on the image plane using the camera projection matrixestimated during calibration of the camera. We track the projected mesh in thevideo sequence, as described in the previous chapter, to compute the final config-65uration in the frowning state, we call it target pose, which consists of a list of 2Dvertex coordinates in the image plane with a predefined triangulation. We call theimage plane pose space. Our simulations generate skin configurations in the 2Dskin atlas (See Figure 4.5). We can reconstruct the 3D configuration of the simu-lated skin from skin atlas (See Chapter 3 for details) and project the 3D coordinatesof the skin vertices on the image plane using the same camera projection matrix asbefore to obtain simulated pose.We then use the target pose in an optimization scheme, that we will describenext, to estimate the muscle activations that simulate a frowning expression usingour simulation pipeline. This optimization minimizes the error (in L2 sense) be-tween the target pose and simulated poses generated by our quasistatic simulationfor different set of activations.Using our quasistatic formulation, we can establish a mapping M : A→ P,from muscle activation space A to the pose space P. We can denote simulatedpose by P and activation by α . α is obtained by arranging the activations of Mmuscles in a column α = [α1, ...αM]T .Consider a set of sample poses {P(1),P(2), ...,P(N) ∈P} obtained from our qua-sistatic formulation at a given set of muscle activations {α(1),α(2), ...,α(N) ∈A}.The set of sample poses also includes the neutral pose. These sample poses area subset of the poses from the Facial Action Coding System (FACS) or any otheruser defined expressive poses. Our video based face tracking framework allows usto obtain a pose Pt of the subject in the pose space at any time t. Let’s consider Ptto be the target pose and we want to estimate the activations that will produce thispose using our simulation.We use Downhill Simplex optimization [78] to minimize an objective function66f (w) = ‖Pt−P‖2, where P is a simulated pose. We represent P as a weighted sumof the sample poses as P=∑Ni=1 wiP(i) and we optimize to find the optimal weightsw∗:w∗ = argminwf (w). (4.15)The weights are constrained to lie in [0,1]. The optimal activation α∗ can now becomputed as: α∗ = ∑Ni=1 w∗i α(i).4.7.1 Optimization ResultsTo evaluate our optimization scheme we selected three target poses correspondingto frowning, eyebrow raising, and squinting. We registered a 3D face template tothe depth data of a subject obtained using a commodity RGB-D camera and pro-jected the face mesh on the video to obtain the target poses. We tried two differentoptimization schemes: fmincon (minimum of constrained nonlinear multivariablefunction) and fminsearch (minimum of multivariable function using derivative-free method).The results show that the optimizations fail to converge and find activationparameters. See Figure. 4.14-4.15. In the frowning example shown in Figure4.14, the activation of the corrugator muscle is estimated correctly as the highestamong all the muscles, but the error did not decrease considerably. While in theeyebrow raising example shown in Figure 4.15 the procerus muscle is estimatedas highly activated but this muscle has very low contribution in raising eyebrows.Additionally the error did not reduce even after a large number of iterations. Thesimulation framework could not find a set of muscle activations that could generatea muscle configuration close to the target due to the following reasons:671. The proposed tissue models may not faithfully model the extremely non-linear skin motions such as in the wrinkled regions. Including wrinkles inthe constitutive model itself is an open problem.2. Any mismatch between the 3D muscle model and real muscle shape, evensmall errors, produce large errors when the 3D model is projected on videofor optimization. It is extremely difficult to obtain accurate muscle registra-tion, as muscles are not visible to RGB-D cameras, and using medical-gradescanning to get accurate muscle shapes of each subject is impracticable.Furthermore, in the current state, proposed two layered simulation itself is com-putationally expensive and interactive application is challenging.Figure 4.14: Activation estimation for frown pose. (a) Target (red) and simu-lated pose after optimization (blue) in 2D skin atlas, (b) 3D reconstruc-tion of target (red) and simulated skin (blue), (c) activation profiles ofindividual muscles, and (d) value of objective function with iterations.68Figure 4.15: Activation estimation for eyebrow raise pose. (a) Target (red)and simulated pose after optimization (blue) in 2D skin atlas, (b) 3Dreconstruction of target (red) and simulated skin (blue), (c) activationprofiles of individual muscles, and (d) value of objective function withiterations.4.8 ConclusionIn this chapter, we discussed how using a reduced coordinate representation ofmuscle and skin of the eye region we can simulate muscle movements and whichin turn generates the skin movements. We proposed constitutive models for mus-cle and skin, and simulated them efficiently as thin layers. We also attempted toestimate muscle activation from video but our optimization did not converge dueto reasons we discussed in the previous section. In future, we want to move towardthe direction of interactive facial animation, but our proposed layered simulationis currently computationally expensive and interactive simulation is not possible.This steered our research toward a data-driven approach to model facial skin move-69ments, which we will discuss in detail in the next chapter.70Chapter 5Interactive Animation of the EyeRegion5.1 BackgroundIn this chapter, we investigate an interactive method to animate the subtle skin mo-tions of the eye region. The eye region is one of the most important parts of theface to convey expression, and therefore it is worthwhile to design a model specif-ically for this region. Other parts of the face may be modeled by more traditionalmethods. Here, we propose a method to model and animate skin movements of theeye region as a function of gaze and a few additional parameters.Most of the research attention on eyes has been on tracking and modeling gaze,especially the gaze behavior of a character in a virtual environment [85]. Eye track-ing is now a relatively mature field, and a wide variety of off-the-shelf eye trackersof varying quality are available. In industry, geometric procedural methods areused to model the eye region (e.g., [77]). Similarly, considerable research in the71past decade has been focused on facial simulation and performance capture. Phys-ically based deformable models for facial modeling and reconstruction include theseminal work of Terzopoulos and Waters [102]. Most of the recent work has beenon data driven methods. Some state-of-the-art methods focus on obtaining realismbased on multi-view stereo [11, 13, 40, 44, 109]; this data can be used to driveblendshapes [41]. Some of the work is based on binocular [104] and monocular[43, 90] videos. Recent work by Li et al. [61] described a system for real-time andcalibration-free performance-driven animation. A system for performance-basedcharacter animation by using commodity RGB-D cameras is presented in [107]that can be used to generate facial expressions of an avatar in real-time.With the widespread use of WebGL and the hardware commonly found in smalldevices, it is now possible to render high quality human characters over the web.Both eye diffraction effects [105] and skin subsurface scattering properties [2] canbe rendered at interactive rates. We make use of these advances to generate facialanimations using our model. Our system provides novel parametrized facial motionmodels that require minimal space and computation to simulate complex tissuemovements in the eye region.5.2 Skin Movement in Reduced CoordinatesOur proposed skin motion model is particularly suitable to model the subtle skinmotions of the eye region. Our model has two motivations: first, since there are noarticulating bones in the eye region, the skin slides on the skull almost everywhere(except at the margins where it slides over the globes; these margins are discussedbelow). Therefore, we would like to efficiently model this skin sliding. Second,we would like the model to be easily learned using videos of real human subjects.72Skin space Physical spaceSkin atlas Camera coordinatesFigure 5.1: Overview of the spaces used for modeling skin movement.We have developed such a system for tracking and learning skin movement using asingle video camera as discussed in Chapter 3. The details are not important here,but it is useful to keep this motivation in mind to understand our choices.We have already introduced the readers to the reduced coordinate representa-tion of skin in Chapter 3. Here, we will describe this representation again in thecontext of skin motion modeling. This representation constrains the synthesizedskin movement to always slide tangentially on the face, even after arbitrary interpo-lation between different skin poses. This avoids cartoonish bulging and shrinkingand other interpolation artifacts. We will see in Section 5.5 that deformation per-pendicular to the face can be achieved where needed, for example in the movement73of the eyelid. This representation also reduces the data size in most computations.Skin is represented by its 3D shape in a reference space called skin space; thisspace is typically called “modeling space” in graphics and “material space” in solidmechanics. The key idea is that since skin is a thin structure, we can also representit using a 2D parameterization pi , using an atlas of coordinate charts. See Figure5.1. In our case, a single chart, denoted skin atlas, is sufficient; it can be thoughtof as the skin’s texture space.Skin is discretized as a mesh S = (V,E), a graph of nv vertices V and ne edgesE. In contrast to [59], this is a Lagrangian mesh, i.e., a point associated with avertex is fixed in skin space. Since most face models use a single chart to providetexture coordinates, these coordinates form a convenient parameterization pi . Forinstance, that is the case for face models obtained from FaceShift [107], which weuse for the examples shown in this thesis.In a slight departure from the notation of [59], a skin material point corre-sponding to vertex i is denoted Xi in 3D and ui in 2D coordinates. We denote thecorresponding stacked arrays of points corresponding to all vertices of the mesh asX in 3D skin space and u in the 2D skin atlas.The skin moves on a fixed 3D body corresponding to the shape of the headaround the eyes. Instead of modeling the body as an arbitrary deformable object asin [59], we account for the specific structure of the hard parts of the eye region. Wemodel the body as the union of two rigid parts, the skull (a closed mesh correspond-ing to the anatomical skull with the eye sockets closed by a smooth surface) andthe mobile globe that is not spherical, with a prominent cornea that displaces theeyelid skin. See Figure 5.2. This allows us to efficiently parameterize changes inshape of the body using the rotation angles of the globe and is useful for modeling74gaze drive motion synthesis and described in detail in Section 5.5.Figure 5.2: Body is the union of “skull” and globes.The skin and body move in physical space, which is the familiar space in whichwe can observe the movements of the face, for instance, with a camera. For model-ing, we assume there is a head-fixed camera with projective transformation matrixP that projects a 3D point corresponding to vertex i (denoted xi) into 2D cameracoordinates ui, plus a depth value di. This modeling camera could be a real camerawhich is used to acquire video for learning the model (as outlined in Chapter 3)or could be a virtual camera, simply used as a convenient way to parameterize theskin in physical space. We note that P is invertible, since P is a full projectivetransformation, and not a projection. We denote the stacked arrays of points cor-responding to all vertices of the mesh as x in 3D physical space and u in the 2Dcamera coordinates.5.3 Gaze Parameterized Model of Skin MovementDuring movements of the eyes, the skin in the eye region slides over the body. Itis this sliding we are looking to model. Following the standard notation in solidmechanics, the motion of the skin from 3D skin space to 3D physical space isdenoted φ (see Figure 5.1). Therefore, we can write x = φ(X). However, directlymodeling φ is not desirable, as it does not take into account the constraint that skincan only slide over the body, and not move arbitrarily in 3D. Instead, the key to our75reduced coordinate model is that we represent skin movement in modeling cameracoordinates, i.e., we model the 2D transformationu = P(φ(pi(u))) fg(u).Our goal is to directly model the function fg as a function of input parameters, g,such as gaze and other factors that affect skin movement around the eyes. This rep-resentation has two advantages: first, it enforces the sliding constraint and second,it makes it easier to learn a model of skin movements using video data.5.3.1 Factors Affecting Skin Movement in the Eye RegionWe now examine the different input variables g that determine skin movement inthe eye region. The most important and dynamic cause is eye movements thatchange gaze, i.e., that change what the character is looking at. However, otherparameters, like eyelid aperture and expressions, also affect the skin; we combinethese into the “generalized” gaze vector g.Gaze. Humans make an average of three rapid gaze shifts (called saccades) ev-ery second, and slow movements called “smooth pursuit” to track small movingtargets, and other slow movements called vestibulo-ocular reflex and opto-kineticreflex to stabilize the image on the retina. Eye movements change the orientationof the globe relative to the head. It is not commonly appreciated that the eyes havefull 3 degrees of freedom, even though to point the optical axis at a target only re-quires 2 degrees of freedom; rotations about the optical axis, called “torsion,” havebeen known and modeled at least since the 19th century. Any parameterization76of 3D rotations could be used. We use Fick coordinates [38], which are widelyused in the eye movement literature to describe the 3D rotation of the eye, since itfactors the torsion in a convenient form. These are a sequence of rotations – firsthorizontal (g1), then vertical (g2), finally torsion (g3). See Figure 5.3.Eyelid Aperture. Eyelid movements are affected by both gaze and other factors.When our gaze shifts, eyelids, especially the upper eyelids, move to avoid occlud-ing vision. We also move our eyelids to blink, and when expressing mental statesuch as arousal, surprise, fatigue, and skepticism. The upper and lower eyelidsmove in subtly different ways. Therefore, we use two additional input parametersto define aperture. One is the displacement of the midpoint of the upper eyelidabove a reference horizontal plane with respect to the head (g4); the plane is cho-sen to correspond to the position of the eyelid when closed. The other input is thedisplacement of the midpoint of the lower eyelid below this plane (g5).Expressions. The skin in the eye region is also affected by facial expressions,such as surprise, anger, and squint. We can optionally extend the input parametersg to include additional parameters to control complex facial expressions. Expres-sions may be constructed using Action Units (AUs), defined by the Facial ActionCoding System (FACS) [30]. In our implementation, AUs are used in a similar wayas blend shapes; they may be learned from using ‘sample poses’ that a subject isasked to perform or could also be specified by an artist. See Table 5.1 (left). Thestrength of the ith AU used in the model contributes an additional input parameter,gi+10 ∈ [0,1]. Note that we defined five parameters per eye (3 gaze and 2 aperture),which together contribute the first 10 inputs.77Sample poses Expression generationAU FACS name SampleposesAUs1 Inner brow raiser Surprise 1, 2, 52 Outer brow raiser Anger 4, 54 Brow lowerer Fear 1, 2, 4, 55 Upper lid raiser Sadness 1, 443 Eyes closed Squint 4444 Squint Blink 5, 43Table 5.1: FACS AU used in our experiments (left) and expressions that canbe produced using these AUs (right).5.3.2 Generative Model of Skin MovementFollowing these observations, we factor the generative model into three parts –baseline aperture model, eyelid shape model, and skin motion model. Each ofthese models can be constructed or learned separately. A schematic diagram of theimplementation is shown in Figure 5.3.Skin meshEyelid Shape Model Skin Motion ModelEyelid shapegazeAdditional parametersAperture modelaperture control parameter default rest skin positionRight eyeLeft eyeFigure 5.3: Gaze parametrized skin movement: We predict eyelid aperturefrom gaze using an aperture model, and we use gaze and aperture to-gether to predict skin deformation in the eye region. If desired, expres-sion parameters can also be included to produce facial expressions.In the following, we will assume g is a ni× 1 column matrix, where ni is thetotal number of possible inputs. Submatrices are extracted using Matlab-styleindexing, e.g., g1:3 is the submatrix comprised of rows 1 to 3, and g[4,5] is a sub-78matrix with just the fourth and fifth elements. To achieve real-time performanceat low computational costs, and also to exploit GPU computation using WebGL,we choose linear models for this application. We have explored using non-linearneural network models but these are more expensive to evaluate.Eyelid Aperture Model. As discussed above, the aperture depends on gaze, butit is modulated by other voluntary actions such as blinking. We use a linear baselineaperture model, A, for predicting the aperture due to gaze; since the torsion angleof the globe does not have a significant effect on aperture, we only use the first twocomponents of gaze. The baseline aperture is then scaled by the eye closing factorc≥ 0 to simulate blinks, arousal, etc. The resulting model for the left eye is:g[4,5] = c A g[1,2]. (5.1)Eyelid Shape Model. We observed that the deformation of skin in the eye regionis well correlated with the shape of the eyelid margin. This makes biomechanicalsense, since the soft tissues around the eye move primarily due to the activation ofmuscles surrounding the eyelids, namely orbicularis oculi and levator palpebraemuscles. We define eyelid shape for each eyelid as piecewise cubic spline curves.We found that using between 17 and 22 control points for each spline faithfullycaptures the shape of the eyelids. The eyelid shape depends on both gaze andaperture. The general form of the model isl = L g, (5.2)79where l is the column matrix of coordinates of all control points for the eyelid.Skin Motion Model. The skin of the eye region is modeled at high resolution(using about a thousand vertices in our examples). It is deformed based on themovement of the eyelids. We use a linear model for this, since the skin movementcan then be efficiently evaluated using vertex shaders on the GPU, and also learnedquickly. Note that the skin motion depends on all four eyelids; the stacked vectorof coordinates of all four eyelids is denoted l. The resulting model isu = u0+M l, (5.3)where u0 is the default rest position of the skin.5.4 Transferring AnimationsThe skin motion model of Section 5.3.2 is constructed for a specific subject. Theskin model may not include some parts, such as the inner eyelid margins, whichare difficult see and track in video. Here we discuss how the information in thegenerative model can be transferred to other target characters and to untrackedparts of eye region.Target Transfer. Given a new target character mesh with topology different fromthe captured subject mesh (3D face mesh of the subject for whom the model wasconstructed), we have to map the model output u to new image coordinates u˜ rep-resenting the motion of the new mesh in image coordinates. The map is computedas follows: we first use a non-rigid ICP method [60] to register the target mesh to80the captured subject mesh in 3D. The resulting mesh is called the registered mesh.The vertices of the registered mesh are then snapped to the nearest faces of thecaptured subject mesh. We compute the barycentric weights of the registered meshvertices with respect to the captured subject mesh, and construct a sparse matrix Bof barycentric coordinates that can transform u to u˜ as u˜ = Bu (see Figure 5.4).Movement Extrapolation. Some vertices of the target mesh, particularly thoseof the inner eyelid margin, are not included in the skin movement model computedin Eq. 5.3. This is because such vertices cannot be tracked in video and it is dif-ficult to build a data-driven model for them. Instead the skin coordinates of thosevertices are computed as a weighted sum of nearby tracked vertices. The normal-ized weights are proportional to the inverse distance to the neighboring points (weused 10) in the starting frame.5.5 Client ApplicationsWe implemented two different client applications using WebGL and JavaScript.Both synthesize animations of the eye region in real time using our model. We usedthe Digital Emily data provided by the WikiHuman project [44, 108]. The appli-cations start by downloading all the required mesh, texture, and model data. Theyrun offline and perform all computations without communicating with a server.Blink input is handled by a stochastic blink model [103]. Wrinkles are synthesizedusing an additional wrinkle map texture. The two applications differ only on themodel input sources: one is a fully interactive user controlled application, whilethe other plays a cut-scene defined using keyframed animation curves (which arealso displayed on screen).81(a)(b)Figure 5.4: An overview of target transfer. (a) The target character mesh (red)is registered non-rigidly on the capture subject mesh (blue) shown inthe top row. Image coordinates of the target mesh are computed fromthe image coordinates of the model output using barycentric mappingcomputed during registration. (b) The model trained on one subjectcan be used to generate animation of a character mesh of any topologyuploaded by an user.We now discuss some important details of these applications.82Optimized Motion Model. Since all our operations are linear, we can premulti-ply the eyelid shape models L and the skin motion model M to construct one largematrix. However, using principal components analysis (PCA) we found that theresults are well approximated using only 4 principal components. The PCA is pre-computed on the server, and can be computed at model construction time. Thisresults in a significantly smaller matrices, U (tall and skinny) and W (small), suchthatM l ≈U Wg. (5.4)Only the smaller matrices U and W need to be downloaded from the server.Reconstructing 3D Geometry. Recall that we represent the 3D coordinates x ofthe original skull mesh using the 2D skin coordinates u corresponding to the neutralpose. To determine 3D vertex positions x from u we pre-render a depth texture of xin 2D skin space using natural neighbor interpolation on the vertices values. Thistexture, which we name the skull map, can be sampled to obtain x from u. We alsorender a skull normal map using the same procedure to sample normals given u.GPU Computations. Our framework is well suited for real-time synthesis usingGPUs. While the aperture model and the PCA-reduced motion model W are ap-plied on the CPU, the PCA reconstruction using U is performed on the GPU, pervertex, as a matrix-vector multiplication in the vertex shader. Expression wrinklesare generated by blending the wrinkle bump maps using the expression parametersas weights in the fragment shader.Our model predicts motion of the skin in 2D modeling camera coordinates.83Figure 5.5: Flow chart of how inputs are handled in our applications by theanimation model.The vertices are then lifted to their final 3D positions by sampling the skull map.However, since the globe is not spherical and has a prominent cornea, rotationsof the globe should produce subsurface deformations of the skin. To address thischallenge, we first represent the shape of the globe in polar coordinates in theglobe’s reference frame, and precompute a depth map D(g1,g2) that represents theglobe shape. For each frame, vertices of the eyelid are radially displaced from thecornea surface at the current globe orientation, using the depth map and the skinthickness.Additionally, some of the extrapolated vertices (such as those in the canthusand eyelid margins) should not slide directly on the skull or globe geometry. Hencewe handle these vertices differently. For each extrapolated vertex, we precomputethe radial distance o between the original vertex position of the target mesh and the84globe. We then offset the skin vertex at its current position on the skull by o. Thisallows us to reconstruct the eyelid margin thickness and canthus depressions. Theskin normal, especially for an extrapolated vertex in the eyelid margin, is computedby parallel transport from the normal at its original position to its current position.Figure 5.6: We can generate facial expressions interactively. From top leftclockwise: normal, frown, eye closure, and surprise expressions areshown.WebGL Considerations. One of the main limitations of current WebGL enabledmobile devices is that there is no support for multi-pass rendering. This feature is85extremely important for less traditional rendering methods such as deferred shad-ing, post-processing effects or, more importantly for our case, subsurface scatteringmethods.To achieve realistic real-time skin rendering it is standard to apply an approx-imation of the subsurface scattering proprieties of the skin. We implement themethod proposed by Jimenez et al. [55]. It requires 3 rendering passes: the firstprojects the geometry to generate screen space diffuse, specular and depth mapsof the skin. This is followed by two screen space passes that perform a Gaussianblur on the diffuse illumination map taking into account the depth map. The lastof these passes will also sum the diffuse and specular illuminations to achieve thefinal result.In our implementation, while we still only perform 3 passes, we need to per-form two geometric passes (instead of one): the first rendering pass generates thediffuse illumination and depth maps (depth is stored in the alpha channel). Thesecond one is a screen space Gaussian blur. The third, although it performs thescreen space blur, also computes specular illumination and sums it to the diffusefor the final result. Hence, our optimized Separable Subsurface Scattering (SSSS)system for WebGL requires our vertex shader program to run twice per frame. Thisis taken into account in Table ResultsThe applications run in any modern browser, at 60fps on a desktop with an IntelCore i7 processor and an NVIDIA GeForce GTX 780 graphics card, and at 24fpson an ultraportable laptop with an Intel Core i7 processor and integrated Intel HD5500 Graphics, and at 6fps on a Nexus 5 android phone with Adreno 330 GPU.86Animation Download Memory GPU RuntimeType (MB) Usage (MB) memory (MB) per frame (ms)Static 3.7 240 390 0.5450 ± 0.1553Animated 5.3 370 417 0.6717 ± 0.1564EmilyAnimation 1.6 130 27 0.1267 ± 0.0011overheadTable 5.2: Overview of the model size, memory usage, and performance ofthe animation. The experiments are run in a Chrome web browser on adesktop with an Intel Core i7 processor and an NVIDIA GeForce GTX780 graphics card at 60 fps.The majority of workload is for rendering; the model itself is very inexpensive (seeTable 5.2).Readers can try the WebGL application on the Internet available in our projectwebpage1. A video is also available in the project webpage that shows a clip cap-tured live from the application running in a browser. In the video, we also includedhead movements of a user tracked using FaceShift and a statistical blink model[103] for realism. Different expressions, such as surprise and anger, are added atkey moments. Please note changes in wrinkles and eyes during these expressions.Saliency Map Controlled Movement. When we observe object motion in reallife or in a video, our eyes produce characteristic saccades. We computed saliencymaps, a representation of visual attention, using the method proposed in [52].Points in the image that are most salient are used as gaze targets to produce skinmovements around the eyes. In Figure 5.7 and also in the accompanying video,we show an example of skin movement controlled by gaze, using salient points1http://www.cs.ubc.ca/research/eyemoveweb3d16/87detected in a video of a hockey game.(a)                                                                              (b)                                             (c)                                                                               (d)                                             Figure 5.7: Facial animation automatically generated from a hockey video.The character starts watching the game with a neutral expression (a), ourstochastic model generates natural blinks (b), gets anxious anticipatinga goal (c), and excited when a goal is scored (d). The salient points arecomputed from the video and used as input to our system.Static Scene Observation. Our generative gaze model can be controlled by gazedata obtained from any eye tracking system. We used gaze data of a subject ob-serving a painting to drive our system. This produces very realistic movements ofeyelid and skin around the eyes as can be seen in Figure 5.8.Eyelid Deformation during Blink. We can generate skin deformation in a blinksequence using a statistical blink model that controls aperture based on the experi-mental data reported in [103]. A blink sequence is shown in Figure 5.9.88Figure 5.8: Skin movement driven by gaze during static scene observation.The red circle in the left represents the image point the subject is lookingat.Figure 5.9: Skin deformation in eye closing during a blink. The characteristicmedial motion of the lower eyelid during blink is generated by the model(shown using red arrow).Vestibulo-ocular Reflex. Our model can also generate eye region movement innovel scenarios in which the head moves. For example, when a character fixates onan object and rotates its head, the eyes counter-rotate. This phenomenon is knownas the ‘vestibulo-ocular reflex’. Our model predicts the skin deformation on theeyelid and around the eyes during such movements. Please watch the video in ourproject page for details.89Figure 5.10: Comparison with no skin deformations (top), and with deforma-tion (bottom) using our model with different gaze positions.5.7 ConclusionDespite an enormous amount of previous work on modeling eyes for computer an-imation, the motion of soft tissues around the eyes when a character’s eyes movehas largely been neglected. Our work is the first to specifically address the prob-lem of soft tissue movements in the entire region of the eye. The model is alsoinexpensive in terms of run times and memory, as seen in Table 5.2. Our modelcurrently has a few limitations that we hope to address in the near future. Theseinclude the linearity of the generative model required for inexpensive simulation inthe browser using WebGL, and the lack of eyelashes. Nevertheless, the system isable to generate realistic eye movements using a variety of input stimuli.90Chapter 6Measurement of Eye Movements6.1 BackgroundAnalysis of eye movements can provide deep insight into behavior during objectinterception. Apart from being used as a sensory organ, because of our ability tocontrol our eyes, it can also be used to generate desired outputs for various ap-plications. This has led to the development of gaze controlled human machineinterfaces. The eyes also have an important role in conveying our emotions duringany conversation, for which they are considered one of the most integral parts of fa-cial expression detection algorithms in computer vision. Furthermore, for generat-ing more realistic eye movements in animated characters, researchers in computergraphics are also analyzing eye movement data for creating a generative model.Robust non-intrusive eye detection and tracking is therefore crucial for the analy-sis of eye movements, and this has motivated us to work toward the developmentof a robust eye detection and tracking system. In the scientific community, theimportance of eye movements is explicitly acknowledged as the method through91which we gather the information necessary to identify the properties of the visualworld [49].In computer vision, the study of eye tracking as a tool for different applicationshas been investigated for many years. The first eye tracking camera to understandgaze was presented in [62]. A review on eye tracking methodologies spanning threedisciplines: neuroscience, psychology, and computer science, with brief remarksabout industrial engineering and marketing is reported in [29]. Jacob and Karn[53] discussed the applications of eye movements mainly in the following majorareas: analyzing interfaces, measuring usability and gaining insights into humanperformance. Another promising area is the use of eye tracking techniques to sup-port interface and product designs. Human computer interaction has become anincreasingly important part of our daily life. User’s gaze can provide a convenient,natural and high-bandwidth source of input. A non intrusive eye tracking methodbased on iris center tracking was presented in [57] for human computer interaction.In [4] active gaze tracking is performed for human-robot interactions. Another in-teresting practical application of eye tracking can be for driving safety. It is widelyaccepted that deficiencies in visual attention are responsible for a large proportionof road traffic accidents [51]. Eye movement recording and analysis provide im-portant techniques to understand the nature of the driving task and are importantfor developing driver training strategies and accident countermeasures. The tor-sional movement of eyes can also provide some useful information in analyzing avisually guided task [24, 50].The two key components of research on eye detection and tracking are: eyelocalization in an image, and gaze estimation. Before discussing further on thesecomponents of our 3D gaze estimation framework, we would like to define a few92terms. To provide readers with clear understanding, Figure 6.1 is used to illustratethese terms in context of our system setup consisting of a motion capture systemand an eye tracker. More details on equipment can be found in Section 6.2.Figure 6.1: Illustrative diagram of our experimental setup.• Videooculography. Eye tracking performed using video is popularly knownas videoculography. In our project video is captured using a head mountedeye tracker.• Gaze. Gaze is the direction in which the eyes are looking. Gaze directionis either modeled as the optical axis or as the visual axis. The optical axis(a.k.a. line of gaze (LoG)) is the line connecting the pupil center, corneacenter, and the globe center. The line connecting the fovea and the center ofthe cornea is known as visual axis (a.k.a. the line of sight (LoS)). The gazecan be defined either locally in the head-fixed coordinates, or globally in theworld coordinates. The gaze in the world can also be defined as gaze in 3D.93Figure 6.2: Rotation of globe in the head-fixed coordinates.• Monocular Gaze. For a single eye, gaze is defined in terms of a point ofregard, which is intersection of the visual axis and a scene plane.• Binocular Gaze. If both the eyes are considered, the point where the visualaxes of the two eyes meet can be defined as point of regard of binocular gaze.The binocular gaze is illustrated in Figure 6.1, where the two visual axes, V1and V2 are meeting at an infrared marker M. Thus M is the point of regard ofbinocular gaze, in this example.• Ocular Torsion. Ocular torsion is defined as the rotation of the globe aroundthe visual axis. In Figure 6.2, the rotation of globe around visual axis isshown. The more detailed mathematical description of eye rotation can befound in Section 6.3.3.Although there are many gaze estimation techniques available in the litera-ture, some of the inherent problems such as eye blinking and eyelash interference94are still not completely solved. The goal of our project is to develop a videoocu-lography based robust 3D eye gaze estimation technique to support computer vi-sion, neuroscience, and computer graphics communities on their research relatedto eye detection and tracking. The existing videooculography methods have notattempted to estimate binocular gaze under unrestrained head movements whichis robust against interference produced by eye blinks. To solve this problem, ourinvestigation will be broadly divided into two phases:• Gaze in Head-fixed Coordinates. Estimating gaze in the head-fixed coor-dinates with the highest accuracy by using binocular information of eyes.Here, we detect and track suitable eye features for robust pupil position esti-mation and blinking detection,• Gaze in World Coordinates. To estimate the position and orientation ofthe globe in the world coordinates. As shown in the block diagram of theproposed framework in Figure 6.3, we combined both eye tracking data andmotion capture data to estimate gaze in the world coordinate space.In our 3D binocular gaze estimation framework (See Figure 6.3), we collectvideo from a head mounted eye tracker, track the head and the eye tracker itselfusing a motion capture system, use computer vision algorithms to track eye featuresin the video and remove eye blinks, and finally use a projective camera to globemodel (See Section 6.3.3) to estimate globe configuration in the world coordinates.By estimating the meeting point of the visual axes of both the eyes, the gaze in theworld coordinate can be estimated as shown in Figure 6.1, which we also call gazein 3D.95Figure 6.3: Block diagram of proposed gaze estimation framework.6.2 EquipmentThe eye feature detection and 3D gaze estimation algorithms of our project areevaluated on a video database collected from a binocular eye tracking device (C-ETD, Chronos Vision, Berlin) with IR light sources and CMOS cameras in near IRrange [24]. The C-ETD consists of a head mounted unit and a system computer.The estimation of the configuration of globe in a situation such as an object inter-ception task is complicated, and we used a motion capture system (Vicon MotionCapture System, Vicon, Los Angeles) to localize IR markers put on the eye trackerand head in the world coordinates for this purpose. In Figure 6.4, the subject iswearing C-ETD, and simultaneously body markers are also being tracked usingthe Vicon motion capture system.96Figure 6.4: Subject wearing Chronos eye tracking device in a Vicon motioncapture system experimental setup (left), and a sample image from thevideo captured using Chronos eye tracking device (right).6.3 Gaze Estimation in Head-fixed CoordinatesIn this section, the major methods used to estimate gaze in head-fixed coordinatesare discussed.6.3.1 Eye Feature Detection in ImageThe eye detection methods always start with modeling of the eye, either explicitlyor implicitly. It is essential to identify an eye model which is sufficiently expressiveto take account of a large variability in appearance and dynamics. The model beingeither rigid or deformable, the taxonomy of eye detection consists of mainly twomethods: shape-based methods and appearance-based methods [49].Eye Feature EstimationWe need to first select some good and relevant features to track. The features mostcommonly used for the tracking purposes are: pupil, iris and eyelids. We also needto remove some artifacts such as eyelashes. In the image of the eye, the pupil is themost prominent feature to track due to its characteristic intensity profile. The pupilis a dark elliptical part of the eye surrounded by the white sclera. Many literature97focused mainly on detecting the pupil and tracking it to obtain the information ofthe eye movement.In our method the Hough Transform is used to localize the pupil in the imageobtained from an eye tracing device. If observed carefully, one can notice thatthe pupil is actually neither a perfect circle nor an ellipse. So, we cannot directlyuse the Hough transform for circles for pupil detection. Furthermore, the ellipticalHough transform is computationally expensive. We used the Hough transformas an initial guess of the pupil position, or in the case of pupil occlusion. Wewill further discuss our technique, after briefly introducing readers to the Houghtransform.Hough Transform. The Hough Transform has been used widely in the fieldof image processing and computer vision for the purpose of feature extraction.Classically it was first proposed to detect lines in an image. Various recent workshave extended its applicability to detect arbitrary shapes, where circles and ellipsesare the most common shapes [5].Figure 6.5: Illustration: Hough transform for circles [47].Line Detection. The simplest shape detection that can be performed using the98Hough Transform is to detect straight lines. The straight line can be expressedusing a linear equation y = mx+c. Let’s consider a point in the image space (x,y).Given the point (x,y) and considering the parameter pair (m,c) to be unknown,the point (x,y) can be represented as a straight line m = (−1/x)c+(y/x) in theparameter space. Now if we consider several points in the image space that lie ina straight line, the lines in the parameter space corresponding to those points willintersect at a single point in the parameter space, say (m0,c0). If the parameterspace is divided into bins, then we can count the number of votes corresponding toeach and every point in the image space. For perfectly collinear points in the imagespace, if we accumulate the votes in the parameter space, the bin for (m0,c0) shouldget the maximum number of votes.But unfortunately, one of the main problems in using the Cartesian coordinatesystem is that vertical lines will lead to unbounded values in the parameter space.Therefore for computational reasons the polar coordinates are used for the param-eter space, where the pair of parameters to be used are r and θ . The parameter rrepresents the distance between the line and the origin, while θ is the angle of thevector from the origin to the closest point (both in image space). Eq. 6.1 is theequation of line in image space. The point (x0,y0) in the image space is mappedto the parameter space as given in Eq. 6.2. Eq. 6.2 corresponds to a sinusoidalcurve in the (r,θ ) plane. Therefore, the points in the image space which produce astraight line will produce sinusoids which cross at the parameters for the line.y =(−cosθsinθ)x+( rsinθ). (6.1)99r(θ) = x0.cosθ + y0.sinθ . (6.2)Circle Detection. The Hough transformation for circle detection in image fol-lows a similar approach as the line detection technique. The Hough transform canbe used to determine the parameters of a circle when number of points that fall onthe perimeter are known. A circle with radius R and center (a,b) can be describedwith these parametric equations:x = a+Rcos(θ), (6.3)y = b+Rsin(θ). (6.4)When the angle θ sweeps through the full 360 degree range, the points tracethe perimeter of a circle. If the circles in an image are of known radius R, thenthe search can be reduced to 2D. The objective is to find the (a,b) coordinate ofthe centers. The locus of the (a,b) points in the parameter space fall on a circle ofradius R centered at (x,y). The true center point will be common to all parametercircles, and can be found with a Hough accumulation array. Figure 6.5 shows anillustrative diagram of circle detection using the Hough transform.Pupil Center Detection Technique. Before applying the Hough transform, weneed to detect the pupil edges present in the image. To detect the pupil edges for allorientations we generate an edge image Iedge using the Canny edge detector. TheHough transform for circle is then used on Iedge to detect an approximate locationof the pupil. We need to provide a range of pupil radii to the Hough transformalgorithm, which is estimated by a manual calibration from a reference image.100Figure 6.6 shows the method findPupilCenter used for the pupil center estimation.This method also calls the method findPupilOcclusion described next. In the pupilcenter finding algorithm, as shown in Figure 6.6, the input arguments are: inputimage I form an eye tracker, radius RHT of the Hough circle detected, center of theHough circle C = (xHT ,yHT ), and the function returns the position Pc of the pupilcenter.Figure 6.6: Pupil center finding algorithm.As discussed previously, we need to consider pupil occlusion, as the pupil canbe occluded due to the eyelids mainly in two situations: first, in the onset or afterblinking, and second, when we look extremely up or down. Another occasionwhere the pupil can get occluded is during pupil dilation in a dark environment.101Figure 6.7: Pupil occlusion detection algorithm.Although we are not interested in pupil occlusion during eye blink, the center ofmass technique for the pupil center estimation can lead to a very inaccurate result,when the pupil is occluded during extreme eye movements. We observed that wecan safely assume that occlusion occurs due to upper and lower eyelids, based onour observation. We define a pupil horizontal width profile, d(r) of the estimatedpupil region as a criterion for pupil occlusion determination. The pupil occlusiondetection algorithm and pupil horizontal width profile are described in Figure 6.7.The symmetry metric, κ mentioned in Figure 6.7 is defined as follows:κ =1100· ∑i=1:1.5RHTd((µ−1.5RHT + i) ·d(µ+ i), (6.5)102where,µ =∑i=1:length(d(r)) d(i) · i∑i=1:length(d(r)) d(i). (6.6)The parameter κo used in the pupil occlusion detection algorithm is experimentallydetermined to be approximately 600.6.3.2 Pupil Feature TrackingIn our project we have applied a Kalman filter based tracking in the image domain.Here we briefly describe the Kalman filtering technique for our approach:The motion of the pupil at any time instant can be characterized by its positionand velocity. Let (xt ,yt) represent the pupil pixel position at time t, and (ut ,vt)represent its velocity at time t along x and y directions. The state vector at timet can therefore be represented as xt = (xt yt ut vt)t . The dynamic model of thesystem can be represented as:xt+1 = φxt +wt , (6.7)where, wt represents system perturbation. Assuming the fast feature extractor esti-Figure 6.8: Pupil horizontal width profile, showing center of mass basedpupil center (in red) and updated pupil center based on our algorithm(in green).103mator is zt = (xˆt , yˆt) giving the pupil position at time t, the observation model nowcan be represented as:zt = Hxt + vt , (6.8)where, vt represents the measurement uncertainty. Specifically, the position of thecurrent frame at t is estimated by some pupil localization technique in the neigh-borhood of the predicted position (based on the system dynamic model). Given thedynamic model and the observation model, as well as some initial conditions, thestate vector xt+1, along with its covariance matrix Σt+1, can be updated using thesystem model (for prediction) and the observation model (for updating). Once weobtain the updated location of the pupil center, this feature can be projected to thehead-fixed coordinates to obtain horizontal and vertical eye movements, which isdiscussed in detail in next section.6.3.3 Imaging GeometryIn eye detection, it is essential to identify a model for the eye which is sufficientlyexpressive to take account of the large variability in the appearance and dynam-ics, while also sufficiently constrained to be computationally efficient [49]. In ourmodel the eye is assumed to be a perfect sphere exhibiting ideal ball and socketbehavior as in [66]. Based on that model, the eye movements are pure rotationsaround the center of the globe without any translational component. Furthermorethe iris is considered to be a planar section and at a distance rp from the centerof the globe. The central projection of the eye onto the image plane is shown inFigure 6.9.To define the imaging geometry we will use the mathematical notation used104Figure 6.9: The central projection of the eye onto the image plane [66]..in [66]. Following this paper, matrices are represented by uppercase characters(e.g. R), points in 3D space by underscored uppercase letters (e.g. P) and unsignedscalar quantities by italicized lowercase letters. To derive the mathematical formulafor the projection of the eye onto the image plane, we define an orthogonal head-fixed, right-handed coordinate system {h1, h2, h3} with the origin at the center ofthe eye as shown in Figure 6.9. The h2-h3 plane is parallel to the coronal plane, withthe h2 axis parallel to the interaural axis of the subject. We define a camera-fixedcoordinate system {c1, c2, c3}, where the c2 and c3 axes lie within the image plane,and c1 corresponds to the “line of sight” of the camera. The plane of the cameralens is located on c1 axis at a distance f from the image plane, and a distance dfrom the center of the eye respectively. When the camera is focused on distantobjects, f is equal to the focal length of the lens.Now, Eq. 6.9 gives the relation between the point in the head-fixed frame,P= (p1, p2, p3), and the corresponding coordinate with respect to the camera-fixedsystem, P′= (p′1, p′2, p′3).105P′=ch R.P+T , (6.9)where, chR and T are the rotation and the translation respectively of the head-fixedcoordinates with respect to the camera-fixed coordinates.We also define a globe-fixed coordinate system {e1,e2,e3} that rotates alongwith the rotation of the globe. The reference globe position is defined as the ori-entation of globe when the visual axis of eye (i.e. direction of e1) coincides withh1. If rotation from the head-fixed coordinate system to the globe-fixed coordinatesystem is given by heR, we can express the rotation by following equation:ei =he R ·hi, i = 1,2,3. (6.10)The 3D rotation of the globe from the reference globe configuration (as de-fined before) to the current globe configuration can be decomposed into three con-secutive rotations about the three well defined axes. The sequence of rotation isimportant as the multiplication of the rotation matrices for more than 2D are notcommutative.Fick coordinates are what we intuitively use to describe the 3D rotation ofthe eye [38]. Here, the sequence of rotations: first a horizontal (R3(θ) about e3),then a vertical (R2(θ) about e2), and finally a torsional rotation (R1(θ) about e1),had first been used by Fick, and the angles θ , φ , and ψ for this sequence arereferred to as Fick angles. This kind of rotation about a globe-fixed axis is knownas passive rotation or rotation of coordinate axes. To simplify the mathematics,106this rotation can also be expressed as active rotation or rotation of object abouthead-fixed coordinates, but the sequence of rotation should be in the reverse orderof the rotation sequence in the passive rotation.Therefore the rotation matrix for the rotation of the head-fixed coordinates,from reference orientation to the current globe configuration, is given by:eRhθφψ = RFick = R3(θ) ·R2(φ) ·R1(ψ). (6.11)=cosθ cosφ cosθ sinφ sinψ− sinθ cosψ cosθ sinφ cosψ+ sinθ sinψsinθ cosφ sinθ sinφ sinψ+ cosθ cosψ sinθ sinφ cosψ− cosθ sinψ−sinφ cosφ sinψ cosφ cosψ.(6.12)If we do not consider any ocular torsion, we will take only θ and φ for consid-eration. Then eRhθφψ can be simplified by Eq. 6.13 as:eRhθφψ,ψ=0 = Rθφ =cosθ cosφ sinθ cosθ sinφsinθ cosφ cosθ sinθ sinφ−sinφ 0 cosφ. (6.13)Using Eq. 6.13 and P0, we can update Eq. 6.9 to form Eq. 6.14.P′= hRc.Rθφ .P0+T . (6.14)107The perspective projection of P′onto the image plane, P′′is given by Eq. 6.15.P′′= (0 x y)T = f ∗0p2′/(p1′+ f )p3′/(p1′+ f ). (6.15)Eq. 6.15 can be replaced by the simpler orthographic projection if the distanced between the lens plane and the center of the eye is much larger than the radius ofthe eye rp, or for small eye movements (θ and φ up to ± 100). Then Eq. 6.15 canbe updated as Eq. 6.16:P′′= (0 x y)T = rp ∗0R21R31+0x0y0 , (6.16)where, (x0,y0) is the projection of the center of the eye onto the image plane, andRi j is the matrix element in the ith row and jth column of the rotation matrix R,which is given by Eq. 6.17:R = hRc.Rθφ . (6.17)An inverse of this imaging geometry can also be formulated to determine theeye position in head-fixed coordinates given the pupil center location in the image.The details of this model can be found in [66].1086.3.4 Ocular Torsion EstimationIn Section 6.3.3, we discussed how to obtain the horizontal and vertical eye move-ments, θ and φ , using the eyemodel proposed by [66]. Using this informationwe can also obtain the location of the iris boundary in the head-fixed coordinatespace. The radius of the iris can be obtained by performing a manual calibration.Using the head to camera transformation, cEh, computed during calibration, we canproject the iris boundary from head-fixed coordinates to the image coordinates, forany biologically plausible globe orientation. Here we assume that the point of gazeis at a finite but relatively large distance, and therefore the optical axis and visualaxis can be assumed to be parallel to each other [46].The measurement of ocular torsion can be reduced to a one-dimensional signalprocessing task by forming an iral signature τ(µ) from the pixel intensity valuesof the iris along a circular sampling path (arc/annulus) centered at the pupil center,where, µ is the polar angle of a point on this circular sampling path. Iral signaturesfrom each video frame are computed by sampling along an arc (which can havea finite width), which is projected from the head-fixed coordinates to the imagecoordinates. The ocular torsion estimation is always relative to a reference eyeFigure 6.10: Iris detection and annulus selection for ocular torsion estima-tion.109Figure 6.11: Ocular torsion and eye velocity in eye movements showing op-tokinetic nystagmus.position. We take the image of the eye looking to the front as a reference imageand the iral signature computed along circular arcs in that image is considered asiral reference signature τo(µ). Now, each of those signatures are crosscorrelated asin Eq. 6.18, with the iral reference signature τo(µ).xcorr(τ(µ),τo(µ)) = F−1(F(τ(µ)).F(τo(µ)∗), (6.18)where, F(τ(µ)) denotes the Fourier transform of τ(µ). The angular displacementcorresponding to the peak in the crosscorrelation function gives the value of theocular torsion relative to the reference globe orientation.ψ = argµ[maxµ∈Θ(xcorr(τ(µ),τo(µ)))], (6.19)where, Θ is the range of µ .As in [114], we also used an elastic pupil-iris model to dynamically determinethe updated radial location of the sampling arc.1106.3.5 Eye Blink DetectionIn our project, variance projection function (VPF) is used as in [113], to estimatethe location of the eyelids. The VPF was originally used in [37] to guide the detec-tion of eye position and shape for applications in human face recognition.In the preprocessing step, a 2D Gaussian filter is applied over the grayscaleimage of the eye region to remove eyelashes. Then the input image is processedusing a directional Sobel edge detection filter to detect the horizontal edges, as-suming that the eyelids are almost horizontal. On the filtered image, the VPF isestimated to detect the location of the eyelids in the image.The VPF can be computed both in the horizontal and vertical directions. In ourcase, we have used horizontal VPF on image Ie, which is an edge filtered image ofthe original input image from eye tracker. The VPF of Ie at a particular row of theimage corresponding to the vertical coordinate y is given by:V PF(y) =1x2− x1x2∑xi=x1[Ie(xi,y)−Hm(y)]2, (6.20)where,Hm(y) =1x2− x1x2∑xi=x1Ie(xi,y). (6.21)We have attempted to solve the following problems in relation to the eye blinkestimation: (1) how to decide the state of the eye, whether it is open or blinking,and (2) how to detect and locate the eyelids?In the plot of horizontal VPF for a single eye, two distinct peaks can be ob-served that correspond to the upper and lower eyelids. The distance between thetwo peaks gives the eyelid distance profile (EDP) or blink profile [103]. If EDP is111less than a particular threshold, we identify the occurrence of an eye blink.To localize the eyelids, a cubic spline interpolation is used to interpolate theavailable feature points of the eyelids viz. inner and outer eye corners, and an edgepoint. The positions of the inner and outer corners in an eye image are estimatedmanually in a reference image, assuming that they do not move significantly duringthe eye movements. The edge point to be used is obtained as follows: the verticalcoordinate of the edge point is computed from the location of the peaks of the VPFof Ie. If the eye is not in a blinking state, the horizontal location of pupil can givethe horizontal coordinate of the edge point. If eye is in a blinking state, the eyelidshould appear almost flat, thus the position at the half width of the image can besafely considered as the horizontal component of the edge point.Figure 6.12: Blink profile and eyelid velocity profile at an instant of eyeblink.6.4 Gaze Estimation in World CoordinatesIn Section 6.3.3, we described the method to obtain θ and φ , i.e. horizontal andvertical eye movements respectively w.r.t. the head-fixed coordinates with originat the globe center. Subsequently, Section 6.3.4 provides a method to compute112ocular torsion to get ψ (as defined in Section. 6.3.3). Using this information, weestimate gaze in the globe coordinates. In this section, we will discuss how toextend this method to estimate gaze in the world, by transforming the head-fixedcoordinates to the world coordinates (or ground coordinates). Figure 6.13 showsall the coordinates used in our discussion. Some of those coordinates are alreadydefined, and rest will be defined in following sections.Figure 6.13: The coordinate systems used in our project.As in [83], head pose is defined in terms of a ground-based (i.e., motion-less with respect to the laboratory) coordinate system [g1,g2,g3] as shown in Fig-ure 6.13. To efficiently compute the head pose, one must measure the position ofthe three points on the head, which must not be collinear. Let us denote these pointsby Ta, Tb, and Tc, and define a plane parallel to the frontal plane h2−h3. Since thehead is assumed to be a rigid body, the positions of these points completely deter-mine the head pose.113Figure 6.14: Head-fixed coordinate system with origin at the globe center.Ta,Ta,Ta are markers on the helmet of the eye tracker. M1,M2, and M3are markers used to compensate for the helmet slippage.Figure 6.15: Helmet slippage in an eye tracking experiment with extremehead movement.It is important to estimate the position of the center of the globe, which iscomputed using the head’s anthropomorphic characteristics and the locations of themarkers Ta , Tb, and Tc. The globe center is the origin of the eye-in-head coordinatesystem [h1,h2,h3]. The head orientation is defined as the orientation of the vector114h1 with respect to the world coordinate system [g1,g2,g3], and is computed as,h1 =(Tc−Tb)× [(Tc−Ta)× (Tb−Ta)]|(Tc−Tb)× [(Tc−Ta)× (Tb−Ta)]| . (6.22)Now, h2 and h3 are defined as follows:h2 =Tc−Tb|Tc−Tb| ,h3 = h1×h2. (6.23)Therefore, [h1,h2,h3] and the location of the center of the globe provide usthe transformation hEg, i.e. from world to the head-fixed coordinates centered atthe globe center. If Ph is the position vector of the point of regard in head-fixedcoordinates centered at the globe center, then the position in the world coordinatesPg can be computed as,Pg =hE g ·Ph. (6.24)By using the method just explained, we can estimate the gaze in 3D for a sin-gle eye. Since our eye tracker captures video of both the eyes simultaneously, bycombining the gaze from both the eyes we can find the point of regard as the inter-section of the visual axes of both the eyes. This allows us to analyze vergence, inaddition to other eye movements such as saccade or smooth pursuit.6.4.1 Helmet slippage compensationThe accuracy of the videoocculography based techniques that use head mountedeye trackers to determine the gaze from video, depends on how firmly the helmetof the eye tracker is attached to the head. There may be slippage of the helmet115of the eye tracker on the head. To determine that slippage we need to put someadditional markers on suitable positions on the head, and can also use characteristicfeatures in the eye image as intrinsic markers. We attach a few motion capturemarkers securely to bony landmarks on the head. In Figure 6.14, the markers areshown, which are M1, M2, and M3. Based on that information, we recalibrate therotation hRc and translationhT c of the head-fixed coordinate system [h1,h2,h3] (withorigin at the center of the globe) with respect to the camera-fixed coordinate system[c1,c2,c3].In our experimental setup, we put markers on: (1) bony landmarks on the headto get transformation hEg, (2) on the eye tracker to obtainetE g using a standard motioncapture system, where [g1,g2,g3], and [et1,et2,et3] represent world and eye trackercoordinate system respectively.From the calibration at time t = 0, we can estimate hEc, which along with pre-vious transformations gives cEet at time t = 0. Considering the eye tracker to bea rigid body, cEet in fact remains constant for all time. Now given,gEh,gEet,cEet atany time t we can compute cEh. The rotation associated with the eye tracker helmetslippage can be obtained as follows:cRh =cR et · gRetT ·gR h. (6.25)6.5 ResultsWe performed several experiments to validate our proposed 3D gaze estimationtechnique. Our experimental setup consists of an 8-camera Vicon MX motion cap-ture system (Vicon, Los Angeles) to track IR markers in the world coordinates,116and a head mounted C-ETD eyetracker (Chronos Vision, Berlin) to capture thehead-unrestrained eye movements. Both systems are recorded at 100 Hz. We im-plemented the algorithm in MATLAB (version 7.10.0, R2010a, MathWorks Inc.),and the software ran on a 2.67 GHz Intel Core i5 computer.Figure 6.16: Distribution of point of regard (in green) while the subject fix-ates at marker shown in red.To validate our technique, the author, acting as subject, wore the C-ETD eyetracker, and put IR markers on the eye tracker and different locations on the sub-ject’s head as described in Section. 6.4. During calibration, as described in Ap-pendix A, the subject visually followed the markers of the calibration bar keepingtheir head fixed. By collecting the video using the eye tracker, and the world coor-dinates of the IR markers using the motion capture system, we perform the calibra-tion to estimate the parameters mentioned in Appendix A. In regular experimentsessions, the subject had to fixate at different markers put randomly in the motioncapture environment. In Figure 6.16, we have shown the distribution of points of117regard estimated from one of our experiments, when the subject fixates at a station-ary marker. The standard deviation of the error distribution of the points of regardin three orthogonal directions in world space is shown in Figure 6.17 for the subjectperforming fixation at three different marker locations. The sources of noise in ourresult, as we identified, are mainly: (1) inaccurate noise model in image space, and(2) approximations in geometric eye model.Figure 6.17: Standard deviation of error in point of regard in binocular gazeestimation (in mm) along X,Y and Z directions of the world coordi-nates.As shown in Figure 6.15, we also estimated the helmet slippage in a few of oureye tracking experiments. The helmet slippage is significant when the head movesto extreme orientations or moves fast.1186.6 ConclusionThe results of our experiment show that our videooculography based techniquecan be successfully used for 3D gaze estimation. The major contribution of ourproposed framework is combining eye blink detection with a videooculographybased eye tracking. We also proposed a novel pupil occlusion detection methodbased on pupil horizontal width profile and update the pupil center accordingly(See Figure 6.8).There are still a lot of possibilities to explore. One of those is tracking eyefeatures in head-fixed coordinates instead of the image plane to incorporate thedynamics of the eye for more accurate tracking. Again, a particle filter-based ap-proach will be more practical to use than using a Gaussian noise model basedKalman filter tracking.Our 3D gaze estimation framework can easily be extended to non-IR videocameras and stereo camera systems. The main benefits of using the C-ETD inour experiments are: (1) this is a wearable device that keeps the camera-to-headtransformation easy to compute, and (2) it captures more detailed close-up eyeimages. In future we can use a high resolution and head-mounted video capturingdevice to capture the colored eye features such as iris texture and skin textures aswell. A more detailed iris texture can significantly improve the estimation of oculartorsion. The application of our gaze estimation technique using monocular camerasetup is shown in Chapter 3 and 5.Another issue that we would like to investigate further with our experimentalsetup is to look closely at the pupil dynamics under different environmental light-ing, and subject looking at different scenes on the screen. Furthermore, in Chapter1193 and 5, we extended our eyelid tracking algorithm to track the eyelid shape and itstexture features.120Chapter 7Anatomical Augmented Realitywith Image-space Alignment7.1 BackgroundTemplate-based 3D human anatomy registration systems register anatomy tem-plates to user specific data captured usually using commodity RGBD cameras.Using these systems we can also animate user specific anatomy based on user’smotion. When we visualize registered anatomy by superimposing it on input video,it is often misaligned with the user’s body. For applications such as video-basedanatomy mirroring [9], it is important for the application to run in real-time andwith proper alignment of user’s anatomy.Some of the recent work has focused on registering human anatomy or reshap-ing human body in images or videos using image-based approaches [72, 80, 112].The work by [80] proposed a semi-automatic process that uses manually anno-tated features from multiple images to generate user-specific 3D anatomy. But this121method is only for a static pose and is computationally expensive. Zhou et al. [112]use content aware image warping to perform editing of 2D images of human bodyat interactive rate. But this requires user assistance, works only for limited occlu-sion, and the results have no temporal consistency. A similar work [72] improvedthis work by including background consistency. A body morphotype manipulationmethod in video footage was proposed in [54]. They edited videos with real-timemanipulation and managed to do that at an interactive rate of 20ms per frame. Butthis work can only handle limited occlusions and cannot guarantee that the recon-structed model poses are correct. With a similar goal of morphotype manipulation,[81] proposed a real-time system using real-time body tracking and could alter theRGB video stream for human body reshaping. Unfortunately, this system producedunrealistic deformations, background artifacts, and limitedly handles occlusion.To provide augmented reality experience, by superimposing human anatomyon video, several techniques have been recently proposed [9, 14, 63]. These appli-cations are getting widely known as ‘anatomical mirror’. In [14], CT scans of thehuman abdomen are superimposed on video, but statically. In a similar work [63],they did not consider deformations of the anatomy regions when the user moves.In [1] they proposed an anatomical mirroring using Kinect tracking but withoutany user specific anatomy. The work most relevant to our research is [9], wherethey superimposed user-specific anatomy on input video to provide augmented re-ality experience. But when registered anatomy is projected and superimposed onthe video we observe lot of misalignments. Our work corrects these efficiently byusing an image-based correction approach.In this chapter, we will discuss our image-based correction framework that cancorrect alignment errors in anatomy superimposed videos using data collected from1223D anatomy registration systems, such as [9]. Our method efficiently computesthese corrections in image-space and avoids performing full three dimensional cor-rections. It can be easily coupled with the real-time 3D anatomy registration systemproposed by [9] without significantly reducing the frame rate. We also propose anocclusion handling technique that can estimate relative occlusions among differ-ent anatomy regions. We qualitatively and quantitatively show that we can correctmisalignment of anatomy regions very efficiently at an interactive rate improvinginitial registration.7.2 MethodsBauer et al. [9] proposed a system that can efficiently register user-specific anatomyusing a RGB-D camera and can provide interactive visualization of its motion. Inthis chapter, we will refer to this system as 3D registration system.The feedback obtained from the user study in [9] suggests that a supplemen-tary process is needed to improve the quality of overlay. Anatomy misalignmentsare in particular visible when anatomy is superimposed onto the user’s color map(augmented reality visualization). For example, anatomical limbs can sometimesget superimposed outside the user’s limbs seen in the image as shown in Figure 7.1(e.g., arms and hands).Image-based Correction. In this chapter, we will describe a method to solvethe anatomy misalignment problem efficiently in the image domain. Our image-based corrective registration reduces the errors that accumulate during previoussystem steps of [9], such as motion capture, anatomy transfer, image generation,and animation.123Occlusion Handling. Initially, we perform these image-based corrections sepa-rately for different anatomy regions (see Figure 7.1). To generate the final overlayimage we need to determine the order in which these corrected images should beoverlayed. Therefore, we propose an image-based occlusion handling techniquethat automatically determines how to overlay these images relative to each other inthe final augmented reality rendering.Figure 7.1: We propose an image-based correction (step 1), and an occlusionestimation and layering technique (step 2). In the first step, we correctanatomy regions separately. In the second step, we combine them incorrect order to generate the final overlay image.1247.2.1 Input Data From 3D RegistrationBy overlaying registered 3D anatomy on the 2D color maps we generate anatomyimages (Figure 7.1 top row) corresponding to a set of predefined 3D anatomy re-gions. Each anatomy region is a collection of bones, organs, and muscles specificto a particular region of the registered 3D anatomical model. For example, inFigure 7.1, we show anatomy images corresponding to five anatomy regions thatwe use in our examples. Unfortunately, after superimposing them onto the cor-responding color maps we can clearly observe that they are misaligned (see Fig-ure 7.1 bottom-left). Using our image-based correction, we warp these anatomyimages such that we can correct these misalignments. In [9], they defined anenhanced body tracking system composed of 25 joints according to the KinectSDK body tracking skeleton. By mapping their enhanced body tracking joints toimage-space, we generate anatomy landmarks. Similarly, mapping the Kinect SDKbody tracking skeleton we generate Kinect landmarks. Our image-based correc-tion method uses these two sets of landmarks and the Kinect depth map to correctanatomy misalignments.Figure 7.2: Overview of our corrective registration system.Since we correct these misalignments in a reduced dimension (in image-space),it causes small projective errors to occur, but these corrections can be performedvery efficiently. The two main stages of our system are: (a) image-based correction,125and (b) occlusion estimation and layering. The overview of this pipeline can beseen in Figure Image-based CorrectionThe image-based correction stage can be further divided into three key steps: (a)feature estimation, (b) landmark correction, and (c) updated anatomy image gener-ation. We will describe these steps below.Feature EstimationWe estimate two types of image features: first, we find a set of anatomy featuresS in the anatomy images, and second, a set of depth features D in the Kinectdepth map. Let’s consider that we have N (=5 in our examples) anatomy regions.Subsets of the Kinect and anatomy landmarks are assigned to each of the anatomyregions based on the contribution of the 3D joints corresponding to these landmarksin producing tissue movements in that region. We also describe below, how wecan estimate N depth contours corresponding to the anatomy regions and estimatedepth features from these contours.To estimate depth contour points, we first detect edges corresponding to thedepth discontinuities in the Kinect depth map using the Canny edge detector ([21])and then we compute the external contour using the contour detection algorithmproposed in [97]. For each contour point we find the closest Kinect landmark.Conversely, for each Kinect landmark we obtain a set of depth contour points. Foran anatomy region, we estimate depth contour by taking the union of all depthcontour points corresponding to the Kinect landmarks of that region. Similarly, weestimate anatomy contour of an anatomy region by estimating the contour around126the rendered anatomy regions (in the corresponding anatomy image).Figure 7.3: Feature estimation for left arm. In left: anatomy features (S )(blue) are estimated using anatomy landmarks (red) and sub-landmarks(green). In right: depth features (D) are estimated using Kinect depthlandmarks (red) and sub-landmarks(green). Here we are showing onesub-division of landmarks.Anatomy Feature Estimation. We denote the anatomy features for the ithanatomy region as Si, which are estimated in the corresponding anatomy image.For each anatomy landmark of the anatomy region we estimate a normal vectorwith a direction which is the average of the normals to the lines connecting thelandmark to the adjacent anatomy landmarks. This vector intersects the anatomycontour at two points and we add these points to Si. If desired, the lines can befurther sub-divided to generate sub-landmarks increasing the number of features,in our implementation we subdivided the lines 6 times. Adding more features in-creases the robustness by reducing the contribution of outliers, see Figure 7.3.127Depth Feature Estimation. We denote depth features of the ith anatomy regionas Di. Similar to the estimation of anatomy features, we can estimate depth fea-tures by finding intersections of the normal vectors from Kinect landmarks with thedepth contour of the anatomy region (see Figure 7.4). Depth contours are mostlyfragmented and not closed because the transition between anatomy regions gener-ates depth discontinuities. Since we have missing points in the contour, sometimesnormal vectors do not intersect with depth contours. In that case, we do not add anydepth features to the depth landmark. At the same time, we drop anatomy featuresof the corresponding anatomy landmark to ensure one-to-one correspondences be-tween anatomy and depth features. Depth maps are often noisy and causes erro-neous depth feature estimation due to noise in the Kinect depth sensor raw data.We apply a Kalman filter temporally to the depth feature locations to remove thenoise.Landmark CorrectionKinect landmarks provide an estimate of the skeleton in the 2D image space, wewill call it Kinect 2D skeleton. Similarly, we will call the corresponding skele-ton formed by the anatomy landmarks anatomy 2D skeleton. But if we warp ouranatomy regions using a map learned from anatomy landmarks to Kinect land-marks, it does not ensure that the mapped regions will entirely reside within thedepth contours. To maintain smoothness in shape at the boundary of the warpedregion, we look into a reliable warp map, we call T .We use a thin plate spline-based interpolation (see [16]) represented with radialbasis functions (RBF) to approximate the mapping T from anatomy features todepth features. This mapping is composed of an affine transformation and a non-128Figure 7.4: Feature estimation: Preliminary 3D registration when superim-posed on input image (a) we can observe misalignment of the anatomy(b). We estimate anatomy features (d) from the intermediate anatomyimages (c). In the bottom row, we show estimation (f) and segmenta-tion (g) of depth contours from the depth map (e), and estimated depthfeatures (h).affine deformation. We use a regularizer parameter λ to control the influence ofthe non-affine deformation part, which refers to a physical analogy of bending.For an anatomy region with M feature points, the parameters of the RBF func-tion are estimated by solving the following Eq. 7.1:Di = ASi+λ ∑i=1:M∑j=1:M(w j R(||si−d j||)), (7.1)where, Di andSi are Mx2 matrices containing respectively locations of depth and129anatomy features. si and di are locations of ith anatomy and depth features, A is aMxM affine transformation matrix. w represents weight of the non-affine bendingtransformation, and R(r) = r2 log(r) is a RBF function. Additionally, we includeKinect landmarks and anatomy landmarks respectively in the matrices D andS .We can rewrite Eq. 7.1 in matrix form as:Di =K(R) PT (Si)P(Si) 0WA , (7.2)where, P contains homogenizedSi, K contains values of RBF functions and W is avector with non-affine weights. We can further simplify Eq. 7.2 asDi =Mi(Si) Xi,where Mi and Xi represent the first and second matrix of the right hand side ofEq. 7.2. Now, by combining equations of all the body parts in one global equation,we can write: D1...DN=M1(S1) · · · 0.... . ....0 · · · MN(SN)X1...XN . (7.3)We can rewrite Eq. 7.3 as:D = M˜(S ) X˜ . (7.4)In our current implementation N=5, size of M˜ andD are 240×240 and 240×2respectively. Finally, we can also write Eq. 7.3 as D = T (M˜(S ), X˜). where,T maps anatomy features in S to the depth features in D , and X˜ includes theparameters of the mapping.130We solve Eq. 7.4 to estimate mapping parameters X˜ of T by formulating itas a linear least squares system for given S and D which includes anatomy anddepth features that have one-to-one correspondences. In our regularized optimiza-tion framework, we constrain the anatomy landmarks with soft constraints to mapKinect landmarks. T can be used to warp anatomy regions such that they remainenclosed within the depth contour while maintaining a smooth boundary shape.Note that T is composed of N separate mappings corresponding to the anatomyregions. By remapping the anatomy landmarks using T , we also obtain a betterestimate of the original Kinect 2D skeleton formed by Kinect landmarks, which wecall updated 2D skeleton. Furthermore, to ensure the connectivity of landmarksacross different anatomy regions, we set the location of shared landmarks to theaverage of their estimates for different anatomy regions. Figure 7.5 shows the re-sulting landmark corrections.Figure 7.5: Landmark correction: Our skeleton correction algorithm correctsinitial Kinect 2D skeleton in image space (b) to produce more consistentconfigurations (c).Note that depth contours are noisy if the user wears loose clothes, which in turn131makes the depth features noisy. Therefore, we prefer to maintain a smooth shapeof the mapped anatomy region instead of mapping anatomy features exactly to thedepth features. By picking a suitable λ in Eq. 7.1 we can control the smoothness.In our current implementation we chose λ = 0.01.Updated Anatomy Image GenerationAs explained previously, each anatomy region corresponds to an anatomy image,therefore, to warp an anatomy region, we simply need to warp the correspondinganatomy image. We decided to separately warp each of these images based onthe transformation from anatomy 2D skeleton to updated 2D skeleton. To obtainthe final anatomy corrected rendered image, we combine these warped anatomyimages to a single composite image. For each pixel of that composite image,we should render the anatomy region closest to the camera (e.g., smallest depthvalue). To estimate the closest anatomy region, we propose a novel occlusionestimation and layering algorithm.Image Warping. We generate bounding boxes, which we call anatomy cages,around the bones of the updated 2D skeleton. Now, our goal is to deform theseanatomy cages based on a deformation field that we will describe next. Figure 7.6shows warping of the right upper limb with cages.Using T as a deformation field is not a good choice for two reasons: first, Tdoes not include the additional deformations we introduced in the post-processingstep (where we modify the locations of mapped anatomy landmarks). Second,T isreliable within the regions enclosed by the anatomy landmarks, but if we use theselandmarks to generate the cages, we will end up having a piecewise linear approxi-132Figure 7.6: Cage generation and warping of the right upper limb. Yellowpoints are anatomy landmarks, White ones are updated landmarks. Thered zones represent anatomy cages.mation of the boundary of the anatomy in the image space. Therefore, we chose touse wider cages to get a smoother anatomy boundary after warping. This puts thecage points outside the regions enclosed by the landmarks and so we cannot useTreliably. Therefore, instead ofT , we use dual-quaternion skinning ([56]) driven bythe skeleton to perform image warping from anatomy to updated 2D skeleton land-marks. Using this, we estimate deformed anatomy cages corresponding to eachof the anatomy cages. To warp the anatomy images, we triangulate the anatomycages, and then we estimate the affine transformation of the individual element of133the cages from the original anatomy cage to the deformed anatomy cage configu-ration. Using bilinear interpolation we then warp the anatomy image pixels insidethe triangles. We call the new warped images warped anatomy images. We defineone for each anatomy region. Figure 7.7 shows warping results for the completeset of anatomy images.Figure 7.7: Image-based corrective registration: Misalignments in theanatomy regions observed in (a) are corrected by our image-based cor-rective algorithm to produce (b).7.2.3 Occlusion Estimation and LayeringAs mentioned before, the main challenge in generating a final composite image ofthe warped anatomy images is to figure out which anatomy region is closest to the134camera for a given view. If images are naively combined as layers, occlusions suchas in Figure 7.8 (b) can occur. In this case, the anatomy region corresponding to thetorso is occluding the hands, which is not what we expect. Our method describedbelow tackles this problem.We first generate synthetic depth images for the anatomy regions based on theKinect 2D skeleton. For each anatomy region, we know the 3D configuration ofthe corresponding Kinect joints in the Kinect 2D skeleton. We model cylindricalprimitives around the bones. The radius is set equal to the maximum cross-sectionradius of the corresponding anatomy region. Using the projection matrix of thecamera, for each anatomy region, we render a depth map. We call this imageanatomy depth image.The size and resolution of the warped anatomy images and the composite imageare the same. For each warped anatomy image, we categorize the pixels into twotypes: valid when pixels belong to the anatomy, and invalid when they do not. Inthe composite image domain we loop through all the pixels: for each pixel, wecheck if at that location any of the warped anatomy images contains a valid pixel.If not, we set that pixel to black. If yes, we check which of the warped anatomyimages contain valid pixels. Out of all those warped anatomy images we pick theone that is closest to the camera. The distance from the camera is determined basedon anatomy depth images. We then update the pixel of the composite image withthe color of that closest warped anatomy image. In Figure 7.8 (c) we can see howour algorithm corrected the problem of occlusion (b).135Figure 7.8: Occlusion handling: Our image-based corrective algorithm cor-rects misalignments in the rendering (a) of initial 3D registration bywarping anatomy regions in image space and in separate layers. Ren-dering them without knowing their relative distances from camera cre-ate occlusions (b). Our occlusion handling algorithm can recover theserelative distances and render these regions in correct order (c).7.2.4 EvaluationWe have shown qualitative results of our landmark correction in Figure 7.5, wherewe used non-linear thin plate spline-based interpolation to model the deforma-tion. Figure 7.7 shows how we improve anatomy registration by applying dual-quaternion skinning based on deformations produced by landmark correction. Theresults of our occlusion handling algorithm are shown in Figure 7.8. We quan-titatively analyze the results of our image-based corrective registration using ananatomy intersection coefficient η . If n f is the total of anatomy pixels in the finalcomposite image and nk is the total of anatomy pixels that also belong to the useraccording to the Kinect depth map. We can define η as: η = nkn f . In Table 7.1, re-136sults are shown for two video sequences and they clearly indicate the improvementafter applying our image-based corrective registration algorithm. Our unoptimizedroutines take around 75ms per frame to perform this correction.Video Before AfterSquat motion 0.939 ± 0.184 0.994 ± 0.026Hand crossing 0.961 ± 0.159 0.988 ± 0.089Table 7.1: Image-based corrective registration results: anatomy intersectioncoefficient before and after correctionsFigure 7.9 shows the temporal profile of η for individual anatomy regions in thesquat motion video. Anatomy regions are represented with different colors. As wecan see, η consistently remains close to 1 after our corrections. Thus, comparingbefore and after image-based correction results we observe significant reduction inthe alignment errors.Furthermore, we can make our corrective registration faster by generatingwarped anatomy images only when the η value of an anatomy region is belowa certain threshold (which means they are not well aligned). For example, in the81 frames of the squat motion video we originally estimate 405 warped anatomyimages (e.g., 81 (number of video frames) × 5 (number of anatomy regions)). Ifwe set the threshold of η to be 0.9, we reduce this number to 137 — this is a 66.2%reduction.Limitations. In the 3D registration system, the errors in orientation of theanatomy regions produce wrong color maps of the anatomy regions. Since weuse these color maps as color or texture of the anatomy regions, we cannot correctorientation errors in the image-based correction step.137Figure 7.9: The anatomy alignment coefficients for anatomy regions areshown before and after image-based correction for the squat sequence.Furthermore, the image-based correction uses color maps rendered by the 3Dregistration system. In our current implementation, we save these color mapsto disk and read them later for image-based corrections. These computationallyexpensive file operations prevent real-time image-based misalignment correction.Furthermore, with the current latency, the combined system does not satisfy “mo-tion fluidity and delay” (criterion C04 as mentioned in [9]). Towards the goal of acombined real-time system, in future implementations, we will read the color mapsfrom memory.7.3 ConclusionWe proposed an image-based corrective registration method to correct the errorsthat build up during system steps of a 3D anatomy registration system: motioncapture, anatomy transfer, image generation, and animation. Currently, our com-bined pipeline is not real-time due to expensive file read and write operations. Infuture, we plan to read color maps from memory instead, and build a combined138real-time system. Another limitation of the current image-based corrective regis-tration is that we cannot correct the errors in orientation of the anatomy regionsrelative to the bones. In future, to solve this problem, we can use an image-basedhybrid solution, such as [54, 81, 112] that uses a 3D morphable model to fit tosome features in 2D images. In our case, we can model our 3D anatomical ref-erence model as a morphable model and then fit it based on 2D joint locations ofthe updated Kinect 2D skeleton. Then, we can re-render the anatomy regions fromthe camera view to generate updated anatomy images. This should be able to re-cover the color of anatomy regions that got occluded due to orientation error in the3D registration system. After that we can follow our usual skinning and occlusionhandling routines to generate final results.139Chapter 8Conclusion and Future WorkIn this thesis, we have presented our studies to measure, model, and animate subtleskin motions in the region around the eyes, which we call the eye region. Towardthat goal, we build a system with several components that utilize the anatomicalstructure of soft tissues of the eye region for efficient and robust computations ofsoft tissue movements. We use a reduced coordinate framework to model thinsheetlike structures of skin and the muscles of the eye region for three purposes:(1) simulation of the soft tissue movements, (2) measurement of gaze and subtleskin movements when skin slides over underlying tissues, and (3) modeling andanimation of skin motions for real-time facial animation. We also studied differentmethods to correct misalignment in 3D anatomy registration systems, and proposedan efficient image-based corrective registration method to correct these errors. Inthis chapter, we will summarize our findings and discuss directions for future work.Gaze tracking is an important component of face tracking and analysis. Inthis dissertation, we performed a review of the gaze tracking technologies and dis-cussed how to do gaze tracking robustly. After our investigation, we proposed a140monocular gaze tracking method that can be used to track 3D gaze and detect eyeblinks. We implemented our algorithm for both head mounted camera devices andstudio monocular camera setups. We also corrected helmet slippages using a setof infrared motion capture markers. Our gaze tracking method proved to be veryuseful in the work discussed in Chapter 5, where we tracked 3D gaze in videoscaptured using a monocular high speed camera to analyze how 3D gaze correlateswith skin motion measurements.Looking at the face, we observed that the soft tissues of the eye region producevery quick and subtle movements. Furthermore, skin and muscles of this region arethin sheet-like, and therefore, we model them using a reduced coordinate frame-work to perform efficient simulation of tissue movements. These soft tissues canalso be modeled in a layered structure, and their motions can be simulated in acommon 2D reduced coordinate system. Inspired by the eye movement studies,we have also shown that by using a simple pulse step controller we can produceeyelid and facial muscle movements that closely follow experimentally collecteddata.With a physics-based simulation model of face, we attempted to estimate tissueparameters in the model by matching simulation results with experimental data butwith limited success. We therefore pivoted toward a measurement-based approachfor facial animation. To analyze the motions of skin when we generate facial ex-pressions, we again used our reduced coordinate framework to model skin as asliding surface moving over underlying tissues. This lets us track skin in reducedimage-space using standard optical flow techniques in a video captured using ahigh speed monocular camera, and we can reconstruct 3D skin movements. There-fore, just by tracking features in 2D image-space we can recover subtle 3D motions141of skin of the eye region.Using 3D gaze and skin movement data collected by our system, we proposeda simple linear regression-based skin motion model, that can predict 3D skin con-figuration for a given gaze. Thanks to the efficiency of our reduced coordinateframework and lightweight skin motion model, all calculations can be computedinteractively. To demonstrate the results and potential applications in facial anima-tion, we developed a real-time interactive web-based application. Our applicationruns across mobile devices and laptops, and generates photo-realistic facial anima-tions in real-time.Finally, we have proposed an image-space anatomy registration correction al-gorithm that corrects anatomy misalignments that occur when the output from a3D anatomy registration system is overlayed on video for applications, such asanatomy mirroring. These misalignments occur due to errors in several stages ofa 3D registration system, e.g., motion capture, anatomy transfer, image transfer,and animation. We also proposed an occlusion handling and layering algorithm,that automatically decides how to warp and overlay different anatomy regions inimage-space. Our results show that we can effectively correct these misalignmentsautomatically at an interactive rate.Overall, this thesis discusses novel methods to measure, simulate, and animatesubtle facial motions around the eye region. The results show that using thesemethods we can build systems to track gaze and skin motions time efficiently, andtrain machine learning models with these measurements to produce interactive fa-cial animations reconstructing subtle but important skin motions in the eye region.These systems can be widely used in the field of facial animation, facial expressionanalysis, physics-based simulation, and can provide further understanding of facial142tissue movements in human communications.8.1 Future WorkOur work focuses on measurement and animation of soft tissue movements aroundthe eyes for two main reasons: (1) the eye region plays a significant role in express-ing our emotions, and (2) the sheetlike structure of muscle and tissues in this regioncan be used as a priori knowledge to make computations for soft tissue measure-ment and simulation more efficient and robust. There are plenty of ways to extendour work, such that we can measure and animate the whole face. This will re-quire some modifications to our existing techniques, along with some additionaltechnologies that can be borrowed from some recent research in this field.Facial Measurements. To modify our system for measurement of facial skinmovements beyond the eye region, we have to carefully look at some of the chal-lenges. Our skin movement measurement method assumes that skin slides on aunderlying body. In our example for the eye region, we assume that the underlyingbodies on which skin is sliding are the facial muscles, which are also sheetlike andconform to the shape of the skull. But, for other regions of the face, this is not nec-essarily true. For example, the muscles in the cheek region are interconnected as acomplex system and do not lie on the surface of the skull. Therefore, skin does notslide over a well defined body. In this case, we have to model the underlying bodybased on the overall surface generated by the muscle system. We can rely on someglobal facial features that we can track from video to estimate the 3D shape of theunderlying body. We can relate this to our hand tracking example in Chapter 3,where we can obtain the underlying shape of the hand using a standard registration143technique. In the case of the face, we can use techniques, such as [17] to estimatethe shape of the face. Then using our reduced coordinate representation of skin,we can measure skin sliding movements over the global face shape.Facial Simulation. We have proposed a physics-based simulation framework forfacial animation of the eye region, but we were not successful in estimating theactivation parameters of the tissue model from video data. The problem was that wecould not perform an accurate subject-specific facial anatomy registration. In thefuture, a method that automatically generates anatomical face simulation models,such as [25] using 3D registration techniques could be useful. Furthermore, inthe future, we could use a classifier trained on data collected by simulating ourphysics-based system for different muscle activation, and learn a map that mapstissue configurations to muscle activations. This mapping will be highly non lineardue to the complex muscle behaviors, and therefore we can use artificial neuralnetwork-based classifiers.Facial Animation. For facial animation, we have used a simple linear regression-based model to animate skin of the eye region based on gaze. Use of a linear modelallows very efficient GPU implementation for real-time animation. But, skin mo-tions do not always change linearly in response to gaze, and this is indicated by thelarge errors shown by our model in some extreme gaze directions. To reduce theerrors we can use a non-linear artificial neural network model, but this requires alot of data to train, this is computationally expensive, and often not real-time. Inthe future, by collecting a large number of data samples, and using recent devel-opments in GPU accelerated real-time deep learning research [3, 111], we could144model non-linearity in skin movements and predict these motions in real-time.Image-based Anatomy Correction. In our current implementation of our image-based anatomy correction system, we save the color maps obtained from a 3Dregistration system, such as of Bauer et al. [9] to disk and read them later for ourcorrections. Therefore, our pipeline is not real-time due to expensive file read andwrite operations. To speed up our system, in the future, we plan to read colormaps from memory instead. Furthermore, our current system cannot correct errorsin 3D orientation of the anatomy registered by the 3D registration system. In thefuture, we can use image-based hybrid solutions, such as [54, 81, 112] and use our3D anatomy template to fit to some features in video, prior to generating anatomyimages for better misalignment corrections.145Bibliography[1] Anatomie spiegel. http://anatomiespiegel.de/. Online; accessed 11 June2018. → pages 122[2] AlteredQualia. WebGL skin.https://www.chromeexperiments.com/experiment/webgl-skin, Sep. 2011.Online; accessed 11 June 2018. → pages 9, 72[3] A. Angelova, A. Krizhevsky, V. Vanhoucke, A. S. Ogale, and D. Ferguson.Real-time pedestrian detection with deep network cascades. BMVC, pages32–1, 2015. → pages 144[4] R. Atienza and A. Zelinsky. Active gaze tracking for human-robotinteraction. Proceedings of the 4th IEEE International Conference onMultimodal Interfaces, pages 261–, 2002. → pages 92[5] D. H. Ballard. Generalizing the Hough transform to detect arbitrary shapes.Pattern Recognition, 13(2):111–122, Jan. 1981. → pages 98[6] D. Baraff and A. Witkin. Large steps in cloth simulation. Proceedings ofthe 25th annual conference on Computer graphics and interactivetechniques, pages 43–54, 1998. → pages 56[7] Baron-Cohen et al. The “reading the mind in the eyes” test revised version:A study with normal adults, and adults with asperger syndrome orhigh-functioning autism. Jour. of child psychology and psychiatry, 42(2):241–251, 2001. → pages 2, 25[8] A. Bartoli et al. On template-based reconstruction from a single view:Analytical solutions and proofs of well-posedness for developable,isometric and conformal surfaces. Computer Vision and PatternRecognition, IEEE Conference on, pages 2026–2033, 2012. → pages 25146[9] A. Bauer, A.-H. Dicko, F. Faure, O. Palombi, and J. Troccaz. Anatomicalmirroring: real-time user-specific anatomy in motion using a commoditydepth camera. Proceedings of the 9th International Conference on Motionin Games, pages 113–122, 2016. → pages 11, 12, 121, 122, 123, 125, 138,145[10] T. Beeler and D. Bradley. Rigid stabilization of facial expressions. ACMTransactions on Graphics (TOG), 33(4):44, 2014. → pages 51[11] T. Beeler, F. Hahn, D. Bradley, B. Bickel, P. Beardsley, C. Gotsman,R. Sumner, and M. Gross. High-quality passive facial performance captureusing anchor frames. ACM Trans. Graph., 30(4):75:1–75:10, July 2011. →pages 72[12] A. Bermano, T. Beeler, Y. Kozlov, D. Bradley, B. Bickel, and M. Gross.Detailed spatio-temporal reconstruction of eyelids. ACM Trans. onGraphics (TOG), 34(4):44, 2015. → pages 25[13] B. Bickel, M. Botsch, R. Angst, W. Matusik, M. Otaduy, H. Pfister, andM. Gross. Multi-scale capture of facial geometry and motion. ACMTransactions on Graphics (TOG), 26(3):33, 2007. → pages 72[14] T. Blum, V. Kleeberger, C. Bichlmeier, and N. Navab. mirracle: Anaugmented reality magic mirror system for anatomy education. VirtualReality Short Papers and Posters (VRW), 2012 IEEE, pages 115–116, 2012.→ pages 11, 14, 122[15] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black.Keep it SMPL: Automatic estimation of 3D human pose and shape from asingle image. Computer Vision – ECCV 2016, Oct. 2016. → pages 5, 24[16] F. L. Bookstein. Principal warps: Thin-plate splines and the decompositionof deformations. IEEE Transactions on pattern analysis and machineintelligence, 11(6):567–585, 1989. → pages 128[17] S. Bouaziz, Y. Wang, and M. Pauly. Online modeling for realtime facialanimation. ACM Trans. on Graphics (TOG), 32(4):40, 2013. → pages 5,24, 28, 30, 35, 144[18] S. P. Boyd and L. Vandenberghe. Convex optimization. Cambridgeuniversity press, 2004. → pages 56147[19] C. Bregler, A. Hertzmann, and H. Biermann. Recovering non-rigid 3dshape from image streams. Computer Vision and Pattern Recognition,IEEE Conference on, 2:690–696, 2000. → pages 25[20] T. Brox and J. Malik. Large displacement optical flow: descriptor matchingin variational motion estimation. IEEE Trans. on Pattern Analysis andMachine Intelligence, 33(3):500–513, 2011. → pages 31[21] J. Canny. A computational approach to edge detection. IEEE Transactionson pattern analysis and machine intelligence, (6):679–698, 1986. → pages126[22] E. Cerda and L. Mahadevan. Geometry and physics of wrinkling. Physicalreview letters, 90(7):074302, 2003. → pages 49[23] K.-J. Choi and H.-S. Ko. Stable but responsive cloth. ACM Transactions onGraphics (TOG), 21(3):604–611, 2002. → pages 49[24] Chronos.http://www.chronos-vision.de/downloads/CV Product C-ETD.pdf, 2004.Online; accessed 11 June 2018. → pages 92, 96[25] M. Cong, M. Bao, K. S. Bhat, R. Fedkiw, et al. Fully automatic generationof anatomical face simulation models. Proceedings of the 14th ACMSIGGRAPH/Eurographics Symposium on Computer Animation, pages175–183, 2015. → pages 45, 144[26] T. F. Cootes, G. J. Edwards, C. J. Taylor, et al. Active appearance models.IEEE Trans. on Pattern Analysis and Machine Intelligence, 23(6):681–685,2001. → pages 25[27] Y. Dai, H. Li, and M. He. A simple prior-free method for non-rigidstructure-from-motion factorization. International Journal of ComputerVision, 107(2):101–122, 2014. → pages 25[28] A. Del Bue. A factorization approach to structure from motion with shapepriors. Computer Vision and Pattern Recognition, IEEE Conference on,pages 1–8, 2008. → pages 25[29] A. T. Duchowski. A breadth-first survey of eye-tracking applications.Behav Res Methods Instrum Comput, 34(4):455–470, Nov. 2002. → pages5, 92148[30] P. Ekman and W. V. Friesen. Facial action coding system. 1977. → pages77[31] P. Ekman and E. L. Rosenberg. What the face reveals: Basic and appliedstudies of spontaneous expression using the Facial Action Coding System(FACS). Oxford University Press, 1997. → pages 57[32] G. ElKoura and K. Singh. Handrix: animating the human hand.Proceedings of the 2003 ACM SIGGRAPH/Eurographics symposium onComputer animation, pages 110–119, 2003. → pages 45[33] EntoKey.https://entokey.com/eyebrows-eyelids-and-face-structure-and-function/.Online; accessed 11 June 2018. → pages xvi, 19[34] C. Evinger, K. A. Manning, and P. A. Sibony. Eyelid movements.mechanisms and normal data. Investigative ophthalmology & visualscience, 32(2):387–400, 1991. → pages 7, 8, 59[35] Y. Fan, J. Litven, D. I. Levin, and D. K. Pai. Eulerian-on-lagrangiansimulation. ACM Transactions on Graphics (TOG), 32(3):22, 2013. →pages 8[36] B. E. Feldman, J. F. O’Brien, B. M. Klingner, and T. G. Goktekin. Fluids indeforming meshes. ACM SIGGRAPH/Eurographics symposium onComputer Animation, pages 255–259, 2005. → pages 56[37] G.-C. Feng and P. C. Yuen. Variance projection function and its applicationto eye detection for human face recognition. Pattern Recognition Letters,19(9):899–906, 1998. → pages 111[38] A. Fick. Die bewegung des menschlichen augapfels. Z. Rationelle Med.,(4):101–128, 1854. → pages 77, 106[39] Y. Fung and R. Skalak. Biomechanics: Mechanical properties of livingtissues. Journal of Biomechanical Engineering, 103:231, 1981. → pages17, 47, 49, 52[40] Y. Furukawa and J. Ponce. Dense 3d motion capture from synchronizedvideo streams. Image and Geometry Processing for 3-D Cinematography,pages 193–211, 2010. → pages 72[41] G. Fyffe, A. Jones, O. Alexander, R. Ichikari, P. Graham, K. Nagano,J. Busch, and P. Debevec. Driving high-resolution facial blendshapes with149video performance capture. ACM SIGGRAPH 2013 Talks, page 33, 2013.→ pages 72[42] P. Garrido, L. Valgaerts, C. Wu, and C. Theobalt. Reconstructing detaileddynamic face geometry from monocular video. ACM Trans. Graph., 32(6):158–1, 2013. → pages 25[43] P. Garrido, L. Valgaerts, C. Wu, and C. Theobalt. Reconstructing detaileddynamic face geometry from monocular video. ACM Trans. Graph., 32(6):158, 2013. → pages 72[44] A. Ghosh, G. Fyffe, B. Tunwattanapong, J. Busch, X. Yu, and P. Debevec.Multiview face capture using polarized spherical gradient illumination.ACM Transactions on Graphics (TOG), 30(6):129, 2011. → pages 72, 81[45] P. F. Gotardo and A. M. Martinez. Computing smooth time trajectories forcamera and deformable shape in structure from motion with occlusion.IEEE Trans. on Pattern Analysis and Machine Intelligence, 33(10):2051–2065, 2011. → pages 25[46] E. Guestrin and M. Eizenman. Listing’s and donders’ laws and theestimation of the point-of-gaze. Proceedings of the 2010 Symposium onEye-Tracking Research & Applications, pages 199–202, 2010. → pages109[47] A. Habib and D. Kelley. Automatic relative orientation of large scaleimagery over urban areas using modified iterated hough transform. ISPRSjournal of photogrammetry and remote sensing, 56(1):29–41, 2001. →pages xx, 98[48] A. G. Hannam, I. Stavness, J. E. Lloyd, and S. Fels. A dynamic model ofjaw and hyoid biomechanics during chewing. Journal of Biomechanics, 41(5):1069–1076, 2008. → pages 45[49] D. W. Hansen and Q. Ji. In the eye of the beholder: A survey of models foreyes and gaze. IEEE Transactions on Pattern Analysis and MachineIntelligence, 32:478–500, 2010. → pages 92, 97, 104[50] M. Hatamian and D. J. Anderson. Design considerations for a real-timeocular counterroll instrument. IEEE Transactions on BiomedicalEngineering, BME-30:278–288, 1983. → pages 92150[51] J. Huang and H. Wechsler. Visual search of dynamic scenes: Event typesand the role of experience in viewing driving situations. G. Underwood(Ed.), Eye Guidance in Reading and Scene Perception, pages 369–394,1998. → pages 92[52] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attentionfor rapid scene analysis. IEEE Transactions on pattern analysis andmachine intelligence, 20(11):1254–1259, 1998. → pages 87[53] R. Jacob and K. Karn. Eye tracking in human-computer interaction andusability research: Ready to deliver the promises. Mind, 2(3):4, 2003. →pages 92[54] A. Jain, T. Thorma¨hlen, H.-P. Seidel, and C. Theobalt. Moviereshape:Tracking and reshaping of humans in videos. ACM Transactions onGraphics (TOG), 29(6):148, 2010. → pages 122, 139, 145[55] J. Jimenez, K. Zsolnai, A. Jarabo, C. Freude, T. Auzinger, X.-C. Wu,J. von der Pahlen, M. Wimmer, and D. Gutierrez. Separable subsurfacescattering. Computer Graphics Forum, 2015. → pages 86[56] L. Kavan, S. Collins, J. Zˇa´ra, and C. O’Sullivan. Geometric skinning withapproximate dual quaternion blending. ACM Transactions on Graphics(TOG), 27(4):105, 2008. → pages 133[57] K. Kim and R. Ramakrishna. Vision-based eye-gaze tracking for humancomputer interface. Systems, Man, and Cybernetics, 1999. IEEE SMC’99Conference Proceedings. 1999 IEEE International Conference on, 2:324–329, 1999. → pages 92[58] Y. Lee, D. Terzopoulos, and K. Waters. Realistic modeling for facialanimation. Proceedings of the 22nd annual conference on Computergraphics and interactive techniques, pages 55–62, 1995. → pages 44, 49[59] D. Li, S. Sueda, D. R. Neog, and D. K. Pai. Thin skin elastodynamics.ACM Transactions on Graphics (TOG), 32(4):49, 2013. → pages 6, 8, 26,27, 28, 53, 54, 55, 57, 74[60] H. Li, R. W. Sumner, and M. Pauly. Global correspondence optimizationfor non-rigid registration of depth scans. Computer graphics forum, 27(5):1421–1430, 2008. → pages 80151[61] H. Li, J. Yu, Y. Ye, and C. Bregler. Realtime facial animation withon-the-fly correctives. ACM Transactions on Graphics (ProceedingsSIGGRAPH 2013), 32(4), July 2013. → pages 72[62] N. H. Mackworth and E. L. Thomas. Head-mounted eye-marker camera.JOSA, 52(6):713–716, 1962. → pages 92[63] X. Maitre. Newscientist: Digital mirror reveals what lies under your skin.https://www.newscientist.com/article/mg22229653-800-digital-mirror-reveals-what-lies-under-your-skin/, 2014.Online; accessed 11 June 2018. → pages 11, 14, 122[64] A. Malti, R. Hartley, A. Bartoli, and J.-H. Kim. Monocular template-based3d reconstruction of extensible surfaces with local linear elasticity.Computer Vision and Pattern Recognition, IEEE Conference on, pages1522–1529, 2013. → pages 25[65] B. C. Mendelson and S. R. Jacobson. Surgical anatomy of the midcheek:facial layers, spaces, and the midcheek segments. Clinics in plasticsurgery, 35(3):395–404, 2008. → pages xvi, 15, 16[66] S. Moore, T. Haslwanter, I. Curthoys, and S. Smith. A geometric basis formeasurement of three-dimensional eye position using image processing.Vision research, 36(3):445–459, 1996. → pages xx, 13, 104, 105, 108, 109,157[67] T. Moriyama, T. Kanade, J. Xiao, and J. F. Cohn. Meticulously detailed eyeregion model and its application to analysis of facial images. IEEE Trans.on Pattern Analysis and Machine Intelligence, 28(5):738–752, 2006. →pages 25[68] P. Neligan, E. Rodriguez, and J. Losee. Plastic Surgery: Craniofacial, headand neck surgery, pediatric plastic surgery. Volume three. ElsevierSaunders, 2013. → pages 57[69] D. R. Neog, J. L. Cardoso, A. Ranjan, and D. K. Pai. Interactive gazedriven animation of the eye region. Proceedings of the 21st InternationalConference on Web3D Technology, pages 51–59, 2016. → pages 26[70] V. Ng-Thow-Hing. Anatomically-based models for physical andgeometrical reconstruction of animals. PhD Thesis, 2001. → pages 45[71] V. Ng-Thow-Hing and E. Fiume. Application-specific musclerepresentations. Graphics Interface, 2:107–16, 2002. → pages 45152[72] A. Nidhi, H. Kumar, J. S. Dhaliwal, P. Kalra, and P. Chaudhuri. Improvedinteractive reshaping of humans in images. WSCG 2013: CommunicationPapers Proceedings: 21st International Conference in Central Europe onComputer Graphics. → pages 121, 122[73] K. C. Nishikawa, J. A. Monroy, T. E. Uyeno, S. H. Yeo, D. K. Pai, and S. L.Lindstedt. Is titin a winding filament? a new twist on muscle contraction.Proceedings of the Royal Society of London B: Biological Sciences, 279(1730):981–990, 2012. → pages 47[74] A. A. O. OPHTHALMOLOGY. Guide, presenter’s and monograph,ophthalmology. OPHTHALMOLOGY, 1:256, 1993. → pages 19[75] J. Orozco, O. Rudovic, J. Gonza`lez, and M. Pantic. Hierarchical on-lineappearance-based tracking for 3d head pose, eyebrows, lips, eyelids andirises. Image and vision computing, 31(4):322–340, 2013. → pages 25[76] D. K. Pai, D. I. Levin, and Y. Fan. Eulerian solids for soft tissue and more.ACM SIGGRAPH 2014 Courses, page 22, 2014. → pages 51[77] D. Pinskiy and E. Miller. Realistic eye motion using procedural geometricmethods. SIGGRAPH 2009: Talks, page 75, 2009. → pages 71[78] M. J. Powell. On search directions for minimization algorithms.Mathematical Programming, 4(1):193–201, 1973. → pages 66[79] P. Prendergast. Minimally invasive face and neck lift using silhouette conedsutures. Miniinvasive Face and Body Lifts - Closed Suture Lifts or BarbedThread Lifts, 2013. → pages xvii, 52[80] C. K. Quah, A. Gagalowicz, R. Roussel, and H. S. Seah. 3d modeling ofhumans with skeletons from uncalibrated wide baseline views.International Conference on Computer Analysis of Images and Patterns,pages 379–389, 2005. → pages 121[81] M. Richter, K. Varanasi, N. Hasler, and C. Theobalt. Real-time reshapingof humans. 3D Imaging, Modeling, Processing, Visualization andTransmission (3DIMPVT), 2012 Second International Conference on,pages 340–347, 2012. → pages 122, 139, 145[82] D. Robinson. The mechanics of human saccadic eye movement. TheJournal of physiology, 174(2):245–264, 1964. → pages 7, 8, 59153[83] R. Ronsse, O. White, and P. Lefevre. Computation of gaze orientationunder unrestrained head movements. Journal Of Neuroscience Methods,159(1):158–169, 2007. → pages 13, 113[84] P. Rossell-Perry. The zygomatic ligament of the face: a critical review. OAAnatomy, 1:3, Feb, 2013. → pages 20[85] K. Ruhland, S. Andrist, J. Badler, C. Peters, N. Badler, M. Gleicher,B. Mutlu, and R. McDonnell. Look me in the eyes: A survey of eye andgaze animation for virtual agents and artificial systems. Eurographics2014-State of the Art Reports, pages 69–91, 2014. → pages 2, 71[86] P. Sachdeva, S. Sueda, S. Bradley, M. Fain, and D. K. Pai. Biomechanicalsimulation and control of hands and tendinous systems. ACM Transactionson Graphics (TOG), 34(4):42, 2015. → pages 46[87] M. Salzmann, F. Moreno-Noguer, V. Lepetit, and P. Fua. Closed-formsolution to non-rigid 3d surface registration. European conference oncomputer vision, pages 581–594, 2008. → pages 25[88] C. A. Sa´nchez, Z. Li, A. G. Hannam, P. Abolmaesumi, A. Agur, andS. Fels. Constructing detailed subject-specific models of the humanmasseter. Imaging for Patient-Customized Simulations and Systems forPoint-of-Care Ultrasound, pages 52–60, 2017. → pages 45[89] A. Shapiro, P. Faloutsos, and V. Ng-Thow-Hing. Dynamic animation andcontrol environment. Proceedings of graphics interface 2005, pages 61–70,2005. → pages 45[90] F. Shi, H.-T. Wu, X. Tong, and J. Chai. Automatic acquisition ofhigh-fidelity facial performances using monocular videos. ACM Trans.Graph., 33(6):222:1–222:13, Nov. 2014. → pages 72[91] R. Sibson et al. A brief description of natural neighbour interpolation.Interpreting multivariate data, 21:21–36, 1981. → pages 32[92] E. Sifakis, I. Neverov, and R. Fedkiw. Automatic determination of facialmuscle activations from sparse motion capture marker data. ACM Trans.Graph., 24(3):417–425, July 2005. → pages 58[93] E. Sifakis, I. Neverov, and R. Fedkiw. Automatic determination of facialmuscle activations from sparse motion capture marker data. 24(3):417–425, 2005. → pages 45154[94] J. Skillman, T. Hardy, N. Kirkpatrick, N. Joshi, and M. Kelly. Use of theorbicularis retaining ligament in lower eyelid reconstruction. Journal ofPlastic, Reconstructive & Aesthetic Surgery, 62(7):896–900, 2009. →pages 20[95] J. Stam. Stable fluids. Proc. SIGGRAPH 1999, pages 121–128, 1999. →pages 56[96] S. Sueda, A. Kaufman, and D. K. Pai. Musculotendon simulation for handanimation. ACM Transactions on Graphics (TOG), 27(3):83, 2008. →pages 46[97] S. Suzuki et al. Topological structural analysis of digitized binary imagesby border following. Computer vision, graphics, and image processing, 30(1):32–46, 1985. → pages 126[98] A. Tagliasacchi, M. Schro¨der, A. Tkach, S. Bouaziz, M. Botsch, andM. Pauly. Robust articulated-icp for real-time hand tracking. ComputerGraphics Forum, 34(5):101–114, 2015. → pages 5, 24, 30[99] W. Tasman and E. A. Jaeger. Duane’s Ophthalmology. ARVO, 2007. →pages 20[100] J. Taylor et al. Efficient and precise interactive hand tracking through joint,continuous optimization of pose and correspondences. ACM Trans. onGraphics (TOG), 35(4):143, 2016. → pages 5, 24, 30[101] J. Teran, S. Blemker, V. Hing, and R. Fedkiw. Finite volume methods forthe simulation of skeletal muscle. Proceedings of the 2003 ACMSIGGRAPH/Eurographics symposium on Computer animation, pages68–74, 2003. → pages 45, 51, 54[102] D. Terzopoulos and K. Waters. Physically-based facial modelling, analysis,and animation. The Journal of Visualization and Computer Animation, 1(2):73–80, Dec. 1990. → pages 72[103] L. C. Trutoiu, E. J. Carter, I. Matthews, and J. K. Hodgins. Modeling andanimating eye blinks. ACM Transactions on Applied Perception (TAP), 8(3):17, 2011. → pages 81, 87, 88, 111[104] L. Valgaerts, C. Wu, A. Bruhn, H.-P. Seidel, and C. Theobalt. Lightweightbinocular facial performance capture under uncontrolled lighting. ACMTrans. Graph., 31(6):187, 2012. → pages 72155[105] A. Vill. Eye texture raytracer.https://www.chromeexperiments.com/experiment/eye-texture-raytracer,Feb. 2014. Online; accessed 11 June 2018. → pages 9, 72[106] P. Volino, N. Magnenat-Thalmann, F. Faure, et al. A simple approach tononlinear tensile stiffness for accurate cloth simulation. ACM Transactionson Graphics, 28(4), 2009. → pages 50[107] T. Weise, S. Bouaziz, H. Li, and M. Pauly. Realtime performance-basedfacial animation. ACM Transactions on Graphics (ProceedingsSIGGRAPH 2011), 30(4), July 2011. → pages 72, 74[108] WikiHuman. http://gl.ict.usc.edu/Research/DigitalEmily2/. Online;accessed 11 June 2018. → pages 81[109] C. Wu, B. Wilburn, Y. Matsushita, and C. Theobalt. High-quality shapefrom multi-view stereo and shading under general illumination. ComputerVision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages969–976, 2011. → pages 72[110] M. Yanoff, J. Duker, and J. Augsburger. Ophthalmology. Mosby Elsevier,2009. → pages xvi, 18, 21, 22, 23[111] B. Zhang, L. Wang, Z. Wang, Y. Qiao, and H. Wang. Real-time actionrecognition with enhanced motion vector cnns. Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, pages2718–2726, 2016. → pages 144[112] S. Zhou, H. Fu, L. Liu, D. Cohen-Or, and X. Han. Parametric reshaping ofhuman bodies in images. ACM Transactions on Graphics (TOG), 29(4):126, 2010. → pages 121, 122, 139, 145[113] Z. Zhou and X. Geng. Projection functions for eye detection. Patternrecognition, 37(5):1049–1056, 2004. → pages 111[114] D. Zhu, S. Moore, and T. Raphan. Robust and real-time torsional eyeposition calculation using a template-matching technique. Computermethods and programs in biomedicine, 74(3):201–209, 2004. → pages 110156Appendix ASupporting MaterialsA.1 Calibration parameters estimationThe eye model used in this project has several parameters which are needed to becomputed using a calibration procedure as used in [66]. The calibration proce-dure described in [66], determines six parameters: Fick angles θo f f ,φo f f , and ψo f fcorresponding to the rotation matrix Ro f f or hRc or camera to head transformation,radius of the eye at pupil center rp, and projection of center of eye on the imageplane T′′= (0,x0,y0). To determine these parameters, five calibration points areused at purely vertical locations at ±φcal , purely horizontal locations at ±θcal , andlocation with eye looking to front in the reference position. For our experimentswe put the markers on a calibration bar, placed at an appropriate distance from theeye positions of the subject.Lets assume the corresponding coordinates in image are:(x+φcal ,y+φcal ), (x−φcal ,y−φcal ), (x+θcal ,y+θcal ), (x−θcal ,y−θcal ), (xr,yr).We consider θcal=φcal . The equations used to estimate the parameters are:157ψo f f =−atan(y+θcal − y−θcaly+φcal−y−φcal)(A.1)rp2 =(y+φcal+y−φcal −2 · yr2 · (1− cos(φcal)))2+(y+φcal−y−φcal2 · cos(ψo f f ) · sin(φcal))2(A.2)φo f f = asin(y+φcal+y−φcal −2 · yr2 · rp · (1− cos(φcal)))(A.3)θo f f = asin(x+φcal+x−φcal −2 · xr2 · rp · cos(φo f f )(cos(φcal)−1))(A.4)T′′= (0,x0,y0) = (0,xr,yr)− rp · (0,cos(φo f f ) · sin(θo f f ), − sin(φo f f )) (A.5)158


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items