UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Understanding the sources of error for 3D human pose estimation from monocular images and videos Hossain, Mir Rayat Imtiaz 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2018_february_hossain_mirrayatimtiaz.pdf [ 5.66MB ]
Metadata
JSON: 24-1.0361162.json
JSON-LD: 24-1.0361162-ld.json
RDF/XML (Pretty): 24-1.0361162-rdf.xml
RDF/JSON: 24-1.0361162-rdf.json
Turtle: 24-1.0361162-turtle.txt
N-Triples: 24-1.0361162-rdf-ntriples.txt
Original Record: 24-1.0361162-source.json
Full Text
24-1.0361162-fulltext.txt
Citation
24-1.0361162.ris

Full Text

Understanding the Sources of Error for 3D Human PoseEstimation from Monocular Images and VideosbyMir Rayat Imtiaz HossainBachelor of Science, Islamic University of Technology, 2013A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)December 2017c©Mir Rayat Imtiaz Hossain, 2017AbstractWith the success of deep learning in the field of computer vision, most state-of-the-art approaches of estimating 3D human pose from images or videos rely on traininga network end-to-end which can regress into 3D joint locations or heatmaps froman RGB image. Although most of these approaches provide good results, the majorsources of error are often difficult to understand. The errors may either come fromincorrect 2D pose estimation or from the incorrect mapping of the features in 2D to3D. In this work, we aim to understand the sources of error in estimating 3D posefrom images and videos. Therefore, we have built three different systems. Thefirst takes 2D joint locations of every frame individually as inputs and predicts 3Djoint positions. To our surprise, we found that by using a simple feed-forward fullyconnected network, with residual connections, the ground truth 2D joint locationscan be mapped to 3D space at a remarkably low error rate, outperforming the bestreported result by almost 30% on the Human 3.6M dataset, the largest publiclyavailable dataset of motion capture data. Furthermore, training this network on theoutputs of an off-the-shelf 2D pose detector gives us state-of-the-art results whencompared with a vast array of systems trained end-to-end. To validate the efficacyof this network, we also trained an end-to-end system that takes an image as inputand regresses 3D pose directly. We found that it is harder to train the network end-to-end than decoupling the task. To examine whether temporal information over asequence improves results, we built a sequence-to-sequence network that takes asequence of 2D poses as input and predicts a sequence of 3D poses as output. Wefound that the temporal information improves the results from our first system. Weargue that a large portion of error of 3D pose estimation systems results from theerror in 2D pose estimation.iiLay SummaryEstimating human pose in 3D from images and videos has multiple applicationsin the field of computer vision, robotics and graphics community such as humanaction or activity recognition, sports analysis, animation and augmented reality.A major challenge for this task is the lack of training data because collecting 3Dmotion capture data is expensive and requires sophisticated laboratory setup. Thisis also a very challenging task because of the inherent ambiguity of mapping ascene in 2D to 3D. Recently, the methods for 3D pose estimation tend to leveragedeep networks and has generated some good results. However, the major source oferror for this task is not well understood. In this thesis, we examine the possiblesources of error and designed three different networks for this cause. Two of ournetworks have produced state-of-the-art results for 3D pose detection task.iiiPrefaceThis thesis is submitted in partial fulfillment of the requirements for a Master ofScience Degree in Computer Science. The entire work presented here is originalwork done by the author, Mir Rayat Imtiaz Hossain, performed under the supervi-sion of Professor James J. Little. A version of this work has been accepted to bepublished as:• J. Martinez, R.Hossain, J.Romero, and J.J.Little. A simple yet effective base-line for 3d human pose estimation. In IEEE International Conference onComputer Vision (ICCV), October 2017 [71]ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . 61.1.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.1.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2 Method Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . 132 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.1 Representation of 3D pose . . . . . . . . . . . . . . . . . . . . . 162.2 Approaches to 3D Pose estimation . . . . . . . . . . . . . . . . . 182.2.1 3D Pose estimation by extracting features from single image 192.2.2 Using features to look up in a database of exemplar 3D poses 20v2.2.3 Deep network trained end-to-end . . . . . . . . . . . . . . 212.2.4 3D Pose Estimation from 2D pose . . . . . . . . . . . . . 232.2.5 Exploiting temporal information . . . . . . . . . . . . . . 262.2.6 Exploiting multiple views . . . . . . . . . . . . . . . . . 272.2.7 Exploiting depth information . . . . . . . . . . . . . . . . 282.3 2D pose estimation techniques . . . . . . . . . . . . . . . . . . . 292.4 Deep Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.4.1 Biological motivation . . . . . . . . . . . . . . . . . . . . 312.4.2 History of Neural Networks . . . . . . . . . . . . . . . . 312.4.3 Convolutional Neural Networks . . . . . . . . . . . . . . 342.4.4 Recurrent Neural Networks . . . . . . . . . . . . . . . . 373 3D pose from 2D pose . . . . . . . . . . . . . . . . . . . . . . . . . . 423.1 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.2 Network design . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.2.1 Mapping 2D pose to 3D . . . . . . . . . . . . . . . . . . 443.2.2 Fully connected layers with ReLU activation . . . . . . . 443.2.3 Residual or shortcut connections . . . . . . . . . . . . . . 453.2.4 Regularization with batch normalization, dropout and max-norm constraint . . . . . . . . . . . . . . . . . . . . . . . 453.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 473.3.1 Camera coordinate frame . . . . . . . . . . . . . . . . . . 473.3.2 2D detections . . . . . . . . . . . . . . . . . . . . . . . . 473.3.3 Training details . . . . . . . . . . . . . . . . . . . . . . . 483.4 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . 493.4.1 Quantitative results . . . . . . . . . . . . . . . . . . . . . 513.4.2 Qualitative results . . . . . . . . . . . . . . . . . . . . . 543.4.3 Discussion of results . . . . . . . . . . . . . . . . . . . . 554 End-to-end model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.1 Stacked hourglass module . . . . . . . . . . . . . . . . . . . . . 604.2 Pre-training stacked-hourglass model . . . . . . . . . . . . . . . . 614.3 Training end-to-end . . . . . . . . . . . . . . . . . . . . . . . . . 62vi4.3.1 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . 624.3.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . 634.3.3 Training Details . . . . . . . . . . . . . . . . . . . . . . . 634.4 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . 634.4.1 Quantitative results . . . . . . . . . . . . . . . . . . . . . 644.4.2 Qualitative results . . . . . . . . . . . . . . . . . . . . . 644.4.3 Discussion of results . . . . . . . . . . . . . . . . . . . . 655 Exploiting temporal information . . . . . . . . . . . . . . . . . . . . 685.1 Network design . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.1.1 Sequence-to-sequence network with residual connections . 695.1.2 Layer Normalization . . . . . . . . . . . . . . . . . . . . 705.1.3 Recurrent Dropout . . . . . . . . . . . . . . . . . . . . . 715.1.4 Temporal smoothness constraint . . . . . . . . . . . . . . 715.1.5 Loss function . . . . . . . . . . . . . . . . . . . . . . . . 725.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 735.2.1 Training details . . . . . . . . . . . . . . . . . . . . . . . 745.3 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . 755.3.1 Quantitative results . . . . . . . . . . . . . . . . . . . . . 755.3.2 Qualitative results . . . . . . . . . . . . . . . . . . . . . 815.3.3 Discussion of results . . . . . . . . . . . . . . . . . . . . 826 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . 896.1 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . 916.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93viiList of TablesTable 3.1 Results showing errors action-wise on Human3.6M [51] underProtocol #1 (no rigid alignment or similarity transform appliedin post-processing). SH indicates that we trained and tested ourmodel with the detections of Stacked Hourglass [80] model pre-trained on MPII dataset [5] as input, and FT indicates that thethe model was fine-tuned on Human3.6M. GT detections de-notes that the ground truth 2D locations were used. SA indicatesthat a model was trained for each action, and MA indicates thata single model was trained for all actions. . . . . . . . . . . . . 49Table 3.2 Results showing errors action-wise on Human3.6M [51] datasetunder protocol #2 (rigid alignment in post-processing). The14j annotation indicates that the body model considers 14 bodyjoints while 17j means considers 17 body joints. (SA) anno-tation indicates per-action model while (MA) indicates singlemodel used for all actions. FT indicates that the stacked-hourglassmodel has been fine-tuned on Human3.6M dataset. The resultsof the methods are obtained from the original papers, except for(*), which were obtained from [16]. . . . . . . . . . . . . . . . 50Table 3.3 Results on the HumanEva [105] dataset, and comparison withprevious methods. . . . . . . . . . . . . . . . . . . . . . . . . 53viiiTable 3.4 Performance of our system on Human3.6M [51] dataset underprotocol #2 under different levels of additive Gaussian noise andnoise from 2D pose estimation from the pose estimators. (Top)Training using ground truth 2D pose and testing on ground truth2d plus plus different levels of additive Gaussian noise. (Bot-tom) Training on ground truth 2D pose and testing on the noisyoutputs of a 2D pose estimator. Note that the size of the croppedregion around the person is 440×440. . . . . . . . . . . . . . 54Table 3.5 Ablative and hyperparameter sensitivity analysis. . . . . . . . . 55Table 4.1 Results showing Mean Per Joint Error over all actions on Hu-man3.6M [51] dataset under protocol #1 (left column) and #2(right column) respectively. SH indicates 2D pose detectionsobtained from stacked-hourglass module [80] trained on MPII [5]dataset and FT indicates that the model was fine-tuned on Hu-man3.6M dataset [51].The results of the methods are obtainedfrom the original papers, except for (*), which were obtainedfrom [16]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65Table 5.1 Results showing errors action-wise on Human3.6M [51] un-der Protocol #1 (no rigid alignment or similarity transform ap-plied in post-processing). Note that our results reported hereare for sequence of length 5. SH indicates that we trained andtested our model with the detections of Stacked Hourglass [80]model pre-trained on MPII dataset [5] as input, and FT indi-cates that the the stacked-hourglass model was fine-tuned onHuman3.6M. SA indicates that a model was trained for eachaction, and MA indicates that a single model was trained forall actions.The bold-faced numbers mean the best result whileunderlined numbers represent the second best. . . . . . . . . . 76ixTable 5.2 Results showing errors action-wise on Human3.6M [51] datasetunder protocol #2 (rigid alignment in post-processing). Notethat the results reported here are for sequence of length 5. The14j annotation indicates that the body model considers 14 bodyjoints while 17j means considers 17 body joints. (SA) anno-tation indicates per-action model while (MA) indicates singlemodel used for all actions. FT indicates that the stacked-hourglassmodel has been fine-tuned on Human3.6M dataset. The bold-faced numbers mean the best result while underlined numbersrepresent the second best. The results of the methods are ob-tained from the original papers, except for (*), which were ob-tained from [16]. . . . . . . . . . . . . . . . . . . . . . . . . . 77Table 5.3 Performance of our system trained with ground truth 2D poseof Human3.6M [51] dataset and tested under different levels ofadditive Gaussian noise (Top) and on 2D pose predictions fromstacked-hourglass [80] pose detector (Bottom) under protocol#2. The size of the cropped region around the person is 440×440. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80Table 5.4 Ablative and hyperparameter sensitivity analysis. . . . . . . . . 81xList of FiguresFigure 1.1 A two step approach to 3D human pose estimation. a) A framefrom the input video. b) The input frame with the 2D poseestimate superimposed. c) 3D pose estimate corresponding tothe input frame. Although it is hard to obtain training datathat maps the input frame to 3D pose, we can decompose thechallenge into two tasks. . . . . . . . . . . . . . . . . . . . . 4Figure 1.2 Example of 3D pose estimation. The 2D pose is overlaid onthe image of the person. The corresponding 3D pose is shownin the bottom. . . . . . . . . . . . . . . . . . . . . . . . . . . 6Figure 1.3 (a) 2D position of joints, (b) Different 3D pose interpretationsof the same 2D pose. Blue points represent the ground truth3D pose while the black points indicate other possible 3D in-terpretations. All these 3D poses project to exactly same 2Dpose. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Figure 1.4 An example of data in Human 3.6m dataset from left to right:RGB image, person silhouette, time-of-flight (depth) data, 3Dpose data (shown using a synthetic graphics model), body sur-face scan. Source: [51]. . . . . . . . . . . . . . . . . . . . . 10Figure 1.5 Sample images from Human 3.6m dataset, showing differentsubjects, poses and viewing angle. Source: [51]. . . . . . . . . 10xiFigure 1.6 Block diagram of our first system. The building block of ournetwork, which we call Residual Block, is composed of a lin-ear layer, followed by batch normalization,ReLU activationand dropout layer repeated twice, wrapped in a residual con-nection. The Residual Block can be repeated any number oftimes. Our best network uses two such residual block. Theinput to our system is an array of 2d joint positions, and theoutput is a series of joint positions in 3d. . . . . . . . . . . . . 11Figure 1.7 Our second model simply stacks our first model over stracked-hourglass [80] 2D pose estimator. The stacked-hourglass net-work is first pre-trained for 2D pose estimation using imagesfrom Human3.6M dataset [51]. The heatmap of the final hour-glass is passed as an input to our residual block and the entirenetwork is trained end-to-end. . . . . . . . . . . . . . . . . . 12Figure 1.8 Our final network. It is a sequence-to-sequence network [113]with residual connections on the decoder side. The encoderencodes the information of a sequence of 2D poses of length tin its final hidden state. The final hidden state of the encoder isused to initialize the hidden state of decoder. 〈START 〉 sym-bol tells the decoder to start predicting 3D pose from the lasthidden state of the encoder. Note that the input sequence isreversed as suggested by Sutskever et al. [113]. The decoderessentially learns to predict the 3D pose at time (t) given the3D pose at time (t−1). The residual connections help the de-coder to learn the perturbation from the previous time step. . . 14Figure 2.1 (Left) A sample skeleton model with 17 joints each of themlabeled. (Right) A Kinematic tree showing the kinematic re-lationship between the joints. The arrow downward indicatesparent-child relationship between two joints. . . . . . . . . . 17xiiFigure 2.2 A Fully Connected Neural Network consisting of an Input Layer,one hidden layer and an output layer. The connections be-tween each neuron is shown with an arrow. Each connectionhas a particular weight which is learned over time from train-ing data using backpropagation. Each neuron also has an ac-tivation function which defines a threshold for the neuron tofire. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Figure 2.3 A convolutional layer having a depth column of 5, i.e. 5 neu-rons are connected to same spatial region and a filter size orreceptive field size of 5×5. . . . . . . . . . . . . . . . . . . 35Figure 2.4 A 2×2 max-pooling layer with a stride of 2 . . . . . . . . . . 36Figure 2.5 A RNN unrolled into a full network . . . . . . . . . . . . . . 38Figure 2.6 (Left) Diagram of a simple RNN unit. Right) Diagram show-ing a LSTM Block. . . . . . . . . . . . . . . . . . . . . . . 40Figure 3.1 Example of output on the test set of Human3.6M dataset. (Left)2D pose, (Middle) 3D ground truth pose in red and blue, (Right)our 3D pose estimations in green and purple. . . . . . . . . . 56Figure 3.2 Qualitative results on the MPII [5] test set. Observed image,followed by 2D pose detection using Stacked Hourglass [80]and (in green) our 3D pose estimation. The bottom 3 examplesshow typical failure cases, where either the 2D detector hasfailed totally (left), or marginally (right). In the middle columnof last row, the 2D detector does a good job in estimating the2D pose, but the person is faced upside-down. Human3.6Mdataset does not provide any corresponding poses which areoriented upside-down. However, our network still seems topredict a meaningful pose although the orientation is reversedvertically . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57Figure 4.1 Example of output on the test images of Human3.6M dataset.(Left) Image, (Middle) 3D ground truth pose in red and blue,(Right) our 3D pose estimations in green and purple. . . . . . 67xiiiFigure 5.1 Mean Per Joint Error(MPJE) in mm of our network for differ-ent sequence length. SH Pre-trained indicates that 2D posesare estimated using the stacked-hourglass model pre-trainedon MPII [5] while SH FT indicates that the detections wereobtained on the stacked-hourglass model fine-tuned by us onHuman3.6M dataset. . . . . . . . . . . . . . . . . . . . . . . 79Figure 5.2 Qualitative result of Subject 11, action sitting down for Hu-man3.6M dataset [51] (Left) Image with 2D pose, (Middle) 3Dground truth pose in red and blue, (Right) 3D pose estimationsin green and purple. . . . . . . . . . . . . . . . . . . . . . . 83Figure 5.3 Qualitative result of Subject 9, action phoning for Human3.6Mdataset [51] (Left) Image with 2D pose, (Middle) 3D groundtruth pose in red and blue, (Right) 3D pose estimations in greenand purple. . . . . . . . . . . . . . . . . . . . . . . . . . . . 84Figure 5.4 Qualitative result of Subject 11, action taking photo for Hu-man3.6M dataset [51] (Left) Images with 2D pose detections,(Middle) 3D ground truth pose in red and blue, (Right) 3Dpose estimations in green and purple. . . . . . . . . . . . . . 85Figure 5.5 Quantitative results on youtube videos. (Left) Images with 2Dpose detections, (Right) our 3D pose estimation. . . . . . . . 86Figure 5.6 Quantitative results on youtube videos. (Left) Images with 2Dpose detections, (Right) our 3D pose estimation. . . . . . . . 87Figure 5.7 Quantitative results on youtube videos. (Left) Images with 2Dpose detections, (Right) our 3D pose estimation. . . . . . . . 88xivAcknowledgmentsI would like to extend my heartfelt gratitude to a number of people and organizationfor providing me continuous academic, financial and mental support during myMaster’s.First I would like to thank my supervisor Prof. James J. Little. Not only is he agreat academician and mentor, he is one of the nicest and most generous people thatI have come across in my life. He has always encouraged me to explore new ideasand problems and appreciated my efforts throughout my program. He provided mevital feedback, advices and insights whenever I got stuck with any problem. Hekept on giving me moral support and motivation to work hard and get the best outof my thesis. Thank you Prof. Jim Little, I will always be grateful for all that Ihave learned from you. I would also like to thank Prof. Leonid Sigal for taking histime out and agreeing to be the second reader of my thesis.Next I would like to thank Julieta Martinez, my lab-mate, who has alwayshelped me with great ideas. I had the privilege of collaborating with her for thefirst part of my thesis and have learned a lot from her. I would also like to thankJavier Romero, at MPI institute for agreeing to collaborate with us for the first partof my thesis. I would like extend my gratitude to former lab mate Ankur Gupta,for initially motivating me with the problem of 3D pose and my other lab matesJimmy, Moumita and Lili for being such great colleagues and for being so nice,kind and helpful.I would also like to thank the Department of Computer Science of Universityof British Columbia (UBC) for giving me the honor and opportunity to be a part oftheir prestigious alumni and for financially supporting me as a teaching assistant. Ithoroughly enjoyed my experience here. I have had the privilege to learn and gainxvknowledge from some amazing instructors during my coursework. I would like toextend my gratitude to all my course instructors. This work was supported by theNatural Sciences and Engineering Research Council of Canada (NSERC). A bigthanks to them as well.Last but not the least I would forever be grateful and indebted to my parents,younger sister and my lovely wife-to-be, Farozaan, for their unconditional love andsupport. They have always given me the confidence, motivation and hope in timesof despair. Specially, I would like to thank my mom for her countless sacrificesand for all her prayers.xviChapter 1IntroductionMost existing representations of humans are two dimensional, e.g. video, imagesor paintings. However, all the objects that we see in front of us are three dimen-sional. What we essentially see with our eyes are images of these objects projectedonto our retina. The phenomenon of projecting 3D space onto a 2D plane is knownas perspective projection. It is from the sense of perspective that humans estimatethe depth of things in front of them, thereby knowing which objects are closer tothem than others. This makes humans very adept in understanding complex spa-tial arrangements of objects in a scene even in the presence of depth ambiguities.Therefore, such two dimensional representations have played a crucial role in con-veying facts, ideas and feelings to other people. In many computer vision androbotics applications, such as virtual and augmented reality, autonomous driving,ability to perform spatial reasoning of objects in a scene is crucial. Poor under-standing of spatial arrangement and depth can seriously limit the performance ofcomputer vision algorithms. In this thesis, we concentrate on a particular instanceof depth and spatial understanding: Estimating 3D human pose from images andvideos.Estimating human pose in 3D from 2D representations is a challenging andactive research area among computer vision and graphics community. An under-standing of human posture and limb articulation is important for higher level com-puter vision tasks such as human action or activity recognition, sports analysis,augmented and virtual reality. A 2D representation of human pose can be used1for these tasks. However, 2D poses are inherently ambiguous because an arbitrarycamera viewpoint can make totally different poses look similar (see Figure 1.3).Moreover, 2D human poses can often be confusing because of occlusion of onebody part by another. 3D representation of human pose is free from such ambi-guities and hence can improve performance for higher level tasks. Moreover, the3D pose can be very useful in computer animation, where the articulated pose ofa person in 3D can be used to accurately model human posture and movement.But, one of the biggest challenges is the lack of abundant data for the task of 3Dpose, particularly for images in the wild. Collecting data for 3D pose estimation isexpensive and requires complex laboratory setup.Over the years, a number of different techniques has been used to address theproblem of 3D pose estimation from images and videos. In order to go from animage to a 3D pose, an algorithm has to be invariant to a number of factors, in-cluding background scenes, lighting, clothing shape and texture, skin color andimage imperfections, among others. Before the advent of deep networks, most ofthe approaches tend to use hand-engineered features, such as silhouettes [1], shapecontext [77], SIFT [68] descriptors [15] or edge direction histograms [103] to learna model that can estimate 3D poses from images. Most of these features have thedesired invariance properties. Another stream of work predicts 3D poses by query-ing into a database of exemplars [22, 41, 53, 78, 128]. Some work tries to predict3D poses given the 2D poses from an image by minimizing the camera parametersof weak perspective projection equation. The 3D human pose is represented as asparse combination of a set of basis poses which is learned separately [2, 92, 134].Another group of work tries to exploit temporal consistency over multiple frames[4, 65, 81, 117, 134].With the recent success of Deep Learning in the area of Computer Vision, manysystems have tried to exploit the powerful discriminative ability of deep networks,to directly estimate 3D poses from RGB images by training the architecture end-to-end [63, 65, 73, 74, 81, 85, 87, 112, 116, 118, 133]. Some other systems haveargued that 3D reasoning from monocular images can be achieved by training onsynthetic data [96, 120]. Most computer vision systems based on deep networkscurrently outperform the traditional approaches on tasks like object classificationand localization [44, 57, 93, 114], 2D pose estimation [19, 45, 80, 123]. However,2deep learning methods require a huge amount of data to perform well. However,unlike object classification or 2D pose estimation tasks which have abundant data,there is a lack of ground truth 3D human pose data for images in the wild. Thismakes the task of inferring 3D poses directly from images very challenging. Al-though some end-to-end systems for 3D pose estimation have remarkably goodresults compared to the older techniques, the primary sources of error in such sys-tems are not well studied and understood. It is not clear whether the error occurscomes from erroneous 2D human pose detection from images due to occlusionfrom self and other objects, motion blur or other imaging artifacts, or from theincorrect mapping of the features from 2D representation to 3D pose. Therefore,in this work, we analyze the possible sources of error in 3D pose estimation bydecoupling the 3D pose estimation task into the well studied problems of 2D poseestimation [80, 123], and 3D pose estimation from 2D joint detections, focusingon the latter. Through decoupling we can exploit any of the existing 2D pose es-timation systems which already provide invariance to the factors like backgroundscenes, lighting, clothing shape and texture, skin color. We can also train a deepnetwork based model for 2D-to-3D pose mapping with large databases of 3D mo-tion capture (mo-cap)’ data captured in controlled environments in research labs.The idea of decoupling the task of 3D pose is illustrated in Figure 1.1.To validate the efficacy of decoupling the task of 3D pose estimation andthereby analyzing the sources of error, we have designed three different networksystems. The first of these systems is based on a simple fully connected feed for-ward network with residual connections in between (Figure 1.6). The input to thesystem is normalized 2d joint positions of single frame. The task of this networkis to backproject joint locations in 2D to 3D in the camera coordinate frame. Toour surprise such a simple network architecture backprojects the ground truth 2Dpositions to 3D with an error rate that improves the results from the state-of-the-artby almost 30% on the dataset Human 3.6M, which is the largest publicly avail-able dataset of motion capture (mo-cap) data in controlled lab environment. Whentrained on the noisy output of a recent 2D pose detector [80], our system also out-performs the state-of-the-art for 3D pose estimation, a large number of which istrained end-to-end to predict 3D pose directly from the raw pixels of an image.Our second system is an end-to-end network that takes RGB image as input and32d poseestimationLifting 2d to 3d(a) (b) (c)Harder to acquire training examplesFigure 1.1: A two step approach to 3D human pose estimation. a) A framefrom the input video. b) The input frame with the 2D pose estimatesuperimposed. c) 3D pose estimate corresponding to the input frame.Although it is hard to obtain training data that maps the input frame to3D pose, we can decompose the challenge into two tasks.regresses 3D pose. The system adds our fully connected network on top of 2Dpose detection network by Newell et al.[80] which they named stacked-hourglass.The 2D pose detection network by Newell et al. outputs a series of probabilityheatmaps for each joint, indicating probability of the presence of the joint on a par-ticular location on the 2D image. The stacked hourglass network is first pre-trainedfor 2D pose estimation task. The 2D heatmaps from the stacked-hourglass networkare then fed into the fully connected network and trained end to end. However, theperformance of the system is worse than the decoupled system. Therefore it is moredifficult to train such a system end-to-end. Our final system attempts to exploit thetemporal information over a sequence of images. We wanted to examine whetheradding temporal information during training helps to improve the result obtainedfrom our first network. This system is also based on the idea of decoupling. It is asequence-to-sequence network [113], which reads through a sequence of 2D poses4and then predicts a sequence of 3D poses. The sequence-to-sequence network wedeveloped also has residual connections on the decoder side. Since we are dealingwith a sequence of frames together in the sequence-to-sequence network, it is alsoeasy to impose temporal smoothness constraint during training. We have foundthat incorporating temporal information allows an improvement in error by about17.5% from our initial system.There are several contributions of this work. The first of them is to design andanalyze a simple network that performs better than the state-of-the-art, is fast (aforward pass takes around 3ms on a batch of size 64, allowing us to process asmany as 300 fps in batch mode) and robust to noise. The primary reason of theimprovement in performance is a collection of simple ideas such as estimating thejoint locations in 3D in the camera coordinate frame, using residual connections,batch normalization. Secondly we have shown empirically that lifting 2D posesto 3D, although still far from being solved, is a much easier task than previouslythought, particularly when compared against systems which predicts 3D pose fromimage directly. This is evidenced by the fact that our simplest network significantlyoutperforms the previous systems on 3D pose estimation when we use noise-freeground truth 2D poses in Human 3.6M dataset or when we fine tune the 2D posedetector on Human 3.6M. Finally, we also showed that by using the temporal in-formation and adding temporal smoothness constraint during the training phase,the results can be improved even further through designing a simple sequence-to-sequence network, with residual connections.From the findings mentioned above, it can be suggested that the major issueinhibiting the performances of recent 3D pose estimation systems, particularly theones which are trained end-to-end from raw images, is lack of proper visual pars-ing of aritculated human bodies in 2D RGB images. Therefore, as a future researchdirection, we suggest putting more focus on obtaining better accuracy on estimat-ing 2D articulated pose of human from images. In what follows next, we define theproblem, discuss the challenges of the task, limitations of our systems, descriptionof our data, followed by a brief outline of our networks.5Figure 1.2: Example of 3D pose estimation. The 2D pose is overlaid on theimage of the person. The corresponding 3D pose is shown in the bottom.1.1 Problem DefinitionThe problem we are trying to address in this work is estimating 3D human posefrom monocular images or sequences of images. More formally, given an image ora sequence of images, a 2-dimensional representation of a human being, 3d poseestimation is the task of producing a 3-dimensional stick figure that matches thespatial position of certain keypoints of joints of the depicted person (see figure 1.2).In this work, we are particularly concentrating on lifting 2D poses detected fromoff the shelve 2D pose detector into 3D pose. Here are some desired properties ofthe solution:• anthropomorphic correctness of the recovered pose• recovered 3D keypoints must be accurate in 3D Cartesian geometry• able to deal with arbitrary viewpoints of the camera• accurately recover the pose in correct viewpoint without any similarity trans-form• robustness to noisy poses from off the shelve 2D pose detectorThe task of predicting 3D human pose estimation is inherently difficult becausefor any 3D object can be projected onto a 2D plane in an infinite number of ways6(a) (b)Figure 1.3: (a) 2D position of joints, (b) Different 3D pose interpretations ofthe same 2D pose. Blue points represent the ground truth 3D pose whilethe black points indicate other possible 3D interpretations. All these 3Dposes project to exactly same 2D pose.based on the position of the camera and its intrinsic parameters. Therefore, back-projecting 2D points of any object to their 3D representation is a difficult tasksince the problem is ill-defined and the mapping is not one to one (see Figure 1.3).Obtaining 3D pose dataset is also difficult because unlike the 2D pose datasets,where the users can manually label the keypoints or key joints by using mouseclicks, 3D pose datasets require a complex laboratory setup with motion capturesensors and cameras. Not only this makes the data collection expensive, it is alsodifficult to replicate similar setup for outdoors or images in the wild. Hence, thereis a lack of 3D pose datasets for images in the wild. Additionally, the task of visualunderstanding of human bodies itself is difficult because of visual ambiguities likeforeshortening or occlusion of certain body parts of a person by other body partsor by objects. Even for humans, the task of reliably estimating the 3D pose isvery challenging. Marinoiu et al. [70] carried out an experiment to investigatehow people perceive the 3D pose in the image space and how they correspond thisperception to the 3D space. They eventually found that humans do not significantlyoutperform existing computer vision approaches at reconstructing the pose in the3D space given the image under laboratory setup. All these factors combined makethe problem of 3D pose estimation from images very challenging.71.1.1 ScopeWe have limited the scope of our problem by making some assumptions. Oneof these limitations is that our system works for single person only. If there aremultiple people, it can deal with 2D poses of at most single person. Another keyassumption is that the image must contain the full body of the person. Currently,our system can not deal with images that have a person with only half his bodyvisible. Another limitation of our approach is that the 3D pose detected is notinvariant to scale. Therefore the skeleton size may vary based on the size of theperson in the image. Below we discuss some other assumptions that we made.3D pose relative to the rootOur system predicts the 3D location of each of the keypoints with respect to the rootnode which in case of Human 3.6M dataset is the hip. By doing so, we are moreconcerned about how far apart the joints are distributed around the hip. Thereby,our focus is to retrieve the human pose as anthropomorphically correct as possible,ensuring that the joints do not extend beyond their usual limit. Hence, our focus isto predict the correct structure of the human pose in 3D. By predicting the 3D poserelative to the root node, we are not able to locate the absolute global position ofthe person in 3D.3D pose in camera coordinate frameInstead of predicting the 3D poses in global coordinate space, we are predicting3D pose in camera coordinate frame. It is very difficult for an algorithm to infer3D joint positions in a particular global coordinate space because any rigid bodytransformation of such space will not result in any change of input data. Therefore,the mapping of 2D joint locations, which depends on camera viewpoint, to 3D is nolonger unique in such cases. Hence, to make the prediction more consistent acrossdifferent camera viewpoints, we predict the 3D joint locations in terms of cameracoordinate frame. It also makes the learning process and prevent overfitting to aparticular global frame.81.1.2 DataFor quantitative analysis of our systems we used the Human 3.6M dataset [20, 51]and HumanEva dataset [105]. However, we used the HumanEva dataset for firstsystem only because dataset is quite old and smaller compared to Human3.6M.Moreover, the same subjects show up on the train and test set. But HumanEvahas largely been used by the community to benchmark pervious work over the lastdecade. For qualitative results we used MPII dataset [5] which is a standard datasetfor 2D pose estimation and does not have ground truth 3D pose.Human3.6M [20, 51] is, to the best of our knowledge, currently the largest pub-licly available datasets for human 3d pose estimation. The dataset consists of 3.6million images, featuring 7 professional actors performing 15 everyday activitiessuch as walking, eating, sitting, making a phone call. The dataset consists of 2Dand 3D joint locations for each corresponding image. Each action is captured using4 different high resolution cameras, each of which is calibrated. It also has 10 dif-ferent motion capture cameras and 1 time of flight sensor to accurately capture themotion of the actors in 3D. In addition to 2D and 3D pose ground truth, the datasetalso provides ground truth for bounding boxes, the camera parameters, the bodyproportion of all the actors and high resolution body scans or meshes of each actor.Figure 1.4 shows an example of data in human 3.6m dataset. While Figure 1.5shows sample images from Human 3.6M dataset indicating the variation of imagesin terms of subject, action and viewpoint.On the other hand, MPII is a state of the art benchmark dataset for the evalu-ation of 2d human pose estimation. The dataset consists of 25K images collectedfrom YouTube videos. It contains over 40K people with annotated body joint loca-tions.1.2 Method OutlineIn this work, we aim to analyze the sources of error in the task of 3D pose estima-tion. We would like to determine if the major source of error is due to poor visualunderstanding of human pose or because of improper mapping from a 2 dimen-sional representation to 3D. We are doing so by developing three different systems,two of which decouple the task of 3D pose estimation, thereby predict 3D pose9Figure 1.4: An example of data in Human 3.6m dataset from left to right:RGB image, person silhouette, time-of-flight (depth) data, 3D pose data(shown using a synthetic graphics model), body surface scan. Source:[51].Figure 1.5: Sample images from Human 3.6m dataset, showing different sub-jects, poses and viewing angle. Source: [51].from 2D joint locations, while other one is trained end to end. Below we give abrief outline of each of the three systems.3D Pose from 2D joint locations of single imageOur first system takes 2D joint locations as input from single image. The inputto the system is simply the xy-pixel location of a set of joints or keypoints andthe output is 3D location of the joints in mm space with respect to a root joint10Linear 1024Linear 1024Batch NormReLUDropout 0.5+Linear 1024Batch NormReLUDropout 0.5Linear 1024Batch NormReLUDropout 0.5x2Residual BlockFigure 1.6: Block diagram of our first system. The building block of ournetwork, which we call Residual Block, is composed of a linear layer,followed by batch normalization,ReLU activation and dropout layer re-peated twice, wrapped in a residual connection. The Residual Blockcan be repeated any number of times. Our best network uses two suchresidual block. The input to our system is an array of 2d joint positions,and the output is a series of joint positions in 3d.in camera coordinate frame. Since we are dealing with low dimensional or veryabstracted form of data, our choice of network is multilayered fully connectednetwork with residual or shortcut connections [44], dropout [110] and batch normlayer [50]. Rectified Linear Units (ReLU) [79] is used as the activation functionfor the network. A block diagram of our first system is shown in Figure 1.6. The2D poses given as input may come from the ground truth or may be an output ofany off the shelf 2D pose detector.3D Pose from single image directlyOur second network aims to prove the effectiveness of decoupling. This time wetrain our network end-to-end to predict 3D pose from single RGB image directly.Our network is built upon the 2D pose detection network called stacked hourglassby Newell et al. [80]. The stacked hourglass network is a collection of hourglassnetworks each of which is fully convolutional network. We overlaid the Residual11Residual BlockLinear 1024ReLUDropout 0.5VectorizedHeatmapFigure 1.7: Our second model simply stacks our first model over stracked-hourglass [80] 2D pose estimator. The stacked-hourglass network isfirst pre-trained for 2D pose estimation using images from Human3.6Mdataset [51]. The heatmap of the final hourglass is passed as an input toour residual block and the entire network is trained end-to-end.Block of our first network over the network by Newell et al. [80]. The hourglasspart of the network is first trained for 2D pose estimation task using the imagesfrom Human3.6M dataset [20, 51]. We performed intermediate 2D pose super-vision while training the network for 2D pose at the end of each hourglass. Theoutput of each hourglass is heatmap of each joint. The heatmap from the last hour-glass is then fed as input to our residual block to output 3D pose. We use theweights from the pre-trained network to initialize the hourglass part and fine tunethe entire network end-to-end to predict 3D pose. During the fine tuning step, thenetwork is supervised both by the 3D pose and by 2D joint heatmaps. We use theintuition of transfer learning [83] here. We try to use the knowledge learned by the2D pose detector and use it for the task of 3D pose estimation. However, empir-ically we found that it is difficult to train such a system end-to-end. Figure 1.7shows our second system.3D Pose from 2D joint locations of a sequence of imagesOur third network tries to figure out whether adding temporal information from asequence of 2D poses gives better results from the first network. For this purpose,we designed a sequence-to-sequence network [113], where Long-Short-Term-Memory12(LSTM) blocks [48] is used as the building block. We have also added layer nor-malization [6] and recurrent dropout [101] to our LSTMs. Sequence-to-sequencenetwork is extensively used in tasks like Neural Machine Translation i.e. trans-lating a sentence from one language to another, hence is useful for tasks whereinput is a sequence of data of one type and output is a sequence of data of differenttype. Each sequence-to-sequence network has an encoder and a decoder part. Inour case, encoder part reads a sequence of 2D poses and encodes it into a fixed sizevector while the decoder reads the encoded vector and predicts a sequence of 3Dpose. Our network has residual connections on the decoder side. The encoder sideof the network effectively encodes the sequence of 2D pose information in a fixedsized high dimensional vector while the decoder side of the network essentiallylearns the perturbation of pose from the previous frame. The residual connectionson the decoder side makes it easy for the decoder to predict 3D poses better sinceit only has to predict the change of 3D pose from the previous frame. We also im-posed temporal constraints during training to ensure smoother output. Figure 1.8shows our final network in detail.1.3 Thesis OrganizationWe have organized this thesis as follows: First, in Chapter 2. we review the relatedworks and the literature. In this chapter, we discuss about different approaches andtechniques tried over the years for solving the problem of 3D pose estimation andsummarize them. We also review about some of the 2D pose estimation systemsand provide a general idea on different types of deep networks that we have usedin our systems particularly Recurrent Neural Networks (RNNs) and ConvolutionalNeural Network (CNNs). In Chapter 3, we describe our first network, which takes2D joint locations as input and gives 3D pose as output. The network is a fullyconnected network with residual connections. We would discuss our network ar-chitecture in details and also how we trained the system. We would also discussthe experiments carried out with this network to prove its effectiveness along withthe results. In Chapter 4, we would discuss our second network, which was trainedend-to-end from RGB images together with the results obtained with this network.Then we move on to describe our final network, which is our most important con-13..... .....x(t)x(t-1) x(t-2)y(0)<START>++ + + +y(1) y(2) y(t)LSTM Units with Layer Normalization and Recurrent Dropoutx(t) x(0)y(t)<START>Flow of hidden state informationFlow of data (input/output)Element Wise AdditionNoisy 2D joint locations from a detectorIndicator for decoder to start predicting 3D pose Predicted 3D pose at time tEncoder UnitsDecoder Unitsy(0) y(1) y(t-1)Figure 1.8: Our final network. It is a sequence-to-sequence network [113]with residual connections on the decoder side. The encoder encodesthe information of a sequence of 2D poses of length t in its final hiddenstate. The final hidden state of the encoder is used to initialize the hiddenstate of decoder. 〈START 〉 symbol tells the decoder to start predicting3D pose from the last hidden state of the encoder. Note that the inputsequence is reversed as suggested by Sutskever et al. [113]. The decoderessentially learns to predict the 3D pose at time (t) given the 3D poseat time (t− 1). The residual connections help the decoder to learn theperturbation from the previous time step.tribution, in Chapter 5. It is a sequence-to-sequence network with residual connec-tions at the decoder that takes a sequence of 2D poses and predicts a sequence of3D poses. We also include the results that we obtained from different experimentswith this network, which shows the effectiveness of using temporal informationand decoupling the task of 3D pose estimation. Finally in Chapter 6, we highlightour main contributions and discuss possible future directions.14Chapter 2Related WorkThe problem which we are addressing is 3D human pose estimation from RGB im-ages or sequence of images that are in 2D. Over the years, the problem of perceiv-ing depth from a two dimensional representation has been a subject of avid interestto scientists, mathematicians and artists since the Renaissance, when Brunelleschiused the mathematical concepts of linear perspective to elicit a sense of depth inhis paintings of Florentine buildings.Centuries later, a similar knowledge of perspective has been exploited in thearea of computer vision to infer quantities such as lengths, areas and distance ra-tios given any arbitrary scenes [135]. Others have tried to use visual cues likeshading [131] or texture [66] to estimate depth from an image. Recently, there hasbeen a trend of using deep learning [30, 67, 88, 99] for estimating the depth froman image.However, one of the initial methods for depth estimation by Roberts [94] ad-dressed the problem in a different manner. Instead of using the knowledge of per-spective or any of the image features, he exploited the known 3D structures ofobjects in a scene. Decades later, Bu¨lthoff et al. [17] found that top-down knowl-edge of the familiar 3D structure is also used by humans when they perceive ahuman body abstracted into a set of sparse points projected onto a 2D plane. Theyfound that the expectation about a known 3D structure of an object overrides thetrue stereoscopic information. This idea of being able to reason or understand 3Dhuman posture from a minimal representation like the projection of sparse set of15points on the human body onto a 2D plane has inspired the problem of estimating3D pose from 2D joint locations.We have divided the related work into four different sections. In the first sectionwe discuss about different representations of 3D pose. In the second section wetake a look into different methods for 3D pose estimation. Then in the third sectionwe briefly discuss about some of the 2D pose estimation techniques. Finally, wereview different types of deep network architectures.2.1 Representation of 3D poseThere are both model-based and model-free representation of 3D pose. Humanbody is a very complex system with highly flexible and articulated body parts.Marinoiu et al. [69] carried out experiments to investigate how people perceive hu-man pose in the 3D space of photos and how their perception actually correspondsto the 3D space. They found that even for humans it is difficult to reliably estimatethe location of joints in real 3D space given an image or a video. Hence, it is dif-ficult to model the human body. Despite this, researchers have attempted to modelthe articulated 3D pose that provides some prior of human body structure for analgorithm to estimate 3D pose.The most common model to represent 3D human pose is a skeleton or a stickfigure. The skeleton is defined by a kinematic tree of a set of joints. The kinematictree consists of initial location of the root joint, offsets of each joint from theirparent and rotational parameters for each joint that represents relative rotation ofthe joint with its parent [8, 16, 20, 51, 84, 133]. One major advantage of this modelis that the resulting poses are forced to have human-like structure. Moreover, it ismuch easier to impose anthropometric and kinematic constraints like joint anglelimits, bone lengths and limb length proportions [125]. Although most joints have3 degrees of freedom, certain joints have 1 degree of freedom, such as the kneejoint, due to their constrained mobility. Hence it is possible to reduce the overalldimensionality of the rotational parameters that need to be estimated. Figure 2.1shows an example skeleton with 17 joints labeled and the corresponding kinematictree for the skeleton.Another approach of modeling the 3D human pose involves learning an over-16HeadNoseNeckLeft ShoulderLeft ElbowLeft WristHip (Root Node)SpineLeft HipLeft KneeLeft AnkleRight HipRight KneeRight AnkleRight ShoulderRight ElbowRight WristHip (Root)Spine Left Hip Right HipLeft Knee Right KneeLeft Ankle Right AnkleNeckNose Left Shoulder Right ShoulderHeadLeft Elbow Right ElbowLeft Wrist Right WristFigure 2.1: (Left) A sample skeleton model with 17 joints each of them la-beled. (Right) A Kinematic tree showing the kinematic relationship be-tween the joints. The arrow downward indicates parent-child relation-ship between two joints.complete dictionary of basis poses by using dimensionality reduction techniquessuch as PCA or non-negative matrix factorization. This approach of modeling 3Dpose was introduced by Ramakrishna et al. [92]. The 3D pose is then computed as asparse linear combination of this over-complete dictionary [2, 132, 134]. However,a major issue about this model is that they can potentially lead to an invalid 3Dpose, because of the lack of any anthropometric constraints. Also, there are anumber of ways the basis poses can be combined to obtain a particular 3D pose.However, most of the recent work based on deep-learning techniques [51, 63,73, 74, 87, 116, 117] typically uses model-free representation of 3D human pose.The 3D human pose is represented as 3D locations of each joint relative to the rootnode or to its parent. Although most of the methods [63, 73, 74, 87, 116, 117] us-ing a model-free representation regress the 3D joint location directly, Pavlakos etal. [87] predicts volumetric heatmap for each joint, which gives the likelihood17of the presence of a joint in a particular 3D spatial location. On the other hand,Mehta et al. [74] predicts x,y,z location maps for each joint which gives the proba-bility of joint being at a particular x,y, or z coordinate individually. The advantageof model free representation is its simplicity and lower dimensionality comparedto model based approaches, which tend to make it work better for deep networksetting. However, because of the lack of a priori knowledge of human body struc-ture, this can lead to an invalid 3D pose, may even fail to predict human structureat all.2.2 Approaches to 3D Pose estimationThere are several streams of work for estimating 3D human pose given an image.The first of these involves extracting features from the image and learning a func-tion to map the features into 3D pose [1, 14, 15, 56, 77, 82, 107, 117]. Anotherstream of work involves using deep networks to predict 3D pose from an imagedirectly by training the network end-to-end. [63, 65, 73, 81, 85, 87, 96, 112, 116,118–120, 133]. Some work uses the 2D human pose from image and learns toback-project these 2D joint locations into 3D [2, 16, 62, 76, 90, 92, 122, 132, 134].The 2D joint locations may either be ground truth or detected from an image us-ing any 2D human pose detector. Some approaches have tried to formulate thetask of 3D pose estimation as a retrieval or similarity search problem. These tech-niques use different image features or 2D pose to lookup into a large databaseof exemplar 3D pose descriptor [22, 41, 53, 78, 103, 128]. Others have tried topredict 3D pose from a sequence of images trying to exploit the temporal infor-mation from the sequence [4, 29, 74, 117, 134]. Additionally, some techniquesleverage multiple views from different cameras to estimate the 3D pose, therebymaking the task much easier [3, 10, 18, 31, 86, 106]. Finally, there are a numberof approaches which uses depth images provided by RGB-D camera for 3D poseestimation [7, 102, 104, 124, 129]. The image from RGB-D camera has an extradepth channel giving depth of different objects in the image, along with the RGBchannels. With the added depth information, these methods can estimate 3D posewith a high accuracy in real time. However, the downside of RGB-D cameras isthat they have limited range and do not work well in outdoor settings.18Below we will discuss the different streams of addressing the problem of 3Dhuman pose estimation, mentioned above, in detail.2.2.1 3D Pose estimation by extracting features from single imageMost of the earlier methods of 3D pose estimation from monocular images aimed atextracting discriminative features from images. A good feature for 3D Pose estima-tion should be invariant to lighting, texture, background scenes, human skin coloretc. Agarwal and Triggs [1] encoded image silhouette shapes in a histogram-of-shape-contexts descriptor [11, 12] and used it to recover 3D pose using non-linearregression. Although silhouettes are invariant to texture and lighting, it requiresvery good segmentation of the human in the image. Mori and Malik [77] usedshape context [12] which represents a shape using a set of sample points fromthe contours of an object. They created a database of a number of exemplar 2Dviews of human body, with joints labeled, under different camera configurationand viewpoint. They used shape context matching technique [12] to match a testimage with the exemplar images and used the 2D joint locations from the exem-plar and the test shape to estimate 3D pose using the method of Taylor [115]. Boet al. [15] built an algorithm that makes learning conditional Bayesian Mixture ofExperts models [109] faster and more scalable that can handle one order magni-tude more data and is one order magnitude faster. They combined forward featureselection and bound optimization contrary to backward feature selection used inoriginal work and compared the performance of SIFT [68], histogram of shapecontexts [12] and multi-scale hyper-feature encodings [54]. Similarly, image fea-tures like Histogram-of-gradients (HOG) [25, 68] and HMAX [28] were used byBo et al. [14] to create a Twin Gaussian Process model and use Gaussian ProcessRegression to estimate the 3D Pose. Ning et al. [82] designed an image descriptorof their own called the Appearance and Position Context (APC) descriptor. Theylearned visual bag of words using unsupervised clustering and then jointly learneda distance metric for each visual word and Bayesian mixture of experts model usinglabeled image-to-pose pairs, which is then used to regress 3D pose. Simo-Serra etal. [107] proposed a method to jointly infer 2D pose and 3D pose using a Bayesianmodel which combines generative latent variables constraining the space of all pos-19sible 3D pose with 2D location of joints using HOG-based discriminative model.Kostrikov et al. [56] swept along each plane through 3D volume of potential 3Djoint locations and used a regression forest to predict the relative 3D position ofjoint given the hypothesized depth and then use mixture of 3D pictorial structuremodels (PSM) [34] to infer 3D pose in global coordinate space.The major drawback of these methods is that their accuracy is bounded by thediscriminative properties of the features and robustness to different factors. Mostoften, these features are not discriminative enough to give accurate estimation ofdepth. Since the advent of deep networks, feature-based techniques have lost theirpopularity because deep networks can learn sophisticated features which produceexcellent results.2.2.2 Using features to look up in a database of exemplar 3D posesSeveral methods have used the features extracted from the images to find the near-est neighbour pose from a large database of exemplar 3D poses. Shakhnarovich etal. [103] used a shape context feature vector to represent general contour shapesand use the features to learn a set of hashing functions which can be used effi-ciently look up and find the nearest-neighbor pose from a database of 3D poses.The shape context feature vector from an image is also used by Mori and Ma-lik [78] in conjunction with a kinematic chain-based deformation model to matcha stored 2D view of human body with labelled 2D pose. Once they obtain 2D pose,they use Taylor’s method [115] to estimate 3D pose. Jiang [53] also used Taylor’salgorithm [115] to generate all possible 3D pose given the 2D pose of an imagethereby forming a hypothesis pose. They used a kd-tree to find approximate near-est neighbour of these hypothesis pose from a large database of exemplar poses.Gupta et al. [41] create a large database of fixed length 2D tractories called v-trajectories using orthographic projection of unlabelled motion capture data. Theyextract dense trajectories feature from vidoes and match the video trajectories tov-trajectories using Non-linear Circular Temporary Encoding to retrieve appropri-ate motion capture data. Gupta et al. [42] extended their method [41] to retrievea portion of longer mocap sequence and temporally align them with features re-trieved from a short sequence using Dynamic Time Warping(DTW) [89]. Yasin et20al. [128] use two separate training sources. The first source is a large databaseof motion capture data which is projected onto a normalized 2D pose space usingvirtual cameras, while the second source is images with labeled 2D poses whichare used to learn pictorial structure model (PSM) [33] for 2D pose estimation. Thepredicted 2D pose from PSM [33] is used to retrieve the nearest normalized 2Dpose using kd-tree search and the final 3D pose is estimated my minimizing thereprojection error. On the other hand, Chen and Ramanan [22] used a CNN to es-timate the 2D pose from an image and then use the predicted 2D pose to match alibrary of 3D pose to estimate the depth.A major drawback of exemplar-based 3D pose estimation is that the time re-quired to match the correct 3D pose from a large database is quite high. This pro-hibits any real time implementation. Moreover, the performance of these methodslargely depends on the range of poses available in the database. It is also diffi-cult to align the retrieved 3D pose with the actual orientation of the person in theimage [42].2.2.3 Deep network trained end-to-endAs mentioned before, deep networks have become extremely popular in many com-puter vision tasks. However, these models require large amount of data to succeed.It is difficult and expensive to collect motion capture data. There is still a lack ofdataset of 3D poses for people in the wild since 3D data acquisition requires spe-cial motion camera with markers and complex laboratory setup. However, since theintroduction of Human3.6M dataset [20, 51], which contains 3.6 million high res-olution images with annotated 2D and 3D data, there are a number of methods thatemploy deep networks being trained end-to-end to predict 3D pose from images.One of the earliest approaches to use deep networks was by Li et al [63]. They pro-posed a convolutional neural network (CNN) [57, 60, 61] that jointly learns to re-gresses 3D human pose and detect body parts in 2D given a monocular image. Thenetwork was initially pre-trained for body parts detection and then jointly trainedfor both tasks. Similar to [63], Park et al. [85] designed a CNN which is jointlytrained for both 3D pose regression and 2D pose estimation. They treated the 2Dpose estimation task as a classification problem for each joint where they divide the21image into n×n grids. Each grid is considered as a class for each joint. They clas-sified each joint as belonging to any of the n2 classes. Tekin et al. [116] first traineda de-noising auto-encoder [121] to learn a high-dimensional latent encoding of 3Dpose. Then they trained a CNN to map the image into latent representation learnedby the auto-encoder. Then they stacked the decoding layers of auto-encoder ontop of the CNN to regress 3D pose and fine-tuned the entire network end-to-end.Tekin et al. followed up their earlier work in [118], where they fuse latent featureslearned from images and their corresponding 2D joint heatmaps. Their networklearns when two fuse the features from the two sources. Mehta et al. [73] usedtransfer learning to transfer the knowledge learned from 2D pose estimation taskfor in-the-wild images to estimate 3D pose. They do so by first training Resnet-101 [44] for 2D pose estimation task and then used the learned weight of up tolevel 5 of ResNet-101 to build a network that outputs 3D joint locations and asan auxiliary task predict 2D heatmaps for each joint. This idea of exploiting 2Dpose ground truth information on in-the-wild images was also adopted by Sun etal. [112]. They modified Resnet-50 [44], pre-trained on ImageNet [57], to predict3D joint locations from both images with and without 3D ground truth. When the3D ground truth is missing, the depth coordinate is set to zero. Zhou et al. [133]designed a CNN which predicts the motion parameters of the kinematic tree of hu-man skeleton and then added a kinematic layer on top of it to convert the motionparameters and skeleton information into 3D joint locations. The loss is defined onthe joint location and since kinematic layer is differentiable, they could train thenetwork end-to-end. Varol et al. [120] argued that a CNN which is trained to pre-dict 3D human pose from synthetic images can effectively and accurately predict3D pose from real images. Likewise, Rogez and Schmid [96] developed a syn-thesis engine which generates synthetic images given real image and use them toaugment the database with more data. Then a CNN is trained on both real and syn-thetic data. Pavlakos et al. [87] also develops an end-to-end CNN based model topredict 3D pose. They extended the popular 2D pose detector by Newell et al. [80]called stacked-hourglass to predict volumetric heatmaps for each joint instead ofpredicting 2D heatmaps. Their method used to be the state-of-the-art before be-ing surpassed by our first and third networks. Tome et al. [119] also used a similaridea of extending a 2D pose estimator to reason in 3D. They extended the Convolu-22tional Pose Machine(CPM) by Wei et al. [123], which iteratively refines 2D posesfrom the knowledge of the image and estimation from previous iteration. Tome etal. [119] modified this architecture by introducing a probabilistic 3D pose layerwhich lifts the predicted 2D heatmaps to 3D pose and projects them back into im-age plane to generate a set of projected pose heatmaps. The projected 2D heatmapsand the predicted 2D pose heatmaps are then fused together in a fusion layer andpassed onto the next stage. The fused heatmap from the final stage is used to liftinto 3D pose using the probabilistic 3D pose model and the entire system is trainedend-to-end. Nie et al. [81] separately encoded the ground truth 2D pose and imagepatches surrounding the joint locations into skeleton LSTM and Patch LSTM. Boththe networks have a kinematic tree structure defined which is broadcast throughoutthe whole skeleton. They predict the depth by integrating the outputs from skele-ton LSTM and patch LSTM into another LSTM which predicts depth of each joint.Lin et al. [65] predicts 3D pose from an image directly and refines them in multiplestages using LSTM [48]. Each stage has a 2D pose module which learns a two di-mensional pose-aware feature map that encodes information of human body pose.This feature map is passed onto feature adaptation module which gives a high di-mensional common embedding space for 2D and 3D pose. The adapted feature isconcatenated with the hidden states of the LSTM and 3D pose detection from theprevious stage and is passed as input to the LSTM of current stage to predict the3D pose in the current refinement stage.Although most of these systems trained end-to-end from images generate goodresults for 3D pose, it is not clear whether the error stems from the visual featureslearned by the network or from the mapping of the 2D pose or features in 2D into3D pose.2.2.4 3D Pose Estimation from 2D poseThe task of inferring 3d joint locations from their 2d projections can be tracedback to the classic work of Lee and Chen [62]. They showed that, given the bonelengths, the problem boils down to a binary decision tree where each split cor-respond to two possible states of a joint with respect to its parent. A commonapproach to estimating 3D joint locations given 2D pose is to separate the camera23pose variability from the intrinsic deformation of human body, the latter of whichis modeled by learning an overcomplete dictionary of basis 3D poses from a largedatabase of 3D human pose [2, 16, 92, 122, 132, 134]. A valid 3D pose is definedby a sparse linear combination of the bases and by transforming the points usingtransformation matrix representing camera extrinsic parameters.S =k∑i=1ciBi (2.1)Here S∈R3×p is a set of 3D locations of p joints, Bi ∈R3×p is a basis pose and ci isits corresponding coefficient. There are k bases in total. These approaches modelthe 3D to 2D projection as weak perspective projection, the equation of which isgiven below:W = RS+T 1T (2.2)Where S ∈ R3×p is the set of 3D locations of p joints, as given by eq. 2.1, W ∈R2×p denotes 2D pose of p joints, R ∈ R2×3 and T ∈ R2 are camera rotation andtranslation parameters respectively. The coefficients of the bases and the cameraextrinsic parameters are estimated by minimizing the reprojection error which isgiven by the following loss function:argminR,T,C∥∥∥∥∥W −R k∑i=1 ciBi−T 1T∥∥∥∥∥2F(2.3)Here W ∈ R2×p is the ground truth 2D locations for p joints and the rest of thesymbols are same as defined in Eq. 2.1 and 2.2Ramakrishna et al. [92] were the first to propose the idea of representing 3Dhuman pose as a sparse linear combination of bases and estimate the camera intrin-sics and coefficients of the bases by minimizing reprojection error function. Theyobtained the basis pose using PCA on a database of exemplar 3D poses. Wang etal. [122] followed the same vein as [92] but instead of minimizing the L2-norm ofreprojection error, they minimized L1-norm and imposed limb length constraintson the output pose. Akhter and Black [2] imposed a joint angle limit constraint forcertain joints after estimating the sparse coefficients and camera extrinsic. Since24rotation matrices are restricted within a set SO(3), the resulting objective functionis non-convex. Zhou et al. [132] proposed a method to relax certain conditionsto approximate convexity for the optimization of rotation matrix. The method isextended by same authors [134], where they imposed temporal smoothness con-straint during optimization. They also designed a CNN to predict 2D heatmaps foreach joint, giving the likelihood of presence of joint in that location. When theground truth 2D pose is not available, they used Expectation-Maximization (EM)algorithm [26] to estimate 3D pose from the detected heatmaps.Bogo et al. [16] used the 2D joint heatmaps from a CNN-based 2D pose detec-tor to predict both 3D pose and the 3D shape of human body. Their body modelis defined as a function parameterized by coefficients of shape prior, pose parame-ters defined by kinematic tree model (See Section 2.1) and translation parameters.They minimize five different error terms: joint-based error defined by re-projectionerror under weak perspective projection, three pose priors and a shape prior. Rad-wan et al. [90] applied self-occlusion reasoning step over off-the-shelf 2D posedetector to remove noise in 2D pose estimation. Then they projected an arbitrary3D model onto the 2D joints and applied geometric and kinematic constraint to re-move ambiguity. Then they generated some synthetic views using the pose distri-butions and applied a structure from motion step to estimate the appropriate depth.On the other hand, Moreno-Nouger [76] first computed a N×N distance matrix,called Euclidean Distance Matrix (EDM), from the detected 2D pose where N isthe number of joints. Then they designed a CNN-based network to estimate theEuclidean Distance Matrix for 3D pose. Then they convert the predicted EDM into3D joint locations using a Multidimensional Scaling (MDS) approach [13].Our first and third model are inspired from the idea of decoupling the taskof 3D pose into the 2D pose estimation using an off-the-shelf 2D pose estimatorand then learning a model to map 2D pose into 3D. We aim to analyze whetherthe error for 3D pose estimation stems from noisy pose detections or from lifting2D features to 3D. We observed empirically that decoupling makes the task of 3Dpose estimation much easier than training a deep network end-to-end. We alsoobserved that the task of lifting 2D poses into 3D can be done with very highaccuracy given the ground truth 2D pose by using a simple deep network model.We believe it is difficult for a network trained end-to-end to perform well in this25case, because it needs to learn to extract image features which invariant to lighting,texture, background scenes, human skin color etc. and at the same time lift thosefeatures in 2D space to 3D. Moreover, the lack of in-the-wild datasets for 3D posemay also be another factor which makes training the networks end-to-end difficultbecause of the lack of variation in the scenes.2.2.5 Exploiting temporal informationEstimating 3D pose per frame may cause jitter because the error in pose estimationfor each frame is independent of one another. A natural extension would be toestimate the 3D pose over a sequence of images or monocular video such that theposes look temporally coherent and smooth i.e. the error is distributed smoothlyover a sequence. A number of methods tried to exploit the temporal informationavailable over a sequence of images to achieve temporal smoothness.Andriluka et al. [4] exploited temporal information using tracking-by-detection.They first estimated 2D poses for each frame individually. Then they associated theposes across frames using tracking-by-detection method. The robust estimates of2D pose over a short sequence was used to recover 3D pose. Tekin et al. [117] ex-ploited the motion information by first using a CNN to align successive boundingboxes such that the person always remains in the center of the bounding box. Thenthey concatenated the aligned images and extracted 3D HOG (histogram of gra-dients) features densely over the spatio-temporal volume from which they regressthe 3D pose of the central frame. They tried different techniques for regressing 3Dpose and found deep network to work the best. Du et al. [29] used a height-map,estimated from RGB image and camera calibration, and RGB image to regress 2Djoint locations using dual stream CNN. From a sequence of 2D joints, they esti-mated 3D pose by minimizing reprojection error and by imposing pose-conditionedjoint velocity and temporal coherence constraints during optimization. Mehta et al.[74] implemented a real time system for 3D pose estimation which exploits tem-poral information from the previous frame to achieve temporal smoothness. Givenan image the bounding box at time t is estimated by tracking the bounding boxand 2D joint locations of the previous frame which is passed to a CNN to estimate2D heatmaps and 3D location map x,y,z for each joint. They combine the 2D and263D pose predictions of the current frame with that of the previous frame and applytemporal filtering and smoothing to obtain the 3D pose of the current frame.In our third model we exploit the temporal information present in a sequenceof frames and would like to examine if applying temporal constraints can improvethe performance of our previous network. For monocular videos, it is intuitive toexploit the temporal information of previous frames as it can provide many impor-tant cues like some part being occluded in one frame may be visible in the nextframe or in our case, the 2D pose estimation of a particular frame may be more er-roneous than other frame. We expect that the temporal information will distributethe error in pose estimation smoothly over the sequence reducing jitter and overallimprovement in results.2.2.6 Exploiting multiple viewsAs discussed previously, acquiring motion capture data requires a complex labora-tory setup and is expensive. It requires markers, multiple motion capture cameraand multiple high resolution RGB cameras. The motivation of using multiple viewsof different cameras for 3D pose estimation is to make the data acquisition processcheaper so that it does not require motion capture cameras or markers to be placedon subject’s body and that the data can be acquired even in the outdoors. The ad-ditional views should intuitively make the task of 3D pose estimation easier sincecertain body parts in one view may be self-occluded in one view but visible clearlyin another view.A number of works have proposed using multiple cameras to estimate 3D pose.Sigal et al. [106] modeled human body as a collection of loosely-connected bodyparts in an undirected graphical model where the nodes represent body parts andedges represent a kinematic relationship between them. They imposed kinematicand penetration constraints using statistical models learned from motion capturedata and use Particle Message Passing (PAMPAS) [52], a type of particle filterthat can be applied over a graph containing loops, to infer 3D pose and motionfrom multi-view images with a set of calibrated camera. Amin et al. [3] extendspictorial structures model for 2D pose estimation to a multi-view model which per-forms joint reasoning over 2D poses from multiple view to estimate the 3D pose.27The same idea of using a multi-view pictorial structure for 3D pose estimation wasused by Burenius et al. [18]. They additionally imposed view, skeleton, joint angleand intersection constraints. 3D multi-view pictorial structures was also used byBelagiannis et al. [10]. The used geometric constraints of triangulation of bodyjoints from multiple views to estimate the 3D pose. On the other hand, Elhayek etal. [31] used a CNN-based network to estimate unary potentials for each joint ofa kinematic tree model of skeleton which are used to extract pose constraints byprobabilistically sampling from a pose posterior model. They combined the sam-pled constraints with an appearance-based similarity term and to track the articu-lated joint angles from multiple views. Pavlakos et al. [86] used the CNN-basedstacked-hourglass model for 2D pose estimation to estimate 2D pose from multipleviews and combined them using 3D pictorial structure model to obtain a volumetricheatmap of 3D joint uncertainties.2.2.7 Exploiting depth informationWith the availability of RGB-D cameras like Microsoft Kinect, a number of sys-tems tried to exploit the additional depth information along with the RGB image.Wei et al. [124] formulated the 3D pose estimation problem as a registration prob-lem in Maximum A Posteriori (MAP) estsimation framework. They integrated thedepth data, person silhouette, full-body geometry, temporal pose prior and occlu-sion reasoning in a unified MAP estimation framework and combine 3D trackingwith 3D pose estimation. Baak et al. [7] combined local optimization and globalretrieval methods to build a robust 3D pose estimator. They used a variant of Djjk-stra’s algorithm to extract pose features from depth channel and later fused the lo-cal and global pose estimates using sparse Hausdoff distance. Shotton et al. [104]modeled 3D pose estimation problem as a per pixel classification problem whichclassifies the pixels as belonging to a specific body part. They used depth compar-ison features from depth image and used random forest classifier to classify eachpixel and generated a confidence-scored 3D proposal for different body joints byreprojecting the classification results and finding local modes. Ye and Yang [129]embedded articulated deformation model with exponential-map parameters into aGausian Mixture model for the task of 3D pose estimation. They also developed a28shape adaptation algorithm using the same probabilistic model used for pose esti-mation. Shafaei and Little [102] used multiple views from multiple depth cameras.They applied image segmentation to depth images and used curriculum learningto train their pose estimation system on synthetic data. The 3D joint locations arerecovered by combining information from multiple views in real time. Although,depth information from depth cameras can give us valuable cue for 3D pose esti-mation, one major drawback of depth cameras is that it works poorly in outdoorsettings.2.3 2D pose estimation techniquesSince this work concentrates on analyzing the effectiveness of decouplng the taskof 3D pose estimation into first estimating 2D pose from an image and then liftingthe 2D pose into 3D, we will discuss some of the techniques for 2D pose estimation.The task of 2D pose estimation is defined as localizing a number joints or key-points in an image.One of the most popular 2D pose estimation technique before the advent ofdeep network-based estimators was by Yang and Ramanan [127]. They describedthe articulated human pose as a flexible mixture of non-oriented pictorial structureand augmented classic spring models with the co-occurrence constraints so thatthey can capture the contextual co-occurrence and spatial relationship betweendifferent parts. Such constraints help to impose notions of local rigidity. Theyembedded the co-occurrence contraints and spatial relationship between differentparts into a tree relational graph and optimize the entire model using dynamic pro-gramming. Following the success of deep networks in computer vision, many ap-proaches decided to leverage the deep learning techniques to estimate the 2D pose.Wei et al. [123] developed a CNN-based 2D pose estimation framework, calledConvolutional Pose Machine (CPM), which predicts 2D belief maps for each joint,giving the likelihood of the presence of that joint at a particular spatial location,and refines the belief over multiple stages. Each stage of pose estimation techniquetakes the image and the belief map from previous stage as input and generates arefined belief map. Cao et al. [19] used a similar CNN architecture as the CPM,refining 2D pose estimation in multiple stages. However, they extended it for pose29estimation of multiple people. They defined a non-parametric representation calledPart Affinity Fields(PAFs) to associate body parts with the individuals present inthe image. Each stage of the frame work has two branches, one branch predictsPAFs and the other branch predicts part confident maps, both of which are passedto next stage of the framework for refinement. Once the part locations are learned,the parts belonging to a particular individual are associated by using Hungarianmethod [58, 59], which is a bipartite graph matching algorithm. Newell et al. [80]came up with a fully convolutional network for 2D pose estimation which com-putes features at different scales and consolidate the features to capture the spatialrelationships of different joints in human body. In an hourglass module, bottom-up and top-down processing of the features takes place through successive stepsof pooling and up-sampling to predict a 2D heatmap for each joint. They namedtheir method stacked-hourglass because they stacked multiple hourglass modulesend-to-end. The perform intermediate 2D pose supervision at the end of each hour-glass. This repeated bottom-up and top-down inference helps to refine the 2D poseheatmaps in the final hourglass. He et al. [45] extended Faster R-CNN network [93]by Ren et al. which is used for finding region proposals or to localize objects inan image. They added a branch for predicting segmentation mask for an object inconjunction with object classification and bounding box regression. Their methodcan also be used for pose estimation of multiple people by training K differentmasks for each of K key-points where each mask is treated as a one-hot binarymask where only one pixel is labeled as a foreground.2.4 Deep NetworksAll of our three methods are based on deep networks. In our first model, we usea fully connected feed forward neural network with residual connections. Oursecond model overlays our first network over a Convolutional Neural Network for2D pose estimation. Finally our third network is a sequence-to-sequence networkwhere the building blocks are Long Short Term Memory Units (LSTMs). We willreview each of this networks briefly in this section.302.4.1 Biological motivationArtificial Neural Networks (ANNs) were originally inspired from biological neuralconnectivity in human brain. Analogous to the neurons and the interconnection ofneurons in the brain, an ANN is composed of a number of connected units calledartificial neurons. In a biological nervous system, neurons communicate with eachother by propagating electrical impulses through connections called synapses. Bi-ological neurons tend to have a threshold value such that if the magnitude of allthe impulses from different neurons exceed the threshold, the neuron would prop-agate the signal forward or else will not send the signal at all. This phenomenonis typically known as activating a neuron. The signal may get amplified or at-tenuated when it is passed through synapses from one neuron to another. Similarto the biological neural connections, artificial neurons have weighted connectionswith other neurons which may amplify or dampen the strength of the signal as itis being passed through the connection. The signal received by an artificial neuronis therefore a linear combination of different signals propagated from the neuronsconnected to it. Each neuron has an activation function which determines whetherthe neuron receiving the signal would fire or not. This adds non-linearity to theotherwise linear transformations. Typically the artificial neurons are arranged inmultiple layers: an input layer, several hidden layers, and an output layer. Neu-rons belonging to a particular layer cannot be connected to a neuron in the samelayer. Figure 2.2 shows an example of a fully connected Artificial Neural Network.The motivation of building such a network of artificial neurons was to mimic thefunctionality of human brain and how humans use their brain to solve a problem.Although artificial neural networks were initially developed keeping human brainin mind, over time, due to practical reasons, the researches had to deviate frombiological motivation such as using backpropagation during training the network.2.4.2 History of Neural NetworksThe idea of building a computational model with artificial neurons mimicking thebehavior of human neurons using mathematics and threshold logic was first pro-posed by McCulloch and Pitts [72] back in 1943. However the technologicallimitation did not allow them to progress much further. Farley and Clark [32]31Figure 2.2: A Fully Connected Neural Network consisting of an Input Layer,one hidden layer and an output layer. The connections between eachneuron is shown with an arrow. Each connection has a particular weightwhich is learned over time from training data using backpropagation.Each neuron also has an activation function which defines a thresholdfor the neuron to fire.and Rochester et al. [95] were the first research groups to perform computationalsimulations of neural networks. In 1958, Roseblatt [97] came up with the singlelayer Perceptron algorithm, a supervised algorithm for binary classification whichhad a hidden layer or association layer to map a given input to a random outputunit. In 1969, Minsky and Papert [75] discovered two crucial issues regardingneural networks. First, the basic perceptrons were not able to handle exclusive-or(XOR) circuit and second, the lack of processing power at that time for comput-ing large neural networks. This slowed down the research in neural networks forsome time until the computational powers were large enough to handle neural net-work processing. However, the discovery of backpropagation algorithm by PaulWerbos [126], the XOR issue was solved, thus speeding up the training of multi-layered neural networks. This rekindled the interest in research on neural networks,although progress was still slow. As parallel distributed processing became popularin the mid ’80s, Rumelhart and McClelland [98] described using parallel process-32ing to simulate neural networks. Throughout the ’80s and ’90s, simpler methodslike Support Vector Machines (SVM), linear classifiers, random-forests dominatedthe machine learning paradigm overshadowing the popularity of neural networks.The vanishing gradient problem was a major issue for training multi-layered neu-ral networks, when gradients tend to shrink to zero as the error is backpropagatedover multiple layers. Schmidhuber [100] proposed a work-around for the vanishinggradient problem. He proposed to pre-train each layer at a time by unsupervisedlearning and then fine-tune the entire network end-to-end through backpropaga-tion. On the other hand, Behnke [9] came up with an algorithm called RProp orresilient backpropagation. It only considers the sign of the gradient during back-propagation. In 2005, Steinkrau et al. [111] were the first group of researchersto implement a two layered fully connected network on GPU. Shortly after them,Chellapilla et al. [21] showed that GPUs can also be used to accelerate the train-ing of CNNs. However, when NVIDIA released the general purpose GPUs andCUDA programming language platform in 2007, it enabled programmers to writeprograms in standard programming languages like C or python and execute anyarbitrary codes on the GPUs. This was the major breakthrough for neural net-works as it opened the floodgates for large number of researchers to train reallydeep multi-layered networks on the GPUs without worrying about the vanishinggradient problem. In 2009, Raina et al. [91] used the CUDA platform to show thatthe Deep Belief Networks (DBN) [47] can be trained 70 times faster on GPUs overmulti-core CPUs. Similarly, Ciresan et al. [24] showed that multi-layered feed for-ward networks can be trained efficiently and extremely fast on the GPUs by usingsimple backpropagation with a low error rate. However, it was the Imagenet classi-fication by Krizhevsky et al. [57] which popularized the use of deep networks in thefield of Computer Vision. Their network became known as AlexNet, named afterAlex Krizhevsky. They achieved an error percentage of 16% and it was after thispaper the classification error rate for Imagenet Competition decreased dramaticallyto merely 2% now. The Imagenet project is a large database of images designed forvisual object recognition and localization task in 2009 by Deng et al. [27] and since2010 an annual competition called the ImageNet Large Scale Visual RecognitionChallenge (ILSVRC) is arranged.332.4.3 Convolutional Neural NetworksA Convolutional Neural Network (CNN) is a type of deep and feed forward neuralnetwork which are typically used for visual analysis of images for tasks like ob-ject classification, object localization, segmentation. The hidden layers of a CNNcan be composed of convolutional layers, pooling layers or fully connected layers.CNNs are suitable to be applied on images for computer vision tasks because thelearned weights of the convolutional layers act as convolution masks which peoplewould have hand-engineered otherwise for processing the image. Hence CNNs re-quire little pre-processing of input data. The fully connected network is not idealfor learning features from images because if each pixel in an image or each neuronin a volumetric input is fully connected to the neurons in the hidden layers, it wouldresult in a very large number of parameters which may cause several problems likethe vanishing gradient problem or overfitting to training data.The concept of convolutional layers stems from the work of Hubel and Wieselin 1968 [49], who showed how the neurons in the visual cortexes of monkeysrespond individually to small regions in their field of view. The portion of area ofthe visual field that triggers a particular neuron in the visual cortex is known asreceptive field. Hubel and Wiesel [49] found that when the eyes are still, visualcells within an small patch of the retina share similar and overlapping receptivefields. They found that the monkey brain has two types of visual cells:• simple cells: Sensitive to edges of different orientations.• complex cells: Have larger receptive fields and is responsible for understand-ing contextual information.Simply put, the learned weights of each convolutional layer performs a convo-lution operation on the input volume and outputs another volume. A conolutionallayer arranges its neurons in a volume. Each neuron of a convolutional layer isconnected to a local spatial region of the input volume instead of being fully con-nected to the neurons of the previous layer. However, the neurons are fully con-nected along the depth of the volume. The spatial area of the region connected to aneuron is known as its receptive field, analogous the receptive fields of biologicalvisual cells.3432323553A volume of neurons in a convolutional layerReceptive field Five neurons across the depth gives a depth column of 5Figure 2.3: A convolutional layer having a depth column of 5, i.e. 5 neuronsare connected to same spatial region and a filter size or receptive fieldsize of 5×5.There are several hyper-parameters of a convolutional layer. We list them be-low:• Filter size or size of the receptive field for a neuron.• Number of filters or neurons connected to the same spatial region of theinput, called the depth column.• Stride by which we want to slide the filter, hence controlling the distancebetween the depth columns.Even if we connect each neuron to a local spatial region, the number of learn-able parameters is still considerably high. Hence, to reduce the number of parame-352 5 7 03 1 4 26 2 2 01 4 4 35 76 4single depth slicemax-pooling with 2×2 filter and stride of 2Figure 2.4: A 2×2 max-pooling layer with a stride of 2ters and make convolutional layers act like image convolution, the neurons withinthe same depth slice are made to share a single set of weights. Hence, a singleforward pass means convolving the input with the learned weight for each slice.Figure 2.3 shows an example of a convolutional layer of depth 5, with filter size5×5 applied on a 32×32 image.Another key ingredient of a CNN is a pooling layer which are periodicallyinserted after convolutional layers. Pooling layers combine outputs of a group ofneighboring neurons from the previous layer into a single neuron in the next layer,hence performing spatial downsampling. Pooling layers help to reduce the numberof parameters and overfitting. Most commonly, pooling is done by max-pooling i.e.by taking the maximum value from a cluster of neighboring neurons from previouslayer. Besides, max-pooling, average pooling is also common. Figure 2.4 showsan example of max-pooling layer.One of the earliest and pioneering deep convolutional network, LeNet-5, wasdesigned by LeCun et al. [60] for handwritten digits recognition. LeNet-5 hadtwo convolutional and 2 pooling layers followed by 3 fully connected layers forclassifying handwritten digits from 32× 32 image. However, as discussed beforethe real breakthrough of convolutional neural network came after NVIDIA openedtheir CUDA platform which allowed GPU implementations of neural network and36with the release of the AlexNet [57] by Krizhevsky et al. the popularity of convo-lutional neural networks in computer vision burgeoned exponentially. Aided by theGPU implementation, the convolutional neural networks just got deeper and deepergiving unprecedented performances particularly in the area of object recognitionand localization. While AlexNet had a depth of 8 layers only, the other popularnetworks like VGG ConvNet [108] by Simonyan and Zisserman, released in 2014had 19 layers, GoogleNet/InceptionNet [114] released in 2015 by Szegedy et al.had 100 layers, and ResNet [44] released in 2016 by He et al. has 152 layers. Theinception module introduced by Szegedy et al. in their GoogleNet [114] allowedthem to design a deeper and wider network. The module performs four differentoperations on the input in parallel and concatenate the output features. Each of thebranch performs a 1× 1 convolution which is followed by a 3× 3 convolution insecond branch and 5× 5 convolution in the third, while a 3× 3 max-pooling pre-cedes the 1× 1 convolution in fourth layer. Because the computation is reducedby 1× 1 convolution before performing more expensive 3× 3 and 5× 5 convo-lutions allowed them to design a deeper network without significant increase inthe number of parameters. On the other hand, He et al. [44], in their ResNet,stacked multiple bottleneck blocks with a residual or shortcut connection betweeneach block. Within each bottleneck block, they stacked three convolutional layersof size a 1× 1, 3× 3 and 1× 1 successively. The 1× 1 convolutional layers areused for altering the number of depth columns. If we consider H(x) to be the de-sired mapping of input data x for a particular block, it now has to fit a mapping ofF(x) = H(x)− x instead of H(x). The authors hypothesized that it is easier to op-timize a network with this residual mapping than the ones without it. The residualconnections in our first and third network is motivated from this work.2.4.4 Recurrent Neural NetworksA Recurrent Neural Network (RNN) is a deep neural network with loops allowingit to store long term information. The looping structure allows them to exploit pre-vious computations to compute present information thereby making them suitablefor sequential data. Figure 2.5 shows an RNN being unrolled into a full network.From the figure, we can observe that RNNs are essentially copies of same net-37RNN UnitsVUxoWRNN Unitst-1VUxt-1ot-1RNN UnitstVUotRNN Unitst+1VUot+1W W Wxt xt+1Figure 2.5: A RNN unrolled into a full networkwork with each unit connected to the next. Messages are passed from each layerto the next forming a long chain-like network. In the figure, xt denotes the input attime step t, st denotes the hidden state at time t calculated by st = f (Uxt +Wst−1)where f is a non-linear function (typically a ReLU [79] or hyperbolic tangent) andot is the output at timestep t. U,V and W are weights or parameters in the net-work which are shared across the network and are learnt during training. Althoughtheoretically RNNs were designed to handle long term dependencies among data,in practice they can only deal with recent information because of the vanishinggradient problem.Long Short-Term Memory (LSTM)The most commonly used LSTM structure in current literature was proposed byGraves and Schmidhuber [40]. They incorporated changes made by Gers et al. [36]and Gers and Schmidhuber [35] into the original LSTM architecture and proposeda full error backpropagation training. We will refer to the architecture proposed byGraves and Schmidhuber [40] as Vanilla LSTM.Each memory block of the recurrent hidden layer of the vanilla LSTM containsmemory cells with self connections having abilities of storing temporal state of thenetwork. These cells are regulated by special multiplicative units called gates.There are three types of gates: input gate, output gate and a forget gate. Gates38are typically sigmoid functions which regulates how much information should belet through. An input gate controls the flow of input activations into the cell andan output gate regulates the output flow of cell activations into the rest of the net-work. The original LSTM by Hochreiter and Schmidhuber [48] did not containforget gates and could not process continuous input streams. To address this is-sue, Gers et al. [36] introduced forget gate. Forget gate scales the internal stateof the cell thereby allowing each cell to reset or forget its memory. Forget gatesallow LSTMs the flexibility of deciding when to drop information and how longto store information. Further modification was proposed by Gers and Schmidhu-ber [35] who argued that regulation of gates was necessary to learn precise tim-ings. Hence, they proposed to include peephole connections from internal cells tothe gates of the same cell and omitted output activation function. Vanilla LSTMincludes all these modifications and introduced full backpropagation through timetraining (BPTT) for LSTM networks. In Original LSTM, backpropagation wastruncated after one timestep, because the authors felt that long time dependencieswould be dealt with by the memory blocks, and not by the flow of backpropagatederror gradient. Vanilla LSTM simplifies the training and implementation of LSTMby performing full error backpropagation. Figure 2.6 shows the difference betweena LSTM node and a simple RNN node. We can observe the three gates (input, out-put and forget) all controlled by sigmoid functions, a block input, a cell known asConstant Error Carousel which continuously feeds error back to each of the gatesuntil they become trained to cut off the value, output activation function and peep-hole connections. Like recurrent networks, the output of a block is connected backto the input block and all of the gates.The vector formulas for the forward pass are given below:zt = g(Wzxt +Rzyt−1 +bz) block inputit = σ(Wixt +Riyt−1 + pi ct−1 +bi) input gatef t = σ(Wf xt +R f yt−1 + p f  ct−1 +b f ) forget gatect = it  zt + f t  ct−1 cell stateot = σ(Woxt +Royt−1 + po ct +bo) output gateyt = ot h(ct) block output39g+input   xtrecurrent      yt-1recurrent      yt-1output    yt-1RNN BlockLSTM Block+input   xtrecurrent      yt-1+input   xtrecurrent      yt-1+input   xtrecurrent      yt-1+input   xtrecurrent      yt-1+gσσhσf trecurrent      yt-1output    yt-1ctotforget gatecelloutput gatepeepholesinput gatect-1ct-1 ctitz tblock outputblock inputLegend+σghunweighted connectionweighted connectionrecurrent connection from previous time steppeephole connectionbrachning pointmultiplication operationaddition operationGate activation function, always sigmoidinput activation function, generally tanhoutput activation function, generally tanhFigure 2.6: (Left) Diagram of a simple RNN unit. Right) Diagram showinga LSTM Block.In the equations, xt and yt denote input and block output vectors respectivelyat time t. W s are rectangular input weight matrices. There are four different sets ofW for each gate and block input. The Rs are square matrices for recurrent weights.Similar to W , there are four different sets of R. The vectors p are peephole weightvectors and b are bias vectors. Functions σ , g and h are non-linear activation func-tions. Sigmoid functions are used for the gates and hyperbolic tangent functions areused for block input and outputs.  indicates element-wise multiplication betweentwo vectors.In 2014, Cho et al. [23] proposed a simplified version of LSTM called GatedRecurrent Unit (GRU) for the task of phrase based Machine Translation (SMT).Their architecture consisted of two RNNs: one for encoding a variable lengthsource sequence to a fixed length vector and other for decoding it back to a variablelength target sequence. Their simplified architecture did not have any peepholeconnections or output activation functions. They combined forget gate and input40gate into an update gate. They also combined cell state and hidden cell state. Theiroutput gate is called a reset gate that applies a sigmoid function over the recurrentconnections to the input block. The reset gate specifies whether the current hiddenstate would ignore the previous hidden state or not. If it is set to zero, hidden statewill update in the current input block only. The update gate controls how muchinformation from previous hidden state will be carried over to current state if thereset gate is closed.Like other deep networks, a complex model of LSTM may result in overfitting.Applying regularization efficiently to RNN networks proved to be a challengingtask until Zaremba et al. [130] showed how dropout can be used in LSTM to re-duce overfitting. The authors applied dropout to the non- recurrent connectionsfor multi-layer RNNs so that it corrupts the information carried by the units result-ing in more robust intermediate computations. At the same time the architectureallowed the units to remember the information occurred many time steps back.Sequence-to-sequence NetworkSutskever et al. [113] came up with the idea of sequence-to-sequence networkfor translating English sentences into French. Sequence-to-sequence networks areconvenient for tasks where the input and output has different sequence length, e.g.Machine Translation. Sutskever et al. used LSTM units to read the input sequenceand encode it to a fixed dimensional vector representation. Then a second set ofLSTM units were used to decode the vector into output sequence. The decoderLSTM units maximize the conditional probability of the output sequence given theinput sequence. They found an improvement in their results when the order ofthe words in the input sequence was reversed. Our final model is inspired fromsequence-to-sequence network and the machine translation task. In our case, theinput is sequence of 2D poses and the output is a sequence of 3D poses of the samesequence length as input.41Chapter 33D pose from 2D poseOur first model aims to analyze the effectiveness of breaking up the task of 3Dpose estimation into two parts: i) obtaining 2D pose using a off-the-shelf 2D poseestimator ii) Learning a mapping from 2D pose to 3D pose. As mentioned before,this would help us to find out whether it is more difficult to estimate 3D posedirectly from an image in an end-to-end frame work than to estimate it from 2Dposes.For the purpose of 3D pose estimation from 2D joint locations of an image,we have designed a simple multi-layered fully-connected network. Since our net-work only takes 2D coordinates of joint locations as input, it is much smaller indimension than an image. Hence, we can afford to use multiple layers of fully-connected neurons. We have used a residual or shortcut connection after every twofully connected layers as inspired by He et al. [44] who used shortcut connectionsto build a deep convolutional network. Additionally we used dropout [110] andbatch normalization [50] layers after each hidden layer and used Rectified LinearUnits (ReLU) [79] as the activation function.3.1 Loss FunctionOur goal is to estimate the body joint locations in 3D space given the joint locationsin 2D. In other words, the input to our system is a set of 2D joint locations x ∈R2nand our output is a set of 3D joint locations y∈R3n. Our network learns a mapping42of f (x)→ y;x ∈ R2n,y ∈ R3n. We use Mean Squared Error (MSE) of 3D jointlocations over a set of N poses as our loss function given by,L ( f (x),y) = minf (x)1NN∑i=1‖ f (xi)−yi‖22 . (3.1)Here f (xi) is the predicted 3D pose for i-th 2D pose and f (yi) is the ground truth3D pose. The input 2D pose xi may be obtained from 2D joint detections from a 2Dpose detector or from the ground truth. We have experimented both with the groundtruth 2D joint locations and with the detections from the 2D pose detector, stacked-hourglass by Newell et al. [80], which predicts 2D joint locations of 16 joints froman image namely: Central hip, spine, neck, head, both left and right joints for hip,knee, ankle, shoulder, elbow and wrist. We map of the 2D locations of these 16joints, using our deep network, into 3D locations of 17 joints, with the nose jointbeing the extra joint. We had to drop the nose joint because the stacked-hourglassnetwork does not predict it. We predict the 3D joint locations with respect to theroot node, central hip, which is common in the literature. However, instead ofpredicting the 3D pose in an arbitrary global coordinate space, we predict them inthe coordinate space of the camera, i.e. how the camera is looking at the 3D pose.3.2 Network designFigure 1.6 shows a diagram with the basic building blocks of our architecture.The key component of our network is the residual block as depicted in the dia-gram. First we project the input into a higher dimension using a fully connectedlinear layer. Then after applying dropout [110] and batch normalization [50] wepass it to our residual block. Each unit of our residual block consists of two fullyconnected layers with dropout and batch normalization layers in between. Thereis a shortcut or residual connection from the input to the residual block to the out-put of the block. In most of our experiments, we have used two units of residualblocks. Finally we project down the output from the second residual block into48-dimensional vector which corresponds to the 3D locations of 16 joints withrespect to root node which is always set at (0,0,0). Overall, our network has 6fully connected linear layers and approximately 4-5 million trainable parameters.43Our model benefits from recent improvements on the optimization of deep net-works, courtesy of the deep convolutional networks submitted to the ImagenetChallenge [27, 57]. The contributions applied by the authors in the context ofdeep networks also help our fully connected model to better generalize our 2D-to-3D pose mapping task. Below we discuss the contribution of each module in ournetwork and elaborate on our design choices.3.2.1 Mapping 2D pose to 3DWe chose to use 2D and 3D locations of joints as inputs and outputs, instead of in-ferring 3D pose from images directly by training the model end-to-end which manyof the recent techniques did [63, 65, 73, 81, 85, 87, 96, 112, 116, 119, 120, 133]because we wanted validate the efficiency of dividing the 3D pose estimation task.Some decoupled approaches [134] have used 2D probability distributions or 2Djoint heatmaps from 2D pose estimators as inputs as inputs. However, the 2D jointlocations have much smaller dimensionality than the heatmaps which enabled us tostore the entire Human3.6M dataset in the GPU while training the network whichmassively reduces the training time. Because our network can be trained very fast(approximately 5ms per batch of 64), we can experiment with network design andtraining hyper-parameters. As we have mentioned in Chapter 2, different models of3D pose estimation have represented the 3D pose output in different ways e.g. 3Dprobabilities or volumetric heatmap of joints [87], 3D motion parameters [133] orcoefficients of basis pose [2, 16, 92, 132, 134]. However, our network predicts the3D joint locations with respect to the root node, which is a simple and model-freerepresentation of 3D pose. This simplifies the 3D pose estimation task of having toestimate the offsets of each joint from the root joint instead of having to predict theabsolute coordinates of each joint, because it is more difficult to find meaningfulspatial relationships between the joints if absolute coordinates are predicted.3.2.2 Fully connected layers with ReLU activationSince our input is 2D joint locations which is low-dimensional when compared toimages or 2D joint heatmaps and hence there is no need for convolutional layers.Therefore we can use fully connected linear layers which are computationally less44expensive than applying convolution. We use Rectified Linear Unit (ReLU) [79] asactivation function because it has been found effective in decreasing the possibilityof vanishing gradient problems. The ReLU [79] is defined as y = max(0,a) whereis a =Wx+b. Hence the gradient for ReLU is defined as,0 i f a < 0,1 i f a > 0undefined i f a = 0 .Hence even if the value of a is very high, the gradient is 1. Therefore, ReLUgradients do not vanish and hence it reduces the chances of vanishing gradients.The constant gradient of ReLU also makes the learning process faster. Anotheradvantage of using ReLU unit is the sparsity of gradients in cases when a < 0.However, since gradient of ReLU is undefined at 0, a small value ε can be addedto a when a = 0.3.2.3 Residual or shortcut connectionsThe idea of residual or shortcut connects was proposed by He et al. [44]. Residualconnections have allowed them to build a convolutional neural network which is152 layers deep. They hypothesized that it is easier for the network to learn theresidual mapping than the ones without residual connections. One reason for thiscould be because the block of network connected by residual connection only needsto learn the amount of change from the input to obtain the desired mapping insteadof having to learn the mapping directly. We also found residual connections tobe highly effective in generalizing on new data and reducing test time. We addedshortcut connections every two fully connection layers. In our case, the connectionhas helped us to reduce the error by approximately 10%.3.2.4 Regularization with batch normalization, dropout andmax-norm constraintBatch normalization was proposed by Ioffe and Szegedy [50] in 2015. One ma-jor issue of deep networks is that the distribution of features at each hidden layerchanges over many times during training as the parameters of previous layer change45even though the input data have same distribution. They called this phenomenoninternal covariance shift. It slows down the training of deep networks by enforc-ing lower learning rate and makes the networks extremely sensitive to parameterinitialization. It also makes training extremely difficult due to saturation of non-linearities leading to vanishing gradient problem. To reduce internal covarianceshift, they proposed to normalize the input features of every layer during training byincorporating normalization as a part of the model. The batch normalization layerlearns an estimation of population variance and mean from each training mini-batch and applies normalization using the estimated population mean and varianceduring test time. The advantages of batch normalization are many. It makes conver-gence of the optimization function quicker by allowing larger learning rate makingthe overall training faster. It also makes the network less sensitive to parameterinitialization and since it normalizes the inputs to the activation function it reducesthe vanishing gradient problem. Additionally it regularizes the network because ofthe noise in population statistics estimation, hence gives better generalization.Dropout proposed by Srivastava et al. [110] is another method for regularizingdeep networks. Dropout works by randomly dropping out or ignoring individualneurons at every layer with a probability of p (keeps a neuron with probability 1-p)during training time by removing all the incoming and outgoing connections fromthe dropped neuron, resulting in a reduced network.We have also found batch normalization and dropout to be effective particularlywhen we trained our network with noisy 2D pose estimates from the detectors.Without batch normalization our network does not generalize well for noisy 2Dpose estimates. Adding both batch normalization and dropout help our network togeneralize better for test data decreasing the overall test error of our network witha minute increase in training time.In addition to batch norm and dropout we also added a constraint on the weightsof each layer of the network so that their maximum norm is always less than orequal to 1. We observed that it makes our model robust to noise and improvesgeneralization.463.3 Data PreprocessingWe normalized both our 2D pose inputs and 3D pose ground truth by subtractingthe mean and dividing it by the standard deviation. Since we predict the 3D jointlocations relative to the root node and do not predict the global position of the rootnode, we zero-center the 3D poses around the hip-joint, the root node. This is inline with the standard protocol of Human3.6M and the previous work.3.3.1 Camera coordinate frameA key factor of our system is predicting the 3D pose in the camera coordinate frameinstead of an arbitrary global frame. Intuitively it is difficult for any model to learnthe mapping from a 2D pose at a particular view to any arbitrary coordinate spacesince it captures no information of the view and any random amount rotation ortranslation to the arbitrary space would yield in no change in the input. Predicting3D pose in a fixed global frame causes the multiple views of the same 2D pose mapto the same output. This reduces variance in the training data, making it harder forthe network to learn the mapping and causes overfitting. A direct consequence ofpredicting in arbitrary global coordinate frame is the failure to capture the globalorientation of the person leading to higher errors in all the joints. There are anumber of works that have predicted 3D pose in camera coordinate frame [29, 64,87, 117, 133, 134].By predicting 3D pose in the same camera frame as 2D pose, we get a greatervariability of training data per camera view. Therefore to make our network predict3D poses in camera space, we rotate and translate the 3D ground truth, in globalcoordinate frame, by applying inverse transform of the camera based on its extrin-sic parameters. It should be noted that we do not use any ground truth cameraparameters at test time. The network learns itself to correctly map the 2D pose ina particular view to its corresponding 3D space in the same view.3.3.2 2D detectionsWe used the the state-of-the-art 2D pose estimator called stacked hourglass networkby Newell et al. [80], trained on the MPII [5] dataset, to obtain 2D pose detections.MPII dataset is a standard dataset for the task of 2D pose estimation containing47over 25K images over wide variety of scenes containing more than 40K people.To obtain the detections, we first used the bounding box ground truth providedwith Human3.6M dataset to estimate the center of the person in the image whichis in line with previous work [51, 64, 76, 85, 117]. We cropped a region of 440×440 pixels around the estimated center and passed it to the stacked-hourglass poseestimator which resizes the cropped image to 256× 256 pixels before processingit.We found that the average error between the detected and ground truth 2Dposes for Human3.6M dataset is approximately 15 pixels which is slightly higherthan the 10 pixel error reported by Moreno-Nouguer [76] who used CPM [123]for 2D pose estimation. However, we chose stacked-hourglass [80] model over theCPM model because we found it to be approximately 10 times faster than CPMin estimating pose from an image. Moreover, stacked-hourglass model reported alower error in MPII dataset which contains a lot of in-the-wild images. Hence wefelt it would generalize better to in-the-wild images than CPM [123].To find out whether more accurate 2D pose estimation improves error in ourmodel, we also fine-tuned the stacked-hourglass model, pre-trained on the MPIIdataset. For fine-tuning we use all the default hyper-parameters of the pretrainedexcept for the mini-batch size which was set to 3 from 6 due to memory limitationson the GPU and fine-tuned it for 40,000 iterations.3.3.3 Training detailsWe trained our network for 200 epochs where each epoch makes a pass over the2D poses of entire Human3.6M dataset. We used the Adam [55] optimizer foroptimization. We started our training with learning rate of 0.001 and applied expo-nential decay on the learning rate as our training progressed. We used a mini-batchsize of 64. We initialized the weights of our network by using using Kaiminginitialization [43]. Our code has been implemented on Tensorflow. A single passover a mini-batch including back-propagation takes around 5ms and a forward passtakes only 2ms on an NVIDIA Titan X GPU. Therefore, when combined with anyreal-time 2D pose estimator our network can predict the 3D pose from an image inreal-time. Single training epoch, which makes a pass over the entire Human3.6M48Protocol #1 Direct. Discuss Eating Greet Phone Photo Pose Purch. Sitting SitingD Smoke Wait WalkD Walk WalkT AvgLinKDE [51] (SA) 132.7 183.6 132.3 164.4 162.1 205.9 150.6 171.3 151.6 243.0 162.1 170.7 177.1 96.6 127.9 162.1Li et al [64] (MA) – 136.9 96.9 124.7 – 168.7 – – – – – – 132.2 70.0 – –Tekin et al [117] (SA) 102.4 147.2 88.8 125.3 118.0 182.7 112.4 129.2 138.9 224.9 118.4 138.8 126.3 55.1 65.8 125.0Zhou et al [134] (MA) 87.4 109.3 87.1 103.2 116.2 143.3 106.9 99.8 124.5 199.2 107.4 118.1 114.2 79.4 97.7 113.0Tekin et al [116] (SA) – 129.1 91.4 121.7 – 162.2 – – – – – – 130.5 65.8 – –Ghezelghieh et al [37] (SA) 80.3 80.4 78.1 89.7 – – – – – – – – – 95.1 82.2 –Du et al [29] (SA) 85.1 112.7 104.9 122.1 139.1 135.9 105.9 166.2 117.5 226.9 120.0 117.7 137.4 99.3 106.5 126.5Park et al [85] (SA) 100.3 116.2 90.0 116.5 115.3 149.5 117.6 106.9 137.2 190.8 105.8 125.1 131.9 62.6 96.2 117.3Zhou et al [133] (MA) 91.8 102.4 96.7 98.8 113.4 125.2 90.0 93.8 132.2 159.0 107.0 94.4 126.0 79.0 99.0 107.3Nie et al [81] (MA) 90.1 88.2 85.7 95.6 103.9 103.0 92.4 90.4 117.9 136.4 98.5 94.4 90.6 86.0 89.5 97.5Rogez et al [73] (MA) – – – – – – – – – – – – – – – 88.1Mehta et al [73] (MA) 57.5 68.6 59.6 67.3 78.1 82.4 56.9 69.1 100.0 117.5 69.4 68.0 76.5 55.2 61.4 72.9Mehta et al [74] (MA) 62.6 78.1 63.4 72.5 88.3 93.8 63.1 74.8 106.6 138.7 78.8 73.9 82.0 55.8 59.6 80.5Lin et al [65] (MA) 58.0 68.2 63.3 65.8 75.3 93.1 61.2 65.7 98.7 127.7 70.4 68.2 72.9 50.6 57.7 73.1Tome et al [119] (MA) 65.0 73.5 76.8 86.4 86.3 110.7 68.9 74.8 110.2 173.9 84.9 85.8 86.3 71.4 73.1 88.4Pavlakos et al [87] (MA) 67.4 71.9 66.7 69.1 72.0 77.0 65.0 68.3 83.7 96.5 71.7 65.8 74.9 59.1 63.2 71.9Tekin et al [118] 54.2 61.4 60.2 61.2 79.4 78.3 63.1 81.6 70.1 107.3 69.3 70.3 74.3 51.8 63.2 69.7Ours (SH detections) (SA) 61.6 73.4 63.3 58.3 91.8 93.6 66.3 62.0 91.7 109.4 75.7 86.5 67.2 51.2 52.3 73.6Ours (SH detections) (MA) 53.3 60.8 62.9 62.7 86.4 82.4 57.8 58.7 81.9 99.8 69.1 63.9 67.1 50.9 54.8 67.5Ours (SH detections FT) (MA) 51.8 56.2 58.1 59.0 69.5 78.4 55.2 58.1 74.0 94.6 62.3 59.1 65.1 49.5 52.4 62.9Ours (GT detections) (MA) 37.7 44.4 40.3 42.1 48.2 54.9 44.4 42.1 54.6 58.0 45.1 46.4 47.6 36.4 40.4 45.5Table 3.1: Results showing errors action-wise on Human3.6M [51] underProtocol #1 (no rigid alignment or similarity transform applied in post-processing). SH indicates that we trained and tested our model withthe detections of Stacked Hourglass [80] model pre-trained on MPIIdataset [5] as input, and FT indicates that the the model was fine-tuned onHuman3.6M. GT detections denotes that the ground truth 2D locationswere used. SA indicates that a model was trained for each action, andMA indicates that a single model was trained for all actions.dataset, takes only about 2 minutes which allowed us to experiment with differenthyper-parameters and variants of our architecture.3.4 Experimental evaluationDatasets and protocols We perform quantitative evaluation on two benchmarkdatasets for 3D pose estimation: Human3.6M [51] and HumanEva [105]. Forqualitative results we use the MPII dataset [5] which is a benchmark dataset for2D pose estimation and does not have any ground truth for 3D pose.As we have discussed in Section 1.1.2, Human3.6M is, to the best of our knowl-edge, the largest publicly available datasets for human 3d pose estimation. Hu-manEva, on the other hand, is another dataset for 3D pose estimation which is49Protocol #2 Direct. Discuss Eating Greet Phone Photo Pose Purch. Sitting SitingD Smoke Wait WalkD Walk WalkT AvgAkhter & Black [2]* (MA) 14j 199.2 177.6 161.8 197.8 176.2 186.5 195.4 167.3 160.7 173.7 177.8 181.9 176.2 198.6 192.7 181.1Ramakrishna et al [92]* (MA) 14j 137.4 149.3 141.6 154.3 157.7 158.9 141.8 158.1 168.6 175.6 160.4 161.7 150.0 174.8 150.2 157.3Zhou et al [134]* (MA) 14j 99.7 95.8 87.9 116.8 108.3 107.3 93.5 95.3 109.1 137.5 106.0 102.2 106.5 110.4 115.2 106.7Bogo et al [16] (MA) 14j 62.0 60.2 67.8 76.5 92.1 77.0 73.0 75.3 100.3 137.3 83.4 77.3 86.8 79.7 87.7 82.3Rogez et al [73] (MA) – – – – – – – – – – – – – – – 87.3Nie et al [81] (MA) 62.8 69.2 79.6 78.8 80.8 86.9 72.5 73.9 96.1 106.9 88.0 70.7 76.5 71.9 76.5 79.5Mehta et al [73] (MA) 14j – – – – – – – – – – – – – – – 54.6Tekin et al [118] (MA) 17j – – – – – – – – – – – – – – – 50.1Moreno-Noguer [76] (MA) 14j 66.1 61.7 84.5 73.7 65.2 67.2 60.9 67.3 103.5 74.6 92.6 69.6 71.5 78.0 73.2 74.0Pavlakos et al [87] (MA) 17j – – – – – – – – – – – – – – – 51.9Ours (SH detections) (SA) 17j 50.1 59.5 51.3 56.9 68.5 67.5 51.0 47.2 68.5 85.6 61.2 67.0 55.1 41.1 45.5 58.5Ours (SH detections) (MA) 17j 42.2 48.0 49.8 50.8 61.7 60.7 44.2 43.6 64.3 76.5 55.8 49.1 53.6 40.8 46.4 52.5Ours (SH detections FT) (MA) 17j 39.5 43.2 46.4 47.0 51.0 56.0 41.4 40.6 56.5 69.4 49.2 45.0 49.5 38.0 43.1 47.7Ours (SH detections) (SA) 14j 44.8 52.0 44.4 50.5 61.7 59.4 45.1 41.9 66.3 77.6 54.0 58.8 49.0 35.9 40.7 52.1Table 3.2: Results showing errors action-wise on Human3.6M [51] datasetunder protocol #2 (rigid alignment in post-processing). The 14j anno-tation indicates that the body model considers 14 body joints while 17jmeans considers 17 body joints. (SA) annotation indicates per-actionmodel while (MA) indicates single model used for all actions. FT in-dicates that the stacked-hourglass model has been fine-tuned on Hu-man3.6M dataset. The results of the methods are obtained from the orig-inal papers, except for (*), which were obtained from [16].comparatively much smaller and older than the Human3.6M dataset but have beenused as a benchmark by many previous work.On Human3.6M we follow the standard protocol which has been used overthe years. The protocol involves using subjects 1, 5, 6, 7, and 8 for training, andsubjects 9 and 11 for evaluation. Our error metric is average error per joint inmillimeters between the estimated and the ground truth 3D pose relative to the rootnode (central hip joint). We refer to this as protocol #1. However, in some previouswork(e.g., [16, 76]), the predicted 3D pose is aligned to the ground truth 3D poseunder a rigid body similarity transform. This is typically done by using Procrustesanalysis [39]. This post-processing is referred to as protocol #2.Several methods which used Human3.6M dataset performed an action specifictraining and testing. However, recent deep network based methods train a singlemodel for all the actions. We observed that training a single model gives betterresults than action specific models.However, for HumanEva action specific models are trained in the literature andthe error is always computed after similarity transform. Hence we also used this50protocol.3.4.1 Quantitative resultsEvaluation on estimated 2D poseConceptually we modeled our 3D pose framework as a decoupled architecturewhich divides the 3D pose estimation task into two parts: detecting 2D pose us-ing a 2D pose estimator and estimating 3D pose from detected 2D joint location.As mentioned before, we obtained 2D pose estimations from Human3.6M datasetusing a stacked-hourglass model [80] trained on MPII dataset [5].Our results on protocol #1 on Human3.6M dataset is shown in Table 3.1.As seen from the table, when we use the predictions from the stacked-hourglassmodel [80] trained only on MPII dataset, our framework outperforms all the re-cently released methods. Our network outperform Pavlakos et al. [87] by 4.4 mm,who trained an end-to-end model from image by extending the stacked-hourglass2D pose estimator to make it estimate volumetric heatmaps. Our network alsomarginally beats the method recently proposed by Tekin et al. [118] by 2.2 mm.Intuitively, since our method takes 2D joint location estimations as input andtries to regress 3D pose from it, the accuracy largely depends on the accuracy of2D estimations. To validate this we fine-tuned the stacked-hourglass network, pre-trained on MPII dataset, over Human 3.6M dataset. As hypothesized, when trainedusing the predictions from fine-tuned network, our method outperforms our nearestcompetitors Pavlakos et al. [87] by 9.0 mm and Tekin et al. [118] by 6.8mm. Themargins between the errors increase by more than twice when we use fine-tunedpredictions, suggesting the superiority of our network compared to the state-of-the-art.We report our results on Human3.6M under protocol #2, which uses a similaritytransform with the ground truth in Table 3.2. Although under protocol #2, ourmethod is very narrowly beaten by both Pavlakos et al. [87] and Tekin et al. [118](by 0.6 mm and 2.4 mm respectively) when we use the detections from the stacked-hourglass model trained on MPII dataset, it beats both the state-of-the-arts (by 4.2mm and by 2.4 mm) when the detections from fine-tuned model are used.51Finally, we report the results on the HumanEva dataset in Table 3.3. In thisdataset, we obtained the best the result in 4 out of 6 cases, and achieved the lowestaverage error for the actions Jogging and Walking and for all subject. But comparedto Human3.6M, HumanEva is a much smaller dataset and the same subjects arepresent on both training and test set. Therefore, visual methods would have astronger bias on this dataset. Therefore, these results are not so significant as theresults obtained on Human 3.6M dataset.A lower bound on the error 2d-to-3d regressionTo validate our hypothesis that the major source of error for 3D pose estimation isdue to the errors in estimating 2D poses, we train our model on Human3.6M withground truth 2D pose. We show the results under protocol #1 in Table 3.1 where asingle model is trained for all actions. Unsurprisingly, the network trained with theground truth results in a significantly lower error (approximately 17 mm) than ourmodel trained on detections of pre-trained stacked-hourglass.Under protocol #2 our network trained on the ground truth achieves an error of37.10 mm, almost 30% better than the case when our network was trained on theestimated 2D pose, which validates our hypothesis empirically that for deep net-works it is easier to learn the mapping from 2D joint locations to 3D joint locationsand the more accurate the estimation of 2D pose is, the better is the accuracy of 3Dpose.Even though we evaluate per frame separately and don’t use any temporal post-processing, we observed that the predictions produced from the ground truth 2Dare smooth. A video demonstrating this and other qualitative results can be foundat at https://youtu.be/Hmi3Pd9x1BE.Robustness to detector noiseTo further analyze the robustness of our approach to noisy inputs, we carried outexperiments where our model is trained on the ground truth 2D pose and tested onground truth 2D pose randomly corrupted by different level of additive GaussianNoise. We used protocol #2 to compare against the work by Moreno-Nouguer [76]because they used the same protocol. The results are reported in Table 3.4. We52Walking JoggingS1 S3 S3 S1 S2 S3 AvgRadwan et al [90] 75.1 99.8 93.8 79.2 89.8 99.4 89.5Wang et al [122] 71.9 75.7 85.3 62.6 77.7 54.4 71.3Simo-Serra et al [107] 65.1 48.6 73.5 74.2 46.6 32.2 56.7Bo et al [14] 46.4 30.3 64.9 64.5 48.0 38.2 48.7Kostrikov et al [56] 44.0 30.9 41.7 57.2 35.0 33.3 40.3Yasin et al [128] 35.8 32.4 41.6 46.6 41.4 35.4 38.9Moreno-Noguer [76] 19.7 13.0 24.9 39.7 20.0 21.0 26.9Pavlakos et al [87] 22.1 21.9 29.0 29.8 23.6 26.0 25.5Ours (SH detections) 19.7 17.4 46.8 26.9 18.2 18.6 24.6Table 3.3: Results on the HumanEva [105] dataset, and comparison with pre-vious methods.outperform the work by Moreno-Nouguer [76] by a huge margin for all levels ofnoise. Even in the case when no Gaussian noise is added our method bettered theresults from Moreno-Nouguer by a staggering 43%.We have also reported the case when our network was trained with ground truth2D pose but tested with noisy detections from the 2D pose detectors CPM [123]and stacked-hourglass [80]. As can be observed from the table, our method alsoperforms reasonably well in this case, thereby demonstrating the robustness of ourmodel.Ablative and hyperparameter analysisTo demonstrate the usefulness of different components and design choices of ournetwork we perform an ablative analysis. We perform the ablative analysis underProtocol #1 where the input 2D pose comes from the model that is trained on MPIIdataset only and train a single model for all the actions for 3D pose estimation. Theresults are show in Table 3.5. As we can see from the table, when we remove onlybatch normalization, the network generalizes poorly and this leads to an increasein error of 21 mm. Removing both batch normalization and dropout leads to anincrease of 8.5 mm, while adding residual connections give as a gain of about 8.3mm. The biggest impact is made when we predict 3D pose in camera coordinateframe. When the 3D pose is predicted in an arbitrary global frame it leads to an53DMR [76] Ours ∆GT/GT 62.17 37.10 25.07GT/GT +N (0,5) 67.11 46.65 20.46GT/GT +N (0,10) 79.12 52.84 26.28GT/GT +N (0,15) 96.08 59.97 36.11GT/GT +N (0,20) 115.55 70.24 45.31GT/CPM [123] 76.47 – –GT/SH [80] – 60.52 –Table 3.4: Performance of our system on Human3.6M [51] dataset underprotocol #2 under different levels of additive Gaussian noise and noisefrom 2D pose estimation from the pose estimators. (Top) Training usingground truth 2D pose and testing on ground truth 2d plus plus differentlevels of additive Gaussian noise. (Bottom) Training on ground truth 2Dpose and testing on the noisy outputs of a 2D pose estimator. Note thatthe size of the cropped region around the person is 440×440.average error of over 100 mm, a significant increase of about 33 mm.We also evaluated how our network performs under different depth. Using asingle residual block results in a performance loss of about 7 mm. The networkstarts to saturate when we use more than 2 Residual blocks , mostly because oflarge number of parameters due to full connections.Although not reported in the table, empirically we observed that decreasing thesize of the hidden layers to 512 from 1024 leads to an increase in error. Increasingthe size of hidden layers to 2048 units did not seemingly improve the results despitea loss in training speed.3.4.2 Qualitative resultsWe show some qualitative results on Human3.6M under protocol #1 and using the2D poses from stacked-hourglass model pre-trained on MPII dataset in Figure 3.1.We also show some results on in-the-wild images from MPII dataset in Figure 3.2.We can observe certain shortcomings of our approach in Figure 3.2. We cansee from the figure that our system cannot recover from a faulty 2D pose estimationparticularly when the 2D pose detector fails completely in generate any meaning-54error (mm) ∆Ours 67.5 –w/o batch norm 88.5 21.0w/o dropout 71.4 3.9w/o batch norm w/o dropout 76.0 8.5w/o residual connections 75.8 8.3w/o camera coordinates 101.1 33.61 block 74.2 6.72 blocks (Ours) 67.5 –4 blocks 69.3 1.88 blocks 69.7 2.4Table 3.5: Ablative and hyperparameter sensitivity analysis.ful pose. Another limitation is that our model cannot handle poses which haveunconventional orientation and are not present in Human3.6M dataset, e.g. a diverdiving into a pool. In this case, the person is upside down. Even though our modelcould capture the pose to some extent, it failed to capture the real orientation of theperson.3.4.3 Discussion of resultsIf we analyze the Table 3.1, we observe a general trend of getting higher errorsin certain action classes like taking photo, talking on the phone, sitting and sittingdown. Most previous work had a hard time dealing with these actions. We wouldlike to attribute the cause of higher error to severe self-occlusion of body parts inthese actions e.g. in certain phone sequences, one of the hands is hardly visible.Same can be said for actions like sitting and sitting down where the actors some-times sit in a way where the legs get aligned with the viewpoint of the camera,resulting in blockage of view of one leg and also foreshortening.We have demonstrated empirically that a model based on simple architecturelike fully connected layers are good enough to achieve a remarkably low error on3D pose estimation given the 2D poses. In fact, using the state-of-the-art 2D poseestimator, stacked-hourglass [80] we have bettered the state-of-the-art results till55Figure 3.1: Example of output on the test set of Human3.6M dataset. (Left)2D pose, (Middle) 3D ground truth pose in red and blue, (Right) our 3Dpose estimations in green and purple.date. The result goes along with our hypothesis, that the task of mapping 2D posesto 3D is easier than previously thought and it is the error in understanding of humanpose in 2D which contributes as the major factor in the accuracy of 3D pose esti-mation task. This hypothesis is in contrast with the standard deep learning mantraapplied to 3D pose estimation task which focuses on training deep networks end-to-end to predict 3D pose directly from images. Pavlakos et al. [87], who had theprevious best results by training their network end-to-end, hypothesized that re-gressing 3D points directly is more difficult than predicting a volumetric heatmap.They also showed in their paper that when they used decoupled network i.e. usethe heatmaps as input to 3D pose estimation system without the image features, itdecreased the performance of their network despite being trained end-to-end. Ournetwork shows that their hypothesis about regressing 3D points directly being moredifficult is not correct. However, we do agree with them that image features mayprovide valuable contextual information and that the 2D heatmaps alone is not goodenough for some reason to estimate 3D pose effectively, which we would show in56Figure 3.2: Qualitative results on the MPII [5] test set. Observed image, fol-lowed by 2D pose detection using Stacked Hourglass [80] and (in green)our 3D pose estimation. The bottom 3 examples show typical failurecases, where either the 2D detector has failed totally (left), or marginally(right). In the middle column of last row, the 2D detector does a goodjob in estimating the 2D pose, but the person is faced upside-down. Hu-man3.6M dataset does not provide any corresponding poses which areoriented upside-down. However, our network still seems to predict ameaningful pose although the orientation is reversed verticallythe next chapter. Despite this our network has shown that something as simple as2D joint locations alone can be discriminative enough to estimate 3D pose with aremarkably low error rate using very simple, totally decoupled network not trainedend-to-end. Our network is simple, fast and lightweight and can be trained veryeasily to obtain the state-of-the-art results.Moreno-Nouguer [76] claimed that the use of a distance matrix as a representa-tion of human body is justified by the claim that invariant, human-designed featuresshould boost the accuracy of the system. However, we found using a much simplerrepresentation like 2D pose, a well trained system can outperform the networksthat learn from hand-generated features.To summarize the findings from our first architecture, our accuracy in 3D poseestimation from ground truth 2D poses suggest that, although 2D pose estimationis considered to be a nearly solved problem, it is one of the root causes for error57in 3D pose estimation task. Our work also suggests that learning invariant featurerepresentation of human pose from images by training a network end-to-end maynot be as critical as thought or has not been exploited to its full potential.58Chapter 4End-to-end modelWith our first model, we showed an empirical proof that 2D pose information alonecan be discriminative enough to regress 3D joint locations with a high accuracyand that more accurate estimation of 2D joint locations can improve overall perfor-mance. To further bolster our argument, we design a network which regresses 3Dposes directly from the RGB image.Our model is inspired by the stacked-hourglass network [80], which predicts2D joint heatmaps, and by the work of Pavlakos et al. [87] which extends thestacked-hourglass network to predict 3D volumetric heatmaps for each joints. How-ever, instead of predicting the volumetric heatmaps, we want to regress the 3Dpoints directly. For this purpose we overlay our first model on top of the stacked-hourglass network. The joint heatmaps from the last hourglass are vectorized andprojected onto a 1024 dimensional vector, which is then passed to residual blockslike the first network. The output of the residual block is then projected down topredict 3D joint locations relative to the root joint. Figure 1.7 shows the architec-ture of our second network.However, we found that it is difficult to train such a network end-to-end andthe error in estimating 3D pose is considerably higher. The heatmap for a jointgives the probability or likelihood of the joint being at a particular spatial location.The 2D joint locations from a heatmap are found by applying 2D argmax on theheatmap to the find the spatial index of the maximum value. However, the argmaxfunction is not differentiable. Hence we cannot extract the 2D joint locations from59the output of stacked hourglass to pass it to our residual block and train the wholenetwork end-to-end. This may suggest that the heatmaps are not as discriminativeas 2D joint locations or the mapping from heatmaps to 3D joint locations is moredifficult than mapping from 2D joints to 3D and therefore leads to a higher error.4.1 Stacked hourglass moduleThe hourglass module was proposed by Newell et al. [80] for the task of 2D poseestimation. The motivation behind designing the hourglass structure was to gatherdiscriminative features and clues needed for understanding human pose at multi-ple scales. Each hourglass module is composed several residual modules, which issame as the bottleneck residual blocks proposed by He et al. [44], discussed in de-tails in the related work Section 2.4.3. Therefore, each hourglass performs a seriesof convolutions and max-pooling to process features at multiple different scales,the lowest resolution being 4×4. The network branches off into two parts beforeeach pooling layer, where more convolutions are applied on the pre-pooled fea-tures on one branch, while the other branch applies max-pooling to bring the scaledown. Once the lowest resolution is reached by successive pooling operations,the features are sequentially up-sampled and combined with the features whichbranched off and were not max-pooled at the same scale in a top-down manner.The up-sampling is done using nearest-neighbor up-sampling and the features arecombined by element-wise addition. The output of the hourglass module is a 2Dheatmap for each joint which gives the likelihood of that joint being at a particularspatial location.Newell et al. [80] stacked multiple hourglass modules together to build a com-plete 2D pose estimation framework. The heatmaps of joints from an hourglassare projected to a larger number channels using 1× 1 convolution and are addedwith the intermediate features of the hourglass and with the feature output of theprevious hourglass. The resulting output is passed onto the following hourglassas input. The repeated bottom-up and top-down inference over the whole networkhelps later hourglasses to refine the outputs of previous hourglasses. They appliedintermediate supervision at the end of each hourglass to ensure each hourglass pre-dicts accurate estimates of heatmaps thereby allowing later hourglasses to refine60previous estimates. The loss function used by the authors is the Mean SquaredError (MSE) between predicted heatmaps and ground truth heatmaps.4.2 Pre-training stacked-hourglass modelEmpirically, we found that the network does not converge easily when the en-tire model is trained end-to-end with random weight initialization. Therefore, wedecided to pre-train the stacked-hourglass part of our network for 2D pose estima-tion only. We stacked four hourglass modules for the task. Each hourglass hasfour residual modules. We trained the stacked-hourglass module from scratch onthe images of Human3.6M dataset. Following the standard protocol of the Hu-man3.6M dataset, we only used the images of subjects 1,5,6,7,8 for training thenetwork.For this task, we cropped the input image using the bounding box annotationsprovided in the dataset. We first estimated the center of the bounding box fromthe given information and then cropped a 440× 440 region around the estimatedcenter to the network. We performed a random color augmentation in each channelof the image separately during training by multiplying with a scalar value chosenfor each channel from a uniform distribution between 0.6 and 1.4, followed by aclipping to ensure that the resulting intensity values like in the range of 0− 255.Following He et al. [44], we zero center each image by subtracting each channelby the mean values computed from the Imagenet dataset [27, 57]. To generate theground truth 2D heatmaps for each joint from the 2D joint locations, we applieda 2D Gaussian filter, having a zero mean and a standard deviation of 0.75 pixels,over the location of the joint. We applied a Mean Squared Error (MSE) betweenthe predicted and ground truth heat maps over all the poses as the loss function.We applied intermediate supervision to the output of intermediate hourglasses assuggested by Newell et al. [80]. However, we could only stack four hourglassesdue to limitation in memory and time. We used RMSprop optimizer [46] used byNewell et al. [80] for optimizing the network with a learning rate 2.5e− 4 andapplied exponential decay for the learning rate. It took us about a day to train thenetwork on a single NVIDIA Titan X GPU.614.3 Training end-to-endAfter pre-training the hourglass part of the network, we combined our first networkon top of this pre-trained network to train the model end-to-end. Over here, weused the intuition of transfer learning [83] where we expect that the knowledge ofhuman pose acquired by the 2D pose estimation part of the network can help inobtaining better 3D pose estimation.4.3.1 Loss FunctionThe goal of our model is to estimate the 3D joint locations from images directly bytraining the entire network end-to-end. Therefore the input to our system is now anRGB image. Let us denote the RGB Image as In×n×3 where n×n is the resolutionof the image having 3 channels. The output of the stacked hourglass part of thenetwork is a set of 16 heatmaps, one for each of the 16 joints. Each heatmap hasresolution 64× 64. Let us denote the estimated heatmaps from the last hourglassby H(I)64×64×16 and ground truth heatmaps by G(I)64×64×16. The final output ofour system is the estimation of 3D joint locations which we denote as yˆ and theground truth is denoted by y.The loss function for our network is the weighted sum of Mean Squared Error(MSE) of 3D joint locations and MSE of the heatmaps of all the joints over a setof N poses. It is given byL (yˆ,H(I),y,G(I))= minyˆ,H(I)1NN∑i=1[α ‖yˆi−yi‖22 +β16∑j=164∑a=164∑b=1∥∥H(I(a,b)) j,i−G(I(a,b)) j,i∥∥22].(4.1)In the equation, α and β are hyperparameters controlling the importance of thepenalty terms.As mentioned before, the stacked-hourglass module predicts the heatmaps for16 joints namely: Central hip, spine, neck, head, both left and right joints for hip,knee, ankle, shoulder, elbow and wrist. However, in 3D we predict 17 joints, theextra joint being the nose. We are simply following the same output format as ourfirst model. We predict the 3D locations of joint relative to the root node, centralhip and the ground truth 3D poses are transformed into camera coordinate space.624.3.2 Data PreprocessingFor training end-to-end, we normalized the 3D ground truth poses by subtractingthe mean and dividing by standard deviation. Similar to our first model and tothe standard protocol of Human3.6M dataset, we zero center our 3D joint locationsrelative to the root node since we do not predict the global position of the root. Likeprevious model, we predict the 3D pose in the camera coordinate space and hencetransformed the ground truth 3D poses into the camera space using the extrinsiccamera parameters. The input images are preprocessed in the same way as theywere preprocessed during the pre-training step in Section 4.2. The ground truthheatmaps are obtained in similar manner.4.3.3 Training DetailsFor end-to-end training we initialized the stacked-hourglass part of our networkwith the weights learned during pre-training. The rest of the network is initializedusing Kaiming-initialization [43].Because a single pass of training a convolutional neural network is expensive,we pick every 20th frame from each training video and randomly sample 50Kimages from it during a single epoch. We trained our network for 100 epochs. Tooptimize our network end-to-end we used the Adam [55] optimizer. We startedour training with a learning rate of 1e− 5 and applied exponential decay on thelearning rate as our training progressed. We used a mini-batch size of 3 images dueto limitations of memory in the GPU and implemented our code on Tensorflow. Asingle pass over a mini-batch including back-propagation takes around 230ms anda forward pass takes approximately 75ms on a NVIDIA Titan X GPU.4.4 Experimental evaluationDatasets and protocols In case of our second model we perform quantitativeevaluation on Human3.6M [51] dataset only. We have not chosen HumanEva,because compared to Human3.6M it is much smaller and the same subject appearsin training and test set. Besides, we have already shown the effectiveness of goingfrom 2D pose to 3D by reporting results on both the datasets. Therefore, we felt63that it is more important to perform better on Human3.6M on which most of therecent approaches have evaluated.In case of the second network, we follow the protocols discussed in Section 3.4.However, because it the forward pass for our second network takes longer dueto more expensive convolutions, we evaluate every 64th frame of all the actionson subject 9 and 11. This is a standard protocol used the methods which havetrained a CNN end-to-end for estimating 3D pose from images directly [73, 87,119]. As mentioned before, in our protocol #1, the error is estimated as averageerror per joint in millimeters between the estimated and the ground truth 3D poserelative to the root node( central hip joint), while in protocol #2, the estimatedpose is aligned with the ground truth pose using similarity transform methods likeProcrustes analysis [39]. For this experiment, we trained a single model for all theactions.4.4.1 Quantitative resultsThe quantitative results for our second model on Human3.6M dataset for both theprotocols, protocol #1 and protocol #2, is shown in Table 4.1. For this experiment,we only reported the average mean per joint error over all the actions. As canbe seen from the table, our end-to-end method performs quite poorly compared tothe state-of-the-art methods, under both the protocols. In fact, we observed thatour model was second worst out of all the methods that reported error on protocol#1 and under protocol #2, it is only better than two other methods [2, 92]. Thisindicates that it is more difficult to train a 3D pose estimator model end-to-end.Particularly, it seems that the mapping 2D heatmaps of joints to 3D pose locationsdirectly is more difficult than mapping from 2D joint locations. We discuss theresults more elaborately in Subsection 4.4.3.4.4.2 Qualitative resultsWe show some qualitative results for our second model in Figure 4.1. We canobserve from the results that our end-to-end network had a hard time predictingterminal joints like the position of ankles, wrists and limbs. Although it does gen-erally well for walking or standing images, there seems to be a large error for sitting64Methods Protocol #1 Protocol #2LinKDE [51] 162.1 –Akhter & Black [2]* – 181.1Ramakrishna et al [92]* – 157.3Bogo et al [16] – 82.3Moreno-Noguer [76] – 74.0Tekin et al [117] 125.0 –Zhou et al [134]* 113.0 106.7Du et al [29] 126.5 –Park et al [85] 117.3 –Zhou et al [133] 107.3 –Pavlakos et al [87] 71.9 51.9Our first model (SH detections) 67.5 52.1Our first model (SH detections FT) 62.9 47.7Our end-to-end model 144.7 112.2Table 4.1: Results showing Mean Per Joint Error over all actions on Hu-man3.6M [51] dataset under protocol #1 (left column) and #2 (rightcolumn) respectively. SH indicates 2D pose detections obtained fromstacked-hourglass module [80] trained on MPII [5] dataset and FT in-dicates that the model was fine-tuned on Human3.6M dataset [51].Theresults of the methods are obtained from the original papers, except for(*), which were obtained from [16].pose.4.4.3 Discussion of resultsBased on the quantitative and qualitative results, we can see that training anend-to-end model has proved to be more difficult, as suggested by a higher averageerror for both the protocols. One reason for worse results than our first model canbe due to the fact that the 2D heatmaps for each joint may not be as discriminativeas the 2D joint locations for 3D pose estimation or the mapping from heatmapsto 3D joint locations may be tougher. While we could use a 2D argmax function65to find the spatial location of maximum likelihood for each joint, this would haveprevented us from designing an end-to-end network because the argmax functionis not differentiable. Therefore, we can argue that it is much easier and simpler forany model to learn a mapping from 2D joint locations to 3D and that combiningthe results from last chapter, we can hypothesize that even though most of 2Dpose estimators give excellent performance, it is the error in estimating 2D jointlocations that gets carried forward when mapping to 3D pose.One interesting experiment which can be done in future is instead of using theheatmaps of the joints as input to the second part of our network, we can try to mapthe intermediate features learned by the stacked-hourglass module to 3D pose. Thiswould also make the network differentiable end-to-end. However, one limitingfactor in this case is that intermediate features have a resolution of 64×64 and havemore than 256 channels, e.g. the intermediate feature of our last our hourglass has512 channels. A possible solution to this may be to reduce the number of channelsusing 1× 1 convolution which would reduce the number of connections to theResidual Block. (See Chapter 3).66Figure 4.1: Example of output on the test images of Human3.6M dataset.(Left) Image, (Middle) 3D ground truth pose in red and blue, (Right)our 3D pose estimations in green and purple.67Chapter 5Exploiting temporal informationFrom our previous experiments, we managed to demonstrate that 2D positions ofjoints, despite having low dimensions, provide sufficient information of humanpose and that simple deep network architecture can efficiently map 2D joint loca-tions into 3D space with a high accuracy. We also showed that designing and train-ing a model end-to-end to predict 3D poses directly from images is more difficultand computationally expensive. In our third model, we analyze the effectivenessof incorporating temporal information over a sequence of 2D poses to estimate asequence of 3D poses.To exploit the temporal information across a sequence of 2D poses, we de-signed a sequence-to-sequence network [113] using Long Short-Term Memory(LSTM) units [48] with layer normalization [6] and recurrent dropout [101, 130]for regularization. Additionally there is a shortcut or residual connection from theinput of each unit to the output of that unit on the decoder. Moreover, makingprediction on a sequence of frames instead of a single frame allows us to imposetemporal smoothness constraints over the joints during training. Figure 1.8 showsthe diagram of our final model.5.1 Network designOur motivation of using sequence-to-sequence network comes from its applica-tion on the task of Neural Machine Translation (NMT) [113], in which the trained68model translates a sentence in one language to a sentence in another language e.gEnglish to French. Our task is analogous to language translation task, where wetransform one form of input, a sequence 2D joint locations, to a different form onoutput, a sequence of 3D joint locations. In a language translation model, the inputand output sentences can have different length. However, our case is simpler thanNMT because the input and the output have the same sequence length.5.1.1 Sequence-to-sequence network with residual connectionsAs shown in Figure 1.8, our network is a sequence-to-sequence network consistingof an encoder and a decoder component. Note that decoder side of the network hasshortcut connections [44] connecting the input of each LSTM unit to the predictionof each unit. The encoder side of our network encodes the 2D pose informationover a sequence of frames in a fixed size high dimensional vector. The encodingalso captures the temporal consistency information over the sequence of input.The initial state of the decoder is initialized by the last state of the encoderLSTM, and a 〈START 〉 token is passed as input to the first time step of the decoderLSTM, which in our case is a vector of ones, to start decoding. Suppose, the inputsequence has a length of t. Once the 〈START 〉 token is passed as input to thedecoder, it predicts 3D pose of the first frame, y0 which in turn is passed as inputto the next LSTM unit of the decoder, which then predicts the 3D pose for the nextframe, y1. In other words, given a 3D pose estimate, yt , of a given time step t eachLSTM unit predicts the 3D pose for next time step, yt+1. Note that the order ofinput sequence is reversed, i.e. 2D pose at time t being passed as the first time step,as recommended by Sutskever et al. [113], who empirically found that it is easierfor a decoder of sequence-to-sequence network to predict the output sequence inreverse order as the input sequence of the encoder.The residual connections practically makes the decoder learn the amount ofchange in 3D position of each joint from the previous frame. This makes it easierfor the network to make output predictions because it only needs to estimate theamount of perturbation from the 3D pose of previous frame, instead of estimatingthe absolute 3D pose for a particular frame directly. This observation is in linewith the hypothesis of He et al. [44]. To regularize our network, we applied layer69normalization on each LSTM unit [6]. We also applied recurrent dropout, with adropout probability [101, 130] of p.5.1.2 Layer NormalizationAlthough batch normalization [50] has been found to be very effective in regu-larizing the deep networks and in reducing the training time, its applicability toRecurrent Neural Networks is not well understood. Unlike deep networks withfixed depth, the summed input to each recurrent neuron varies with the length ofthe sequence. Therefore, it requires storing different global statistics for differenttime-steps for a RNN rather than maintaining batch statistics separately for eachhidden layer in ordinary feed-forward networks. Moreover, batch normalization isineffective for online learning tasks or when the model is extremely large therebyforcing a smaller batch size.Therefore, to regularize RNNs effectively and speeding up the training pro-cess, Ba et al. [6] proposed layer normalization. Layer normalization estimates thenormalization statistics (mean and standard deviation) from the summed inputs tothe recurrent neurons of hidden layer on a single training case instead of trying toestimate a population mean and variance like batch normalization.In any feed forward neural network, the input to a hidden neuron is a weightedlinear combination of the outputs of neurons from previous hidden layers, on whicha non-linear function like ReLU or sigmoid function is applied. In layer normal-ization, the normalization statistics over all the hidden units in the same layer arecomputed by:µk =1HH∑i=1aki , σk =√1HH∑i=1(aki −µk)2. (5.1)In the given equations, µk represents the mean and σk represents the standarddeviation of all the recurrent neurons in hidden layer k, aki represents the input tohidden unit i in layer k, which is actually linear combination of the outputs of hid-den units from previous layer. Total number of hidden units in a particular layeris denoted by H. All the hidden units in the same layer share the same normal-ization terms like batch normalization but instead of estimating population mean70and variance for a training mini-batch, different training examples lead to differentnormalization terms. Additionally, there is no constraint on the training mini-batchsize in layer normalization and it can be used even for a batch size of 1 and per-forms same computation during training and test time.5.1.3 Recurrent DropoutAlthough dropout is very popular for regularizing deep networks, they do not workwell particularly for RNNs. Zaremba et al. [130] proposed a technique for effec-tively applying dropout on LSTM cells so that overfitting can be reduced. Theyproposed to apply dropout only on the non-recurrent connections of the networkwith a certain probability p while always keeping the recurrent connections intact.Therefore, the dropout operation adds noise to the information propagating throughthe LSTM units, making their intermediate computation more robust. Not apply-ing dropout on recurrent connections ensures that each LSTM unit remembers theinformation of events that occured many timesteps back in the past. Therefore, byusing recurrent dropout technique, LSTMs can be regularized effectively withoutsacrificing their ability to memorize contrary to vanilla dropout operation whichthrows away residual connections inhibiting LSTMs to memorize information forlong time.5.1.4 Temporal smoothness constraintOne issue of making 3D pose prediction for 2D poses of each frame individuallyis that the error in one frame is independent of the other. The lack of aggregatedinformation of errors over a sequence of frames tends to cause temporally jitterypredictions. In fact, a high error in estimating 3D pose for one of the frames wouldmake the predictions appear inconsistent over time.Since in our final network, we are making predictions over a sequence of 2Dposes, we can easily apply temporal smoothness constraint to ensure that the 3Djoint locations of successive frames don’t differ by too much. We apply this con-straint by adding L2 norm of the first order derivative on the 3D joint locations withrespect to time to our loss function during training.However, from our empirical observation from first network, we found that71certain joints are difficult to estimate accurately e.g. wrist, ankle, elbow. In fact,compared to the rest these joints highly contribute to the overall mean error. Toaddress this issue, we partitioned the joints into three disjoint sets torso head,limb mid and limb terminal based on the magnitude of error in estimating thejoints. We observed that the joints connected to the torso and head e.g. hips,shoulders, neck are always predicted with high accuracy since compared to thelimbs, these body parts tend to be more rigid. Therefore these joints are put in theset torso head. On the other hand, the joints of the limbs are always more difficultto predict due to their high range of motion. To our observation, the terminal jointsof the limbs i.e. wrists and ankles are more difficult to predict accurately than kneesand elbows. Therefore, we put the knees and the elbows in set limb mid and theterminal joints in set limb terminal. To reduce jitter, we multiply the derivativesof each set of joints with different scalar values, with the highest weight beingassigned to the derivatives of the set of terminal joints, followed by the set of limbsand the set of torso and head joints. This ensures that the derivatives of terminaljoints are penalized more than the derivatives of torso joints.5.1.5 Loss functionThe loss function of our network consists of the sum of two separate terms: MeanSquared Error (MSE) of N different sequences of 3D joint locations; the mean ofL2 norm of the first order derivative of N sequences of 3D joint locations withrespect to time, where the joints are divided into three disjoint sets (see last sub-section).The MSE over N sequences, each of T time-steps, of 3D joint locations is givenby:L(Yˆ,Y) =1NTN∑i=1T∑t=1∥∥Yˆi,t−Yi,t∥∥22 (5.2)Here, Yˆ denotes the estimated 3D joint locations while Y denotes 3D groundtruth.The mean of L2 norm of the first order derivative of N sequences of 3D jointlocations, each of length T , with respect to time is given by:72∥∥∇tYˆ∥∥22 = 1N(T −1) N∑i=1T∑t=2{η∥∥YˆTHi,t − YˆTHi,t−1∥∥22 +ρ ∥∥YˆLMi,t − YˆLMi,t−1∥∥22 +τ∥∥YˆLTi,t − YˆLTi,t−1∥∥22} . (5.3)As mentioned in the last subsection the joints are divided into three different dis-joint sets based on their likelihood of being more erroneous. In the above equation,YˆTH, YˆLM and YˆLT denotes the predicted 3D locations of joints belonging to setstorso head, limb mid, and limb terminal respectively. The η ,ρ and τ are scalarhyper-parameters to control the significance of the derivatives of 3D locations ofeach of the three set of joints. A higher weight is assigned to set of the joints whichare generally predicted with higher error.The overall loss function for our network is given as:L = minYˆαL(Yˆ,Y)+β∥∥∇tYˆ∥∥22 . (5.4)Here α and β are scalar hyper-parameters regulating the importance of each of thetwo terms in the loss function.5.2 Data PreprocessingFor our sequence-to-sequence network, we normalized the 3D ground truth poses,the noisy 2D pose estimates from stacked-hourglass network and the 2D groundtruth [80] by subtracting the mean and dividing by standard deviation in the samemanner as Chapter 3. Just like our previous two models, we do not predict the 3Dlocation of the root joint i.e. central hip joint and hence zero center the 3D jointlocations relative to the global position of the root node. The 3D poses are predictedin the camera coordinate frame and the ground truth 3D poses are transformed intothe camera coordinate frame using the ground truth parameters of the camera.Like our first model, we obtain 2D joint locations both from the stacked-hourglass model pre-trained on MPII dataset [5] and from the model fine-tunedby us for our first network on Human3.6M dataset. The detections in both caseswere obtained in the same manner described in Chapter 3.73To generate the training sequences, we used a sliding window of length T . Thewindow is slid by one frame at a time to generate the input and output sequencesof 2D and 3D poses, each of length T . Thus there is overlapping between trainingsequences. This gives us more data to train on, which is always an advantage fordeep learning systems. To generate test sequences, the sliding window is slid in anon-overlapping manner i.e. the stride length of the sliding window is same as itssize.5.2.1 Training detailsWe trained our final network for 100 epochs, where each epoch makes a completepass over the entire Human 3.6M dataset just like our first network. Because we areusing 2D joint locations as input, which is low in dimension, we can store the entireHuman3.6M dataset into the GPU memory. We used the Adam [55] optimizer fortraining the network with a learning rate of 1e−5 which is decayed exponentiallyper iteration. The weights of the LSTM units are initialized by Xavier uniform ini-tializer [38]. We used a minibatch batch size of 32 i.e. 32 sequences. For most ofour experiments we used a sequence length of 5, because it allows faster trainingwith high accuracy. We experimented with different sequence lengths and foundsequence length 4,5 and 6 to generally give better results, which we will discussin detail in the results section. Our code is implemented in Tensorflow just likethe previous two models. We empirically set the hyper-parameter values αandβ ofour loss function to 1 and 5 respectively. Similarly the three hyper-parameters ofthe temporal consistency constraint η ,ρ and τ , are set to 1,2.5 and 4 respectively.A single training step, including both forward pass and backprogagation, for se-quence of length 5 takes only 34 ms approximately, while a forward pass takesonly about 16ms on NVIDIA Titan X GPU. Therefore, on average, our networktakes only about 3.2ms to predict 3D pose per frame, which is only slightly higherthan our first network which predicts 3D pose at 2ms per frame, with a higher ac-curacy in prediction. Therefore, our final network is simple and can be trained fast,which allowed us to experiment with diferent hyper-parameters and components ofour network.745.3 Experimental evaluationDatasets and protocols For our final model, we perform quantitative evaluationon Human 3.6M [51] dataset like our second model. For qualitative evaluation, weused some videos from youtube and from Human3.6M dataset.For our final experiment we follow the standard protocol of Human3.6M datasetdescribed in Chapter 3. As described in previous chapters, protocol #1 requires us-ing subjects 1, 5, 6, 7, and 8 for training, and subjects 9 and 11 for testing and theerror is evaluated on the predicted 3D pose without any transformation. In protocol#2, the predicted pose is rigidly aligned to the ground pose using similarity trans-form. Like our previous two models, the error metric is average error per joint inmillimeters between the estimated and the ground truth 3D pose relative to the rootnode. We trained a single model for all the actions.5.3.1 Quantitative resultsEvaluation on estimated 2D poseAs shown by our previous two models, mapping 2D joint locations to 3D is aneasier task for deep network models than directly predicting 3D pose from images.Our final network takes the idea of decoupling the 3D pose estimation task even fur-ther. We want to see the effect of exploiting temporal information using a sequenceof 2D joint locations to predict a sequence of 3D joint locations. As mentioned inChapter 3, we obtain two sets of 2D pose detections from both stacked-hourglassmodel pre-trained on MPII [5] dataset and from the model that we fine-tuned onHuman3.6M dataset [51]. We use a sequence length of 5 to evaluate our finalmodel.The results on Human3.6M dataset [51] under protocol #1 dataset is shownin Table 5.1. As we can see, our final model achieves state-of-the-art results onprotocol #1. Compared to our first network, our final network achieves significantlybetter performance for both noisy 2D estimates and ground truth 2D pose. For the2D estimates from stacked-hourglass model pre-trained on MPII dataset, our finalnetwork has an error of approximately 12 mm less than that of the first model. As75Protocol #1 Direct. Discuss Eating Greet Phone Photo Pose Purch. Sitting SitingD Smoke Wait WalkD Walk WalkT AvgLinKDE [51] (SA) 132.7 183.6 132.3 164.4 162.1 205.9 150.6 171.3 151.6 243.0 162.1 170.7 177.1 96.6 127.9 162.1Li et al [64] (MA) – 136.9 96.9 124.7 – 168.7 – – – – – – 132.2 70.0 – –Tekin et al [117] (SA) 102.4 147.2 88.8 125.3 118.0 182.7 112.4 129.2 138.9 224.9 118.4 138.8 126.3 55.1 65.8 125.0Zhou et al [134] (MA) 87.4 109.3 87.1 103.2 116.2 143.3 106.9 99.8 124.5 199.2 107.4 118.1 114.2 79.4 97.7 113.0Tekin et al [116] (SA) – 129.1 91.4 121.7 – 162.2 – – – – – – 130.5 65.8 – –Ghezelghieh et al [37] (SA) 80.3 80.4 78.1 89.7 – – – – – – – – – 95.1 82.2 –Du et al [29] (SA) 85.1 112.7 104.9 122.1 139.1 135.9 105.9 166.2 117.5 226.9 120.0 117.7 137.4 99.3 106.5 126.5Park et al [85] (SA) 100.3 116.2 90.0 116.5 115.3 149.5 117.6 106.9 137.2 190.8 105.8 125.1 131.9 62.6 96.2 117.3Zhou et al [133] (MA) 91.8 102.4 96.7 98.8 113.4 125.2 90.0 93.8 132.2 159.0 107.0 94.4 126.0 79.0 99.0 107.3Nie et al [81] (MA) 90.1 88.2 85.7 95.6 103.9 103.0 92.4 90.4 117.9 136.4 98.5 94.4 90.6 86.0 89.5 97.5Rogez et al [73] (MA) – – – – – – – – – – – – – – – 88.1Mehta et al [73] (MA) 57.5 68.6 59.6 67.3 78.1 82.4 56.9 69.1 100.0 117.5 69.4 68.0 76.5 55.2 61.4 72.9Mehta et al [74] (MA) 62.6 78.1 63.4 72.5 88.3 93.8 63.1 74.8 106.6 138.7 78.8 73.9 82.0 55.8 59.6 80.5Lin et al [65] (MA) 58.0 68.2 63.3 65.8 75.3 93.1 61.2 65.7 98.7 127.7 70.4 68.2 72.9 50.6 57.7 73.1Tome et al [119] (MA) 65.0 73.5 76.8 86.4 86.3 110.7 68.9 74.8 110.2 173.9 84.9 85.8 86.3 71.4 73.1 88.4Tekin et al [118] 54.2 61.4 60.2 61.2 79.4 78.3 63.1 81.6 70.1 107.3 69.3 70.3 74.3 51.8 63.2 69.7Pavlakos et al [87] (MA) 67.4 71.9 66.7 69.1 72.0 77.0 65.0 68.3 83.7 96.5 71.7 65.8 74.9 59.1 63.2 71.9Our first model (SH detections) (MA) 53.3 60.8 62.9 62.7 86.4 82.4 57.8 58.7 81.9 99.8 69.1 63.9 67.1 50.9 54.8 67.5Our first model (SH detections FT) (MA) 51.8 56.2 58.1 59.0 69.5 78.4 55.2 58.1 74.0 94.6 62.3 59.1 65.1 49.5 52.4 62.9Our seq-2-seq model (SH detections) (MA) 45.2 51.0 55.5 51.3 75.3 62.6 48.4 47.4 67.6 75.4 61.0 52.1 53.6 43.9 45.6 55.7Our seq-2-seq model (SH detections FT) (MA) 44.2 46.7 52.3 49.3 59.9 59.4 47.5 46.2 59.9 65.6 55.8 50.4 52.3 43.5 45.1 51.9Our first model (GT detections) (MA) 37.7 44.4 40.3 42.1 48.2 54.9 44.4 42.1 54.6 58.0 45.1 46.4 47.6 36.4 40.4 45.5Our seq-2-seq model (GT detections) (MA) 35.2 40.8 37.2 37.4 43.2 44.0 38.9 35.6 42.3 44.6 39.7 39.7 40.2 32.8 35.5 39.2Table 5.1: Results showing errors action-wise on Human3.6M [51] underProtocol #1 (no rigid alignment or similarity transform applied in post-processing). Note that our results reported here are for sequence of length5. SH indicates that we trained and tested our model with the detectionsof Stacked Hourglass [80] model pre-trained on MPII dataset [5] as input,and FT indicates that the the stacked-hourglass model was fine-tuned onHuman3.6M. SA indicates that a model was trained for each action, andMA indicates that a single model was trained for all actions.The bold-faced numbers mean the best result while underlined numbers representthe second best.stated before in Chapter 3, a better 2D pose estimate improves the performanceof the network. When trained on detections of a fine-tuned 2D pose detector, theerror of our final network decreased by approximately 4 mm. As can be seen fromTable 5.1, the error of our network for fine-tuned 2D detections is 51.9 mm whichis 11 mm lower than the error of our first model which had an error of 62.9 mm onfine-tuned detections. Our sequence-to-sequence model beats the previous state-of-the-art by Pavlakos et al. [87] by 20 mm (almost 28% better) on protocol #1.The results for protocol #2, which aligns the predictions to the ground truthusing a similarity transform before computing error, is reported in Table 5.2. Ourmethod improves the results of our first model by 8.1 mm and 5.7 mm for detec-76Protocol #2 Direct. Discuss Eating Greet Phone Photo Pose Purch. Sitting SitingD Smoke Wait WalkD Walk WalkT AvgAkhter & Black [2]* (MA) 14j 199.2 177.6 161.8 197.8 176.2 186.5 195.4 167.3 160.7 173.7 177.8 181.9 176.2 198.6 192.7 181.1Ramakrishna et al [92]* (MA) 14j 137.4 149.3 141.6 154.3 157.7 158.9 141.8 158.1 168.6 175.6 160.4 161.7 150.0 174.8 150.2 157.3Zhou et al [134]* (MA) 14j 99.7 95.8 87.9 116.8 108.3 107.3 93.5 95.3 109.1 137.5 106.0 102.2 106.5 110.4 115.2 106.7Rogez et al [73] (MA) – – – – – – – – – – – – – – – 87.3Nie et al [81] (MA) 62.8 69.2 79.6 78.8 80.8 86.9 72.5 73.9 96.1 106.9 88.0 70.7 76.5 71.9 76.5 79.5Mehta et al [73] (MA) 14j – – – – – – – – – – – – – – – 54.6Bogo et al [16] (MA) 14j 62.0 60.2 67.8 76.5 92.1 77.0 73.0 75.3 100.3 137.3 83.4 77.3 86.8 79.7 87.7 82.3Moreno-Noguer [76] (MA) 14j 66.1 61.7 84.5 73.7 65.2 67.2 60.9 67.3 103.5 74.6 92.6 69.6 71.5 78.0 73.2 74.0Tekin et al [118] (MA) 17j – – – – – – – – – – – – – – – 50.1Pavlakos et al [87] (MA) 17j – – – – – – – – – – – – – – – 51.9Our first model (SH detections) (MA) 17j 42.2 48.0 49.8 50.8 61.7 60.7 44.2 43.6 64.3 76.5 55.8 49.1 53.6 40.8 46.4 52.5Our first model (SH detections FT) (MA) 17j 39.5 43.2 46.4 47.0 51.0 56.0 41.4 40.6 56.5 69.4 49.2 45.0 49.5 38.0 43.1 47.7Our seq-2-seq model (SH detections) (MA) 37.7 41.2 45.5 42.4 54.9 48.9 38.1 37.2 54.1 57.7 49.2 40.9 44.7 35.0 38.9 44.4Our seq-2-seq model (SH detections FT) (MA) 36.9 37.9 42.8 40.3 46.8 46.7 37.7 36.5 48.9 52.6 45.6 39.6 43.5 35.2 38.5 42.0Table 5.2: Results showing errors action-wise on Human3.6M [51] datasetunder protocol #2 (rigid alignment in post-processing). Note that theresults reported here are for sequence of length 5. The 14j annotationindicates that the body model considers 14 body joints while 17j meansconsiders 17 body joints. (SA) annotation indicates per-action modelwhile (MA) indicates single model used for all actions. FT indicates thatthe stacked-hourglass model has been fine-tuned on Human3.6M dataset.The bold-faced numbers mean the best result while underlined numbersrepresent the second best. The results of the methods are obtained fromthe original papers, except for (*), which were obtained from [16].tions from pre-trained and fine-tuned models respectively. This beats the previousstate-of-the-art by Tekin et al. [118] by 5.7 mm for detections from out-of-the-boxhourglass model and by 8.1 mm for fine-tuned model. Like protocol #1, our modelalso achieves the best result on protocol #2.From the above tables, we observe that exploiting temporal information acrossmultiple sequences is indeed useful. It significantly improves the overall accuracyof the estimates of 3D joint locations, particularly on actions like phone and sittingdown on which most of the previous approaches have performed poorly due toheavy occlusion. For the detections from fine-tuned stacked hourglass, our networkhas achieved the lowest error in every action class of Human3.6M dataset. It is tobe noted that we used the same detections from the fine-tuned stacked-hourglass inour first network.77Evaluation on 2D ground truthAs suggested by the results from our first model, the more accurate the 2D jointlocations are, the better are the estimates for 3D pose. We carried the same ex-periment for our final network to show that the lower bound on the 3D pose errorcan be decreased even further by exploiting temporal information across the se-quences. We used a sequence of 2D ground truth poses of length 5 as input to trainour network. The results of experimenting with ground truth 2D joint locationsunder protocol #1 are reported in Table 5.1. As seen from the table, our sequence-to-sequence model improves the lower bound error of our first network by almost6.3 mm.The results for protocol #2 are reported in Table 5.3 where we show the ro-bustness of our network which is trained using the ground truth 2D pose and testedwith different levels of Gaussian noise. We can see that even under protocol #2 ourfinal network outperforms our first network when there is no noise in the 2D jointlocations.From the results mentioned above we can hypothesize that information of tem-poral consistency over a sequence of pose is a valuable cue for the task of estimat-ing 3D pose. Even in the noise-free ground truth data, the temporal informationimproves the overall performance.Performance on different sequence lengthsThe results reported so far have been for input and output sequences of length 5only. We carried out experiments to see how our network performs for differentsequence lengths ranging from 2 to 10. The results are shown in Figure 5.1. Wecarried out this experiment for 2D detections from both out-of-the-box stacked-hourglass and the fine-tuned one. As it can be seen, the performance of our networkfor both case remains stable for sequences of varying lengths. Particularly the bestresults were obtained for length 4, 5 and 6. However, we chose sequence length5 for carrying out our experiments as a compromise between training time andaccuracy.78Figure 5.1: Mean Per Joint Error(MPJE) in mm of our network for differentsequence length. SH Pre-trained indicates that 2D poses are estimatedusing the stacked-hourglass model pre-trained on MPII [5] while SHFT indicates that the detections were obtained on the stacked-hourglassmodel fine-tuned by us on Human3.6M dataset.Robustness to noiseLike our first model, to test the tolerance of our final model to noise in input 2Djoint locations, we carried out experiments where we train our model on groundtruth 2D pose data and evaluate the performance of our network on inputs corruptedby different levels of Gaussian noise. As mentioned in Chapter 3, we use protocolprotocol #2 for comparison which rigidly aligns the output with the ground truth.Table 5.3 shows how our final model compares against the model by Moreno-Nouguer [76] and our first network. Both of our networks are significantly morerobust to noise than Moreno-Nouguer’s model [76]. When compared our two net-works, we find similar level of tolerance for noise. Our sequence-to-sequence net-work trained on ground truth 2D pose seems to fare better when the level of inputnoise is low i.e. less than 10, whereas our first model proves to be marginally morerobust for higher levels of noise. We have also evaluated the case when our network79DMR [76] Our first model Ours(seq−2− seq)GT/GT 62.17 37.10 31.67GT/GT +N (0,5) 67.11 46.65 37.46GT/GT +N (0,10) 79.12 52.84 49.41GT/GT +N (0,15) 96.08 59.97 61.80GT/GT +N (0,20) 115.55 70.24 73.65GT/SH [80] – 60.52 62.43Table 5.3: Performance of our system trained with ground truth 2D poseof Human3.6M [51] dataset and tested under different levels of addi-tive Gaussian noise (Top) and on 2D pose predictions from stacked-hourglass [80] pose detector (Bottom) under protocol #2. The size ofthe cropped region around the person is 440×440.was tested with noisy detections from the stacked-hourglass model [80] not fine-tuned on Human3.6M data. The stacked-hourglass network coming out-of-the-boxhas an error of 15 pixels on average per joint. Similar to our observation for higherlevels Gaussian noise, our sequence-to-sequence network slightly under-performscompared to our first model. Note that the size of the cropped region around theperson is 440× 440. One reason for our sequence-to-sequence network, trainedon ground truth data, being more sensitive to higher levels of noise may be that,because of the temporal smoothness constraint, the errors from individual framesgets distributed over the entire sequence to maintain smoothness, whereas for ourfirst model, the errors are independent in each frame.Ablative analysisTo show the effectiveness of different components of our network, we performan ablative analysis. We follow protocol #1 for performing ablative analysis andtrained a single model for all the actions. The errors reported here are for 2Dpose prediction from the fine-tuned stacked-hourglass network [80]. The resultsare reported in Table 5.4.From the table, we observe that the biggest improvement of result is due thethe residual connections on the decoder side, which agrees with the hypothesis of80error (mm) ∆Ours 51.9 –w/o temporal consistency constraint 52.7 0.8w/o recurrent dropout 58.3 6.4w/o layer normalized LSTM 61.1 9.2w/o layer norm and recurrent dropout 59.5 7.6w/o residual connections 102.4 50.5Table 5.4: Ablative and hyperparameter sensitivity analysis.He et al. [44]. Removing the residual connections increases the error by 50.5 mm,which is a huge margin. When we train our network without layer normalizationon LSTM units, the error increases by 9.2 mm. On the other hand when no dropoutis performed, the error raises by 6.4 mm. If both layer norm and recurrent dropoutis not used the results get worse by 7.6 mm. Although the temporal consistencyconstraint may seem to have less impact (only 0.8 mm) on the performance of ournetwork, it ensures that the predictions over a sequence are smooth and temporallyconsistent which is apparent from our qualitative results discussed in next section.5.3.2 Qualitative resultsWe provide some qualitative results on Human3.6M sequences and some Youtubevideos. The 2D poses were detected using the fine-tuned stacked-hourglass model.We show some qualitative results on Human3.6M under protocol #1 in Figure 5.2,Figure 5.3 and Figure 5.4. The results for Youtube videos are shown in Figure 5.5,Figure 5.6 and Figure 5.7. To generate results we used our network trained onfine-tuned stacked hourglass predictions with a sequence length of 5.We can see that for the Human3.6M sequence, our network predicts smooth andtemporally consistent 3D poses in challenging actions like Sitting Down, Phoningand Taking Photo on which most methods performed worse than other actions.Especially in Figure 5.2, we can see for the second frame, the 2D detection isnoisy. Yet our network manages to estimate a temporally consistent 3D based onthe information from previous frame.The real advantage of using temporal smoothing constraint during training is81apparent in the results of our network on youtube video sequences. As can be seenin Figure 5.5, for the 3rd and 5th frames the 2D pose detector totally breaks and es-timates unrealistic 2D poses. Yet our network was able to recover a meaningful andconsistent 3D pose by exploiting the temporal information. Also in Figure 5.7, the2D pose estimator generates very noisy poses from which our network successfullypredicts temporally coherent 3D pose.5.3.3 Discussion of resultsBoth the quantitative and qualitative results for our sequence-to-sequence networkshow the effectiveness of exploiting temporal information over multiple sequencesto estimate 3D poses which are temporally smooth. Our network achieved the bestaccuracy on all of the 15 actions which is a remarkable feat. Particularly, most ofthe previous work struggled with actions which have a high degree of occlusionlike taking photo, talking on the phone, sitting and sitting down. Our network hassignificantly better results for these actions e.g. for sitting down action our error islower by an impressive 29 mm, while for the rest of the complicated actions, theimprovement ranges between 6−19 mm.We have seen that our network is reasonably robust to noisy 2D poses. Al-though the contribution of temporal smoothness constraint is not apparent in theablative analysis in Table 5.4, its effectiveness is highlighted in the qualitative re-sults particularly in the results on challenging youtube vidoes, where we observethat, even though the 2D pose estimator breaks and generates faulty predictions,our network can recover meaningful 3D pose.Our final network effectively demonstrates the power of using temporal in-formation and we achieved it using a simple sequence-to-sequence network whichcan be trained efficiently in reasonably quick time. Also our network makes predic-tions at 3ms per frame on average which suggests that, given the 2D pose detectoris real-time, our network can be applied in real-time scenarios.82Figure 5.2: Qualitative result of Subject 11, action sitting down for Hu-man3.6M dataset [51] (Left) Image with 2D pose, (Middle) 3D groundtruth pose in red and blue, (Right) 3D pose estimations in green andpurple.83Figure 5.3: Qualitative result of Subject 9, action phoning for Human3.6Mdataset [51] (Left) Image with 2D pose, (Middle) 3D ground truth posein red and blue, (Right) 3D pose estimations in green and purple.84Figure 5.4: Qualitative result of Subject 11, action taking photo for Hu-man3.6M dataset [51] (Left) Images with 2D pose detections, (Middle)3D ground truth pose in red and blue, (Right) 3D pose estimations ingreen and purple.85Figure 5.5: Quantitative results on youtube videos. (Left) Images with 2Dpose detections, (Right) our 3D pose estimation.86Figure 5.6: Quantitative results on youtube videos. (Left) Images with 2Dpose detections, (Right) our 3D pose estimation.87Figure 5.7: Quantitative results on youtube videos. (Left) Images with 2Dpose detections, (Right) our 3D pose estimation.88Chapter 6Conclusion and future workIn this work, we analyzed the sources of error for the task of 3D pose detection.We designed three different deep-network based models to address the task in threedifferent manners. One key issue for 3D pose estimation is that the major source oferror for the task is still not well understood. The difficulty of predicting 3D jointsfrom images can arise due to any of these reasons:• Error in predicting 2D joint locations from an image;• Difficulty in learning image features that can be reliably mapped to 3D jointlocations;• Difficulty in mapping 2D pose representation to 3D pose.Our first network decouples the task of estimating 3D pose from an imageinto two parts: i) estimating 2D joint locations ii) transforming the 2D pose to3D. In this experiment we wanted to verify how accurately the 2D poses can betranslated to 3D. Empirically, we found that a simple network composed of a setof fully connected linear layers with residual connections predicted 3D pose fromground truth 2D pose with remarkably high accuracy almost 30% better than thestate-of-the-art. When trained with noisy 2D detections from a pre-trained 2Dpose detector, the error increased, still giving us a better result than the state-of-art models by Tekin [118]. When the same network was trained with the 2D posedetector fine-tuned on Human3.6M [51] dataset, the results improved by more than897%. From these results we hypothesize that it is the task of mapping 2D jointlocations to 3D is easier than mapping 3D joint locations directly from the image.To further prove our hypothesis, we trained a second model end-to-end thatpredicts 3D poses from an image directly by stacking our first model on top of 2Dpose estimator such that the 2D joints heatmaps are fed as input to the second part.We found that it is much more difficult to train the network because the predictionerror was significantly higher compared to our first network. This further provedour hypothesis that although state-of-the-art 2D pose estimators are very accurate,the noise from the detections is the primary cause of error in 3D pose estimation.By these experiments, we also found that 2D joint locations, despite being low indimension, is a better feature for learning 3D pose than the 2D joint heatmaps.The results obtained are contradictory to the hypotheses proposed by the recentmethods for 3D pose estimation to justify their complex systems trained end-to-endto predict 3D pose from images. For example, Pavlakos et al. [87] claimed that it ismore difficult to regress 3D joint locations directly than predicting the volumetricheatmaps of joints, whereas our first network proved that 3D joint locations canbe predicted with high accuracy from something as simple as 2D coordinates ofjoints. Although image features can provide useful information cues, we wouldlike to argue that that finding invariant features and complex representations of 3Dpose, which has been the focus for a majority of recent approaches, may either notbe that important or has not been utilized to its full potential yet.In our third and final model we wanted to examine the effect of exploiting tem-poral consistency information over a sequence. For this purpose, we designeda sequence-to-sequence network with shortcut connections on the decoder sidewhich connects the input of decoder to its output. Given 2D joint locations ofsequence, our network predicts a sequence 3D pose. We also imposed temporalconsistency constraint on the network during training. Our final network signifi-cantly outperformed our first network (which happened to be the state-of-the-art ofall methods) for both noisy and ground truth 2D pose, thereby proving the effective-ness of exploiting temporal information from multiple frames. Also, qualitatively,our sequence-to-sequence network predicts more temporally coherent poses undernoisy 2D inputs reducing the jitter that occurred when 3D poses were estimated oneach frame separately.90Next we will discuss some of the some future research directions not addressedby our work, followed by a summary of our work and contributions.6.1 Future directionsOne area not addressed by our systems and most of recent work is the absolutelocation of the person in 3D world. To find this the homography information andcamera parameters must be known, which is possible if the 3D pose estimationsystem is deployed on smartphones or tablets or if we know the camera used tocapture the image. However, to find 3D pose from arbitrary image or videos, oneapproach can be to estimate the extrinsic parameters of the camera along with the3D pose. Finding the absolute location of root joints is critical for multi-person 3Dpose estimation, which is an interesting research path to explore.For our end-to-end network, we used 2D joint heatmaps as features to the 3Dpose estimator. However, the latent features learn from the images by the 2D poseestimator can provide valuable and discriminative information. We did not trainany end-to-end model from deep features. One future research direction can beto take the intermediate features learned by deep convolutional layers of either a2D pose estimator like stacked-hourglass [80] or of deep networks trained on Ima-genet like ResNet-101 [44] and perform a 1×1 convolution to reduce the numberof channels and try to estimate 3D joint locations from the deep features. Moststate-of-the-art 2D pose estimators estimates the locations of joints by applying 2Dargmax operation over the heatmaps. However, since the argmax function is non-differentiable we cannot put it in the end-to-end deep learning pipeline. A wayaround could be to estimate an expected gradient for the argmax operation usingmultiple samples of 2D heatmaps similar to the policy gradients commonly usedin reinforcement learning to train deep reinforcement networks.Moreno-Nouguer [76] showed that a distance-matrix can be a good represen-tation of the structure of human joint. One interesting direction can be to somehowcombine the distance-matrix of each joint with 2D joint locations as input.Since deep networks are dependent on large amounts of data, we can simulate2D detectors by projecting the 3D motion capture data from multiple virtual cam-eras and add some noise with it to augment the training data. One limitation for all91our networks, is that they cannot estimate 3D poses when a person is in unusualorientation i.e. being upside down during diving or doing a front flip, because ofthe absence of such poses in 3D pose estimation dataset. Augmenting such posescan help in alleviating this problem.For a sequence of images, it will be interesting to see if the translation andchange of orientation of the root joint over a sequence can be predicted by ex-ploiting temporal information. Other than that, it may be interesting to see if deepfeatures from networks like ResNet-101 [44] or Mask R-CNN [45] can be com-bined with LSTM units to learn the temporal coherence between the 3D poses.6.2 ConclusionTo summarize our work, we designed two simple, yet sophisticated and robust net-works both of which can be trained very fast to estimate 3D poses from noisy 2Djoint locations. We hypothesized that a majority of the error for 3D pose estimationcomes from the error in 2D pose detections and that training a network end-to-endto predict 3D pose from images directly is more difficult and computationally ex-pensive. Finally, we proved that temporal coherence information over a sequencecan be exploited efficiently to improve the accuracy of 3D pose estimation andproduce estimations which are temporally smooth. Both of our networks also gen-eralize very well to arbitrary and noisy inputs as evident by the performances ofour networks on MPII dataset and Youtube videos.92Bibliography[1] A. Agarwal and B. Triggs. 3D human pose from silhouettes by relevancevector regression. In The IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2004. → pages 2, 18, 19[2] I. Akhter and M. J. Black. Pose-conditioned joint angle limits for 3dhuman pose reconstruction. In The IEEE Conference on Computer Visionand Pattern Recognition (CVPR), pages 1446–1455, 2015. → pages 2, 17,18, 24, 44, 50, 64, 65, 77[3] S. Amin, M. Andriluka, M. Rohrbach, and B. Schiele. Multi-view pictorialstructures for 3d human pose estimation. In British Machine VisionConference (BMVC), 2013. → pages 18, 27[4] M. Andriluka, S. Roth, and B. Schiele. Monocular 3d pose estimation andtracking by detection. In The IEEE Conference on Computer Vision andPattern Recognition (CVPR), pages 623–630. IEEE, 2010. → pages 2, 18,26[5] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human poseestimation: New benchmark and state of the art analysis. In The IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2014.→ pages viii, ix, xiii, xiv, 9, 47, 49, 51, 57, 65, 73, 75, 76, 79[6] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprintarXiv:1607.06450, 2016. → pages 13, 68, 70[7] A. Baak, M. Mu¨ller, G. Bharaj, H.-P. Seidel, and C. Theobalt. Adata-driven approach for real-time full body pose reconstruction from adepth camera. In Consumer Depth Cameras for Computer Vision, pages71–98. Springer, 2013. → pages 18, 28[8] C. Barron and I. A. Kakadiaris. Estimating anthropometry and pose from asingle uncalibrated image. Computer Vision and Image Understanding93(CVIU), 81(3):269–284, 2001. URLhttp://dx.doi.org/10.1006/cviu.2000.0888. → pages 16[9] S. Behnke. Hierarchical neural networks for image interpretation, volume2766. Springer Science & Business Media, 2003. → pages 33[10] V. Belagiannis, S. Amin, M. Andriluka, B. Schiele, N. Navab, and S. Ilic.3d pictorial structures for multiple human pose estimation. In The IEEEConference on Computer Vision and Pattern Recognition (CVPR), pages1669–1676, 2014. → pages 18, 28[11] S. Belongie, J. Malik, and J. Puzicha. Shape context: A new descriptor forshape matching and object recognition. In Advances in neural informationprocessing systems, pages 831–837, 2001. → pages 19[12] S. Belongie, J. Malik, and J. Puzicha. Shape matching and objectrecognition using shape contexts. IEEE transactions on pattern analysisand machine intelligence, 24(4):509–522, 2002. → pages 19[13] P. Biswas, T.-C. Liang, K.-C. Toh, Y. Ye, and T.-C. Wang. Semidefiniteprogramming approaches for sensor network localization with noisydistance measurements. IEEE transactions on automation science andengineering, 3(4):360–371, 2006. → pages 25[14] L. Bo and C. Sminchisescu. Twin Gaussian processes for structuredprediction. International Journal of Computer Vision (IJCV), 87(1-2),2010. → pages 18, 19, 53[15] L. F. Bo, C. Sminchisescu, A. Kanaujia, and D. N. Metaxas. Fastalgorithms for large scale conditional 3D prediction. In The IEEEConference on Computer Vision and Pattern Recognition (CVPR), pages1–8, 2008. URL http://dx.doi.org/10.1109/CVPR.2008.4587578. → pages2, 18, 19[16] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black.Keep it smpl: Automatic estimation of 3d human pose and shape from asingle image. In European Conference on Computer Vision (ECCV), pages561–578. Springer, 2016. → pages viii, ix, x, 16, 18, 24, 25, 44, 50, 65, 77[17] I. Bu¨lthoff, H. Bu¨lthoff, and P. Sinha. Top-down influences on stereoscopicdepth-perception. Nature Neuroscience, 1(3):254–257, 1998. → pages 1594[18] M. Burenius, J. Sullivan, and S. Carlsson. 3d pictorial structures formultiple view articulated pose estimation. In The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), pages 3618–3625,2013. → pages 18, 28[19] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d poseestimation using part affinity fields. arXiv preprint arXiv:1611.08050,2016. → pages 2, 29[20] C. S. Catalin Ionescu, Fuxin Li. Latent structured models for human poseestimation. In IEEE International Conference on Computer Vision (ICCV),2011. → pages 9, 12, 16, 21[21] K. Chellapilla, S. Puri, and P. Simard. High performance convolutionalneural networks for document processing. In Tenth International Workshopon Frontiers in Handwriting Recognition. Suvisoft, 2006. → pages 33[22] C.-H. Chen and D. Ramanan. 3d human pose estimation= 2d poseestimation+ matching. arXiv preprint arXiv:1612.06524, 2016. → pages 2,18, 21[23] K. Cho, B. V. Merrie¨nboer, C. Gulcehre, D. B. F. Bougares, H. Schwenk,and T. Bengio. Learning phrase representations using RNNencoder-decoder for statistical machine translation. In Conference onEmpirical Methods in Natural Language Processing (EMNLP 2014), 2014.→ pages 40[24] D. C. Cires¸an, U. Meier, L. M. Gambardella, and J. Schmidhuber. Deep,big, simple neural nets for handwritten digit recognition. Neuralcomputation, 22(12):3207–3220, 2010. → pages 33[25] N. Dalal and B. Triggs. Histograms of oriented gradients for humandetection. In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), volume 1, pages 886–893. IEEE, 2005. → pages 19[26] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood fromincomplete data via the em algorithm. Journal of the royal statisticalsociety. Series B (methodological), pages 1–38, 1977. → pages 25[27] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: Alarge-scale hierarchical image database. In The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), pages 248–255. IEEE,2009. → pages 33, 44, 6195[28] J. Deutscher, A. Blake, and I. Reid. Articulated body motion capture byannealed particle filtering. In The IEEE Conference on Computer Visionand Pattern Recognition (CVPR), volume 2, pages 126–133. IEEE, 2000.→ pages 19[29] Y. Du, Y. Wong, Y. Liu, F. Han, Y. Gui, Z. Wang, M. Kankanhalli, andW. Geng. Marker-less 3d human motion capture with monocular imagesequence and height-maps. In European Conference on Computer Vision(ECCV), pages 20–36. Springer, 2016. → pages 18, 26, 47, 49, 65, 76[30] D. Eigen and R. Fergus. Predicting depth, surface normals and semanticlabels with a common multi-scale convolutional architecture. In IEEEInternational Conference on Computer Vision (ICCV), 2015. → pages 15[31] A. Elhayek, E. de Aguiar, A. Jain, J. Tompson, L. Pishchulin,M. Andriluka, C. Bregler, B. Schiele, and C. Theobalt. Efficientconvnet-based marker-less motion capture in general scenes with a lownumber of cameras. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 3810–3818, 2015. → pages 18, 28[32] B. Farley and W. Clark. Simulation of self-organizing systems by digitalcomputer. Transactions of the IRE Professional Group on InformationTheory, 4(4):76–84, 1954. → pages 31[33] P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for objectrecognition. International journal of computer vision (IJCV), 61(1):55–79,2005. → pages 21[34] M. A. Fischler and R. A. Elschlager. The representation and matching ofpictorial structures. IEEE Transactions on computers, 100(1):67–92, 1973.→ pages 20[35] F. A. Gers and J. Schmidhuber. Recurrent nets that time and count. InProceedings of the IEEE-INNS-ENNS International Joint Conference onNeural Networks (IJCNN), volume 3, pages 189–194, 2000. → pages 38,39[36] F. A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: Continualprediction with lstm. In Ninth International Conference on ArtificialNeural Networks (ICANN).(Conf. Publ. No. 470), volume 2, pages850–855. IET, 1999. → pages 38, 3996[37] M. F. Ghezelghieh, R. Kasturi, and S. Sarkar. Learning camera viewpointusing cnn to improve 3d body pose estimation. In 3D Vision (3DV), 2016Fourth International Conference on, pages 685–693. IEEE, 2016. → pages49, 76[38] X. Glorot and Y. Bengio. Understanding the difficulty of training deepfeedforward neural networks. In Proceedings of the ThirteenthInternational Conference on Artificial Intelligence and Statistics, pages249–256, 2010. → pages 74[39] C. Goodall. Procrustes methods in the statistical analysis of shape. Journalof the Royal Statistical Society. Series B (Methodological), pages 285–339,1991. → pages 50, 64[40] A. Graves and J. Schmidhuber. Framewise phenome classification withbidirectional lstm and other neural network architectures. Neural Networks,18(5-6):602–610, 2005. → pages 38[41] A. Gupta, J. Martinez, J. J. Little, and R. J. Woodham. 3D Pose fromMotion for Cross-view Action Recognition via Non-linear CirculantTemporal Encoding. In The IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2014. → pages 2, 18, 20[42] A. Gupta, J. He, J. Martinez, J. J. Little, and R. J. Woodham. Efficientvideo-based retrieval of human motion with flexible alignment. In IEEEWinter Conference on Applications of Computer Vision (WACV), pages1–9. IEEE, 2016. → pages 20, 21[43] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers:Surpassing human-level performance on imagenet classification. In IEEEInternational Conference on Computer Vision (ICCV), pages 1026–1034,2015. → pages 48, 63[44] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for imagerecognition. In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 770–778, 2016. → pages 2, 11, 22, 37, 42, 45,60, 61, 69, 81, 91, 92[45] K. He, G. Gkioxari, P. Dolla´r, and R. Girshick. Mask r-cnn. arXiv preprintarXiv:1703.06870, 2017. → pages 2, 30, 92[46] G. Hinton, N. Srivastava, and K. Swersky. Lecture 6.5-rmsprop: Divide thegradient by a running average of its recent magnitude., 2012. → pages 6197[47] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm fordeep belief nets. Neural computation, 18(7):1527–1554, 2006. → pages 33[48] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neuralcomputation, 9(8):1735–1780, 1997. → pages 13, 23, 39, 68[49] D. H. Hubel and T. N. Wiesel. Receptive fields and functional architectureof monkey striate cortex. The Journal of physiology, 195(1):215–243,1968. → pages 34[50] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep networktraining by reducing internal covariate shift. In International Conferenceon Machine Learning (ICML), 2015. → pages 11, 42, 43, 45, 70[51] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3.6m: Largescale datasets and predictive methods for 3d human sensing in naturalenvironments. IEEE Transactions on Pattern Analysis and MachineIntelligence, 36(7):1325–1339, jul 2014. → pages viii, ix, x, xi, xii, xiv, 9,10, 12, 16, 17, 21, 48, 49, 50, 54, 63, 65, 75, 76, 77, 80, 83, 84, 85, 89[52] M. Isard. Pampas: Real-valued graphical models for computer vision. InThe IEEE Conference on Computer Vision and Pattern Recognition(CVPR), volume 1, pages I–I. IEEE, 2003. → pages 27[53] H. Jiang. 3d human pose reconstruction using millions of exemplars. InInternational Conference on Pattern Recognition (ICPR), pages1674–1677. IEEE, 2010. → pages 2, 18, 20[54] A. Kanaujia, C. Sminchisescu, and D. Metaxas. Semi-supervisedhierarchical models for 3d human pose reconstruction. In The IEEEConference on Computer Vision and Pattern Recognition (CVPR), pages1–8. IEEE, 2007. → pages 19[55] D. Kingma and J. Ba. Adam: A method for stochastic optimization. InInternational Conference on Learning Representations (ICLR), 2015. →pages 48, 63, 74[56] I. Kostrikov and J. Gall. Depth sweep regression forests for estimating 3dhuman pose from images. In British Machine Vision Conference (BMVC),2014. → pages 18, 20, 53[57] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification withdeep convolutional neural networks. In Advances in neural information98processing systems (NIPS), pages 1097–1105, 2012. → pages 2, 21, 22, 33,37, 44, 61[58] H. W. Kuhn. The hungarian method for the assignment problem. NavalResearch Logistics (NRL), 2(1-2):83–97, 1955. → pages 30[59] H. W. Kuhn. Variants of the hungarian method for assignment problems.Naval Research Logistics (NRL), 3(4):253–258, 1956. → pages 30[60] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learningapplied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. → pages 21, 36[61] Y. LeCun, F. J. Huang, and L. Bottou. Learning methods for generic objectrecognition with invariance to pose and lighting. In The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), volume 2, pagesII–104. IEEE, 2004. → pages 21[62] H. J. Lee and Z. Chen. Determination of 3D human body postures from asingle view. Computer Vision, Graphics and Image Processing, 30:148–168, 1985. → pages 18, 23[63] S. Li and A. B. Chan. 3d human pose estimation from monocular imageswith deep convolutional neural network. In Asian Conference on ComputerVision (ACCV), pages 332–347. Springer, 2014. → pages 2, 17, 18, 21, 44[64] S. Li, W. Zhang, and A. B. Chan. Maximum-margin structured learningwith deep networks for 3d human pose estimation. In IEEE InternationalConference of Computer Vision (ICCV), 2015. → pages 47, 48, 49, 76[65] M. Lin, L. Lin, X. Liang, K. Wang, and H. Chen. Recurrent 3d posesequence machines. In The IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2017. → pages 2, 18, 23, 44, 49, 76[66] T. Lindeberg and J. Garding. Shape from texture from a multi-scaleperspective. In IEEE International Conference on Computer Vision (ICCV),1993. URL http://dx.doi.org/10.1109/ICCV.1993.378146. → pages 15[67] F. Liu, C. Shen, and G. Lin. Deep convolutional neural fields for depthestimation from a single image. In The IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pages 5162–5170. IEEEComputer Society, 2015. ISBN 978-1-4673-6964-0. URLhttp://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=7293313.→ pages 1599[68] D. G. Lowe. Distinctive image features from scale-invariant keypoints.International journal of computer vision, 60(2):91–110, 2004. → pages 2,19[69] E. Marinoiu, D. Papava, and C. Sminchisescu. Pictorial human spaces:How well do humans perceive a 3d articulated pose? In The IEEEConference on Computer Vision and Pattern Recognition (CVPR), pages1289–1296, 2013. → pages 16[70] E. Marinoiu, D. Papava, and C. Sminchisescu. Pictorial human spaces: Acomputational study on the human perception of 3d articulated poses.International Journal of Computer Vision (IJCV), 119(2):194–215, 2016.→ pages 7[71] J. Martinez, R. Hossain, J. Romero, and J. J. Little. A simple yet effectivebaseline for 3d human pose estimation. In IEEE International Conferenceon Computer Vision (ICCV), 2017. → pages iv[72] W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent innervous activity. The bulletin of mathematical biophysics, 5(4):115–133,1943. → pages 31[73] D. Mehta, H. Rhodin, D. Casas, O. Sotnychenko, W. Xu, and C. Theobalt.Monocular 3d human pose estimation using transfer learning and improvedcnn supervision. arXiv preprint arXiv:1611.09813, 2016. → pages 2, 17,18, 22, 44, 49, 50, 64, 76, 77[74] D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shafiei, H.-P. Seidel,W. Xu, D. Casas, and C. Theobalt. Vnect: Real-time 3d human poseestimation with a single rgb camera. arXiv preprint arXiv:1705.01583,2017. → pages 2, 17, 18, 26, 49, 76[75] M. Minsky and S. A. Papert. Perceptrons: An introduction tocomputational geometry. MIT press, 1969. → pages 32[76] F. Moreno-Noguer. 3d human pose estimation from a single image viadistance matrix regression. In The IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2017. → pages 18, 25, 48, 50, 52, 53, 54,57, 65, 77, 79, 80, 91[77] G. Mori and J. Malik. Recovering 3D human body configurations usingshape contexts. IEEE Transactions on Pattern Analysis and MachineIntelligence, 28(7):1052–1062, July 2006. URLhttp://dx.doi.org/10.1109/TPAMI.2006.149. → pages 2, 18, 19100[78] G. Mori and J. Malik. Recovering 3d human body configurations usingshape contexts. IEEE Transactions on Pattern Analysis and MachineIntelligence, 28(7):1052–1062, 2006. → pages 2, 18, 20[79] V. Nair and G. E. Hinton. Rectified linear units improve restrictedBoltzmann machines. In International Conference on Machine Learning(ICML), pages 807–814, 2010. → pages 11, 38, 42, 45[80] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for humanpose estimation. In European Conference on Computer Vision (ECCV),2016. → pages viii, ix, x, xii, xiii, 2, 3, 4, 11, 12, 22, 30, 43, 47, 48, 49, 51,53, 54, 55, 57, 59, 60, 61, 65, 73, 76, 80, 91[81] B. X. Nie, P. Wei, and S.-C. Zhu. Monocular 3d human pose estimation bypredicting depth on joints. 2017. → pages 2, 18, 23, 44, 49, 50, 76, 77[82] H. Ning, W. Xu, Y. Gong, and T. Huang. Discriminative learning of visualwords for 3d human pose estimation. In The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), pages 1–8. IEEE, 2008.→ pages 18, 19[83] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactionson knowledge and data engineering, 22(10):1345–1359, 2010. → pages12, 62[84] V. Parameswaran and R. Chellappa. View independent human body poseestimation from a single perspective image. In The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2004. URLhttp://doi.ieeecomputersociety.org/10.1109/CVPR.2004.264. → pages 16[85] S. Park, J. Hwang, and N. Kwak. 3d human pose estimation usingconvolutional neural networks with 2d pose information. In ComputerVision–ECCV 2016 Workshops, pages 156–169. Springer, 2016. → pages2, 18, 21, 44, 48, 49, 65, 76[86] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis. Harvestingmultiple views for marker-less 3d human pose annotations. In The IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2017.→ pages 18, 28[87] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis. Coarse-to-finevolumetric prediction for single-image 3D human pose. In The IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2017.→ pages 2, 17, 18, 22, 44, 47, 49, 50, 51, 53, 56, 59, 64, 65, 76, 77, 90101[88] A. Popa, M. Zanfir, and C. Sminchisescu. Deep Multitask Architecture forIntegrated 2D and 3D Human Sensing. In CVPR, 2017. → pages 15[89] L. R. Rabiner and B.-H. Juang. Fundamentals of speech recognition. 1993.→ pages 20[90] I. Radwan, A. Dhall, and R. Goecke. Monocular image 3d human poseestimation under self-occlusion. In IEEE International Conference onComputer Vision (ICCV), 2013. → pages 18, 25, 53[91] R. Raina, A. Madhavan, and A. Y. Ng. Large-scale deep unsupervisedlearning using graphics processors. In International Conference onMachine Learning (ICML), pages 873–880. ACM, 2009. → pages 33[92] V. Ramakrishna, T. Kanade, and Y. Sheikh. Reconstructing 3d human posefrom 2d image landmarks. Computer Vision–ECCV 2012, pages 573–586,2012. → pages 2, 17, 18, 24, 44, 50, 64, 65, 77[93] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-timeobject detection with region proposal networks. In Advances in neuralinformation processing systems, pages 91–99, 2015. → pages 2, 30[94] L. G. Roberts. Machine perception of three-dimensional solids. TR 315,Lincoln Lab, MIT, Lexington, MA, May 1963. → pages 15[95] N. Rochester, J. Holland, L. Haibt, and W. Duda. Tests on a cell assemblytheory of the action of the brain, using a large digital computer. IRETransactions on information Theory, 2(3):80–93, 1956. → pages 32[96] G. Rogez and C. Schmid. Mocap-guided data augmentation for 3D poseestimation in the wild. In NIPS, 2016. URL http://papers.nips.cc/book/advances-in-neural-information-processing-systems-29-2016. → pages 2,18, 22, 44[97] F. Rosenblatt. The perceptron: A probabilistic model for informationstorage and organization in the brain. Psychological review, 65(6):386,1958. → pages 32[98] D. Rumelhart, J. McClelland, and S. D. P. R. G. University of California.Parallel Distributed Processing: Foundations. A Bradford book. MITPress, 1986. ISBN 9780262680530. URLhttps://books.google.ca/books?id=eFPqqMBK-p8C. → pages 32102[99] A. Saxena, M. Sun, and A. Y. Ng. Learning 3-D scene structure from asingle still image. In IEEE International Conference on Computer Vision(ICCV), 2007. → pages 15[100] J. Schmidhuber. Learning complex, extended sequences using the principleof history compression. Neural Computation, 4(2):234–242, 1992. →pages 33[101] S. Semeniuta, A. Severyn, and E. Barth. Recurrent dropout withoutmemory loss. arXiv preprint arXiv:1603.05118, 2016. → pages 13, 68, 70[102] A. Shafaei and J. J. Little. Real-time human motion capture with multipledepth cameras. In Computer and Robot Vision (CRV), 2016 13thConference on, pages 24–31. IEEE, 2016. → pages 18, 29[103] G. Shakhnarovich, P. A. Viola, and T. J. Darrell. Fast pose estimation withparameter-sensitive hashing. In IEEE International Conference onComputer Vision (ICCV), 2003. URLhttp://dx.doi.org/10.1109/ICCV.2003.1238424. → pages 2, 18, 20[104] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake,M. Cook, and R. Moore. Real-time human pose recognition in parts fromsingle depth images. Communications of the ACM, 56(1):116–124, 2013.→ pages 18, 28[105] L. Sigal, A. O. Balan, and M. J. Black. Humaneva: Synchronized videoand motion capture dataset and baseline algorithm for evaluation ofarticulated human motion. International journal of computer vision (IJCV),87(1):4–27, 2010. → pages viii, 9, 49, 53[106] L. Sigal, M. Isard, H. Haussecker, and M. J. Black. Loose-limbed people:Estimating 3d human pose and motion using non-parametric beliefpropagation. International journal of computer vision (IJCV), 98(1):15–48,2012. → pages 18, 27[107] E. Simo-Serra, A. Quattoni, C. Torras, and F. Moreno-Noguer. A jointmodel for 2d and 3d pose estimation from a single image. In The IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2013.→ pages 18, 19, 53[108] K. Simonyan and A. Zisserman. Very deep convolutional networks forlarge-scale image recognition. CoRR, abs/1409.1556, 2014. → pages 37103[109] C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas. Discriminativedensity propagation for 3d human motion estimation. In The IEEEConference on Computer Vision and Pattern Recognition (CVPR),volume 1, pages 390–397. IEEE, 2005. → pages 19[110] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov. Dropout: a simple way to prevent neural networks fromoverfitting. Journal of Machine Learning Research (JMLR), 15(1), 2014.→ pages 11, 42, 43, 46[111] D. Steinkraus, I. Buck, and P. Simard. Using gpus for machine learningalgorithms. In International Conference on Document Analysis andRecognition, pages 1115–1120. IEEE. → pages 33[112] X. Sun, J. Shang, S. Liang, and Y. Wei. Compositional human poseregression. arXiv preprint arXiv:1704.00159, 2017. → pages 2, 18, 22, 44[113] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning withneural networks. In Advances in neural information processing systems(NIPS), pages 3104–3112, 2014. → pages xii, 4, 12, 14, 41, 68, 69[114] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In TheIEEE Conference on Computer Vision and Pattern Recognition (CVPR),pages 1–9, 2015. → pages 2, 37[115] C. J. Taylor. Reconstruction of articulated objects from pointcorrespondences in a single uncalibrated image. In The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), volume 1, pages677–684. IEEE, 2000. → pages 19, 20[116] B. Tekin, I. Katircioglu, M. Salzmann, V. Lepetit, and P. Fua. Structuredprediction of 3d human pose with deep neural networks. In British MachineVision Conference (BMVC), 2016. → pages 2, 17, 18, 22, 44, 49, 76[117] B. Tekin, A. Rozantsev, V. Lepetit, and P. Fua. Direct prediction of 3d bodyposes from motion compensated sequences. In The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), pages 991–1000, 2016.→ pages 2, 17, 18, 26, 47, 48, 49, 65, 76[118] B. Tekin, P. Marquez Neila, M. Salzmann, and P. Fua. Learning to fuse 2dand 3d image cues for monocular body pose estimation. In IEEEInternational Conference on Computer Vision (ICCV), numberEPFL-CONF-230311, 2017. → pages 2, 18, 22, 49, 50, 51, 76, 77, 89104[119] D. Tome, C. Russell, and L. Agapito. Lifting from the deep: Convolutional3d pose estimation from a single image. arXiv preprint arXiv:1701.00295,2017. → pages 22, 23, 44, 49, 64, 76[120] G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black, I. Laptev, andC. Schmid. Learning from synthetic humans. In The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2017. → pages 2, 18,22, 44[121] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol.Stacked denoising autoencoders: Learning useful representations in a deepnetwork with a local denoising criterion. Journal of Machine LearningResearch, 11(Dec):3371–3408, 2010. → pages 22[122] C. Wang, Y. Wang, Z. Lin, A. L. Yuille, and W. Gao. Robust estimation of3d human poses from a single image. In The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2014. → pages 18, 24,53[123] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional posemachines. In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2016. → pages 2, 3, 23, 29, 48, 53, 54[124] X. Wei, P. Zhang, and J. Chai. Accurate realtime full-body motion captureusing a single depth camera. ACM Transactions on Graphics (TOG), 31(6):188, 2012. → pages 18, 28[125] X. K. Wei and J. Chai. Modeling 3D human poses from uncalibratedmonocular images. In IEEE International Conference on Computer Vision(ICCV), 2009. URLhttp://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=5453389.→ pages 16[126] P. Werbos. Beyond Regression: New Tools for Prediction and Analysis inthe Behavioral Sciences. Harvard University, 1975. URLhttps://books.google.ca/books?id=z81XmgEACAAJ. → pages 32[127] Y. Yang and D. Ramanan. Articulated pose estimation with flexiblemixtures-of-parts. In The IEEE Conference on Computer Vision andPattern Recognition (CVPR), pages 1385–1392. IEEE, 2011. → pages 29[128] H. Yasin, U. Iqbal, B. Kruger, A. Weber, and J. Gall. A dual-sourceapproach for 3d pose estimation from a single image. In The IEEE105Conference on Computer Vision and Pattern Recognition (CVPR), pages4948–4956, 2016. → pages 2, 18, 21, 53[129] M. Ye and R. Yang. Real-time simultaneous pose and shape estimation forarticulated objects using a single depth camera. In The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), pages 2345–2352,2014. → pages 18, 28[130] W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural networkregularization. arXiv preprint arXiv:1409.2329, 2014. → pages 41, 68, 70,71[131] R. Zhang, P.-S. Tsai, J. E. Cryer, and M. Shah. Shape from shading: Asurvey. IEEE Transactions on Pattern and Machine Intelligence (TPAMI),21(8):690–706, 1999. URL http://dx.doi.org/10.1109/34.784284;http://doi.ieeecomputersociety.org/10.1109/34.784284. → pages 15[132] X. Zhou, S. Leonardos, X. Hu, and K. Daniilidis. 3d shape estimation from2d landmarks: A convex relaxation approach. In The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), pages 4447–4455,2015. → pages 17, 18, 24, 25, 44[133] X. Zhou, X. Sun, W. Zhang, S. Liang, and Y. Wei. Deep kinematic poseregression. In Computer Vision–ECCV 2016 Workshops, pages 186–201.Springer, 2016. → pages 2, 16, 18, 22, 44, 47, 49, 65, 76[134] X. Zhou, M. Zhu, S. Leonardos, K. G. Derpanis, and K. Daniilidis.Sparseness meets deepness: 3d human pose estimation from monocularvideo. In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 4966–4975, 2016. → pages 2, 17, 18, 24, 25,44, 47, 49, 50, 65, 76, 77[135] A. Zisserman, I. D. Reid, and A. Criminisi. Single view metrology. InIEEE International Conference on Computer Vision (ICCV), 1999. URLhttp://dx.doi.org/10.1109/ICCV.1999.791253. → pages 15106

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0361162/manifest

Comment

Related Items