Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Using unlabeled 3D motion examples for human activity understanding Gupta, Ankur 2016

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2016_september_gupta_ankur.pdf [ 14.07MB ]
JSON: 24-1.0305862.json
JSON-LD: 24-1.0305862-ld.json
RDF/XML (Pretty): 24-1.0305862-rdf.xml
RDF/JSON: 24-1.0305862-rdf.json
Turtle: 24-1.0305862-turtle.txt
N-Triples: 24-1.0305862-rdf-ntriples.txt
Original Record: 24-1.0305862-source.json
Full Text

Full Text

Using Unlabeled 3D Motion Examples for Human ActivityUnderstandingbyAnkur GuptaB. Tech., The Indian Institute of Technology Kanpur, 2005M. Sc., The University of British Columbia, 2010A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University Of British Columbia(Vancouver)July 2016c© Ankur Gupta, 2016AbstractWe demonstrate how a large collection of unlabeled motion examples can helpus in understanding human activities in a video. Recognizing human activity inmonocular videos is a central problem in computer vision with wide-ranging ap-plications in robotics, sports analysis, and healthcare. Obtaining annotated datato learn from videos in a supervised manner is tedious, time-consuming, and notscalable to a large number of human actions. To address these issues, we proposean unsupervised, data-driven approach that only relies on 3d motion examples inthe form of human motion capture sequences.The first part of the thesis deals with adding view-invariance to the standardaction recognition task, i.e., identifying the class of activity given a short videosequence. We learn a view-invariant representation of human motion from 3d ex-amples by generating synthetic features. We demonstrate the effectiveness of ourmethod on a standard dataset with results competitive to the state of the art. Next,we focus on the problem of 3d pose estimation in realistic videos. We present anon-parametric approach that does not rely on a motion model built for a specificaction. Thus, our method can deal with video sequences featuring multiple actions.We test our 3d pose estimation pipeline on a challenging professional basketballsequence.iiPrefaceThis dissertation is based on the research work conducted in collaboration withmultiple researchers at the Laboratory for Computational Intelligence at UBC.A version of Chapter 3 has appeared in these two publications:• A. Gupta*, J. Martinez*, J. Little and R. Woodham. 3D Pose from Motion for Cross-view Action Recognition via Non-linear Circulant Temporal Encoding. In CVPR,2014. (*Indicates equal contribution)• A. Gupta, A. Shafaei, J. Little and R. Woodham. Unlabelled 3D Motion ExamplesImprove Cross-View Action Recognition. In BMVC, 2014.The author identified the problem, formulated the solution, and designed theexperiments for the both of these publications. For the CVPR paper, implementa-tion was done by J. Martinez and the author. J. Martinez also contributed to prob-lem formulation and drafting the manuscript. For the BMVC paper, A. Shafaeihelped with the implementation, ran the experiments, and provided feedback onthe mathematical formulation.A part of Chapter 4 has appeared in the following publication:• A. Gupta, J. He, J. Martinez, J. Little and R. Woodham. Efficient video-based re-trieval of human motion with flexible alignment. In WACV, 2016.The author contributed with identifying the challenge, formulating the solution,and implementing feature generation. J. He implemented the flexible matching(DTW-based) methods. The design and the data collection for the Video-based 3dmotion Retrieval benchmark (V3DR) used in this paper was primarily done by J.Martinez and the author.J. Little and R. Woodham contributed with ideas and provided feedback at allthe stages of the project. They also edited all the above manuscripts.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.1 Tackling dataset bias in action recognition . . . . . . . . . 31.1.2 Identifying and localizing human activities . . . . . . . . 41.1.3 Utilizing unstructured data . . . . . . . . . . . . . . . . . 51.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.1 Cross-view action recognition via feature synthesis . . . . 61.2.2 Video-based mocap retrieval . . . . . . . . . . . . . . . . 71.2.3 3d pose estimation in sports videos . . . . . . . . . . . . 91.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11iv2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . 132.1 Cross-view action recognition . . . . . . . . . . . . . . . . . . . 142.1.1 Reasoning in 3d space . . . . . . . . . . . . . . . . . . . 142.1.2 Statistical approaches . . . . . . . . . . . . . . . . . . . . 142.2 Retrieval of human motion . . . . . . . . . . . . . . . . . . . . . 152.2.1 Descriptors for pose and motion retrieval . . . . . . . . . 152.2.2 Exemplar-based 3d pose estimation . . . . . . . . . . . . 172.2.3 Comparing and aligning temporal sequences . . . . . . . 182.3 3d human pose estimation in videos . . . . . . . . . . . . . . . . 192.3.1 Statistical priors to human motion . . . . . . . . . . . . . 192.3.2 Motion synthesis . . . . . . . . . . . . . . . . . . . . . . 202.3.3 Direct regression to 3d pose . . . . . . . . . . . . . . . . 212.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Unsupervised Cross-View Action Recognition . . . . . . . . . . . . . 233.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Dense Trajectories from Mocap (MDT) . . . . . . . . . . . . . . 243.2.1 Generating multiple projections . . . . . . . . . . . . . . 253.2.2 Hidden point removal . . . . . . . . . . . . . . . . . . . . 263.2.3 Trajectory generation and postprocessing . . . . . . . . . 263.3 Learning from synthetic data . . . . . . . . . . . . . . . . . . . . 263.3.1 Generating correspondences . . . . . . . . . . . . . . . . 273.3.2 Learning codeword transformations . . . . . . . . . . . . 273.4 Synthesizing cross-view action descriptors . . . . . . . . . . . . . 293.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.5.1 Dataset and evaluation . . . . . . . . . . . . . . . . . . . 303.5.2 Dense trajectories for action classification . . . . . . . . . 313.5.3 Mocap feature synthesis . . . . . . . . . . . . . . . . . . 313.5.4 Mocap retrieval-based augmentation . . . . . . . . . . . . 323.5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.6.1 Limitations and the future work . . . . . . . . . . . . . . 37v4 Video-Based Mocap Retrieval and Alignment . . . . . . . . . . . . . 384.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2 Video and mocap representation . . . . . . . . . . . . . . . . . . 394.2.1 Trajectory-based motion feature . . . . . . . . . . . . . . 404.2.2 Relational pose feature . . . . . . . . . . . . . . . . . . . 424.3 Retrieval and alignment . . . . . . . . . . . . . . . . . . . . . . . 424.3.1 Retrieval with inflexible alignment . . . . . . . . . . . . . 434.3.2 Flexible alignment . . . . . . . . . . . . . . . . . . . . . 444.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.4.1 V3dR: Video-Based 3D Motion Retrieval Benchmark . . . 494.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.5.1 Limitations and future directions . . . . . . . . . . . . . . 585 Localized Motion Trellis . . . . . . . . . . . . . . . . . . . . . . . . . 625.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.2 Localized Motion Trellis . . . . . . . . . . . . . . . . . . . . . . 635.2.1 Unary terms: accounting for image evidence . . . . . . . 665.2.2 Binary term: ensuring contextual output . . . . . . . . . . 685.2.3 Search and 3d pose estimation . . . . . . . . . . . . . . . 685.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.3.1 Data and evaluation . . . . . . . . . . . . . . . . . . . . . 695.3.2 Implementation details . . . . . . . . . . . . . . . . . . . 715.3.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . 735.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.4.1 Limitations and future work . . . . . . . . . . . . . . . . 786 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . 856.1 Contributions and Impact . . . . . . . . . . . . . . . . . . . . . . 866.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 876.2.1 Robust evaluation of 3d pose estimation . . . . . . . . . . 876.2.2 Features for video-based mocap retrieval . . . . . . . . . 88vi6.2.3 Fine tuning pose estimation in videos . . . . . . . . . . . 89Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91viiList of TablesTable 3.1 Comparison of the overall performance of our approach andthe state of the art on the IXMAS dataset. We show the ac-curacy averaged over all the camera pairs. We highlight thebest value with boldface and underline the second best value.The results for the leave-one-subject-out evaluation mode, aswell as the usual mode (common in the literature) are shown.Rahmani and Mian [RM15] build on the initial version of ourmethod [GSLW14], but learn a non-linear transformation be-tween mocap trajectory features from different viewpoints to acanonical view. . . . . . . . . . . . . . . . . . . . . . . . . . 36Table 4.1 Per-class and overall mean Average Precision (mAP) on thedetection modality of the video-based mocap retrieval bench-mark. We show an average performance using both IXMAS andYouTube queries. # ex. is the number of files in the databasecontaining the given action. Chance corresponds to the ex-pected performance of uniformly random retrieval. We high-light the best value in each category with boldface and under-line the second best value. Again, we observe that using pose +motion (P+T), significantly improves the retrieval performanceover motion-based features (T). Also, the flexible alignmenttechniques (SS-DTW and SLN-DTW) perform the best on mostaction classes in comparison to cross-correlation (CC). . . . . 57viiiList of FiguresFigure 1.1 A comparison of different tasks for localizing human activ-ity in videos vis-a`-vis the requirement of training data. Weroughly measure the annotation cost as the number of clicksrequired per frame. Detection and tracking enable locating aperson in the 2d space and has the least cost. A more compli-cated task of 2d pose estimation can be used for action recog-nition [JGZ+13] and video retrieval [EMJZF12], but the anno-tations are expensive. The 3d pose estimation task provides aneven more detailed description of the activity, but we cannotobtain the annotations using a single view. . . . . . . . . . . . 5Figure 1.2 The cross-view action recognition problem. We show the kick-ing action as seen from two widely different viewpoints. Notethat the projection of appearance as well as motion varies acrosscameras. If we train our model using the examples from Cam-era 1, it is not going to generalize well to Camera 2 videos.Therefore, we need an effective way to transfer knowledgeacross views. . . . . . . . . . . . . . . . . . . . . . . . . . . 6ixFigure 1.3 Video-based mocap retrieval and alignment. We illustrate aflexible alignment between a short query video and a longermotion capture (mocap) sequence. The video matches only apart of the mocap, so the end-points of the two sequences donot align. The connections show the frame level correspon-dences. Note that the flexible alignment (many-to-one or one-to-many matches) ensures that motion at different speeds canbe matched correctly. . . . . . . . . . . . . . . . . . . . . . . 8Figure 1.4 3d pose estimation from a professional basketball sequencetaken from a broadcast video. This is a challenging problemdue to depth ambiguity, artifacts such as motion blur, and thecomplexity of player movements. We show two frames of asequence with the corresponding 3d pose obtained by our sys-tem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Figure 2.1 Commonly used evaluation strategies for cross-view action recog-nition. Each circle denotes a video, and solid circles repre-sent labeled examples. The data from two different views isarranged in Source and Target columns. The boxes representdifferent action types. a) In correspondence mode, a fixed frac-tion of examples is known to be the same action seen from twoviews. The lines connecting the circles denote these correspon-dences. b) In semi-supervised mode, a small percentage of test(or target view) examples is annotated with a class label. c)The unsupervised mode is the most challenging case, where noannotations connect the source and the target examples. In thiswork, we demonstrate our results on the unsupervised modal-ity only. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16xFigure 3.1 Mocap dense trajectory (Mocap Dense Trajectories (MDT))generation pipeline. (a) Mocap sequences have 3d body jointlocations over time. (b) We approximate human body shapeusing tapered cylinders to obtain a “tin-man” model. (c) Sam-pled points on the surface of cylinders are projected under or-thography for a fixed number of views and (d) cleaned up usinghidden point removal. (e) Connecting these points over a fixedtime horizon gives us the synthetic version of dense trajectory(Dense Trajectories (DT)) features [WKSL11]. . . . . . . . . 25Figure 3.2 (a) A visual comparison between synthesized MDT (left) andoptical flow based DTs [WKSL11] (right). We exploit the sim-ilarity in their shape to learn a feature mapping between differ-ent views using MDT, and use this mapping to transform theDT based features commonly used for action recognition. (b)To generate view correspondence at the local feature level, thehuman body is represented using cylindrical primitives drivenby mocap data. We project the 3D path of the points on thesurface to multiple views and learn how the features based onthese idealized trajectories transform under viewpoint changes. 28Figure 3.3 All camera views from the IXMAS dataset. We show the kick-ing action synchronously captured from the 5 camera angles.Note that the appearance changes significantly across views.Camera 4 is especially challenging because of its extreme ele-vation angle. . . . . . . . . . . . . . . . . . . . . . . . . . . 30Figure 3.4 Accuracy for each camera pair. We highlight the best resultsin boldface and underline the second best value. Our mocapfeature synthesis approach performs the best on most train-testview pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . 33xiFigure 3.5 Per-class classification accuracy of our method on the IXMASbenchmark. These are the same results as Figure 3.4, rear-ranged to show the per-class accuracy averaged over all train-test camera pairs. The mocap feature synthesis improves ac-curacy over the no augmentation baseline on every categoryexcept kick. . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Figure 3.6 Confusion matrices before and after applying our mocap fea-ture synthesis and retrieval-based augmentation approaches.We show average accuracy over all cameras as train-test pairs.We note that our data augmentation helps resolve confusionfor many action categories. . . . . . . . . . . . . . . . . . . . 34Figure 3.7 Variation of the average classification accuracy (on IXMAS)with the number of divisions of azimuthal and elevation an-gles for our mocap feature synthesis approach. We note that asmall number of divisions in angles does not provide the cov-erage needed for all the test viewpoints. However, the gain inaccuracy saturates as we keep increasing the number of syn-thesized views. . . . . . . . . . . . . . . . . . . . . . . . . . 35Figure 4.1 Representing image and mocap sequences for retrieval. Themocap data and videos featuring human motion have relatedbut complementary information. We bridge the gap betweenthe two modalities by estimating the 2d joint locations fromeach video frame, and projecting 3d mocap joints for a view-point (right). We also generate optical flow based DT [WKSL11]from video and synthetic trajectories from mocap (see Sec-tion 3.2) to describe motion patterns in both the sequences(left). Hence, we have a pose as well as a motion-based rep-resentation for both the modalities that are comparable to eachother. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40xiiFigure 4.2 Corresponding joints between the 2d pose estimate in an imageand the CMU-mocap skeleton. We only use a subset of mocapjoints. This mapping is similar to the one used by Zhou and Dela Torre [ZDlT14] for matching DT to a motion model learnedfrom mocap. . . . . . . . . . . . . . . . . . . . . . . . . . . 41Figure 4.3 A toy example to illustrate various Dynamic Time Warping(DTW)-based flexible alignment algorithms. The grey boxes inthe cumulative cost matrices C represent the chosen warp pathfor the same distance matrix D shown on the left. Each algo-rithm initializes some cells of C, fills the rest according to arecursion formula, and then chooses a final score dists(v,z) forthe alignment. The set of indices {(i−1, j),(i−1, j−1),(i, j−1)} is abbreviated as (i, j)∗. In the case of SLN-DTW, P(i, j)is the length of the chosen normalized warp path for the sub-problem up to D(i, j). . . . . . . . . . . . . . . . . . . . . . 47Figure 4.4 A comparison between exact normalization (as defined in Equa-tion 4.9), using dynamic programming over a 3d cumulativecost array, and approximate normalization used in SLN-DTW[MGB09]. The corresponding distance matrix (D) is shownbehind each warp path, with darker shades representing smallerdistances. We show results using (a-b) 2 different video queriesmatched against a mocap sequence and (c) two sequences oflength 100 containing random, L2-normalized 256-dimensionalvectors. We notice that the approximate local normalizationworks fairly well in all three cases, while being several timesfaster. Note that we have kept the end-point constraint here forsimplicity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48Figure 4.5 Example queries from YouTube videos provided in the V3DRdataset. Notice the realistic clothing and backgrounds that arenot typical in videos collected in a laboratory setting. . . . . . 49xiiiFigure 4.6 An example of mocap annotations provided in the V3DR bench-mark. At the top, we show a few frames of a mocap sequence.The corresponding action labels are shown below. Note thatthe annotations are not temporally exclusive. As shown above,one frame can have multiple labels. We use these annotationsto evaluate video to mocap alignment. . . . . . . . . . . . . . 51Figure 4.7 Recall for different feature types. For the same descriptorlength, relational pose features significantly improve recall overtrajectory-based motion features. And, a concatenation of poseand motion features performs comparably or better than the in-dividual features alone. The improvements in recall using Mo-tion + Pose in the case of realistic data (youtube) indicates thatwhen the 2d pose estimation is not reliable the motion infor-mation can be more useful. . . . . . . . . . . . . . . . . . . . 52Figure 4.8 We compare the performance of the cross-correlation (CC)-based retrieval with Circulant Temporal Encoding (CTE) fordifferent values of regularization parameter (λ ). We plot re-call, averaged over all queries, for different values of N (num-ber of retrieved examples). Note that the recall improves forincreasing λ , but CC performs better on most N for both thequery sets. This observation indicates that the regularizationprovided by CTE does not help the retrieval performance inour case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53Figure 4.9 Recall on the localization modality of the video-based mo-cap retrieval benchmark averaged over all YouTube and IX-MAS queries. The black dotted line depicts the ideal recallcurve, and the magenta dotted line shows recall for randomlyretrieved examples. These two curves act as the upper and thelower bound on the performance for each class. All match-ing techniques use motion + pose features except Gupta etal. [GMLW14]. Note that, flexible matching techniques (SS-DTW and SLN-DTW) perform better on most classes. . . . . 55xivFigure 4.10 The confusion matrix for average recall (at N = 100) over allqueries using SLN-DTW + smooth for retrieval with pose +motion features. Each row shows the recall@N for the queriesfrom the action category, and columns depict the target cate-gory used to calculate recall. Therefore, the diagonal corre-sponds to the curves shown in Figure 4.9. Note the confusionof sit down and get up with pick up. This is due to the visualsimilarity of these actions. We also note that the categorieswith isolated body movement, e.g., kick, and throw overhead,are much harder to retrieve reliably. Also, the category turnis challenging to distinguish, possibly because of the subtlechange in the pose during the action compared to a significantchange in case of sit down, get up, and pick up. . . . . . . . . 56Figure 4.11 The confusion matrix for average recall (at N = 100) over allqueries using SS-DTW for retrieval with pose + motion fea-tures. We observe a trend very similar to that shown in Fig-ure 4.10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58Figure 4.12 A few representative alignments for YouTube videos (best viewedin color). The query frames and corresponding frames of theretrieved mocap sequences are shown (right limbs are markedin red). For each retrieval algorithm, we display the top rankedtrue-positive. Top: Walking sequences are relatively easy tomatch. All the algorithms perform well on this example. Mid-dle: In this pick up sequence, the flexible matching algorithmscan capture the bend down and get up movements. However,CC only aligns with the final get up movement. Bottom: Again,in this kick sequence, we get a better alignment using the flex-ible matching techniques. . . . . . . . . . . . . . . . . . . . . 59xvFigure 4.13 Some of the typical error cases for video-based mocap retrieval.We use SLN-DTW with Pose + Motion features for both the ex-amples. We show the top three aligned matches along withthe top-ranked true positive inside the green box. Top: Inthis case the throw overhead action is best matched to a dancemove where the person has their arm lifted, similar to the queryvideo. Bottom: The query comes from a turn sequence. Herethe top ranked sequences are again dancing and walking. Inboth cases, we do find the appropriate matches, but they arepoorly ranked. . . . . . . . . . . . . . . . . . . . . . . . . . 60Figure 5.1 Localized Motion Trellis (Localized Motion Trellis (LMT)).(a) The input is a monocular video sequence. (b) We use eachoverlapping subsequence to (c) search a large collection of mo-cap files for similar motion sequences. (d) The retrieved 3dsnippets are connected in time to form a trellis graph. (e) Theminimization of energy over this graph produces a smooth 3doutput that best explains the image evidence. Being a model-free method, the LMT can estimate 3d motion in sequenceswith multiple activities, which overcomes one of the majorlimitations of current approaches. . . . . . . . . . . . . . . . 64Figure 5.2 Some of the common challenges with broadcast team sportsvideos. a) Even in case of a high-definition (HD) video, theplayer height in pixels is often less than 150 pixels in a wide-angle shot. b) There are motion blur artifacts due to the cameramotion required to follow the game. c) Also, severe occlusionsare common in team sports. . . . . . . . . . . . . . . . . . . 70xviFigure 5.3 The LMT implementation details. (a) We assume that playerbounding box is given at each frame. The red dot is our esti-mate of the player location based on the current bounding box.The blue line shows the estimated path connecting these loca-tions over time. We transform this path to world coordinatesusing the homography. (b) We also use the homography ateach frame to estimate the camera viewpoint. The left figureshows a square around a player projected to the image usingthe given homography (solid-magenta) and a local affine ap-proximation to homography (cyan-dotted). Since the camerais located far from the court, the approximation is reasonablyaccurate in this case. We use the approximate affine transfor-mation to obtain the elevation and the azimuthal angle of thecamera under the orthographic projection (right). . . . . . . . 72Figure 5.4 Percentage of Correct Parts (left-right sensitive) for differentvariations of the LMT compared against Oracle-pose. Top matchonly uses the matching error score. Path only minimizes onlythe path error to find the best 3d sequence for each video snip-pet. Unary adds 2d pose error along with path and matchingerror. Full model uses a weighted sum of both unary and binaryterms to evaluate the final LMT path. Oracle-pose has access tothe ground truth 2d pose in addition to using the same param-eters as the full model. We note that adding binary terms (inthe full model) leads to a consistent gain in Percentage of Cor-rect Parts (PCP). It demonstrates the importance of pairwiserelations in resolving ambiguities. . . . . . . . . . . . . . . . 75Figure 5.5 Percentage of Correct Parts (PCP) for the LMT output com-pared compared to 2d pose estimation methods Flexible Mixture-of-Parts (FMP) [YR13] and nFMP [PR11]. Since FMP andnFMP do not distinguish between left and the right body parts,we do the same for the LMT projections to make the compar-ison fair. Again, the oracle-pose uses the same parameters asOurs but has access to ground truth 2d pose. . . . . . . . . . 76xviiFigure 5.6 A typical 3d pose sequence generated using Localized MotionTrellis. The top row shows a cropped image sequence fromthe NBA data. The subsequent rows display the output of dif-ferent methods, including our full mode (using both unary andbinary terms). The 3d output corresponding to the full modelis presented in the last row. Note that we have rotated the axisin the final row to emphasize the 3d nature of the output. Thearrow on the middle frame shows the viewing direction of thecamera. Also, we mark the right limbs in red. In this sequence,the player walks, then turns to his right and starts running. Ourfull model smoothly handles the transition from one activity tothe next and accurately captures the walk cycles i.e., left andright legs are correctly aligned even though this information isnot available from FMP. The unary output is affected by theerrors in the nFMP pose estimate (see the fourth frame fromthe left and the last frame), while the full model can correct forthese errors. . . . . . . . . . . . . . . . . . . . . . . . . . . 80Figure 5.7 Another example of the 3d pose sequence generated using Lo-calized Motion Trellis. In this sequence, the player runs to-wards the camera, then turns left. After waiting for a few sec-onds, he turns around and runs again. Note that the full modelis correctly able to capture the walk cycle (see the last threeframes), while the players direction is inconsistent in the unaryoutput (see the sixth frame from the left, and third to the lastframe). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81Figure 5.8 An error case for our full model. In this example, the playerruns, turns around, and keeps going in the same direction. Hethen stops for a few frames, and the turns left. Our 3d out-put from the full model faces in the wrong direction when theplayer turns and keeps facing in the wrong direction when theplayer is standing. This example is one of the cases when thefull model may be failing due to over-smoothing. In contrast,the unary output is able to capture the change in direction. . . 82xviiiFigure 5.9 Another error case emphasizing a common limitation of all theLMT variants. In this example, the player is performing fastand complicated movements. Even though the LMT can cap-ture the overall jump motion (frames 2-6), the output does notcapture any other movements. There can be multiple reasonsfor this failure such as a) the absence of an appropriate ex-emplar in the database that can reasonably approximate thismotion; or, b) a failure in retrieval. . . . . . . . . . . . . . . 83Figure 5.10 The effect of the LMT parameters on the average PCP (overthe wrist, elbow, knee, and foot joints) for different thresh-olds. a) We plot accuracy as the function of query length. Wereconstruct the 3d pose using 50 matches for each video sub-sequence. Since we are interested in measuring the effect ofquery length on the quality of retrieved mocaps sequences, weonly use 2d pose error to find the best path in the LMT, and nointerpolation is done to generate the final 3d output. Based onthis result, we choose k = 35. b) Average PCP as a functionof overlap between consecutive queries. In this case, we fixthe query length to 35 frames and calculate PCP using the top500 matches for each video subsequence. To avoid the effectof other terms, we use only transition error to find the best path. 84xixGlossaryCTE Circulant Temporal EncodingDT Dense TrajectoriesDTW Dynamic Time WarpingFMP The Flexible Mixture-of-Parts model for 2d pose estimation [YR13]LMT The Localized Motion Trellis proposed in this thesisMDT Mocap Dense TrajectoriesMG Motion GraphPCP Percentage of Correct Parts — a measure to evaluate 2d pose estimationV3DR Video-based 3d motion Retrieval benchmarkxxAcknowledgmentsA Ph.D. is often considered a lonely pursuit. However, my graduate school experi-ence couldn’t be any further from this stereotype. This thesis would not have beenpossible without the help and active involvement of many individuals, to whom Iam greatly indebted.I am grateful to my supervisors Prof. Jim Little and Prof. Bob Woodhamfor their constant support and encouragement. Bob inspired us to ask ourselveschallenging questions. Jim often provided new and insightful perspectives on theproblems we discussed. Our meetings were intellectually stimulating and fun. Myadvisory committee member Prof. Michiel van de Panne provided invaluable guid-ance throughout the process. Many thanks to Prof. Rick Wildes, the externalexaminer of the thesis, for his excellent feedback during the exam and for writing adetailed report that helped me improve the final manuscript. I also thank Prof. JaneWang and Prof. Ian Mitchell for agreeing to be university examiners and carefullygoing through the manuscript.Research can be rewarding but often it is a gruelling process. Thankfully, itgets easier working alongside talented individuals who are also passionate aboutthe subject. The contributions of my collaborators are instrumental to the researchpresented in the following pages. Julieta Martinez and Alireza Shafaei workedhard on these problems with me and provided the much-needed feedback at variousstages of the project. I am also grateful to summer research interns Umang Guptaand John He for their help. A special thanks to David Matheson who was the firstto suggest the idea of using synthesized features to describe human motion.I believe that a culture of curiosity and continuous learning is essential for re-search. I thank Jim for fostering this culture in the lab. Thanks to my labmatesxxifor creating a great learning environment and making the lab a fun place to work.My roommates during this period, Nishant Chandgotia and Krishna Teja, ensureda similar stimulating environment at home. They taught me a large chunk of what-ever little I know about topics outside computer vision — from mathematics toclassical music to philosophy and religion. The circle of friends that I had at UBCis as priceless as the education and the training.My gratitude for my family is difficult to express in words. I thank my parentsand my brother for their unconditional love, patience, and encouragement. Mywife Rashmi has been a great support throughout my Ph.D. journey, which canoften be very demanding for a spouse; I thank her for navigating these challengeswith grace and humor.xxiiChapter 1IntroductionThe motion of the body accounts for a large part of human expression. Our move-ments help communication and reflect intentions. To function safely and efficientlyalongside a person, robots (or AI systems) must have the ability to understand theuser’s activity. Therefore, building algorithms that can analyze human motion iscrucial. Also, a notable percentage of video data “in the wild” — on YouTube, TV,and movies — features humans. Indexing and searching through this massive datawould require paying particular attention to human movements. Due to its numer-ous applications, automated analysis of human motion is an important problem inrobotics as well as computer vision.Obtaining appropriate training examples is a key practical challenge in solvinga majority of computer vision tasks. Using supervised machine learning on largehuman-annotated datasets has led to impressive performance on many fundamen-tal vision problems such as object recognition and detection [KSH12, GDDM14].However, adapting these methods to more complicated problems such as complexactivity recognition in videos is challenging, partly due to lack of annotated data.It becomes more tedious and expensive to obtain these annotations as the task getsmore complicated. Also, the labels can be subjective or ambiguous. In response,instead of relying on labeled examples, we propose to use a large set of unlabeled3d human motion examples, in the form of motion capture (mocap) data. We focuson three related problems in human activity understanding — a) cross-view actionrecognition, b) video-based retrieval of human motion, and c) 3d pose estimation1in video.In the computer vision literature, action recognition refers to the task of clas-sifying a short video clip into one of the predefined classes. These class labels de-note specific activities in the video such as getting-out-of-the-car, playing-football,sitting-down, or getting-up. Cross-view action recognition adds viewpoint invari-ance to action recognition, i.e., it allows actions to be recognized under a diverseset of views. Human activities often have a hierarchical spatiotemporal structure.Hence, assigning a single label to the whole video is not very descriptive. Manytechniques decompose activities into smaller primitives [PR14, LZRZS15]. Werefer to this simultaneous decomposition and recognition of action as complex ac-tivity recognition. Video-based mocap retrieval is a new task that we introducein this thesis. We define it as retrieving mocap snippets given a short video as aquery. Mocap retrieval can be a handy tool for animators to search through a largedatabase of mocap files. We also show its applications to human activity under-standing. 3d pose estimation is the problem of localizing the 3d joint locations ofa person given a video. 3d pose is a detailed, well-localized description of humanmotion in space and time. Apart from being useful on it own (see Section 1.1.2),3d pose can also serve as an intermediate feature for complex activity recogni-tion [RF03]. Also note that we use activity understanding as an umbrella term thatcan include all the above tasks.What do we set out to achieve?We are interested in the detailed description of the human activity in monocularvideos captured offline, such as YouTube clips or broadcast sports videos. Thetheme uniting all our contributions is that our approach only uses a large set of hu-man motion examples as training data and does not require human labeling effort.Here we summarize the main goals of this thesis:• We are interested in building a view-invariant model of human actions with-out using labeled or synchronized video examples. For quantitative evalua-tions, we use a standard benchmark for cross-view action recognition calledthe INRIA XMAS dataset [WRB06], where we compare our method withrecently proposed unsupervised approaches such as [RM15].2• Our second goal is to search through a large number of mocap files with shortvideo queries. We also wish to align the retrieved mocap snippets with thevideo to establish a one-to-one correspondence between 2d and 3d motion(see Figure 1.3). Since there is no standardized way of measuring successon this task, we assess video-based mocap retrieval on our own benchmarkVideo-based 3d motion Retrieval benchmark (V3DR).• Finally, we wish to obtain the 3d pose of basketball players using broadcastvideo as input (see Figure 1.4). We would like to solve this problem in a data-driven way, allowing us the flexibility of being activity independent (moredetails in Section 1.2.3). We evaluate our approach on a professional bas-ketball sequence used previously to test player tracking and identity recog-nition [LTLM13].1.1 Motivation1.1.1 Tackling dataset bias in action recognitionChanges in imaging conditions lead to visual domain shifts in videos and images(i.e., shift in the distribution of features) [SKFD10]. One of the factors affectingaction recognition performance is the viewpoint of the camera used to capture thevideo. Recent methods using unconstrained videos for action recognition rely onthe internet or movies to obtain training data [KJG+11], but these sources maysuffer from a bias in viewpoint. For instance, most YouTube videos are shot withpeople holding the camera at a similar angle and height (see Figure 4.5 for a fewexamples). However, many test scenarios in action recognition may involve cam-eras mounted on a wall or a robot, with a different viewing angles than the typicalinternet video.Therefore, we need techniques that can generalize to a wider range of viewingangles not featured in the training data. Cross-view action recognition is crucial todeal with dataset bias in action recognition.31.1.2 Identifying and localizing human activitiesAction recognition only provides a single label to describe a video. Also, it is oftentested on video clips rather than unconstrained videos. Recent efforts deal withthis limitation by using untrimmed videos [GIJ+15] or by predicting a boundingbox around the activity (i.e., action spotting [DSCW10]). In this thesis, we focuson 3d pose retrieval and estimation because it localizes human activity in spaceand time, which can potentially provide us the information needed to support otherapplications such as:NUIs and healthcare. Natural User Interfaces or NUIs — systems using humanmotion and gestures to control devices — are already making headway in gam-ing and entertainment [SGF+13]. Effective human activity understanding can alsohave a positive impact on other aspects of life, including the general safety andwell-being of people. For instance, a fall-detection system [NTTS06] for the el-derly or a patient monitoring system can assist human caregivers, as well as helpreduce the health care costs.Sports and fitness. Similarly, personal training systems with an ability to providefeedback on posture can be very beneficial. Again, such a system would need todiscriminate between subtle differences in human pose and motion. Analyzingsports videos for assessing player performance, refereeing, or commentating isanother interesting application [SZ14b].Robotics. Self-driving cars serve as an excellent use case for human activity un-derstanding. A fully autonomous vehicle not only needs to observe the motion ofpedestrians but also to anticipate their future actions, e.g., a child playing near theroad may run after the ball onto the road. Similarly, the use of robots in a humanenvironment, such as homes and offices, will not be practical unless the machinescan take human motivations and safety into account. Another interesting applica-tion is training robots via learning by demonstration [AN04]. We would like robotsto acquire skills by watching a person demonstrating the task live or in a video.4• all the above + provides: • view invariance • reasoning in world132 px(a)                          (b)                           (c)  Query videoMatchedmocap sequence  Query videoMatchedmocap sequenceMocap sequenceImage sequence2d pose  estimationmocap dense  trajectories2d projectiondense  trajectoriesDetection and tracking 2d pose estimation 3d pose estimation• location • direction • speed• all the above + • action recognition • video retrieval2 14 ?ClicksApplicationsTaskIncreasing cost of training dataFigure 1.1: A comparison of different tasks for localizing human activity invideos vis-a`-vis the requirement of training data. We roughly measurethe annotation cost as the number of clicks required per frame. De-tection and tracking enable locating a person in the 2d space and hasthe least cost. A more complicated task of 2d pose estimation can beused for action recognition [JGZ+13] and video retrieval [EMJZF12],but the annotations are expensive. The 3d pose estimation task providesan even more detailed description of the activity, but we cannot obtainthe annotations using a single view.1.1.3 Utilizing unstructured dataState-of-the-art activity understanding methods rely on human-annotated data andoften use one label per sequence (e.g., [KTS+14]). If we wish to localize actions intime, we need per-frame annotations of actions, which are much harder to acquire.Similarly, localizing activities in space is challenging because it requires detailedlabels in each frame of the video [ZZD13]. Human time and effort required forobtaining labeled training data become heavier as we learn more details (see Fig-ure 1.1). Also, annotating training data with suitable labels is often ambiguous andhard to scale to new labels.Motivated by these challenges, we explore the use of unstructured motion ex-amples (in the form of mocap files with no labels) to address some of the challeng-ing problems in human activity understanding.5KickCamera 1Camera 2Kick?Figure 1.2: The cross-view action recognition problem. We show the kick-ing action as seen from two widely different viewpoints. Note that theprojection of appearance as well as motion varies across cameras. If wetrain our model using the examples from Camera 1, it is not going togeneralize well to Camera 2 videos. Therefore, we need an effectiveway to transfer knowledge across views.1.2 Contributions1.2.1 Cross-view action recognition via feature synthesisOur first contribution is a method for adding viewpoint invariance to action recog-nition without using any additional labeled examples or multi-view videos. Shape[DT05, GBS+07, YS05] and optical flow-based [DTS06, WKSL11] features thatare commonly used to describe actions in videos are not viewpoint invariant bydesign. Therefore, the effectiveness of a method depends heavily on the availabil-ity of training data from diverse views. When the training view (or source view)is different from the test view (or target view), we need a strategy to either trans-fer knowledge across views or devise a view-invariant description. These methodsto achieve view-invariance are often referred to as cross-view action recognitiontechniques (see Figure 1.2).Many cross-view action recognition approaches transform action descriptors toa view-invariant space where observations from the source and the target view arecomparable [LZ12, LCS12, MT13, ZJ13, HW13]. However, learning an invari-ant space requires supervision in the form of correspondence or partially labeledexamples in another view (see Figure 2.1 (a-b)). A few methods deal with the un-supervised case [LZ12, ZWX+13, LCS12], but all of them assume the availability6of multi-view videos that can be hard to acquire in a general case.We present a scheme to deal with the completely unsupervised case, i.e., wehave no access to the target view examples at the training time. This is a likelyscenario when it is not possible to predict the test view in advance. Instead of look-ing for a view-invariant space, we learn a function to transform descriptors fromone view to another using unlabeled mocap examples. For action recognition, wedescribe each video with a bag of words (BoW) of Dense Trajectories (DT) fea-tures [WKSL11]. DT are optical flow-based features commonly used for actionrecognition in unconstrained videos. Given multiple feature mappings (learnedtransformations) and training videos from a single view, we can “hallucinate” ac-tion descriptors from different viewpoints and use them as additional training ex-amples for classification. We augment the training data with synthesized descrip-tors to make our model resilient to viewpoint changes.In summary:• To learn correspondence between features across different views, we syn-thetically generate motion features using mocap examples as seen from mul-tiple viewpoints.• We show, due to the similarity between synthesized features and real tra-jectories [WKSL11], we can learn the view transformations on mocap andapply it to videos (Figure 3.2 (a)).• Our relatively simple feature transformation and data augmentation tech-nique improves the action classification accuracy over the baseline with noaugmentation. Our approach is also competitive with a non-linear featuremapping method [RM15] proposed recently.1.2.2 Video-based mocap retrievalNext, we focus on finding an efficient method for alignment and distance computa-tion between human motion sequences across two modalities — video and motioncapture. This problem typically occurs when searching a large database of mocapfiles. Often mocap examples contain an actor performing multiple activities (e.g.,the actor may get up, walk, run and kick in a martial arts sequence). However, we7  Query videoMatchedmocap sequenceFigure 1.3: Video-based mocap retrieval and alignment. We illustrate a flexi-ble alignment between a short query video and a longer motion capture(mocap) sequence. The video matches only a part of the mocap, so theend-points of the two sequences do not align. The connections show theframe level correspondences. Note that the flexible alignment (many-to-one or one-to-many matches) ensures that motion at different speedscan be matched correctly.want to search for a single action as depicted in a short video query. In a nutshell,we wish to a) retrieve mocap files with similar actions and b) align the video queryto the relevant portion of the retrieved mocap sequence.A search through existing mocap files can save the effort of collecting the dataagain. Also, we can use similar, aligned mocap sequences and blend them to gen-erate new animations satisfying higher-level goals and constraints [KG04]. Thisretrieval-with-alignment task can also be a crucial step in various vision applica-tions such as 3d human pose estimation [RSH+05] and cross-view action recogni-tion [GMLW14].We can rely on rigid matching (i.e., one-to-one frame correspondence) to alignsequences efficiently [RDC+13], but we hypothesize that due to the style and thespeed variations in human motion, flexible temporal alignment is a better approach,and that can improve retrieval quality. Therefore, for aligning video and mocap,we use flexible alignment methods (based on Dynamic Time Warping (DTW)) andshow their effectiveness as compared to rigid matching.Another challenge lies in generating a video and mocap representation that iscomparable across the two modalities. Traditional methods based on silhouette arenot practical for realistic videos. We also propose a new similarity measure relying8on the 2d pose and motion information (see Figure 4.1 for a summary).Finally, we notice that there is no standard benchmark for video-based mocapretrieval. Therefore, we propose V3DR. V3DR allows us to evaluate the task quan-titatively. The benchmark uses action labels as a way to assess correct alignmentbetween video and mocap. As a part of the benchmark, we provide per-frame ac-tion labels for 4.5 hours of mocap data. We also provide a set of realistic videoqueries taken from YouTube.In summary:• We introduce the task of video-based mocap retrieval with a standard bench-mark and a method for quantitative evaluation. The benchmark is made pub-licly available 1.• We propose a new motion and 2d pose based feature descriptor that makethe two modalities comparable for retrieval. Our features are effective onrealistic videos as demonstrated by our experiments.• We also benchmark different alignment techniques such as Circulant Tempo-ral Encoding (CTE) [RDC+13] and DTW-based methods. These experimentsshow that flexible matching helps improve the retrieval quality.1.2.3 3d pose estimation in sports videosIn the final part of the thesis, we address the problem of 3d pose estimation inunconstrained videos. Given a video sequence, the task is to estimate 3d bodyarticulations of all the subjects in each frame. The task has many applicationsin complex activity recognition, sports video analysis, scene reconstruction, com-puter games, and natural user interfaces. We focus on estimating 3d pose fromprofessional basketball videos (see Figure 1.4).Given the complexity of articulated human motion in 3d and the natural depthambiguity in monocular videos, pose tracking algorithms often rely on human mo-tion or pose models learned from mocap data [ARS10]. A large body of previ-ous research has focused on creating models that can generalize well with smallamounts of data [Fle11]. However, these models are often specific to an action.1 1.4: 3d pose estimation from a professional basketball sequence takenfrom a broadcast video. This is a challenging problem due to depthambiguity, artifacts such as motion blur, and the complexity of playermovements. We show two frames of a sequence with the corresponding3d pose obtained by our system.Also, learning these action-specific models requires manually labeling mocap se-quences. Therefore, we design a flexible method that can exploit a large set ofunlabeled 3d motion examples. We do not need to know the action featured in thevideo in advance and our method scales well with respect to the size of the data.We use approximately 4.5 hours of CMU mocap dataset [cmu] for our method, andwe do not require any labels associated with these mocap files.Motion synthesis — a related problem often studied in computer graphics —aims to generate human motion sequences that satisfy some constraints, while re-lieving animators from the time-consuming task of manually editing joint loca-tions. Specifically, the objective is to generate a smooth 3d pose sequence withoutspecifying the joints per frame. This approach is especially useful in interactiveanimation settings where requirements for a particular motion cannot be antici-pated in advance. In this work, we let a video sequence replace the animator, andexploit a motion synthesis algorithm to generate a 3d pose sequence that best ex-plains the image evidence. A canonical motion synthesis approach is the MotionGraph (MG) [KGP02]. The MG models the space of 3d human motion as a graph,where each walk on the graph is a possible 3d pose sequence. MGs have been used10for 3d pose estimation using videos [AF02, RSH+05]. However, there are twomain challenges in making this approach practical and scalable, which we addressin this thesis.The first problem lies in constructing the graph. We can build the graph offlinewith short mocap sequences as nodes and transitions that allow for smooth motionbetween the nodes as edges. However, this method does not scale well, and thegraph becomes unwieldy as we add more mocap examples. We address this chal-lenge by constructing the graph on-the-fly for each video sequence. The secondchallenge is using noisy evidence from the video to search the graph. Instead ofusing a general directed graph, we construct a trellis graph that can be searchedefficiently. As a trade-off, we need a shortlist of mocap sequences suitable for theinput video that we obtain using video-based mocap retrieval described in the lastsection. Since we also align the nodes in our trellis graph to the video sequence intime, we call our approach the Localized Motion Trellis (LMT).In summary:• We present a novel 3d pose estimation method based on motion synthesis,called the LMT. By using video-based mocap retrieval and employing a sim-pler graph structure, we make the LMT scalable (in data) and easier to search.• Most 3d pose estimation methods have been demonstrated only in very con-strained scenarios with a single activity. The LMT allows us to estimate posewithout knowing or estimating an action label. Also, transitions between ac-tivities are naturally handled in this model. Therefore, we can demonstrateour method on a challenging sports video sequence.1.3 OutlineThe thesis is organized into 6 chapters. Chapter 2 describes the relevant relatedwork. We discuss our novel cross-view action recognition approach in Chapter 3.We also introduce the Mocap Dense Trajectories (MDT) feature to learn from syn-thetic examples. Chapter 4 introduces and formalizes the new task of video-basedmocap retrieval. We also present a challenging benchmark, V3DR, to test the qual-ity of the retrieved mocap examples. Subsequently, we employ ideas from the11previous chapters to propose the LMT — a non-parametric approach to 3d poseestimation — in Chapter 5. We evaluate the effectiveness of the LMT on a pro-fessional basketball sequence. Finally, we conclude and discuss future work inChapter 6.12Chapter 2Background and Related WorkRecognizing human activity and estimating articulated pose from monocular videosare two crucial challenges in computer vision because of their wide-ranging ap-plications. The task, in the case of action recognition [VNK15], is to classify avideo into one of the predefined action classes, e.g., sports actions such as discusthrow, pole vault or typical activities such as pick-up, sit-down, get-out-of-the-car [WKSL11]. Action recognition has numerous applications including videoindexing and vision-based security systems. In this thesis, we target a very specificchallenge of adding view-invariance to the action recognition task. Pose estima-tion and tracking deals with localizing body joints of a person in 2d or 3d space. Ahuman pose sequence provides a detailed description of movements that is usefulfor natural user interfaces (for video games and computing) [SGF+13], human-robot interaction [KS16] as well recognizing complex activities [RF03]. We usevideo-based mocap retrieval as a subproblem for 3d pose estimation, but it is alsouseful for other applications. For computer games and character animation, mocapretrieval can be used to pick out the relevant sequence from a large collection ofmocap files. Thus, we can save the cost of collecting the data again. Since thereis an extensive literature on all of these problems, we limit ourselves to the mostrelevant methods. For further details see the review by Moeslund et al. [MHKS11].132.1 Cross-view action recognitionDue to the highly articulated nature of the human body, the appearance of poseand motion changes considerably when seen from a different viewpoint (see Fig-ure 1.2). Cross-view action recognition is defined as the task of recognizing actionwhere training and test videos come from cameras with different points of view,called the source and the target view respectively.As we mentioned in the Chapter 1, features commonly used for action recog-nition are not viewpoint invariant. Also, when classifying actions, the performancedrops significantly when no supervision is available in the target view (see Figure 2in [LZ12]). Therefore, to make action recognition robust to changes in viewpoint,we need a strategy to either devise a feature descriptor that is invariant to viewpointchanges or transfer knowledge across views by establishing a relationship betweenviewpoints and action descriptors.2.1.1 Reasoning in 3d spaceFor action recognition, one of the obvious ways to achieve view-invariance is 3dmodeling of the human motion. However, the challenge lies in solving someof the intermediate problems such as 2d or 3d pose estimation. Ramanan andForsyth [RF03] track 2d body parts and match them to mocap data annotated withaction labels. They also recover 3d pose and camera viewpoint in the process. An-other set of methods build a voxel-based 3d representation of human motion usingmulti-view video data [WBR07, YKS08]. Although these methods do not requireexplicit 2d part detectors, they still need synchronized and calibrated multi-viewvideos to construct the model. Such data is challenging to acquire in an unstruc-tured environment.2.1.2 Statistical approachesMany recent approaches to cross-view action recognition begin with the view-dependent feature description for activity. Rather than reasoning about geome-try, these methods directly transform the view-dependent descriptor spaces to alignthem. Farhadi et al. [FT08] use the frame-level correspondence between descrip-tors from synchronized source and target views to learn the transformation between14features across views. The training requires synchronized multi-view videos. Liuet al. [LS11] build separate dictionaries for features in source and target views, andlook for correspondence between words using bipartite matching to form bilin-gual words. These bilingual words act as a mapping from individual dictionariesto a common one. This approach is more flexible as it requires video-level cor-respondences as opposed to frame-level correspondences between different views,which relaxes the constraint of videos being precisely synchronized. Zheng etal. [ZJ13] simultaneously learn a common dictionary between views and individ-ual view-specific dictionaries. They describe each video as a combination of thesetwo factors. All the above methods use supervision in the form of correspondenceor partial labels in the target view (see Figure 2.1 (a-b)).Inspired by unsupervised domain adaptation [GSSG12], some of the recent ap-proaches [LZ12, ZWX+13] can work only given unlabeled examples in the targetview. In Chapter 3 we further relax the assumption by using unlabeled mocap datainstead. Our method does not require the matching step [LS11] or any target viewexamples (labeled or unlabeled).2.2 Retrieval of human motionMocap data serves as a compact description of human motion. It is commonlyused in video games, computer animation, and special effects. Efficient mocapretrieval can help animators find relevant clips to reuse in an animation. Variousinput types such as hand-drawing [CLAL12, CYI+12], movements of a woodenpuppet [FGDJ08, NNSH11], Kinect depth map [KCT+13], and mocap [MRC05]itself have been used to query mocap datasets. Retrieving similar motion capturesequences given a short video as a query can also be useful 3d pose estimation(Chapter 5) and cross-view action recognition (Chapter 3). Although it is widelyapplicable, video remains a largely unexplored and challenging input modality formocap retrieval.2.2.1 Descriptors for pose and motion retrievalThe first challenge in video-based mocap retrieval is to establish similarity betweena mocap and a video frame. The question arises — what is a good representation15Source TargetAct01Act02Source TargetAct01Act02Source TargetAct01Act02(a) Correspondence mode (b) Semi-supervised mode (c) Unsupervised modeFigure 2.1: Commonly used evaluation strategies for cross-view actionrecognition. Each circle denotes a video, and solid circles representlabeled examples. The data from two different views is arranged inSource and Target columns. The boxes represent different action types.a) In correspondence mode, a fixed fraction of examples is known to bethe same action seen from two views. The lines connecting the circlesdenote these correspondences. b) In semi-supervised mode, a small per-centage of test (or target view) examples is annotated with a class label.c) The unsupervised mode is the most challenging case, where no an-notations connect the source and the target examples. In this work, wedemonstrate our results on the unsupervised modality only.of human pose in these two modalities for this task?Efficient retrieval of similar 3d motion examples given a short mocap query(mocap-to-mocap matching) is a closely related problem. Mu¨ller et al. [MRC05]describe each mocap sequence using binary geometrical features based on relative3d arrangements of body parts. Their features are designed to capture the similar-ity in the activity space rather than the exact numerical similarity in pose. Kovarand Gleicher [KG04] use a numerical similarity measure (L2 distance between thealigned point cloud of 3d joints), but add query expansion to extend the search tologically similar motions. These techniques are effective at 3d-to-3d retrieval.2d pose can be used for video retrieval. Given an image featuring a person as aquery, the task is to find the frames with a similar pose from a video database. Todescribe pose in each video frame Eichner et al. [EMJZF12] first run a 2d pose de-16tector on each frame to obtain a heatmap for body joint locations and orientations.A set of statistics computed on these heatmaps acts as the descriptor for the poseto calculate similarity between the image and the video frame. Jammalamadaka etal. [JZJ15] propose an extension based on deep-poselets [BYF14]. Their methodcan also be extended to other query types such as Kinect depth maps [JZE+12].Silhouettes are commonly used to match human pose in images to 3d exam-ples. For an image, we can use background subtraction to obtain a silhouette. Incase of mocap, silhouette can be easily obtained using a 3d shape model driven bythe mocap sequence [SVD03, RSH+05]. For instance, Ren et al. [RSH+05] searchfor mocap examples given an image using Haar-like features based on silhouettesfrom multiple synchronized views. However, in the case of realistic images, it isdifficult to generate a clean silhouette of the person. To deal with this limitation,we use an off-the-shelf 2d pose estimation [YR13] technique to establish a corre-spondence between a video and a mocap frame (Section 4.2). We also use MDT(introduced in Chapter 3) to describe human motion in a mocap sequence. Thefeatures based on these trajectories are comparable to DT features [WKSL11] invideos. We demonstrate that MDT are complementary to the pose-based featureson this task.2.2.2 Exemplar-based 3d pose estimationVideo-based mocap retrieval can also be used for 3d pose estimation (more detailsin Section 2.3). The example-based approaches for 3d pose estimation in videosoften retrieve 3d pose for each frame from a database and then smooth the outputover time [BMB+11, YKW14]. Another possible way to add temporal consistencyand restrict search is using a Motion Graph (MG) [RSH+05]. We hypothesize thatmatching short videos instead of single frames can help add higher-order temporalconstraints to matching. Also, among the methods mentioned above, [BMB+11]uses a depth image and [RSH+05] works with multi-camera input. Since we areonly using monocular RGB videos, a sequence has much richer information than asingle image to allow for a more robust match in our case. Therefore, we focus onmatching short video sequences to mocap examples.172.2.3 Comparing and aligning temporal sequencesTo evaluate a match between a mocap and a video, we require alignment of thesetwo temporal sequences. This alignment can be flexible (elastic matching) or in-flexible in time. The rigid alignment works well for tasks such as copy detection(in videos) or event recognition [RDC+13]. However, due to the style and speedvariations in human motion, flexibility in time is crucial to improving matchingquality (Chapter 4). We also wish to align the video frames to their correspond-ing mocap frames based on pose and motion similarity, which requires temporalflexibility in the general case.Dynamic Time Warping (DTW) is a popular algorithm used to align temporalsequences, and to cluster time-series data. It uses dynamic programming to find theminimum-cost alignment of two sequences subject to constraints suitable for time-series data, i.e., monotonicity and local continuity (described in Section 4.3.2).FastDTW provides an efficient approximation to DTW by solving the problem it-eratively at multiple scales, achieving a complexity of O(n) [SC04] in the lengthof the sequence, as opposed to O(n2) for the original DTW. One of the limitingassumptions in DTW and its variants is that the first and the last frames of the twosequences must be aligned (the end-point constraint). The constraint is not suitablefor comparing sequences of different length. Some recent methods have attemptedto relax the end-point constraint partially by fixing one of the ends but letting theother float [SVBC08, SVBC09]. DTW-S [YCN+11] matches sequences of differ-ent sizes and allows flexibility at both ends by assuming balanced alignment, i.e.,the warping uses an equal number of frames from both sequences. This assump-tion is violated when we match actions performed at different speeds. SubsequenceDTW (SS-DTW) [Mu¨l07] is an efficient method for relaxing the end-point con-straint; however, it introduces a bias for choosing a shorter database subsequencefor a given query (more details in Section 4.3.2). Normalizing the score with ameasure of path length [AF13, MGB09] can remove the bias. We experiment withdifferent versions of these normalizations in Chapter 4.An alternate approach to sequence alignment is to treat the warping as a dis-crete version of a monotonic function. For instance, GCTW [ZdlT15] poses align-ment as an optimization over a continuous space. GCTW can also incorporate18floating end-points; however, due to a non-convex objective function, GCTW re-quires effective initialization to avoid local minima.2.3 3d human pose estimation in videosThe final part of this thesis deals with estimating articulated 3d pose of the playersin broadcast team sports videos, where only one view is available at a time. Esti-mating pose in a monocular video is particularly challenging due to foreshorteningand occlusion that make the solution inherently ambiguous.2.3.1 Statistical priors to human motionTo resolve the ambiguity, a prior model of human pose or motion can be used tohelp the problem. Under the a Bayesian formulation, we can write the problem oftracking 3d pose in a video asargmaxep(e|D) = argmaxe∏ip(Ii|ei)× p(e1,e2, ...,eN) (2.1)where e = (e1,e2, ...,eN) is the pose at all the frames in the range [1,N]. D isthe set of observations (I1, I2, ..., IN). The first term p(Ii|ei) is the likelihood of anobservation Ii given pose ei, and the second term is the prior probability of a posesequence p(e). Under the first order Markov assumption the expression can besimplified asp(e1,e2, ...,eN) = p(e1)N∏i=2p(ei|ei−1) (2.2)where p(ei|ei−1) can be modelled as a normal distribution to encourage a smoothoutput sequence [SHG+11].In general, it is challenging to build an empirical probability distribution p(e)because of the high-dimensionality of the pose data. Therefore, the prior termis often modeled as a latent low dimensional space of the activity-specific humanposes [Fle11]; particularly, non-linear regression with Gaussian process has provensuccessful using small amounts of training data [LM07, UFHF05, WFH08]. Re-cently, bootstrapping 3d pose with 2d pose detectors has become a popular ap-proach [ARS10, EAJ+15]. Andriluka et al. [ARS10] associate 2d body parts de-19tections over time, using tracking-by-detection, and later used an hGPLVM [LM07](hierarchical Gaussian process latent variable model) to generate stable 3d output.Similarly, Simo-Serra et al. [SSQTMN13] applied kinematic constraints to obtain3d pose from noisy 2d estimates. The main drawback of these statistical priors isthat they are specific to an action and hence are not suitable for complex activitiesinvolving multiple actions.In contrast, our approach to 3d pose estimation (Chapter 5) is designed to ben-efit from large collections of unlabeled mocap data. Although action recognitioncan be used as a means to pick the appropriate 3d motion model [YGG12, YKC13],the common assumptions are: (a) a set of actions is known a priori and (b) there isa pre-trained model for each action.2.3.2 Motion synthesisIn addition to generating the pose, we are also interested in reconstructing the re-alistic human motion. Thus, we can alternatively view this problem as 3d motionsynthesis driven by the video input. In many graphics applications, motion synthe-sis aims to generate a sequence of 3d poses that satisfy user-specified constraintswhile relieving animators from the time-consuming task of manually editing in-dividual joint locations. The objective is to generate the sequences by specifyingonly high-level goals (i.e., get the character from a location a to b). This method isespecially useful in interactive animations, where the requirements for a particularmotion cannot be fully anticipated in advance.Approaches to motion synthesis can be divided into three main categories —physics-based, statistical and example-based. Physics-based methods simulate thedynamics of the body and the physical world with the goal of making virtual char-acters learn control strategies to perform a variety of tasks [GvdPvdS13]. Thesemethods have been previously used for 3d pose estimation in 2d videos by addingvisual evidence in the control loop [VSHJ12, WC10, BF08]. However, they suf-fer from a high computational cost and are not able to produce motion that looksnatural to the human eye. Statistical methods search for a low-dimensional rep-resentation of human motion to build a generative model. These methods can beused to produce character animations, given user-defined constraints [LWS02], and20have also been used for 3d pose estimation in monocular video [ARS10, LTSY09].While these models can be learned from a small number of 3d examples, theytend to be action-specific, and are unable to capture complex variations in mo-tion. Example-based methods are widely popular in interactive graphics [PP10]due to their simplicity and the ability to generate natural motion. In these meth-ods, motion examples are spliced, interpolated and concatenated to synthesize newcharacter animations. Our approach to 3d pose synthesis is inspired by MotionGraphs [KGP02] — a popular approach to 3d exemplar-based synthesis.Motion GraphsMotion Graphs (MGs) exploit large collections of mocap data by discovering goodtransitions between different sequences. For any pair of motion sequences, a simi-larity matrix is computed, where high scores correspond to points of smooth tran-sitions. A graph is then built, such that nodes represent mocap frames and edgesdenote suitable transitions. Motion can be generated by simply walking the MG,interpolating the joint angles to generate natural-looking 3d sequences.Generating motion using a MG is very simple: one only has to walk the graph.However, a random walk of the graph is likely to be of little use. Rather, the chal-lenge is to find walks in the graph that satisfy some constraints. The expressivenessof an MG is determined by the number and variety of sequences used in its con-struction: a large number of examples can create a huge MG that is able to generatea variety of outputs; however, searching for a path that satisfies desired constraintsin a large MG quickly becomes infeasible.2.3.3 Direct regression to 3d poseIn contrast to tracking assisted with learned priors or motion synthesis, regressionbased techniques such as [AT06, UD08] learn a regression function from 2d ob-servations to 3d pose. Recently, Tekin et al. [TSW+15] use HOG3d features toregress directly from image sequences to 3d pose using kernel ridge regression.The motion as well as the pose is encoded in the space-time volume and using datafrom multiple action categories helps the accuracy. This observation suggests thatthe regression is not activity-specific. However, regression can only be applied to21short video sequences, as the dimensionality of regression input and output growslinearly with the number of frames. Another limitation of such methods is therequirement of synchronized videos with the 3d pose for training. Such data is ex-pensive to acquire in realistic scenarios such as sports, and outdoor environments.2.4 SummaryWe have reviewed the most relevant approaches to human-motion analysis and syn-thesis. For cross-view action recognition, the present state-of-the-art methods arebased on domain adaptation (i.e., learning a transformation for action descriptorsfrom one-view to the next). Thus, these methods do not depend on the underlyingfeature representation. However, the requirement of multi-view video examples inrealistic settings is challenging. Therefore, we look for a completely unsupervisedtechnique, requiring no-synchronized views or specialized data collection. Simi-larly, in the case of pose estimation, the most successful techniques using monocu-lar videos (tracking-by-detection approaches) learn their pose or motion prior withcarefully curated data while the direct regression based methods need synchro-nized video and mocap. Since we are interested in avoiding any labeling effortand using unstructured examples, our approach to 3d pose estimation is based onretrieval of mocap subsequence from a large dataset using a short video as queryand effectively combining the output using exemplar-based motion synthesis.22Chapter 3Unsupervised Cross-View ActionRecognitionWe present a novel approach to recognize human actions in videos from a newviewpoint — a camera angle not seen in training examples. To build a view-invariant model of actions, we establish a link between the video descriptor andthe point of view of the camera. Many successful approaches to the problem uselabeled examples to link descriptors across views [FT08, ZJ13, LS11]. Instead ofrelying on labeled examples, we learn the mapping between views via syntheticfeature generation using a large corpus of unlabeled motion capture (mocap) se-quences.We also present a method to generate video-like motion features from mocapwithout any photo-realistic rendering. These synthetic features are analogous toDense Trajectories (DT) [WKSL11] features in videos, often used for action recog-nition. We refer to our synthesized version of DT as Mocap Dense Trajectories(MDT). Once we generate synthetic features for a variety of mocap sequences, alinear function can be learned to map these features from one view to the next.Because of the similarity between the synthesized and the real trajectories (seeFigure 3.2(a)), the mapping learned on synthetic features can be applied to the realtraining data to generate additional multi-view training examples. These syntheticdescriptors along with the original training data are then used to train the actionclassifier. As shown by our experiments, this simple scheme of generating syn-23thetic training examples leads to significant improvements in the cross-view actionrecognition accuracy. Since we do not require any multi-view labels, we refer tothe approach as unsupervised.3.1 OverviewA view-invariant representation of human motion is crucial for effective actionrecognition. However, as noted in Chapter 1, widely popular shape and opticalflow-based features [DT05, DTS06, GBS+07, YS05], which are used to describeactions in videos, are not specifically designed to be viewpoint invariant. Conse-quently, the effectiveness of a method depends on the availability of training datafrom diverse views. However, this may not be the case for many data sources,e.g., internet videos often have a very limited set of viewpoints (see Figure 4.5 fora few samples). If our test videos do not follow the same limited distribution ofviewpoints, we need to come up with strategies to transfer knowledge across views.Our approach uses mocap examples for knowledge transfer. To utilize mocapsequences, we define MDT as the orthographic projection of 3d trajectories. Weobtain 3d trajectories by following points on a human model driven by a mocapsequence (details in the next Section). In this chapter, first, we describe the processto generate MDT (Section 3.2). Second, we use corresponding trajectories synthe-sized from multiple views to learn the transformation of motion features due tochange in viewpoint (Section 3.3). We further utilize these transformation func-tions to add view-invariance to action recognition. Finally, we present evaluationson the INRIA XMAS (IXMAS) [WRB06] dataset, a standard benchmark for theproblem (Section 3.5).3.2 Dense Trajectories from Mocap (MDT)Our goal is to generate mocap features equivalent to the DT [WKSL11] features.However, mocap sequences only provide a series of 3d joints locations over time.To generate a surface representing the body, we approximate body parts by taperedcylinders with bones as axes, and put a dense grid of points on the surface of eachcylinder (see Figure 3.1 (b)). We uniformly sample points to get approximately1500 points for the whole body model.24(a) Mocap sequencet t(b) Tapered cylinder representation (c) Orthographic projections t t(d) Hidden point removalYXYXYX(e) Mocap generated trajectoriesFigure 3.1: Mocap dense trajectory (MDT) generation pipeline. (a) Mocapsequences have 3d body joint locations over time. (b) We approxi-mate human body shape using tapered cylinders to obtain a “tin-man”model. (c) Sampled points on the surface of cylinders are projectedunder orthography for a fixed number of views and (d) cleaned up us-ing hidden point removal. (e) Connecting these points over a fixedtime horizon gives us the synthetic version of dense trajectory (DT) fea-tures [WKSL11].3.2.1 Generating multiple projectionsWe project the points of the model surface under orthography for a fixed number ofviews. For observing human motion at a distance, orthgraphic projection offers areasonable approximation to perspective projection. With orthographic projection,there are only two parameters to vary — the azimuthal angle and the elevation an-gle. We choose the azimuthal angle φ ∈ Φ = {0,pi/3,2pi/3,pi,4pi/3,5pi/3}, andthe elevation angle θ ∈ Θ = {pi/6,pi/3,pi/2} measured from the vertical point-ing upwards. By discretizing the angle space, we get 18 different projections perframe. Although the equal division along azimuthal and elevation angles does notuniformly sample the viewing sphere, it is a simple and adequate mechanism forchoosing camera viewpoints. Since we assume that a camera looking up is un-likely, we do not include elevation angles greater than pi/2. We also assume thatthere is no camera roll (see Figure 3.1 (c)).253.2.2 Hidden point removalMDT account for self-occlusions by removing points that should not be visible froma given viewpoint. We use a freely available off-the-shelf implementation of themethod by Katz et al. [KTB07].1 Hidden point removal gives us a set of filteredpoints for each projection (see Figure 3.1 (d)).3.2.3 Trajectory generation and postprocessingFor a given viewpoint, we connect the filtered 2d points over a fixed time horizonτ to obtain synthetic trajectories. Therefore, only points that are visible within theτ frame window are included in trajectories. To make synthetic trajectories com-parable to video dense trajectories, we make sure that the frame rate for mocapis the same as for the videos used in the experiments. We use τ = 15 (coveringhalf a second in a 30 fps sequence), which has been found to work well in thepast [WKSL11]. Again following Wang et al. [WKSL11], we remove trajecto-ries smaller than a threshold as described in their paper. Figure 3.1 provides anoverview of the MDT generation pipeline.A trajectory descriptor can be generated for a physical trajectory by concate-nating the velocities at each frame and normalizing it by the total length of thetrajectory [WKSL11]. Thus, each τ frame long trajectory can be described using a2τ dimensional vector.3.3 Learning from synthetic dataMocap data allows us to observe how the appearance of the same 3d trajectory,generated using human movements, transforms from one view to the next by look-ing at the descriptors from two corresponding trajectories. We begin by generatingmocap trajectories as seen from a pair of viewpoints (as shown in Figure 3.2 (b))for a large number of mocap sequences. The mocap sequences are taken from theCMU-mocap dataset [cmu]. We use a visual vocabulary for trajectory descriptorsand quantize descriptors to their closest codewords. This hard vector encodingsimplifies learning the dependency between the codewords in the two views. We1 the transformation of features due to viewpoint change as a linear functionof codewords. We make the following two assumptions while learning the trans-formation function:• We assume that the feature transformation is independent of the activity,and generate only one function per view change. Although this assumptionsimplifies the learning process, it may not be a good approximation in allthe scenarios. For instance, an individual trajectory may have a very similarshape for a walk and a turn sequence from one view, but it may look verydifferent from another viewpoint.• We also assume that each trajectory transforms independently, i.e., we donot model the effect of transformation of a trajectory on the neighboringtrajectories. The independence assumption lets us transform each codewordseparately. Although it works well in practice (as shown by our results), theassumption clearly does not hold true for human motion.3.3.1 Generating correspondencesWe assign a unique ID to each point on the 3d surface of the human model (de-scribed in Section 3.2). Then, given two viewpoints, we can get a feature pair thatoriginates from the same point on the surface (see Figure 3.2(b)). When the exactmatch is not found due to self-occlusion, a small neighborhood on the model sur-face is considered equivalent, i.e., points in the neighboring region are assumed tohave the same ID.3.3.2 Learning codeword transformationsFor each view pair, we refer to the initial viewpoint as the source view, and thechanged viewpoint as the target view. We quantize the MDT features using afixed codebook C of size n = 2000. Given a source elevation angle θ and arelative change in viewpoint given by ∆ = (δθ ,δφ), we define the training setD∆θ = {( fi,gi)}m1 to be the set of m pairs ( f ,g) ∈ C ×C , where fi and gi are thecodewords for two corresponding MDT features.27(a) (b)Figure 3.2: (a) A visual comparison between synthesized MDT (left) and opti-cal flow based DTs [WKSL11] (right). We exploit the similarity in theirshape to learn a feature mapping between different views using MDT,and use this mapping to transform the DT based features commonlyused for action recognition. (b) To generate view correspondence atthe local feature level, the human body is represented using cylindricalprimitives driven by mocap data. We project the 3D path of the pointson the surface to multiple views and learn how the features based onthese idealized trajectories transform under viewpoint changes.Given the training data D∆θ , the relationship between codewords fi and gi canbe modeled within a probabilistic framework. Since we assume each feature trans-forms independently, we can learn a joint probability mass function p(F,G) whichcaptures the probability of the codeword pairs ( fi,gi). We train the model usingmaximum likelihood estimation and calculate the empirical probability by count-ing the co-occurrences of ( fi,gi) inD∆θ followed by normalization. The conditionalprobability distribution of G, given an observation of codeword fi in the source do-main can be written asp(G|F = fi) = p(F = fi,G)p(F = fi) =p(F = fi,G)∑c∈C p(F = fi,G = c)(3.1)After observing an instance of codeword fi in the source view, p(G|F = fi)allows us to infer the possible outcomes in the target view. In the next section, we28use this probability distribution to map source Bag of Words (BoW) descriptors tothe target view.3.4 Synthesizing cross-view action descriptorsGiven a training descriptor, we use the mapping between codewords to “halluci-nate” action descriptors as seen from different viewpoint changes from the initialviewpoint and use them as additional examples for training, thus adding view-invariance to our model. We assume that, due to the similarity between mocap anddense trajectories, we can learn the view transformations on one and apply it to theother. Figure 3.2 (a) shows the visual similarity between trajectories generated fora video and synchronized mocap trajectories.Given a BoW descriptor of an action, we wish to synthesize a correspondingnew descriptor as seen from the viewpoint ∆ = (δθ ,δφ) away from the origi-nal view. Let x = [x1, . . . ,xn]T be the BoW descriptor in the source view, andy = [y1, . . . ,yn]T be the descriptor we want to estimate. As seen in the last sec-tion, we have a probabilistic mapping between codewords across views. Using thismapping, we return an average descriptor by taking the expectation.y¯ = [E[y1], . . . ,E[yn]]T (3.2)E[y j] =n∑i=1xi ·p(G = f j|F = fi) (3.3)By organizing p(G|F) in the form of a matrix (say N) where the i-th row is thecategorical distribution p(G|F = fi), we can rewrite the above formulation as alinear transformationy¯ = NTx (3.4)where N is the transition matrix corresponding to a transition {θ ,∆}.We generate these additional multi-view examples for each training sequenceand append them to our training data for cross-view action recognition.29  Camera 0 Camera 1 Camera 2 Camera 3 Camera 4Figure 3.3: All camera views from the IXMAS dataset. We show the kickingaction synchronously captured from the 5 camera angles. Note that theappearance changes significantly across views. Camera 4 is especiallychallenging because of its extreme elevation angle.3.5 ExperimentsThe goal of our experiments is to evaluate the effectiveness of the synthetic multi-view descriptors on unsupervised cross-view action recognition. We hypothesizethat including these descriptors in the training data will lead to view-invariance inaction classification.3.5.1 Dataset and evaluationWe choose the INRIA IXMAS dataset [WRB06] for our experiments. The datasethas 11 actions categories, check watch, cross arms, scratch head, sit down, getup, turn around, walk, wave, punch, kick, and pick up, performed 3 times by 10subjects. The dataset contains videos of subjects performing these activities —synchronously captured using 5 cameras (Figure 3.3).Previously, the IXMAS dataset has been used for cross-view action recognitionin different evaluation modes [ZWX+13] (described in Section 2.1). Our approachis well-suited for the unsupervised mode, where no labeled examples in the testview are available. The unsupervised mode tests the scenario in which the entiretraining data comes from an unknown viewpoint (called the source view), and thetest view (or the target view) is not known in advance. This mode is the mostchallenging in terms of classification accuracy reported in the literature so far, sincethe target view is novel. To evaluate action recognition, we pick videos from onecamera view as the training set and use another camera view as the test set. Fivecameras give us 2×C52 = 20 train-test pairs. We report classification accuracy for30each camera pair separately in Figure 3.4.As mentioned earlier, IXMAS data is synchronously captured. To avoid con-taminating our training data with test examples, we use a leave-one-subject-outstrategy for evaluation, i.e., we exclude the videos featuring the test subject fromthe training data. This strategy prevents including the test example, seen from thetraining view, in the training set. However, a more prevalent mode of evaluation iswhere all the examples from a source view are used for training. We use this modeto compare with the state of the art (Table 3.1).3.5.2 Dense trajectories for action classificationWe describe each video in our dataset as a BoW of DT. We compute DT using thecode provided by Wang et al. 2, sampling every other pixel, and let each trajectorybe 15 frames long as in [WKSL11]. We cluster the trajectories from the trainingview into 2000 k-means clusters to obtain a codebook C , and generate a BoWdescriptor for each video. For classification, we train a non-linear SVM with a χ2kernel. The classification only based on the BoW descriptors from the source viewvideos without any knowledge transfer across views gives us the no augmentationbaseline.3.5.3 Mocap feature synthesisFor learning feature correspondences using the mocap data, we use the CMU Mo-tion Capture Database [cmu]. This dataset includes over 2600 mocap sequences ofhuman subjects performing a variety of actions. Though the CMU dataset containssome action labels for each file, we do not use these annotations.To learn the mapping between codewords, we take a random subsample of themocap data, keeping 10% of the frames. We generate MDT from multiple view-points and quantize them using the same codebook C . For this experiment, wequantized the elevation angle θ to {pi/6,pi/3,pi/2} degrees and the azimuthal an-gle φ to 6 equally spaced angles in [0 2pi). Thus, for each source elevation θ ,we have 17 possible viewpoint transitions (excluding transition to itself). Given atraining example from IXMAS we generate one synthesized descriptor per transi-2 trajectories31tion as described in Section 3.4. Since we do not know the source elevation θ forthe training set, we consider transitions from all the possible elevation angles. Thisgives us 51 synthesized descriptors per training example.Following [CG13], we augment our original training data using these new de-scriptors. In such augmentation schemes, the synthesized data is often weightedless compared to the original data. We empirically set the weight of the augmenteddata to 0.01, while the original examples have equal weight 1. In our case, this im-portance is controlled by the slack penalty of the SVM. This way, we account fora) the imbalance in the number of examples in original and augmented data, and b)the fact that the augmented data might contain errors. Again, we train a non-linearSVM with a χ2 kernel on the real and augmented data.3.5.4 Mocap retrieval-based augmentationWe add another baseline using video-based mocap retrieval (Chapter 4). Insteadof generating synthetic descriptors for each view change, we directly search forthe best-matching mocap sequence in a mocap database. We generate descriptorscorresponding to different projections (18 different projections described above) ofthe retrieved mocap example, and add them to our training set. This is an alternativemethod for adding view-invariance to our model.For retrieval we use the shortest 2000 sequences from the CMU mocap. Wealso use a concatenation of pose and trajectory features (P+T) to describe eachvideo and mocap frame, along with SLNDTW (+smooth) for matching the se-quences. Each of these steps is described in detail in Section 4.2 and Section 4.3.Our retrieval method also returns the aligned mocap frames for each video query.We only use MDT corresponding to these aligned frames for data augmentation.The synthetic examples are weighted the same as above for SVM training.3.5.5 ResultsFigure 3.4 shows the classification accuracy per camera pair. We use the leave-one-subject-out scheme for evaluation. Our mocap feature synthesis approach performsthe best on most of the camera pairs. We also note that our method consistentlyoutperforms the baselines with camera 4 as the target view. Camera 4 is an espe-320-1 0-2 0-3 0-4 1-0 1-2 1-3 1-4 2-0 2-1 2-3 2-4 3-0 3-1 3-2 3-4 4-0 4-1 4-2 4-3 Mocap feat. synthesis 82.1 75.2 80.6 48.8 85.8 67.9 81.2 39.1 74.8 69.7 75.8 60.9 82.7 75.2 63.3 33.3 53.3 41.5 53.6 42.4 Retrieval-based aug. 87.3 60.6 77.3 40.3 87.0 66.4 73.3 30.3 67.3 77.0 60.9 58.2 81.8 75.5 54.5 31.8 43.3 41.5 58.2 40.3w/o aug. 87.9 59.4 77.3 29.7 84.2 60.0 70.6 17.6 61.5 75.5 60.9 47.6 73.0 69.7 60.0 27.9 38.5 33.6 53.6 40.31030507090Avg. Accuracy (%)Train camera - Test cameraFigure 3.4: Accuracy for each camera pair. We highlight the best results inboldface and underline the second best value. Our mocap feature syn-thesis approach performs the best on most train-test view pairs.20 30 40 50 60 70 80 90check watchcross armsscratch headsit downget upturn aroundwalkwavepunchkickpick upAvg. classification accuracy (%)  No aug. Retrieval−based aug. Mocap feat. synthesisFigure 3.5: Per-class classification accuracy of our method on the IXMASbenchmark. These are the same results as Figure 3.4, rearranged to showthe per-class accuracy averaged over all train-test camera pairs. Themocap feature synthesis improves accuracy over the no augmentationbaseline on every category except kick.3364.711.815.   0.3   12.518.755.   4.355.313. watchcross armsscratch headsit downget upturn aroundwalkwavepunchkickpick upMocap feature synthesischeck watchcross armsscratch headsit downget upturn aroundwalkwavepunchkickpick up47.04.310.70.70.2      4.30.5   0.32.2      1.30.747. watchcross armsscratch headsit downget upturn aroundwalkwavepunchkickpick upNo data augmentationcheck watchcross armsscratch headsit downget upturn aroundwalkwavepunchkickpick up53.      4.01.0   1.01.0   1.0   1.0   4.01.0         2.0   61.0check watchcross armsscratch headsit downget upturn aroundwalkwavepunchkickpick upRetrieval−based augmentationcheck watchcross armsscratch headsit downget upturn aroundwalkwavepunchkickpick upFigure 3.6: Confusion matrices before and after applying our mocap featuresynthesis and retrieval-based augmentation approaches. We show aver-age accuracy over all cameras as train-test pairs. We note that our dataaugmentation helps resolve confusion for many action categories.cially difficult test set because its elevation is significantly different than the othercamera views (see Figure 3.3). We show per-class classification accuracy in Fig-ure 3.5. Again, mocap feature synthesis gives the best classification accuracy onmost classes. Also, it outperforms the no augmentation baseline on all categoriesexcept kick. We also compare the confusion matrices for no augmentation andfeature synthesis approaches in Figure 3.6.Finally, we test the sensitivity of our method to the number of viewpoints.3457.960.562.359.161.063.363.660.962.264.764.461.263.063.963.759.6# divisions in azimuthal angle# divisions in elevation angleAverage accuracy (%)3 4 6 121236Figure 3.7: Variation of the average classification accuracy (on IXMAS) withthe number of divisions of azimuthal and elevation angles for our mocapfeature synthesis approach. We note that a small number of divisions inangles does not provide the coverage needed for all the test viewpoints.However, the gain in accuracy saturates as we keep increasing the num-ber of synthesized views.As described in Section 3.5.3, we quantize the viewpoint space into 3 divisionsin elevation, and 6 along azimuthal angle. We vary the number of divisions andobserve the average accuracy (in Figure 3.7). We observe that for the IXMASbenchmark, 2 divisions in elevation and 4 along the azimuthal are enough to coverthe viewpoints. Note that the rest of the parameters are kept the same for thisexperiment including the weights of synthetic examples for learning the classifier.Since the number of augmented descriptors changes with the chosen number ofdivisions, we also experimented with fixing the weight for 3 divisions in elevation,and 6 in azimuthal, while adjusting other weights inversely proportional to the totalnumber of views. However, this gave a very similar trend to Figure 3.7.Comparison with the state of the artWe also compare with the previous work of Li et al. [LCS12] and the recent workby Rahmani and Mian [RM15] (Table 3.1). We use all the examples from the35Average classification accuracy (%)Method Leave one sub. out Orig. evaluationNo augmentation 56.4 61.1Retrieval-based aug. 60.6 66.1Mocap feat. synthesis (ours) 64.4 70.3Hankelets [LCS12] - 56.4Rahmani & Mian [RM15] - 72.5Table 3.1: Comparison of the overall performance of our approach and thestate of the art on the IXMAS dataset. We show the accuracy averagedover all the camera pairs. We highlight the best value with boldface andunderline the second best value. The results for the leave-one-subject-outevaluation mode, as well as the usual mode (common in the literature)are shown. Rahmani and Mian [RM15] build on the initial version ofour method [GSLW14], but learn a non-linear transformation betweenmocap trajectory features from different viewpoints to a canonical view.source view (instead of the leave-one-subject-out scheme) for a fair comparisonwith these methods. The compared results are taken directly from the cited papers.We note that our no augmentation baseline outperforms [LCS12]. Also, ourmocap feature synthesis approach is competitive with the state of the art [RM15].3.6 DiscussionWe have demonstrated a novel method for using unlabeled motion capture se-quences as prior knowledge of human activities and their view dependent de-scriptions. To this end, we have introduced the MDT, a synthetic, idealized, andviewpoint-aware motion feature, generated using motion capture data. The MDTcan also be seen as a bridge between 2d motion in videos and human movementsin mocap examples.We have used MDT to add view-invariance to action recognition without usingmulti-view video data or any additional data labeling effort. Since mocap data livesin 3d, we were able to use the MDT generated from different views of the same mo-cap sequence to learn the transformation of features from one view to the next. Thelearned transformation was then used to synthesise new training examples. Finally,we have also shown that synthetic feature generation is an effective technique for36unsupervised cross view action recognition.3.6.1 Limitations and the future workAlthough our method is competitive with the state of the art, it has multiple as-sumptions in the formulation that can be relaxed for further improvements.As mentioned in Section 3.3, we assume that transformation of features withthe change in viewpoint is independent of the activity. Though the simplificationallows us to learn one transformation matrix per view change, it may not be a goodapproximation in some cases. Intuitively, the feature transformation depends on theactivity of the person. This assumption may be one of the reasons for our method’spoor performance on the action category kick. Similarly, features from the differentpart of the body may not follow the same geometrical transformation. One of theways we can deal with these limitations is to have a bank of transformations perchange in viewpoint. To generate synthetic examples, we should be able to chooseone transformation on the fly based on the overall action descriptor.We note that the improvement due to mocap feature synthesis is not consistentacross all action categories (Figure 3.5), as well as its performance is not symmetricacross view pairs (e.g., 1-0 vs. 0-1 in Figure 3.4). This discrepancy may be dueto occlusion of the critical body parts (involved in the action) from a particularcamera angle and relatively large overall occlusion of the body at extreme cameraangles. A more rigorous understanding of the dependency of our results on actioncategories and viewpoints requires further exploration.Additionally, we decided to use MDT because they can be generated efficientlyfrom mocap. However, the underlying model for generating motion features inMDT is very simplistic. A more realistic human model and rendering may givebetter features for learning view-invariance.37Chapter 4Video-Based Mocap Retrievaland AlignmentIn this chapter, we present a technique to retrieve motion capture (mocap) filesefficiently, using a short video query. Our retrieval approach generates a list ofaligned mocap snippets, ranked by their similarity to the video. We define thesimilarity based on the human motion depicted in these sequences.The above task requires us to establish a frame-level similarity metric betweenvideo and mocap. To this end, we explore a set of features that are comparableacross these two modalities. Given the similarity measure, we can use differenttemporal alignment and retrieval techniques. The first method we explore is basedon cross-correlation that can be computed efficiently in the Fourier domain, thus,it is well-suited for our application. However, matching similar but stylisticallydifferent actions in mocap and video may require temporal flexibility to align them.Therefore, we also experiment with different flexible alignment methods.We thoroughly evaluate the effect of the two stages — feature extraction andalignment — on the retrieval accuracy. Since there is no publicly available datasetfor such an evaluation, we propose a new benchmark: Video-based 3d motion Re-trieval or V3DR. Our benchmark consists of realistic video queries as well asframe-level annotations for a large mocap database to measure the performance ofvideo-based mocap retrieval and alignment.384.1 OverviewRetrieving similar mocap sequences from a large database can be useful for char-acter animation, 3d pose tracking [RSH+05], and cross-view action recognition[GMLW14]. In this thesis, we use retrieval for 3d human pose estimation in com-plex activities such as team sports, with only monocular video as input. We discussthis application in Chapter 5.To connect human motion in video and mocap, we look for a common rep-resentation between the two. Many previous methods have used features basedon silhouettes or edge-maps [RSH+05]. A silhouette can be easily extractedfrom a video with a static camera and a known background, while generatingthe same from a mocap sequence requires creating a mesh for the body, riggingit to the skeleton, and rendering each frame using a virtual camera. Althoughthese simple features are useful when dealing with videos in laboratory settings,they are hard to obtain reliably for realistic videos with complex, unknown back-grounds. To address this problem, the likelihood of a 3d pose given an imagecan also be obtained using discriminatively-trained 2d pose detectors (as shownin [SSQTMN13, ARS10]). Since these detectors (e.g., Flexible Mixture-of-Parts(FMP) [YR13]) are trained on realistic images, they tend to be more robust. Wetoo partly base our similarity measure on the 2d pose estimate for each frame, asdescribed in the next section.The next stage of our retrieval and alignment algorithm involves temporalmatching between the two sequences. The efficiency of the matching algorithmis crucial for scaling it to large datasets. Therefore, all the methods described inthis chapter are linear-time (in the length of the sequence) algorithms. The finalconsideration is the viewpoint. We match the same mocap sequence as seen frommultiple points of view. Thus, viewpoint estimation becomes a part of the retrievalprocess. We describe each of these ideas in detail in Section Video and mocap representationA video featuring a person and a mocap sequence are complementary representa-tions for human motion analysis. While videos have rich appearance informationwith a large variation in shape, clothing, and background, mocap sequences have39Mocap sequenceImage sequence2d pose  estimationmocap dense  trajectories2d projectiondense  trajectoriesFigure 4.1: Representing image and mocap sequences for retrieval. The mo-cap data and videos featuring human motion have related but comple-mentary information. We bridge the gap between the two modalities byestimating the 2d joint locations from each video frame, and projecting3d mocap joints for a viewpoint (right). We also generate optical flowbased DT [WKSL11] from video and synthetic trajectories from mo-cap (see Section 3.2) to describe motion patterns in both the sequences(left). Hence, we have a pose as well as a motion-based representationfor both the modalities that are comparable to each other.accurate 3d pose. Hence, it is a challenge to compare them. We need features thatcan be computed efficiently for both video and mocap. Additionally, they should beable to discriminate among various pose and motion patterns arising from humanactivities. In this work, we rely on features that have been shown to be effectivefor recognizing actions in videos (see Figure 4.1).4.2.1 Trajectory-based motion featureIn the last chapter, we used Mocap Dense Trajectories (MDT) features as an ideal-ized version of Dense Trajectories (DT) [WKSL11] features for cross-view actionrecognition. Here we use them to compare motion patterns between mocap andvideos for retrieval. We synthesize trajectories for each file in the mocap database,as seen from 18 different viewpoints — elevation angle θ = {pi/4, 3pi/8, pi/2}and azimuthal angle φ = {0, pi/3, 2pi/3, pi, 4pi/3, 5pi/3}. Each frame (per view)contains a different number of trajectories (τ frames long). Each trajectory canbe described by a 2τ long feature vector by only keeping horizontal and vertical40CMU-mocap joints2d pose joints12345678910111213 141234567891011121314Figure 4.2: Corresponding joints between the 2d pose estimate in an im-age and the CMU-mocap skeleton. We only use a subset of mocapjoints. This mapping is similar to the one used by Zhou and De laTorre [ZDlT14] for matching DT to a motion model learned from mo-cap.displacements over consecutive frames, normalized by the total length of the tra-jectory (as described in [WKSL11]). For each frame, we aggregate trajectories ter-minating at the frame using Fisher Vector (FV) encoding [PSM10], because it hasbeen shown to better describe human actions as compared to Bag of Words (BoW)[OVS13]. Finally, following good practice [PSM10], we take signed square-rootand L2 normalize the Fisher Vector to obtain one motion descriptor per mocapframe (in this case containing the information from the last τ frames). We sim-ilarly obtain a corresponding descriptor for each video frame using DT features.Finally, we run PCA to reduce the dimensionality of both video and mocap framedescriptors to 128 dimensions.One of the limitations of aggregating features is that we lose the informationabout the relative location of each feature. Also, trajectories do not have any se-mantic labels associated with the body parts. Therefore, we add further detailsusing pose-based features.414.2.2 Relational pose featureWe also use 2d relational pose features [MRC05] to describe each video and mocapframe. Relational features capture relative distances and orientations between allthe body joints. These features are shift-invariant and robust to noise in the poseestimation [MRC05, YGG12]. Following [JGZ+13], we extract the feature for aframe, and compute all pairwise distances and all pairwise orientations betweenjoints giving rise to a C152 = 105 dimensional vector each. All inner angles for allcombinations of 3 joint angles are also concatenated (3×C153 = 1365) to obtain a1575 dimensional vector. As suggested by Jhuang et al. [JGZ+13], we add mo-tion information by appending the temporal differences of these features over thepreceding few frames. We whiten and L2 normalize this feature vector along eachdimension. Finally, we run PCA on each feature type separately and keep the samenumber of dimensions for each type to obtain a 128-dimensional vector.To estimate 2d pose in each video frame, we use the Flexible Mixture-of-Parts(FMP) [YR13]. For mocap sequences, a comparable representation can be obtainedby projecting the corresponding 3d joint locations along multiple viewpoints. Weuse an orthographic projection and the same viewing angles as described above inSection 4.2.1. Since mocap usually contains more joints than most 2d pose esti-mates, we pick a subset of joints from mocap. Figure 4.2 shows the correspondencebetween the 2d pose in an image and the mocap joints. Note that, in contrast tothe 3d model, the 2d pose estimate does not distinguish between the right and theleft body parts. However, the projection of the 3d joints can be made consistent byswitching left and right labels of body parts when the 3d model faces away fromthe camera. We do this by checking the relative positions of the projected left andright shoulder joints.We have made the code to generate relational pose features from the mocapsequences publicly available1.4.3 Retrieval and alignmentGiven frame level descriptors for video and mocap, our next task is to search amocap database using a video as query. For retrieval, we rank all the files based1 their similarity with the video. In addition, for each mocap file we also find thebest matching portion of the mocap to the query video. We refer to this problemas the alignment task. We explain how to measure the performance on these tasksin Section 4.4.1. First, we describe a few existing retrieval techniques and discusstheir applicability to our problem.Notation and the measure of frame similarityWe concatenate the pose and motion features (described in the last section) for eachframe into a single vector of d dimensions. Let v be an n-frame video descriptor.We can obtain its matrix representation as v ∈ Rd×n = [v>1 , . . . ,v>d ]> (in columnnotation) = [v1, . . . ,vn] (in row notation). Similarly, we construct a database ofthe mocap descriptors using all the mocap sequences available in the dataset. Letzi ∈ Rd be a mocap frame descriptor. We concatenate these for an m frame longmocap file as z = [z>1 , . . . ,z>d ]> = [z1, . . . ,zm] ∈ Rd×m.We use the dot-product, i.e., vi · z j as our measure of similarity between a mo-cap and a video frame . In case of DTW-based methods (described in Section 4.3.2),we need a distance measure instead of similarity. We use (1−vi ·z j) as our measureof distance. Note that both the descriptors vi and z j are L2 normalized.4.3.1 Retrieval with inflexible alignmentAssuming that the motions in two sequences are performed at the same speed, andthe dot-product is a good measure of similarity, we can write the overall similaritybetween a video and a mocap sequence as cross-correlationsδ =∞∑i=−∞vi−δ · zi (4.1)where δ represents the shift needed to align the two sequences. This similarity forall possible shifts can be written in the form of a 1-d cross-correlation along eachdimensions(v,z) =d∑i=1v i ? z i (4.2)43where ? is the cross-correlation operator. This expression can be very efficientlycomputed by taking the signals to the Fourier domain. Let F−1(.) be the inverseFourier transform function and Vi, Zi be the discrete Fourier transforms of v i andz i respectively.s(v,z) =F−1( d∑i=1V∗i Zi)(4.3)where  is the element-wise multiplication operator and ∗ denotes the complexconjugate. Finally, we take a max over s(v,z) to obtain the matching score (usedto rank the mocap sequence) as well as the best alignment between the mocap andthe video.Circulant Temporal Encoding (CTE)CTE [RDC+13] adds a filtering stage to the computation of cross-correlation toensure that the peaks in s(v,z) are salient and well-localized in case of a goodmatch. CTE has been shown to be effective for video retrieval and copy detection.Similar to cross-correlation, CTE can be computed very efficiently in the Fourierdomain ass(v,z) =F−1( d∑i=1V∗i ZiV∗i Vi+λ)(4.4)where the parameter λ adjusts the regularization controlling the “peakiness” of theresponse. It also stabilizes the solution by making sure that the denominator inEquation 4.4 does not collapse to zero.The original implementation of CTE removes high frequencies after DFT tocompress the representation. CTE also uses Product Quantization (PQ) [JDS11] toreduce the memory requirement and speed up the similarity computation. In ourcase, we do not use frequency pruning or PQ to keep the CTE matching comparableto other techniques using a complete, uncompressed representation. For more CTEimplementation details see [RDC+13].4.3.2 Flexible alignmentThe methods described above assume similar speeds of action in the query and thedatabase. To allow for some temporal flexibility in alignment we also try flexible44matching techniques based on Dynamic Time Warping (DTW).Again let v and z be the query and the database sequence respectively. Thefunction dist(i, j) returns the distance between the two frames vi and z j of therespective sequences. A warp can be described using a pair of functions φ(.) ={φv(.),φz(.)} that maps aligned frame indices to the original v, z frame indices.The distance between the warped sequences can be defined asd(v,z) =Tφ∑i=1dist(φv(i),φz(i)) (4.5)where Tφ is the warped length after alignment. A warp path is the sequence ofindices (φv(i),φz(i)), for i from 1 through Tφ .Dynamic Time Warping (DTW)In the original DTW formulation, the warp path is constrained such that we do notskip any frames and only move forwards in time during alignment, i.e., (φv(i+1),φz(i+ 1))− (φv(i),φz(i)) equals either (1,0), (1,1), or (0,1) ∀i. This propertyenforces monotonicity and continuity of DTW’s warp paths, which are needed for awell-behaved alignment of time-series (for details see [RJ93]). Another importantconsideration in DTW is the end-point constraint that can be written asφv(1) = 1,φv(Tφ ) = n;φz(1) = 1,φz(Tφ ) = m, (4.6)and forces the algorithm to use the full length of both sequences.Given these constraints, DTW solves the minimization problemφ ∗ = argminφd(v,z) (4.7)using dynamic programming to obtain the optimal warp path φ ∗. The distancefunction dist(.) is used to fill a matrix D ∈ Rn×m, where each element D(i, j) isthe distance between vi and z j (see Figure 4.3 for an example D matrix). DTWaggregates the cost of matching subproblems in a cumulative cost matrix C suchthat C(n,m) returns the score of the optimal alignment between the original se-quences. Figure 4.3 also shows the initialization and the update rules for filling45matrix C. Note that, unlike CTE, the time and memory requirements of DTW scalequadratically in the sequence length (O(n2), assuming n = m). Even though DTWreturns the lowest-cost warp path, because of the aforementioned end-point con-straint it cannot localize a query within a larger database file. Therefore, in itsoriginal form, DTW cannot be applied to our task of retrieval and alignment.Subsequence DTWMu¨ller [Mu¨l07] presents a modified version of DTW, called Subsequence DTW(SS-DTW), that removes the end-point constraint. Instead of finding the lowest-cost alignment using both the sequences, we can computeargmink,l,φd(v,zk:l), (4.8)where zk:l = [zk, . . . ,zl] is a subsequence of z, i.e., m ≥ l ≥ k ≥ 1. The solution tothis problem can also be obtained using a dynamic program, as shown in Figure 4.3.Note the changes, compared to DTW, in the initialization of the cumulative costmatrix and calculation of the final score. Conceptually, this can be thought of asplacing two “special frames”, at each end of the query sequence v that can alignwith any extra frames in z for a cost of 0.Normalization by the warp path lengthAlthough SS-DTW loosens the end-point constraint, it is biased towards matchingshorter database subsequences. The increased warp path length due to choosing alonger subsequence (when multiple query frames match to a one database frame)severely penalizes the SS-DTW objective. Since it is just as likely for a motionoccurring in a database sequence to be slower than a motion occurring in a querysequence as it is the other way around, this asymmetry in flexibility is problematic.To get rid of this bias we consider a new objective that normalizes the distance bythe warp path lengthargminφTφ∑i=11Tφdist(φv(i),φz(i)). (4.9)468 7 8 9 7 84 3 2 1 5 64 5 3 4 1 3DTW SS-DTW Local normalization (SLNDTW)8 15 23 32 39 4712 11 13 14 19 2516 16 14 17 15 188 7 8 9 7 812 10 9 9 12 1316 15 12 13 10 138 7 8 9 7 8Distance matrix (D)12321 3 4 5 6210𝒗𝒗𝒛𝒛Alignment score:Recursion formula:Initialization:}*),(1*),(*),(),(min{),(jiPjiPjiCjiDjiC++=)},({min jTC xj)},({min jTC xj),( yx TTC)*},(min{),(),( jiCjiDjiC +=)*},(min{),(),( jiCjiDjiC +=),1(),1( jDjC =Cumulative cost matrices (C)),1(),1( jDjC =)1,1()1,1( DC =Figure 4.3: A toy example to illustrate various DTW-based flexible alignmentalgorithms. The grey boxes in the cumulative cost matrices C repre-sent the chosen warp path for the same distance matrix D shown on theleft. Each algorithm initializes some cells of C, fills the rest accordingto a recursion formula, and then chooses a final score dists(v,z) for thealignment. The set of indices {(i− 1, j),(i− 1, j− 1),(i, j− 1)} is ab-breviated as (i, j)∗. In the case of SLN-DTW, P(i, j) is the length of thechosen normalized warp path for the subproblem up to D(i, j).Note that Tφ , the length of the warp path, is dependent on φ . This problem is knownas the normalized edit distance problem [MV93], which can again be solved usingdynamic programming. However, in this case the dynamic program needs anotherdimension for path length, leading to a run time of O(n3) in the sequence length,which is not suitable in a retrieval setting.Since this expensive normalization is not practical, we use a local normaliza-tion approach (SLN-DTW) [MGB09]. This O(n2) approximation (O(n) whenbuilt on top of FastDTW [SC04]) works well in practice. Rather than buildinga three-dimensional cumulative cost array — over query frames, database frames,and the warp path length — SLN-DTW just keeps track of the path length (obtainedgreedily) so far in a separate matrix P. See Figure 4.3 for the update rule. Notethat the initialization depicted here differs slightly from the original SLN-DTWinitialization. Although this algorithm does not return an optimal path, as definedby Equation 4.9, it often finds a warp path very close to the best normalized pathon our data (see Figure 4.4).47Normalized DTW  SLNDTW(a)Normalized DTWSLNDTW(b)Normalized DTW SLNDTW(c)Figure 4.4: A comparison between exact normalization (as defined in Equa-tion 4.9), using dynamic programming over a 3d cumulative cost array,and approximate normalization used in SLN-DTW [MGB09]. The cor-responding distance matrix (D) is shown behind each warp path, withdarker shades representing smaller distances. We show results using (a-b) 2 different video queries matched against a mocap sequence and (c)two sequences of length 100 containing random, L2-normalized 256-dimensional vectors. We notice that the approximate local normaliza-tion works fairly well in all three cases, while being several times faster.Note that we have kept the end-point constraint here for simplicity.Smoothing warp pathsSince the original DTW objective is the sum of distances between warped framesof the two sequences, the warp path is regularized. Adding an extra frame in thewarp path adds a positive value to the overall cost. But once we normalize thecost by path length, the objective has no preference for shorter path lengths. Asa result, normalization leads to “jagged” warp paths. However, a smooth warppath is more desirable because it better describes the distortions due to variationsin speed. To achieve smoothness, we use simple slope weighting [RJ93] whichapplies local constraints by associating small multiplicative costs to horizontal andvertical movements in the warp path. Thus, the final update rule becomesC(i, j) = min{D(i, j)+w∗C(i, j)∗P(i, j)∗1+P(i, j)∗}where w∗ are weights {w10,w11,w01} for indices (i, j)∗ ≡ {(i− 1, j),(i− 1, j−1),(i, j− 1)} respectively. We choose w01 = w10 = 1.15 and w11 = 1 for all our48Figure 4.5: Example queries from YouTube videos provided in the V3DRdataset. Notice the realistic clothing and backgrounds that are not typi-cal in videos collected in a laboratory setting.experiments. We refer to the resulting technique as SLN-DTW (+smooth).4.4 ExperimentsTo quantitatively evaluate the retrieval performance of different approaches de-scribed above we propose a new benchmark called V3DR2. Since it is hard to ob-tain ground truth 3d pose for the query videos, and since action category is knownto be a good proxy for 3d motion [YGG12, YKC13], we define similarity betweena video query and the retrieved sequence using their action labels. In other words,we assume that each action is defined by a series of poses over time.4.4.1 V3dR: Video-Based 3D Motion Retrieval BenchmarkThe benchmark contains a series of video queries, a database of 3d motion se-quences (mocap), ground truth annotations and two protocols for evaluation.2 select two sets of queries: videos captured in a controlled environment, aswell as a more challenging set of videos downloaded from YouTube. To factor outocclusion and camera motion we choose videos where the full body is visible andthere is little or no camera motion. The videos are typically short snippets of 1to 4 seconds in length. Figure 4.5 shows some of the frames from example videosequences.Another set of video queries come from the IXMAS dataset [WRB06]. IX-MAS consists of short video sequences where 10 actors perform simple actions ina laboratory setting. This dataset is typically used to evaluate cross-view actionrecognition, and contains videos captured from 5 distinct camera views (see Fig-ure 3.3). While most of the YouTube videos have similar camera elevations, thesevideo snippets from IXMAS add variation in viewpoint. The queries are pickedrandomly from cameras 0− 3, excluding camera 4 due to its low elevation angle(measured with respect to the positive z-axis).Both sets contain the following 8 classes: sit down, get up, turn, walk, punch,kick, pick up and throw overhead. We choose these classes because they are com-monly featured in the mocap dataset that we use as our database. Each group ofqueries has 20 videos per action, with a single person performing the action. Thisamounts to a total of 160 queries per set. The dataset also provides manually an-notated bounding boxes around the person in each video.Mocap DatabaseWe use the CMU-mocap dataset [cmu] — the largest publicly-available mocapdatabase that we are aware of. From the 2549 sequences, we kept the 2000 short-est, which results in around 4.5 hours of 3d human motion. To make mocap andvideo data comparable we sub-sample each mocap sequence to 24 fps.Ground truthWe annotated the mocap sequences using the same action labels as the queries.Since a mocap sequence may contain more than one action performed in sequence(e.g., walk, turn, sit-down), we annotate each file per frame. These annotations50Frame: 25 Frame: 29 Frame: 33 Frame: 37 Frame: 41 Frame: 45 Frame: 49 Frame: 53 Frame: 57 Frame: 61 Frame: 65Walk Pick-upTurnWalkFigure 4.6: An example of mocap annotations provided in the V3DR bench-mark. At the top, we show a few frames of a mocap sequence. The cor-responding action labels are shown below. Note that the annotations arenot temporally exclusive. As shown above, one frame can have multiplelabels. We use these annotations to evaluate video to mocap alignment.are not necessarily temporally exclusive (e.g., a person might walk and turn at thesame time, and so some frames may have both of these labels. 976 sequenceswere annotated with at least one class, leaving 1024 sequence not having any ofthe above action classes. Figure 4.6 illustrates an example annotation from V3DR.Evaluation metricsGiven a query, we retrieve the top N mocap files based on their similarity to thevideo. We choose two different metrics to evaluate performance on V3DR, whichcorrespond to different use cases of video-based mocap retrieval. The first evalua-tion protocol is called detection modality. In this modality, a retrieved example iscounted as a true positive if it contains the action featured in the video.In the second protocol, localization modality, retrieval is expected to produceboth a ranking and a frame number that localizes the action in each 3d sequencein time. A retrieved 3d sequence is counted as a true positive if it contains thequeried action and its localization is correct. To evaluate localization, we comparethe query label against the ground truth mocap annotation at the frame numberreturned by the algorithm.We evaluate our detection modality using mean Average Precision (mAP) whichis defined as the mean of the APs over all queries, and serves as a single numberto evaluate performance per class. We evaluate our localization modality usingrecall@N curves [JDS11]. For each query, we plot the number of true positivesin the top N retrieved sequences over the total number of positive examples in the51100 101 102 10300.  Motion + PoseMotion−onlyPose−only100 101 102 10300.  Motion + PoseMotion−onlyPose−onlyFigure 4.7: Recall for different feature types. For the same descriptor length,relational pose features significantly improve recall over trajectory-based motion features. And, a concatenation of pose and motion fea-tures performs comparably or better than the individual features alone.The improvements in recall using Motion + Pose in the case of realisticdata (youtube) indicates that when the 2d pose estimation is not reliablethe motion information can be more useful.database. This results in a monotonically increasing curve for increasing N. Weshow the average of these curves over all queries.4.4.2 ResultsThe goal of the evaluation is to study the effect of features as well as matchingtechniques on the retrieval performance. We also show qualitative results to illus-trate strengths and weaknesses of different retrieval algorithms. Note that, to reportalignment in the case of cross-correlation (CC) and CTE we return the mocap framematching the central video frame of the query; for SS-DTW and SLN-DTW we usethe mocap frame at the middle of the warp path.Effect of featuresFirst, we study the performance of the pose and motion features described in Sec-tion 4.2. See Figure 4.7. We use the same matching algorithm (cross-correlationor CC) for all the feature types and keep the total number of dimensions of the de-scriptor fixed to 256. We use the localization modality and plot recall for different5210 100 500 100000.  CTE, λ = 0.001CTE, λ = 0.1CTE, λ = 10CC10 100 500 100000.  CTE, λ = 0.001CTE, λ = 0.1CTE, λ = 10CCFigure 4.8: We compare the performance of the cross-correlation (CC)-basedretrieval with CTE for different values of regularization parameter (λ ).We plot recall, averaged over all queries, for different values of N (num-ber of retrieved examples). Note that the recall improves for increasingλ , but CC performs better on most N for both the query sets. This obser-vation indicates that the regularization provided by CTE does not helpthe retrieval performance in our case.values of N (number of retrieved examples). We observe the same trend in the de-tection modality as shown in Table 4.1 (see columns corresponding to CC). Sincethe concatenation of motion and pose descriptors performs better than the individ-ual features, we use Motion + Pose descriptors for all following experimentation.Retrieval and matching performanceWe compare the performance of the two inflexible alignment techniques, cross-correlation (CC) and Circulant Temporal Encoding (CTE), in Figure 4.8. For CTE,we try different values of the regularization parameter λ . We observe that thoughCTE has proved to be effective for copy-detection in videos, CC works better for ourretrieval task. Higher values of λ dominate the denominator in Equation 4.4, whichcorresponds to reducing regularization, eventually approaching CC. Henceforth weuse CC as our inflexible retrieval technique and do not include CTE in the results.Next we present a detailed analysis of our retrieval results for the flexible align-ment methods — SS-DTW and SLN-DTW. Figure 4.9 shows the results on local-ization modality, and Table 4.1 displays mAP values for the detection modality. In53both cases, we average the results over IXMAS and YouTube. We add Gupta etal. [GMLW14] as a baseline to compare our techniques with the first publishedmethod on this task. Gupta et al. use only motion features (aggregated using BoWinstead of Fisher Vector), and retrieve mocap using CTE. This baseline servesto show the cumulative improvements based on the techniques suggested in thischapter. A significant gap between the ideal performance (the black dotted line)and our best methods points at the difficulty of the mocap retrieval task. We notethat SLN-DTW(+smooth) outperforms other matching approaches overall. How-ever, the results are not consistent across different action classes. This result pointsto the complementary nature of these methods and requires further exploration.Finally, we present a confusion matrix to show commonly confused classes forthe two best performing retrieval approaches SLN-DTW (+smooth) and SS-DTWin Figures 4.10 and 4.11. This matrix is different from the confusion matrix com-monly used to evaluate a multi-class classification. Here, instead of classificationaccuracy or error, we show the recall at N = 100 with different classes as retrievaltargets. Therefore, each row does not sum to 1. We note that it is challenging todistinguish examples of actions such as turn, punch, and throw. Also, get up andsit down are commonly confused as pick up.Qualitative resultsWe demonstrate the quality of our recovered alignments in Figure 4.12 and 4.13. Afew successful top matches for the considered methods (CC, SS-DTW, and SLN-DTW) are shown in Figure 4.12. We note that the flexible methods lead to betteralignment in the illustrated examples. The error cases are shown in Figure 4.13. Weobserve that although actions involving the full body or a significant change in poseare relatively easy to match, activities such as throw remain hard to disambiguate.4.5 DiscussionIn this chapter, we have presented an efficient method for alignment and distancecomputation between a short video query and a mocap sequence. We also for-malize the problem of video-based mocap retrieval by introducing a challengingbenchmark, V3DR, and by proposing metrics for quantitative evaluation on the54100 101 102 10300. downRecallN100 101 102 10300. upRecallN100 101 102 10300. 101 102 10300. 101 102 10300. 101 102 10300. 101 102 10300. upRecallN100 101 102 10300. overheadRecallN  SS−DTW SLN−DTW + smooth CC Gupta et al.Figure 4.9: Recall on the localization modality of the video-based mocap re-trieval benchmark averaged over all YouTube and IXMAS queries. Theblack dotted line depicts the ideal recall curve, and the magenta dottedline shows recall for randomly retrieved examples. These two curvesact as the upper and the lower bound on the performance for each class.All matching techniques use motion + pose features except Gupta etal. [GMLW14]. Note that, flexible matching techniques (SS-DTW andSLN-DTW) perform better on most classes.550.2970.0070.0080.0010.0120.0130.1540.0050.0080.3200.0080.0040.0070.0110.0250.0150.0100.0080.0800.0240.0070.0220.0110.0160.0390.0270.0450.1010.0070.0320.0220.0150.0190.0270.0540.0060.2150.0980.0130.0420.0170.0200.0210.0070.0290.0730.0240.0290.1630.1680.0030.0020.0180.0360.4850.0570.0010.0040.0240.0030.0720.0410.0060.065sit downget upturnwalkpunchkickpick upthrow ohsit downget upturnwalkpunchkickpick upthrow ohFigure 4.10: The confusion matrix for average recall (at N = 100) over allqueries using SLN-DTW + smooth for retrieval with pose + motionfeatures. Each row shows the recall@N for the queries from the ac-tion category, and columns depict the target category used to calculaterecall. Therefore, the diagonal corresponds to the curves shown in Fig-ure 4.9. Note the confusion of sit down and get up with pick up. Thisis due to the visual similarity of these actions. We also note that thecategories with isolated body movement, e.g., kick, and throw over-head, are much harder to retrieve reliably. Also, the category turn ischallenging to distinguish, possibly because of the subtle change in thepose during the action compared to a significant change in case of sitdown, get up, and pick up.56CC SS SLN + smooth# Action # ex. Chance [GMLW14] T P P+T P+T P+T1 Sit down 30 0.015 0.038 0.043 0.172 0.162 0.169 0.1552 Get up 56 0.028 0.076 0.076 0.231 0.225 0.203 0.2193 Turn 194 0.098 0.148 0.161 0.229 0.222 0.195 0.2304 Walk 739 0.373 0.598 0.624 0.596 0.666 0.585 0.6695 Punch 13 0.007 0.032 0.027 0.033 0.054 0.070 0.0596 Kick 23 0.012 0.017 0.023 0.015 0.022 0.042 0.0277 Pick up 76 0.038 0.147 0.177 0.335 0.352 0.392 0.4388 Throw oh. 17 0.009 0.015 0.019 0.027 0.023 0.035 0.027mAP 0.072 0.134 0.144 0.205 0.216 0.211 0.228Table 4.1: Per-class and overall mean Average Precision (mAP) on the detec-tion modality of the video-based mocap retrieval benchmark. We showan average performance using both IXMAS and YouTube queries. #ex. is the number of files in the database containing the given action.Chance corresponds to the expected performance of uniformly randomretrieval. We highlight the best value in each category with boldfaceand underline the second best value. Again, we observe that using pose+ motion (P+T), significantly improves the retrieval performance overmotion-based features (T). Also, the flexible alignment techniques (SS-DTW and SLN-DTW) perform the best on most action classes in com-parison to cross-correlation (CC).task. V3DR provides frame-level action annotations for around 400 000 frames ofthe CMU-mocap dataset. Video queries with variety in viewpoints, realistic cloth-ing and backgrounds are also provided as a part of the benchmark. We hope thatV3DR will encourage further research in video-based mocap retrieval.We have also shown that DT and relational pose features, previously used foraction recognition, are also effective for human motion retrieval. This finding al-lows us to use DT along with any state-of-the-art 2d pose detector trained on im-ages (rather than videos) to retrieve similar mocap sequences using videos. Ourapproach does not depend on any additional training data such as synchronizedvideo and mocap examples. This feature is important because acquiring such datain a realistic setting remains a challenge.570.3520.0090.0030.0000.0100.0150.0880.0100.0110.3130.0070.0080.0040.0140.0260.0190.0120.0080.0710.0400.0120.0190.0110.0290.0190.0140.0300.0740.0070.0240.0080.0110.0100.0120.0480.0150.2270.1230.0250.0560.0260.0320.0370.0150.0360.1120.0340.0270.1570.1780.0050.0100.0140.0280.4610.0380.0030.0100.0130.0090.0750.0490.0100.079sit downget upturnwalkpunchkickpick upthrow ohsit downget upturnwalkpunchkickpick upthrow ohFigure 4.11: The confusion matrix for average recall (at N = 100) over allqueries using SS-DTW for retrieval with pose + motion features. Weobserve a trend very similar to that shown in Figure Limitations and future directionsAs seen in the qualitative results, we can obtain the matched viewpoint for theretrieved mocap snippet. One of the main limitations of V3DR is that it does notyet measure the accuracy of a viewpoint prediction. In the future, we would like toextend V3DR to include viewpoint annotations, to allow evaluation of the matchedviewpoint.Also, the mocap retrieval for action categories such as punch and throw ischallenging and often results in poor alignments for the top retrieved examples.Since our features are completely hand-crafted and are not trained to discrimi-nate between these categories, they fail to adequately distinguish the isolated bodypart movements and subtle changes in pose. Therefore, our approach leads topoor retrieval in some cases. To improve feature representation for retrieval, we58QueryCC Rank: 1SS-DTW Rank: 1SLNDTW +smooth Rank: 1QueryQueryCC Rank: 3SS-DTW Rank: 1SLNDTW +smooth Rank: 1CC Rank: 8SS-DTW Rank: 1SLNDTW +smooth Rank: 1Figure 4.12: A few representative alignments for YouTube videos (bestviewed in color). The query frames and corresponding frames of theretrieved mocap sequences are shown (right limbs are marked in red).For each retrieval algorithm, we display the top ranked true-positive.Top: Walking sequences are relatively easy to match. All the algo-rithms perform well on this example. Middle: In this pick up se-quence, the flexible matching algorithms can capture the bend downand get up movements. However, CC only aligns with the final getup movement. Bottom: Again, in this kick sequence, we get a betteralignment using the flexible matching techniques.59QueryRank: 1Rank: 2Rank: 3Rank: 151QueryRank: 1Rank: 2Rank: 3Rank: 31Figure 4.13: Some of the typical error cases for video-based mocap retrieval.We use SLN-DTW with Pose + Motion features for both the examples.We show the top three aligned matches along with the top-ranked truepositive inside the green box. Top: In this case the throw overheadaction is best matched to a dance move where the person has their armlifted, similar to the query video. Bottom: The query comes from aturn sequence. Here the top ranked sequences are again dancing andwalking. In both cases, we do find the appropriate matches, but theyare poorly ranked.60can either perform metric learning on our hand-crafted features (similar to Ren etal. [RSH+05]) or reuse deep learning-based descriptors trained on a similar tasksuch as action recognition [KTS+14]. Both of these directions can potentially im-prove features, and lead to a significant gain in the retrieval performance.Another related problem is that we discard the uncertainty in the 2d pose es-timation to evaluate our pose descriptor. We can potentially use the heat-map ofdifferent body joint locations, instead of the final MAP estimate that we currentlyuse. Incorporating uncertainty of observation in a retrieval pipeline is another in-teresting direction to explore.Finally, our evaluation relies on frame-level action annotations of mocap se-quences. Since actions depend on the context and the interaction with other objects,these labels can be ambiguous, e.g., for a mocap sequence, the activity of cleaninga window and waving can be easily confused and mislabeled. A 3d-to-3d match-ing score between mocap (such as [MRC05]) sequences can be a more objectivemeasure of similarity than discrete labels. Also, we can use a similarity measurebetween mocap sequences to mine examples with similar motion, and speed up thelabeling process.61Chapter 5Localized Motion TrellisIn this chapter we introduce the Localized Motion Trellis (LMT) — a novel non-parametric approach to 3d pose estimation from monocular RGB video. Our focusis on generating realistic 3d pose and motion without knowing the human activitylabel at each time instance in the video. Given a video sequence featuring a personas input, we first retrieve snippets with a similar motion from a large collectionof mocap files. We efficiently combine these short motion exemplars into a con-tinuous 3d pose sequence that best explains the image evidence. We demonstrateour approach by estimating articulated 3d motion of players in a challenging se-quence from a broadcast sports video. Since we only use real motion examples,the resulting poses are anthropomorphic, and the overall motion is realistic.The LMT distinguishes itself from the state-of-the-art methods for 3d pose es-timation from monocular video in that it does not require action-specific pose ormotion priors, and uses exemplar-based motion synthesis as a model to estimatehuman pose. As a result, video or mocap used for training does not need any ac-tion labels. This property is desirable because in the case of complex activitiessuch as sports, annotators need to have deep domain knowledge — making it cum-bersome and expensive to get labeled examples. Additionally, since no labels areneeded, we can use a much larger set of examples. For instance, we use 2000 filesfrom the CMU mocap dataset (approximately 4.5 hours of mocap) for our 3d poseestimation pipeline.625.1 OverviewIn the last chapter we introduced a method for retrieving mocap snippets given ashort monocular video as a query, which can also be thought of as example-based3d pose estimation from video. However, for longer sequences (more than a fewseconds), the same approach cannot be applied. As the length of the video in-creases, it becomes harder to find a similar sequence in mocap. One way to dealwith this problem is to divide the video into overlapping chunks and retrieve ex-amples for each short clip. We can later combine the retrieved mocap sequencesusing interpolation. Analogous to Tracklets (e.g., as used in [ZLN08]) for objecttracking, the top retrieved examples can serve as a mid-level representation of thefull motion over the video. However, the best match for each video snippet may notalways emerge as the top match due to a) noise in the features used for retrieval,and b) the inherent ambiguity in recovering 3d motion from a 2d image sequence.We address this problem using context. Apart from the spatial structure of the body(often represented as a tree), there is a high temporal structure to human motion. Inthis chapter we show that the LMT can be used to model longer term temporal con-text (e.g., person running does not stop suddenly) and spatial continuity over time(e.g., the body joints move in smooth trajectories) to help us recover a consistentand realistic output sequence.Given a video sequence, we divide it into overlapping subsequences. For eachof these videos as queries, we search for similar examples in a mocap database(as described in the Section 4.3). The LMT allows us to model continuity betweenneighboring mocaps to build a time-forwards graph. We then construct a minimumenergy path through this graph using dynamic programming to generate a 3d posesequence similar to that seen in the input video (see Figure 5.1).5.2 Localized Motion TrellisFirst, we divide the input video sequence into k−frame chunks with p overlap-ping frames between the neighboring chunk. For each of these subsequences, wecalculate a combination of pose and motion features (as described in Section 4.2)for retrieval. We set cross-correlation (CC) as our matching algorithm for retrieval(Section 4.3.1) because of its simplicity and competitive performance with respect63(a)(b)(c)(d)(e)Figure 5.1: Localized Motion Trellis (LMT). (a) The input is a monocularvideo sequence. (b) We use each overlapping subsequence to (c) searcha large collection of mocap files for similar motion sequences. (d) Theretrieved 3d snippets are connected in time to form a trellis graph. (e)The minimization of energy over this graph produces a smooth 3d out-put that best explains the image evidence. Being a model-free method,the LMT can estimate 3d motion in sequences with multiple activities,which overcomes one of the major limitations of current approaches.64to DTW-based methods. In the last chapter, for each query we only matched oneexample per mocap file. This restriction is not required in case of the LMT. Thematches can come from a single file as long as they are not the same. Therefore, forthe LMT we keep our retrieval algorithm the same, except that we allow retrievingmultiple non-overlapping matches per file.We build a trellis graph with the top-N retrieved 3d snippets as nodes and con-nect all neighboring nodes with directed edges (forward in time). Since each nodeis temporally aligned to the video input, we call this graph the Localized MotionTrellis or LMT. The weights in an LMT consist of both unary and binary terms;the former are meant to enforce a high similarity with the image evidence and thelatter encourage smooth transitions over time. The resulting graph is similar to aMotion Graph (MG) [KGP02] typically used in Graphics to synthesize human mo-tion. Before describing different energy terms in the LMT, we highlight some of itskey differences to MGs:• The MG and the LMT both have short mocap snippets as nodes, and the con-nection between nodes describes the transition from one node to the next.But, in the case of MGs, the graph is generated before the user input is ob-served, and, to limit the number of connections, a threshold is used on thenode similarity to decide if an edge should be added, while in the case of theLMT, the graph is generated on the fly based on the input. The retrieval stepselects the nodes to include in the graph, and no threshold is required to limitconnectivity.• Another significant difference lies in the search strategy. The search in anMG finds a path in the graph that maximizes the objective defined by theuser. However, the search can become cumbersome when the graph is large.Therefore, often an MG is constructed using small carefully chosen exam-ples. In contrast, we restrict our method to a time-forwards graph, with di-rected edges and no loops, which can be searched efficiently. Also, retrievalacts as a filter to limit the number of nodes, allowing us to use a larger mocapdatabase to construct an LMT.• Another subtle but crucial point is transitions. For an LMT, the transitions are65constrained because each retrieved snippet corresponds to a user input frameby frame. In the case of an MG, since the graph is generated a priori, theallowed transitions in the graph may mismatch with the user input. To allowfor some flexibility Ren et al. [RSH+05] insert a few nodes and extra edgesto construct an augmented motion graph. However, any addition of nodesand edges leads to increased complexity of the search. Also, for MGs thegenerated graph needs to be pruned to avoid a few undesirable cases wherethe search cannot progress, e.g., nodes with no outward edges.5.2.1 Unary terms: accounting for image evidenceTo evaluate the likelihood of a retrieved mocap snippet given the video subse-quence, we investigate three unary terms. We calculate these terms for each nodein the graph.Matching errorWe define the matching score as the normalized sum of the per-frame similarityof the video and mocap sequences, where an aggregation of features representseach frame. These features are the same as the ones used for retrieval (described inSection 4.3). Formally the matching error can be written asEm = K− 1pp∑i=1〈vi,zi〉, (5.1)where vi and zi represent the aggregated features of the aligned video and 3d mo-tion sequences over p frames. K is a positive constant to convert the matchingscore into an error measure while keeping the error positive.2d pose errorTo evaluate the quality of the predicted pose, we calculate the distance between thejoints of the 3d sequence projected onto the corresponding image and the outputof a 2d pose detector (the FMP [YR13]) on the image. Let fi be all the 2d jointlocations returned by FMP over p frames and gi be the location of the corresponding66joint i in a 3d sequence. The 2d pose similarity error is given byE2d =1mm∑i=1wi · ‖fi− pθ ,T (gi)‖22, (5.2)where m = 14× p is the number of common joints between the model of FMPand the CMU-mocap dataset, and the weights wi assign importance to each joint,accounting for the fact that some 2d joint predictions are more reliable than others.The function pθ ,T (·) projects the 3d sequences to the image plane, as seen fromviewpoint θ (note that θ is known from retrieval) under orthographic projection.T is the translation that best aligns the projection of the mocap frame to the FMPoutput by aligning the centroid of the hip and the shoulder joints. Since we alreadyuse hip and shoulders to align skeletons, we do not use them for error computationi.e., set wi = 0 for corresponding joints. We also exclude head and neck joints asthey move very little about the torso. All other joints are weighted equally exceptthe wrist, which has the relative weight 0.5. We make this choice based on theobservation (see Figure 5.4) that wrists are often poorly localized using 2d posedetectors, as compared to other joints.Path errorWe borrow this score from the original Motion Graph (MG) formulation [KGP02].In MGs, the motion is synthesized by indicating a path on the ground, and the al-gorithm walks the MG minimizing the difference between the hip projection of thecharacter and the path specified by the user. In our case, the path on the groundcan be approximated using a tight bounding box around the person in each frameand the homography of the scene — assuming a flat ground (more details in Sec-tion 5.3.2). Formally, the path error is given by the average distance between thepoints on the player path and the projection of the 3d model center onto the ground.We align the centroids of the two set of points before calculating the error.Ep =1pp∑i=1‖ti−mi‖22, (5.3)where ti is the point on the player path and mi is the corresponding projection ofthe 3d model center on the ground plane at frame i.675.2.2 Binary term: ensuring contextual outputThe binary term of the LMT encourages a smooth 3d output in space and time.Unlike the unary terms, the binary term defines errors on the overlapping framesof retrieved 3d snippets from neighbouring video queries.Transition errorThis term is identical to the transition error used in MGs [KGP02]. Let xi and x′i bethe 3d location of the joint i in the overlapping frames of two neighboring nodes.The transition error for each frame is given by summing the error over all m jointsover the overlapping framesEt =1mm∑i=1ωi · ‖xi−qt(x′i)‖22, (5.4)where qt(·) is the translation function corresponding to the best alignment of thetwo 3d snippets parametrized by a translation vector t. Since we only use the samesubset of joints as used in 2d pose error m= 14× p. ωi is the weight indicating theimportance of each joint. We set equal weights for all joints except the elbow andthe wrist with relative weights 0.5 and 0.25 respectively to avoid over-penalizingthe transition error due to the localization error for these joints. Also, we set ωi = 0for the head because with the LMT we hope only to capture overall coarse bodymovement instead of subtle details such as the movement of the head or toes.5.2.3 Search and 3d pose estimationThe final expression for the overall unary energy is given byEu = wmEm+w2dE2d +wpEp (5.5)which acts as the node potential. w2d , wp, and wm are the weights for 2d pose error,path error, and matching error respectively. Similarly, we weigh the binary termwith wt . The binary term is the potential for the LMT edges that is computed for allpair of neighbouring nodes. Subsequently, we find a path through this graph withthe smallest sum of these weighted energies. Since the LMT uses a trellis graph, we68can efficiently calculate the minimum energy path using dynamic programming.The chosen path contains a sequence of 3d mocap snippets. Note that retrievalalso returns an azimuthal orientation for each of these snippets. We bring all thesnippets in a single global reference frame and stitch them together into one 3dpose sequence. First, we align the returned 3d snippets using translation. To aligntwo neighbouring sequences we ensure that the mean of the joint locations in theoverlapping frames coincides. Once all the sequences are aligned, we interpolatebody joint rotations (using the quaternion space) in overlapping regions to smooththe transition from one sequence to the next.5.3 ExperimentsWe demonstrate the task of synthesizing players’ articulated 3d movements in amonocular sports video sequence. We choose the domain of team sports becauseplayers often perform multiple activities (e.g., running, walking, turning, shooting)one after the other — a scenario that has not been well-explored in the previousliterature. Also, it is challenging to annotate these video sequences with either3d pose or action labels per frame. Thus, relying only on the unlabeled data, ourapproach is particularly well-suited for such an application.5.3.1 Data and evaluationWe choose a 500 frame sequence from a professional basketball game, taken froma broadcast video. We have chosen the sequence such that there are no scenetransitions (i.e., it is one continuous shot). Also, since it is a wide angle shot, allthe players are visible in the frame throughout the sequence.Since it is not possible to reliably annotate 3d pose on a monocular video, weannotate each frame with the 2d pose for evaluation. Instead of evaluating 3d poseerror, we project the 3d pose onto the image and compare it to the 2d ground truth.To obtain 2d pose ground truth, we manually label 14 key body joints for all theplayers in each video frame.We show the quantitative results for the pose error using the normalized Per-centage of Correct Parts (Percentage of Correct Parts (PCP)) as defined by Sappet al. [ST13], where the distances are normalized such that the torso is 100 pixels69132 px(a)                          (b)                           (c)Figure 5.2: Some of the common challenges with broadcast team sportsvideos. a) Even in case of a high-definition (HD) video, the playerheight in pixels is often less than 150 pixels in a wide-angle shot. b)There are motion blur artifacts due to the camera motion required tofollow the game. c) Also, severe occlusions are common in team sports.tall. There is one key difference in our evaluation to make it more reliable: our 2dpose annotations distinguish between the left and the right body part. Unlike other2d pose evaluations, our localization for each joint counts as correct only if it fallswithin the distance threshold of body part and matches the correct side of the body.We note that our approach focuses on estimating 3d articulated motion andnot on localization, we choose the best translation of our predicted 3d pose beforecomputing PCP. We use the mean of the left and the right hip of the ground truth702d pose to place the projection of the estimated 3d pose onto the video.AssumptionsWe make some simplifying assumptions. First, we assume that the players havebeen successfully tracked, and that a tight bounding box is provided for each playerper-frame. In our experiments, we use bounding boxes obtained using manual la-beling. We also assume that we have a homography of the court at each frame,so we can stabilize the video, accounting for camera motion; this minimizes noisein the optical-flow trajectories that we use for retrieval, and is also necessary tocompute the path error (Section 5.2.1). The homography is also manually labeledusing a simple GUI to mark correspondences between the court (using the stan-dard NBA dimensions) and each video frame. In the domain of sports, we couldalternatively obtain the homography using automatic camera rectification meth-ods [GLW11, GZA12, CC15]. However, the assumption helps us to separate theproblem of camera auto-calibration from the rest of the pipeline.Even with these assumptions, the broadcast videos are challenging for 3d poseestimation. There is significant camera motion in the form of pan, tilt, and zoom.Also, the lossy compression of the video leads to many video compression arti-facts. We summarize some of the main challenges in dealing with these videos inFigure Implementation detailsHere we describe some of the implementation details that are crucial for reproduc-ing the results presented in this chapter.Obtaining path on the groundWe use player path in the world coordinates to estimate path error (as describedin Section 5.2.1). Since the player bounding box and the frame homography aregiven, we can obtain an approximate path on the ground. Given a tight boundingbox with the top-left corner (xo, yo), width w and height h, we set the player’slocation in the image as (xo +w/2, yo +0.95×h) (see Figure 5.3(a)). We projectthis point into the field map using the homography to get a player path in the world71(a) (b)Figure 5.3: The LMT implementation details. (a) We assume that playerbounding box is given at each frame. The red dot is our estimate of theplayer location based on the current bounding box. The blue line showsthe estimated path connecting these locations over time. We transformthis path to world coordinates using the homography. (b) We also usethe homography at each frame to estimate the camera viewpoint. Theleft figure shows a square around a player projected to the image usingthe given homography (solid-magenta) and a local affine approximationto homography (cyan-dotted). Since the camera is located far from thecourt, the approximation is reasonably accurate in this case. We usethe approximate affine transformation to obtain the elevation and theazimuthal angle of the camera under the orthographic projection (right).coordinates.Estimating camera viewpointAgain, using the homography of the frame we obtain an approximate viewpointof the camera for each player location. Since we assume an orthographic projec-tion, we only need to estimate two parameters — elevation and azimuthal angle.We use this viewpoint to calculate the 2d pose error (see Section 5.2.1). Also,the estimated elevation angle helps us reduce the search for viewpoints during re-trieval. In the case of video-based retrieval (Chapter 4), we match each video to 3different projections of mocap sequences along the elevation angle. Here, we can72set the elevation angle to a fixed value. Finally, we also use the estimated eleva-tion and azimuthal angles to synthesize the final 3d pose sequence (described inSection 5.2.3).To obtain a viewpoint estimate in the video, we locally approximate the homog-raphy around a player’s location using an affine transformation. First, we considera fixed sized square around the player in the world coordinates and project it to theimage using the homography. Next, we estimate an affine transformation using thefour corners of the square in world coordinates and their projections in the image.Given the affine transformation, the elevation and azimuthal angles can be easilycalculated (see Figure 5.3(b)). Note that, since the viewpoint changes for differentpositions on the court, we need to calculate this for each player location separately.Parameter tuningThere are a total of 10 players in the video. We randomly pick 2 players — onefrom each team in the game — as our validation set. We use their video sequencesto tune the LMT weight parameters {w2d ,wp,wm,wt}. We do a few rounds of gridsearch to find parameters that return the best PCP curve on the validation set. Wenote that the performance is not sensitive to small changes in the above weights.All the test results are averaged over the remaining 8 players. We fix the numberof top retrieved examples N to 500.We also measure the sensitivity of our results to the LMT parameters — tem-poral window size used for retrieval (k) and the number of overlapping framesbetween two neighboring windows (p). For more details see Section BaselinesWe compare our method against the Flexible Mixture-of-Parts (FMP) [YR13] andthe n-best maximal decoders of Park and Ramanan (nFMP) [PR11], which producesmooth outputs over FMP using temporal consistency. We set the number of n-bestsolutions to 50, and tune the α parameter of their objective function (see Equation7 in [PR11]) on our data. We run both FMP and nFMP in the bounding boxes foreach person, and not on the whole frame. As mentioned earlier, we use the groundtruth body center to place the projection of our 3d model onto the image. For a fair73comparison, FMP and nFMP algorithms should also be aware of the ground truthbody center that the LMT is using. But, the primary focus of this result is not toshow that the LMT is better at localizing 2d joints but to give a sense of the qualityof the LMT output alongside other 2d methods. LMT’s strength lies in generatingnatural looking 3d pose sequences that we illustrate in our qualitative results.We also set-up an oracle called Oracle-pose to give us an upper bound onthe performance. This method uses the same approach as LMT but assumes thatthe ground truth 2d pose is given to the algorithm. The assumption helps us togenerate noise-free relational pose features for retrieval. The rest of the pipelineand parameters are kept the same. We use the oracle as an upper bound on accuracy.5.3.4 ResultsFor quantitative evaluations, we plot Percentage of Correct Parts (PCP) for differentthresholds, i.e., the fraction of predicted joint locations that fall within the thresholddistance from the ground truth. For qualitative results, we choose a few sequencesand plot the predicted 3d pose as well as its projection on the image.Different LMT modesFirst, we show the PCP results for different variations of our LMT implementa-tion. Figure 5.4 shows the 2d joint localization accuracy for 4 of the joints (knee,foot, elbow, and wrist). We obtain the PCP results for the parameter configuration{w2d ,wp,wm,wt}= {1,0.25,1.2,1.2}. Also, we set the LMT parameter k = 35 andp = 10. The same configuration is used for all the experiments.We make following observations from these results:• First, we observe that results are consistent across these joints. The consis-tent gain in accuracy in the full model (unary + binary terms) shows thatthe LMT is effective in utilizing the contextual information present in thepairwise relationships to resolve ambiguities.• The comparison with the oracle-pose shows the potential gain in accuracythat we could achieve by improving 2d pose estimation.• The performance of path-only is consistently below all other variations tried.740 10 20 30 40 50 6000. thresholdelbow  Oracle−poseFull (unary + binary)UnaryPath onlyTop match only0 10 20 30 40 50 6000. thresholdwrist  Oracle−poseFull (unary + binary)UnaryPath onlyTop match only0 10 20 30 40 50 6000. thresholdknee  Oracle−poseFull (unary + binary)UnaryPath onlyTop match only0 10 20 30 40 50 6000. thresholdfoot  Oracle−poseFull (unary + binary)UnaryPath onlyTop match onlyFigure 5.4: Percentage of Correct Parts (left-right sensitive) for different vari-ations of the LMT compared against Oracle-pose. Top match only usesthe matching error score. Path only minimizes only the path error to findthe best 3d sequence for each video snippet. Unary adds 2d pose erroralong with path and matching error. Full model uses a weighted sum ofboth unary and binary terms to evaluate the final LMT path. Oracle-posehas access to the ground truth 2d pose in addition to using the same pa-rameters as the full model. We note that adding binary terms (in the fullmodel) leads to a consistent gain in PCP. It demonstrates the importanceof pairwise relations in resolving ambiguities.750 10 20 30 40 50 6000. thresholdelbow  Oracle−poseOursNFMPFMP0 10 20 30 40 50 6000. thresholdwrist  Oracle−poseOursNFMPFMP0 10 20 30 40 50 6000. thresholdknee  Oracle−poseOursNFMPFMP0 10 20 30 40 50 6000. thresholdfoot  Oracle−poseOursNFMPFMPFigure 5.5: Percentage of Correct Parts (PCP) for the LMT output com-pared compared to 2d pose estimation methods FMP [YR13] andnFMP [PR11]. Since FMP and nFMP do not distinguish between leftand the right body parts, we do the same for the LMT projections tomake the comparison fair. Again, the oracle-pose uses the same param-eters as Ours but has access to ground truth 2d pose.76This result shows that the path of the person on the ground — although usedas the user input for synthesis in the case of MGs — is not effective in video-based pose estimation. We must incorporate the pose information.We show the qualitative results for pose estimation in Figures 5.6, 5.7, 5.8, and 5.9.The main observations from the qualitative results are:• As shown in Figure 5.6 and 5.7, the LMT gracefully deals with transitionsbetween different activities such as walk, turn, and run. We are also ableto capture the walk cycle in most cases, i.e., the correct leg is in front in awalk or run sequence. Additionally, the person orientation is matched well.Overall, the LMT output gives us rich information about the player movementand direction in 3d.• Our full model returns a smoother result, while the unary-only output is eas-ily affected by the local errors in the 2d pose estimation (as shown in Fig-ure 5.6). The temporal context provided by the binary term in the full modelalso helps in resolving the left-right ambiguity (Figure 5.7).• Sometimes, the full model may lead to over-smoothing. Figure 5.8 showsone such example.• Finally, even though the LMT works well for matching coarse motion, andgetting the direction of the movement right, the method often fails to recoverfast, complex movements in the video as shown in Figure 5.9.2d pose estimation performanceWe also compare the LMT PCP results with FMP and nFMP baselines in Figure 5.5.We use the same 4 joints as the previous experiment. However, to make the PCPresults comparable, we give up the left-right sensitivity of our evaluation, e.g., theelbow is considered a single joint rather than two different joints, left and the rightelbow, for the purpose of evaluation. This change is necessary because FMP andnFMP do not distinguish between the left and the right body parts.We observe that the LMT output is comparable or better than the nFMP resultsfor higher thresholds, but it does not perform well when high precision is required77in localization. Again, this indicates that while LMT is good at capturing the overallcoarse movements of a person in 3d, it does not localize the individual body jointsvery well. We can observe the same pattern in our qualitative results. The reasonfor a larger error at low thresholds is that the LMT formulation does not focus onlocalizing individual joints. We always match the full 3d model from one of theexemplars which may match the overall motion, but may not correspond well to thelocations of the body parts. Also, the LMT uses a fix-sized 3d model to generatethe pose sequence for all the player. Since the players have some variation in theirbody shape and size, this assumption adds an error that we do not compensate forin the current approach.LMT parametersNext, we study the sensitivity of different LMT parameters on the final perfor-mance. The two important parameters related to the construction of the LMT arethe length of each snippet used to query the database and the number of overlappingframes between two consecutive queries. Figure 5.10 summarizes the results.5.4 DiscussionTo the best of our knowledge, the LMT is the first scalable approach to 3d humanpose estimation from realistic monocular video input. Motion retrieval along withthe simple time-constraint trellis graph architecture keeps our method efficient andwidely applicable. We make no assumptions about the used motion examples andthe availability of class labels. Even though exact body joint locations are hard toestimate, the LMT can provide information about the activity, walk-cycle, and theheading direction of the person in the video.5.4.1 Limitations and future workOne of the main limitations of the LMT is its inability to generalize outside theavailable exemplars. For instance, we do not provide any mechanism to customizethe shape or size of the puppet for the tracked person in the video. The LMT is alsonot able to spatially localize the estimated 3d pose sequence in the provided video,and hence, we need to rely on the availability of the body center. Even though78we only use the exemplars and perform no further optimization to improve thejoint localization, the LMT output can also act as an initialization for further fine-tuning for better 3d pose estimation. To begin with, we can optimize for personsize and the location of the output. Also, to make pose level adjustments, weshould explore a combination of non-parametric and parametric approaches suchas GPDM [WFH08] or parametric motion graphs [HG07].Also, the LMT is inherently limited by the accuracy of the retrieval pipeline,which in turn depends on the 2d pose estimation accuracy. The large gap betweenthe performance of the oracle and our best result shows that improving 2d poseestimation can lead to better 3d pose output. We can make multiple changes to im-prove 2d pose estimation. First, we do not exploit the continuity of the appearance,i.e., we detect 2d pose in each frame independently. We can build an appearancemodel using the initial tracking results to improve the 2d pose tracking for subse-quent iterations. Second, as mentioned in the last chapter, we can incorporate theuncertainty of pose prediction in the retrieval pipeline rather than using the MAPestimate for the 2d pose.Furthermore, we have tested the LMT in a relatively restricted setting of a sportsenvironment with a known geometry. Extending this work to videos-in-the-wildwould also be an interesting future direction.79Image Seq.nFMPunaryfullfull(3d)Figure5.6:Atypical3dposesequencegeneratedusingLocalizedMotionTrellis.ThetoprowshowsacroppedimagesequencefromtheNBAdata.Thesubsequentrowsdisplaytheoutputofdifferentmethods,includingourfullmode(usingbothunaryandbinaryterms).The3doutputcorrespondingtothefullmodelispresentedinthelastrow.Notethatwehaverotatedtheaxisinthefinalrowtoemphasizethe3dnatureoftheoutput.Thearrowonthemiddleframeshowstheviewingdirectionofthecamera.Also,wemarktherightlimbsinred.Inthissequence,theplayerwalks,thenturnstohisrightandstartsrunning.Ourfullmodelsmoothlyhandlesthetransitionfromoneactivitytothenextandaccuratelycapturesthewalkcyclesi.e.,leftandrightlegsarecorrectlyalignedeventhoughthisinformationisnotavailablefromFMP.TheunaryoutputisaffectedbytheerrorsinthenFMPposeestimate(seethefourthframefromtheleftandthelastframe),whilethefullmodelcancorrectfortheseerrors.80Image Seq.nFMPunaryfullfull(3d)Figure5.7:Anotherexampleofthe3dposesequencegeneratedusingLocalizedMotionTrellis.Inthissequence,theplayerrunstowardsthecamera,thenturnsleft.Afterwaitingforafewseconds,heturnsaroundandrunsagain.Notethatthefullmodeliscorrectlyabletocapturethewalkcycle(seethelastthreeframes),whiletheplayersdirectionisinconsistentintheunaryoutput(seethesixthframefromtheleft,andthirdtothelastframe).81Image Seq.nFMPunaryfullfull(3d)Figure5.8:Anerrorcaseforourfullmodel.Inthisexample,theplayerruns,turnsaround,andkeepsgoinginthesamedirection.Hethenstopsforafewframes,andtheturnsleft.Our3doutputfromthefullmodelfacesinthewrongdirectionwhentheplayerturnsandkeepsfacinginthewrongdirectionwhentheplayerisstanding.Thisexampleisoneofthecaseswhenthefullmodelmaybefailingduetoover-smoothing.Incontrast,theunaryoutputisabletocapturethechangeindirection.82Image Seq.nFMPunaryfullfull(3d)Figure5.9:AnothererrorcaseemphasizingacommonlimitationofalltheLMTvariants.Inthisexample,theplayerisperformingfastandcomplicatedmovements.EventhoughtheLMTcancapturetheoveralljumpmotion(frames2-6),theoutputdoesnotcaptureanyothermovements.Therecanbemultiplereasonsforthisfailuresuchasa)theabsenceofanappropriateexemplarinthedatabasethatcanreasonablyapproximatethismotion;or,b)afailureinretrieval.835 10 15 20 25 30 35 40 45 50 550. PCPQuery length (in frames)  Norm. Th. 30Norm. Th. 40Norm. Th. 50(a)5 10 15 PCPQuery overlap (in frames)  Norm. Th. 30Norm. Th. 40Norm. Th. 50(b)Figure 5.10: The effect of the LMT parameters on the average PCP (over thewrist, elbow, knee, and foot joints) for different thresholds. a) Weplot accuracy as the function of query length. We reconstruct the 3dpose using 50 matches for each video subsequence. Since we are inter-ested in measuring the effect of query length on the quality of retrievedmocaps sequences, we only use 2d pose error to find the best path inthe LMT, and no interpolation is done to generate the final 3d output.Based on this result, we choose k = 35. b) Average PCP as a functionof overlap between consecutive queries. In this case, we fix the querylength to 35 frames and calculate PCP using the top 500 matches foreach video subsequence. To avoid the effect of other terms, we useonly transition error to find the best path.84Chapter 6Discussion and Future WorkUnderstanding actions, emotions, and intentions of people around us is an impor-tant part of human communication. This non-verbal exchange is essential to coop-eration, teamwork, and relationships. Intelligent machines also need the ability tounderstand non-verbal cues if they are to help us and work alongside us. Keepingthis motivation in mind, we have focussed on the problem of automated humanactivity understanding in videos.In Chapter 1 we argued that many crucial applications require a detailed de-scription of human activities under realistic imaging conditions. The three specificproblems that we targeted in this thesis are cross-view action recognition, video-based mocap retrieval, and 3d pose estimation. Cross-view action recognition helpsin overcoming the viewpoint bias in action recognition, mocap retrieval allows usto efficiently search through a large database of mocap files, and 3d pose estimationin videos can provide a well-localized (in space and time) description of human ac-tivities. Next, we identified one of the main obstacles in building such systems —human effort required in labelling data for supervised learning. The rest of thethesis concentrated on various techniques for overcoming this challenge.In the following sections, we discuss the potential impact of our contributions,identify main limitations of presented approaches, and speculate on possible futuredirections to address these challenges.856.1 Contributions and ImpactView-invariant action recognition:• Training data collected from the internet has its biases. For instance, theperspective of a security camera, an autonomous car or a home robot maynot match well with the videos downloaded from YouTube. Therefore, aview-invariant representation of human action is important.• In Chapter 3, we have demonstrated a novel method to add view-invarianceto action recognition without using any multi-view data or human annota-tions. Our approach has shown a significant improvement over the baselineon a standard cross-view action recognition benchmark and has remainedcompetitive to the state of the art.• We have also presented a method to generate motion features from mocapsequences without photo-realistic rendering. These features are analogousand comparable to the popular dense trajectory features [WKSL11].• We expect that these contributions will further encourage research towardslearning from synthetic sources of data such as CAD models, computergames, and physics-based simulations of the world.Video-based mocap retrieval:• Retrieval of mocap examples given visual input has been used for 3d pose es-timation. Chapter 4 has shown that instead of retrieving individual frames (asdemonstrated in [RSH+05]), we can retrieve a sequence using a short videoclip as a query. In addition, we have established a new task of video-basedmocap retrieval and demonstrated its applications to monocular 3d pose es-timation (in Chapter 5) and cross-view action recognition (in Chapter 3).• We have also presented a set of features for retrieval that is comparableacross video and mocap (Section 4.2). Due to temporal aggregation of infor-mation in videos, even noisy pose estimates in individual frames are effectivein retrieving mocap sequences.86• Moreover, we have provided a new benchmark1 for evaluating video-basedmocap retrieval to allow standardized comparisons in the future. Our frame-level annotations to mocap can also be useful for other vision tasks.3d pose estimation in videos:• Pose has been shown to be a good feature for activity recognition [JGZ+13,CLS15]. There are many advantages to predicting pose per-frame instead ofa single action label for the whole video. For instance, given a long video wecan segment the corresponding pose sequence to recover a set of action labelsalong with the temporal localization [MBS09]. Although annotating poseis tedious, to obtain fine-grained labels, action annotations require greaterdomain knowledge, e.g., people who do not watch basketball may not befamiliar with the different kinds of shots in the game.• Chapter 5 has presented a non-parametric (example-based) method for 3dpose estimation that is scalable to a large number of exemplars and realisticmonocular videos.• A large body of past research in 3d pose estimation has focussed on para-metric methods and used constrained environments for testing. We hopethat this thesis will help shift the focus towards example-based method andunconstrained videos.6.2 Future Directions6.2.1 Robust evaluation of 3d pose estimationAs we noted in the introduction, it is challenging to obtain ground truth for 3d posein videos. Synchronized video and motion capture can be used to get precise 3dpose, but it is currently restricted to indoor settings only. Moreover, the appearanceof such a video is not natural because of the special suit and reflective markers1 for tracking. Markerless mocap systems allow for natural clothing butrequire a large number of carefully calibrated cameras.Recently Elhayek et al. [EAJ+15] used a small number of calibrated cameras toestimate 3d pose in an outdoor setting with natural lighting conditions, and showedimpressive results. Using inertial sensors can add further robustness to such vision-based methods. However, for all the methods above, we still require an interfaceto clean-up the data and correct mistakes using human input. We can constructa pipeline for obtaining 3d pose ground truth in a realistic setting using multiplecameras, inertial sensors, and a manual clean-up step.Another challenge lies in measuring the similarity between the predicted andthe ground truth pose sequences. Obtaining an accurate ground truth estimate canhelp us in getting a reliable numerical similarity measure. However, a numericalsimilarity measure such as L2 distance between the two pose vectors may not implycloseness in the semantic space of activities [MRC05]. The problem of finding ameasure of similarity that agrees with human intuition is an interesting area forfurther exploration.6.2.2 Features for video-based mocap retrievalThe features that we propose for video-based mocap retrieval are effective as demon-strated in Chapter 3 and Chapter 5. However, there is a huge gap between our bestresults and the upper bound on the recall@N performance (Figure 4.9). One of thepotential bottlenecks in the performance can be the features used for comparinga mocap and a video frame. Our features are hand-designed, as opposed to thelearned features commonly used for action recognition and pose estimation. Al-though a retrieval method does not have access to labeled examples for learning,the features learned on similar tasks can be reused for retrieval. Here are someconcrete suggestions for improvements:• We use Flexible Mixture-of-Parts (FMP) for 2d pose estimation. Our results(in Figure 5.4) suggest that an increase in 2d pose estimation accuracy canimprove 3d pose estimation. As a first step, we can replace FMP with aConvNet-based 2d pose detector (such as [TGJ+15]) with a better perfor-mance on standard benchmarks.88• Our motion descriptor for retrieval is also an aggregation of hand-designedDense Trajectories (DT) features. These features can be replaced with ConvNet-based optical flow features used for action recognition [SZ14a]. One of theways we can adapt our method to ConvNet-based features is by generat-ing synthetic optical flow using different projections of the mocap sequence(similar to [ZRSB13]).• Additionally, we use relational pose features based on the 2d pose detectionoutput and calculate similarity using dot product. It is possible to transformthe feature space such that it becomes discriminative with respect to 3d pose.We can use metric learning [SJ03] for this purpose. It is possible to learna distance metric, without human supervision, by considering short mocapsequences and their different 2d projections. Since it is easier to establishsimilarity in 3d pose space, we can use the 3d similarity to supervise metriclearning for 2d relational features.6.2.3 Fine tuning pose estimation in videosThe Localized Motion Trellis (LMT) constructs the 3d pose output by interpolatingmultiple mocap snippets chosen from a database of mocap examples. One of themain limitations of such example-based methods is their inability to generalizebeyond the examples. Therefore, the output has a limited range, in the set of poses,skeleton shapes, and motion patterns. We can make our approach accurate byadding a fine-tuning step at the end of the pipeline. Here are some of the potentialfuture directions:• One of the most challenging and fundamental problems in human motionanalysis is to formulate a flexible, activity-independent generative model ofhuman motion. To simplify the problem, we can instead begin with a bankof generative models to represent the wide variety of human activities. Sim-ilar to [BRMS09, YGG12, YKC13], we can use the output of the LMT tochoose an appropriate model for each snippet and tune the model parametersto explain the image evidence.• Many 3d pose estimation methods assume that the 3d model of the person89being tracked in known in advance. On the other extreme, we use the samemodel provided by the mocap sequence for each subject. The mismatch inthe size and shape of the model of the person introduces an error in the 3dpose estimate at each frame. The fine-tuning step can also customize themodel to each subject. We can use a parametric model of human shape andpose [ZB15] rather than using a fixed-sized puppet. Again, we can startwith a generic shape with the pose sequence returned by the LMT as ourinitialization and optimize for shape over the whole sequence.• Finally, one of the biggest limitations of our approach is that it is not proba-bilistic. For instance, we do not model or incorporate the uncertainty in the2d pose estimate. Some pose or specific joint estimates can be less reliablethan others due to self-occlusion or motion blur. Some of the past approachesdeal with this problem by optimizing for 3d pose and updating the underlying2d pose estimate in an alternating fashion [SSQTMN13, ZZL+16]. Incorpo-rating these ideas into our pipeline is another exciting challenge.90Bibliography[AF02] Okan Arikan and D. A. Forsyth. Interactive motion generationfrom examples. TOG, 21(3), 2002. → pages 11[AF13] Xavier Anguera and Miquel Ferrarons. Memory efficientsubsequence DTW for query-by-example spoken term detection.In Multimedia and Expo (ICME), 2013. → pages 18[AN04] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning viainverse reinforcement learning. In ICML, 2004. → pages 4[ARS10] Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele. Monocular3D pose estimation and tracking by detection. In CVPR, 2010. →pages 9, 19, 21, 39[AT06] A. Agarwal and B. Triggs. Recovering 3d human pose frommonocular images. TPAMI, 28(1), 2006. → pages 21[BF08] Marcus A Brubaker and David J Fleet. The kneed walker forhuman pose tracking. In CVPR, 2008. → pages 20[BMB+11] Andreas Baak, Meinard Muller, Gaurav Bharaj, Hans-PeterSeidel, and Christian Theobalt. A data-driven approach forreal-time full body pose reconstruction from a depth camera. InICCV, 2011. → pages 17[BRMS09] Andreas Baak, Bodo Rosenhahn, M Muller, and H-P Seidel.Stabilizing motion tracking using retrieved motion priors. InICCV, 2009. → pages 89[BYF14] Lubomir Bourdev, Fei Yang, and Rob Fergus. Deep poselets forhuman detection. arXiv preprint arXiv:1407.0717, 2014. → pages1791[CC15] Jianhui Chen and Peter Carr. Mimicking human camera operators.In WACV, 2015. → pages 71[CG13] Chao-Yeh Chen and Kristen Grauman. Watching unlabeled videohelps learn new human actions from very few labeled snapshots.In CVPR, 2013. → pages 32[CLAL12] Min-Wen Chao, Chao-Hung Lin, J. Assa, and Tong-Yee Lee.Human motion retrieval from hand-drawn sketch. Visualizationand Computer Graphics, 18(5), 2012. → pages 15[CLS15] Guilhem Che´ron, Ivan Laptev, and Cordelia Schmid. P-CNN:Pose-based CNN Features for Action Recognition. In ICCV, 2015.→ pages 87[cmu] Carnegie Mellon University Motion Capture Database. → pages10, 26, 31, 50[CYI+12] Myung Geol Choi, Kyungyong Yang, Takeo Igarashi, Jun Mitani,and Jehee Lee. Retrieval and visualization of human motion datavia stick figures. In Computer Graphics Forum, volume 31, 2012.→ pages 15[DSCW10] Konstantinos G Derpanis, Mikhail Sizintsev, Kevin Cannons, andRichard P Wildes. Efficient action spotting based on a spacetimeoriented structure representation. In CVPR, 2010. → pages 4[DT05] Navneet Dalal and Bill Triggs. Histograms of Oriented Gradientsfor Human Detection. In CVPR, 2005. → pages 6, 24[DTS06] Navneet Dalal, Bill Triggs, and Cordelia Schmid. HumanDetection Using Oriented Histograms of Flow and Appearance. InECCV, 2006. → pages 6, 24[EAJ+15] A. Elhayek, E. Aguiar, A. Jain, J. Tompson, L. Pishchulin,M. Andriluka, C. Bregler, B. Schiele, and C. Theobalt. Efficientconvnet-based marker-less motion capture in general scenes with alow number of cameras. In CVPR, 2015. → pages 19, 88[EMJZF12] Marcin Eichner, Manuel Marin-Jimenez, Andrew Zisserman, andVittorio Ferrari. 2d articulated human pose estimation andretrieval in (almost) unconstrained still images. IJCV, 99(2), 2012.→ pages ix, 5, 1692[FGDJ08] Tien-Chieng Feng, P. Gunawardane, J. Davis, and B. Jiang.Motion capture data retrieval using an artist’s doll. In ICPR, 2008.→ pages 15[Fle11] David J Fleet. Motion models for people tracking. In Thomas BMoeslund, Adrian Hilton, Volker Kru¨ger, and Leonid Sigal,editors, Visual Analysis of Humans. 2011. → pages 9, 19[FT08] Ali Farhadi and MK Tabrizi. Learning to Recognize Activitiesfrom the wrong viewpoint. In ECCV, 2008. → pages 14, 23[GBS+07] Lena Gorelick, Moshe Blank, Eli Shechtman, Michal Irani, andRonen Basri. Actions as Space-Time Shapes. TPAMI, 29(12),2007. → pages 6, 24[GDDM14] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik.Rich feature hierarchies for accurate object detection and semanticsegmentation. In CVPR, 2014. → pages 1[GIJ+15] A. Gorban, H. Idrees, Y.-G. Jiang, A. Roshan Zamir, I. Laptev,M. Shah, and R. Sukthankar. THUMOS challenge: Actionrecognition with a large number of classes., 2015. → pages 4[GLW11] Ankur Gupta, James J. Little, and Robert J. Woodham. Using lineand ellipse features for rectification of broadcast hockey video. InCRV, 2011. → pages 71[GMLW14] Ankur Gupta, Julieta Martinez, James J. Little, and Robert J.Woodham. 3D Pose from Motion for Cross-view ActionRecognition via Non-linear Circulant Temporal Encoding. InCVPR, 2014. → pages xiv, 8, 39, 54, 55, 57[GSLW14] Ankur Gupta, Alireza Shafaei, James J. Little, and Robert J.Woodham. Unlabelled 3D Motion Examples Improve Cross-ViewAction Recognition. In BMVC, 2014. → pages viii, 36[GSSG12] Boqing Gong, Yuan Shi, Fei Sha, and K. Grauman. GeodesicFlow Kernel for Unsupervised Domain Adaptation. In CVPR,2012. → pages 15[GvdPvdS13] Thomas Geijtenbeek, Michiel van de Panne, and A. Frank van derStappen. Flexible muscle-based locomotion for bipedal creatures.TOG, 32(6), 2013. → pages 2093[GZA12] B Ghanem, T Zhang, and N Ahuja. Robust video registrationapplied to field-sports video analysis. In ICASSP, 2012. → pages71[HG07] Rachel Heck and Michael Gleicher. Parametric motion graphs. InSymposium on Interactive 3D graphics and games, 2007. →pages 79[HW13] De-An Huang and Yu-Chiang Frank Wang. Coupled Dictionaryand Feature Space Learning with Applications to Cross-DomainImage Synthesis and Recognition. In ICCV, 2013. → pages 6[JDS11] Herve Jegou, Matthijs Douze, and Cordelia Schmid. Productquantization for nearest neighbor search. TPAMI, 33(1), 2011. →pages 44, 51[JGZ+13] Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, andMichael J Black. Towards understanding action recognition. InICCV, 2013. → pages ix, 5, 42, 87[JZE+12] N. Jammalamadaka, A. Zisserman, M. Eichner, V. Ferrari, andC. V. Jawahar. Video retrieval by mimicking poses. In ACMInternational Conference on Multimedia Retrieval, 2012. →pages 17[JZJ15] N. Jammalamadaka, A. Zisserman, and C. V. Jawahar. Humanpose search using deep poselets. In International Conference onAutomatic Face and Gesture Recognition, 2015. → pages 17[KCT+13] Mubbasir Kapadia, I-kao Chiang, Tiju Thomas, Norman I. Badler,and Joseph T. Kider, Jr. Efficient motion retrieval in large motiondatabases. In ACM SIGGRAPH Symposium on Interactive 3DGraphics and Games, 2013. → pages 15[KG04] Lucas Kovar and Michael Gleicher. Automated extraction andparameterization of motions in large data sets. TOG, 23(3), 2004.→ pages 8, 16[KGP02] Lucas Kovar, Michael Gleicher, and Fre´de´ric Pighin. Motiongraphs. TOG, 21(3), 2002. → pages 10, 21, 65, 67, 68[KJG+11] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre.HMDB: a large video database for human motion recognition. InICCV, 2011. → pages 394[KS16] Hema S Koppula and Ashutosh Saxena. Anticipating humanactivities using object affordances for reactive robotic response.PAMI, 38(1), 2016. → pages 13[KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classification with deep convolutional neural networks.In NIPS, 2012. → pages 1[KTB07] Sagi Katz, Ayellet Tal, and Ronen Basri. Direct visibility of pointsets. TOG, 26(3), July 2007. → pages 26[KTS+14] Andrej Karpathy, George Toderici, Sanketh Shetty, ThomasLeung, Rahul Sukthankar, and Li Fei-Fei. Large-scale videoclassification with convolutional neural networks. In CVPR, 2014.→ pages 5, 61[LCS12] Binlong Li, O.I. Camps, and M. Sznaier. Cross-view activityrecognition using hankelets. In CVPR, 2012. → pages 6, 35, 36[LM07] Neil D Lawrence and Andrew J Moore. Hierarchical Gaussianprocess latent variable models. In ICML, 2007. → pages 19, 20[LS11] Jingen Liu and Mubarak Shah. Cross-view action recognition viaview knowledge transfer. In CVPR, 2011. → pages 15, 23[LTLM13] Wei-Lwun Lu, J-A Ting, James J Little, and Kevin P Murphy.Learning to track and identify players from broadcast sportsvideos. TPAMI, 35(7), 2013. → pages 3[LTSY09] Rui Li, Tai-Peng Tian, Stan Sclaroff, and Ming-Hsuan Yang. 3dhuman motion tracking with a coordinated mixture of factoranalyzers. IJCV, 87(1-2), 2009. → pages 21[LWS02] Yan Li, Tianshu Wang, and Heung-Yeung Shum. Motion texture:a two-level statistical model for character motion synthesis. TOG,21(3), 2002. → pages 20[LZ12] Ruonan Li and Todd Zickler. Discriminative Virtual Views forCross-View Action Recognition. In CVPR, 2012. → pages 6, 14,15[LZRZS15] Tian Lan, Yuke Zhu, Amir Roshan Zamir, and Silvio Savarese.Action recognition by hierarchical mid-level action elements. InICCV, 2015. → pages 295[MBS09] Meinard Mu¨ller, Andreas Baak, and Hans-Peter Seidel. Efficientand robust annotation of motion capture data. In EurographicsSymposium on Computer Animation, 2009. → pages 87[MGB09] Armando Muscariello, Guillaume Gravier, and Fre´de´ric Bimbot.Variability tolerant audio motif discovery. In Advances inMultimedia Modeling, pages 275–286. Springer, 2009. → pagesxiii, 18, 47, 48[MHKS11] Thomas B Moeslund, Adrian Hilton, Volker Kru¨ger, and LeonidSigal. Visual analysis of humans. Springer, 2011. → pages 13[MRC05] Meinard Mu¨ller, Tido Ro¨der, and Michael Clausen. Efficientcontent-based retrieval of motion capture data. TOG, 24(3), 2005.→ pages 15, 16, 42, 61, 88[MT13] Behrooz Mahasseni and Sinisa Todorovic. Latent MultitaskLearning for View-Invariant Action Recognition. In ICCV, 2013.→ pages 6[Mu¨l07] Meinard Mu¨ller. Information retrieval for music and motion,volume 2. Springer, 2007. → pages 18, 46[MV93] Andres Marzal and Enrique Vidal. Computation of normalizededit distance and applications. PAMI, 15(9), 1993. → pages 47[NNSH11] Naoki Numaguchi, Atsushi Nakazawa, Takaaki Shiratori, andJessica K Hodgins. A puppet interface for retrieval of motioncapture data. In Eurographics Symposium on ComputerAnimation, 2011. → pages 15[NTTS06] MN Nyan, Francis EH Tay, AWY Tan, and KHW Seah.Distinguishing fall activities from normal activities by angular ratecharacteristics and high-speed camera characterization. MedicalEngineering & Physics, 28(8), 2006. → pages 4[OVS13] Dan Oneata, Jakob Verbeek, and Cordelia Schmid. Action andevent recognition with Fisher vectors on a compact feature set. InICCV, 2013. → pages 41[PP10] Tomislav Pejsa and Igor S Pandzic. State of the art inexample-based motion synthesis for virtual characters ininteractive applications. In Computer Graphics Forum,volume 29, 2010. → pages 2196[PR11] Dennis Park and Deva Ramanan. N-best maximal decoders forpart models. In ICCV, 2011. → pages xvii, 73, 76[PR14] Hamed Pirsiavash and Deva Ramanan. Parsing videos of actionswith segmental grammars. In CVPR, 2014. → pages 2[PSM10] Florent Perronnin, Jorge Sa´nchez, and Thomas Mensink.Improving the Fisher kernel for large-scale image classification.In ECCV, 2010. → pages 41[RDC+13] Je´roˆme Revaud, Matthijs Douze, Schmid Cordelia, Herve´ Je´gou,et al. Event Retrieval in Large Video Collections with CirculantTemporal Encoding. In CVPR, 2013. → pages 8, 9, 18, 44[RF03] Deva Ramanan and David A Forsyth. Automatic Annotation ofEveryday Movements. In NIPS, 2003. → pages 2, 13, 14[RJ93] Lawrence Rabiner and Biing-Hwang Juang. Fundamentals ofspeech recognition. 1993. → pages 45, 48[RM15] Hossein Rahmani and Ajmal Mian. Learning a non-linearknowledge transfer model for cross-view action recognition. InCVPR, 2015. → pages viii, 2, 7, 35, 36[RSH+05] Liu Ren, Gregory Shakhnarovich, Jessica K Hodgins, HanspeterPfister, and Paul Viola. Learning silhouette features for control ofhuman motion. TOG, 24(4), 2005. → pages 8, 11, 17, 39, 61, 66,86[SC04] Stan Salvador and Philip Chan. Fastdtw: Toward accuratedynamic time warping in linear time and space. In KDD workshopon mining temporal and sequential data, 2004. → pages 18, 47[SGF+13] Jamie Shotton, Ross Girshick, Andrew Fitzgibbon, Toby Sharp,Mat Cook, Mark Finocchio, Richard Moore, Pushmeet Kohli,Antonio Criminisi, Alex Kipman, and Andrew Blake. Efficienthuman pose estimation from single depth images. TPAMI, 35(12),2013. → pages 4, 13[SHG+11] Carsten Stoll, Nils Hasler, Juergen Gall, Hans-Peter Seidel, andChristian Theobalt. Fast articulated motion tracking using a sumsof Gaussians body model. In ICCV, 2011. → pages 1997[SJ03] Matthew Schultz and Thorsten Joachims. Learning a distancemetric from relative comparisons. In NIPS, 2003. → pages 89[SKFD10] Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell.Adapting visual category models to new domains. In ECCV. 2010.→ pages 3[SSQTMN13] Edgar Simo-Serra, Ariadna Quattoni, Carme Torras, and FrancescMoreno-Noguer. A joint model for 2d and 3d pose estimationfrom a single image. In CVPR, 2013. → pages 20, 39, 90[ST13] Ben Sapp and Ben Taskar. Modec: Multimodal decomposablemodels for human pose estimation. In CVPR, 2013. → pages 69[SVBC08] Adam A Smith, Aaron Vollrath, Christopher A Bradfield, andMark Craven. Similarity queries for temporal toxicogenomicexpression profiles. PLoS Computational Biology, 4(7), 2008. →pages 18[SVBC09] Adam A Smith, Aaron Vollrath, Christopher A Bradfield, andMark Craven. Clustered alignments of gene-expression time seriesdata. Bioinformatics, 25(12):i119–i1127, 2009. → pages 18[SVD03] Gregory Shakhnarovich, Paul Viola, and Trevor Darrell. Fast poseestimation with parameter-sensitive hashing. In CVPR, 2003. →pages 17[SZ14a] Karen Simonyan and Andrew Zisserman. Two-streamconvolutional networks for action recognition in videos. InAdvances in Neural Information Processing Systems, pages568–576, 2014. → pages 89[SZ14b] Khurram Soomro and Amir R Zamir. Action recognition inrealistic sports videos. In Computer Vision in Sports. 2014. →pages 4[TGJ+15] Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, andChristopher Bregler. Efficient object localization usingconvolutional networks. In CVPR, 2015. → pages 88[TSW+15] B. Tekin, X. Sun, X. Wang, V. Lepetit, and P. Fua. PredictingPeople’s 3D Poses from Short Sequences. ArXiv e-prints, April2015. → pages 2198[UD08] R. Urtasun and T. Darrell. Sparse probabilistic regression foractivity-independent human pose inference. In CVPR, 2008. →pages 21[UFHF05] Raquel Urtasun, David J Fleet, Aaron Hertzmann, and Pascal Fua.Priors for people tracking from small training sets. In ICCV, 2005.→ pages 19[VNK15] Michalis Vrigkas, Christophoros Nikou, and Ioannis AKakadiaris. A review of human activity recognition methods.Frontiers in Robotics and AI, 2, 2015. → pages 13[VSHJ12] Marek Vondrak, Leonid Sigal, Jessica Hodgins, and OdestJenkins. Video-based 3D Motion Capture through Biped Control.TOG, 31(4), 2012. → pages 20[WBR07] Daniel Weinland, Edmond Boyer, and Remi Ronfard. ActionRecognition from Arbitrary Views Using 3D Exemplars. In ICCV,2007. → pages 14[WC10] Xiaolin Wei and Jinxiang Chai. Videomocap: modeling physicallyrealistic human motion from monocular video sequences. TOG,29(4), 2010. → pages 20[WFH08] Jack M Wang, David J Fleet, and Aaron Hertzmann. Gaussianprocess dynamical models for human motion. TPAMI, 30(2),2008. → pages 19, 79[WKSL11] Heng Wang, Alexander Klaser, Cordelia Schmid, and Cheng-LinLiu. Action recognition by dense trajectories. In CVPR, 2011. →pages xi, xii, 6, 7, 13, 17, 23, 24, 25, 26, 28, 31, 40, 41, 86[WRB06] Daniel Weinland, Remi Ronfard, and Edmond Boyer. Freeviewpoint action recognition using motion history volumes.CVIU, 104(2), 2006. → pages 2, 24, 30, 50[YCN+11] Yuan Yuan, Yi-Ping P Chen, Shengyu Ni, Augix G Xu, Lin Tang,Martin Vingron, Mehmet Somel, and Philipp Khaitovich.Development and application of a modified dynamic time warpingalgorithm (DTW-S) to analyses of primate brain expression timeseries. BMC bioinformatics, 12(1):347, 2011. → pages 1899[YGG12] Angela Yao, Juergen Gall, and Luc Gool. Coupled ActionRecognition and Pose Estimation from Multiple Views. IJCV,100(1), 2012. → pages 20, 42, 49, 89[YKC13] Tsz-Ho Yu, Tae-Kyun Kim, and Roberto Cipolla. Unconstrainedmonocular 3d human pose estimation by action detection andcross-modality regression forest. In CVPR, 2013. → pages 20, 49,89[YKS08] Pingkun Yan, Saad M. Khan, and Mubarak Shah. Learning 4DAction Feature Models for Arbitrary View Action Recognition. InCVPR, 2008. → pages 14[YKW14] Hashim Yasin, Bjo¨rn Kru¨ger, and Andreas Weber. Motiontracking, retrieval and 3d reconstruction from video. IJMUE, 9(2),2014. → pages 17[YR13] Yi Yang and Deva Ramanan. Articulated human detection withflexible mixtures of parts. TPAMI, 35(12), 2013. → pages xvii,xx, 17, 39, 42, 66, 73, 76[YS05] Alper Yilmaz and Mubarak Shah. Actions sketch: A novel actionrepresentation. In CVPR, 2005. → pages 6, 24[ZB15] Silvia Zuffi and Michael J Black. The stitched puppet: A graphicalmodel of 3d human shape and pose. In CVPR, 2015. → pages 90[ZDlT14] Feng Zhou and Fernando De la Torre. Spatio-temporal matchingfor human detection in video. In ECCV. 2014. → pages xiii, 41[ZdlT15] F. Zhou and F. de la Torre. Generalized canonical time warping.TPAMI, PP(99), 2015. → pages 18[ZJ13] Jingjing Zheng and Zhuolin Jiang. Learning View-InvariantSparse Representations for Cross-View Action Recognition. InICCV, 2013. → pages 6, 15, 23[ZLN08] Li Zhang, Yuan Li, and Ramakant Nevatia. Global dataassociation for multi-object tracking using network flows. InCVPR, 2008. → pages 63[ZRSB13] Silvia Zuffi, Javier Romero, Cordelia Schmid, and Michael JBlack. Estimating human pose with flowing puppets. In ICCV,2013. → pages 89100[ZWX+13] Zhong Zhang, Chunheng Wang, Baihua Xiao, Wen Zhou, ShuangLiu, and Cunzhao Shi. Cross-View Action Recognition via aContinuous Virtual Path. In CVPR, 2013. → pages 6, 15, 30[ZZD13] Weiyu Zhang, Menglong Zhu, and Konstantinos Derpanis. Fromactemes to action: A strongly-supervised representation fordetailed action understanding. In ICCV, 2013. → pages 5[ZZL+16] Xiaowei Zhou, Menglong Zhu, Spyridon Leonardos, KostaDerpanis, and Kostas Daniilidis. Sparseness meets deepness: 3dhuman pose estimation from monocular video. In CVPR, 2016. →pages 90101


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items