Understanding the sources of error for 3D human pose estimation from monocular images and videos

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Understanding the sources of error for 3D human pose estimation from monocular images and videos Hossain, Mir Rayat Imtiaz

Abstract

With the success of deep learning in the field of computer vision, most state-of-the-art approaches of estimating 3D human pose from images or videos rely on training a network end-to-end which can regress into 3D joint locations or heatmaps from an RGB image. Although most of these approaches provide good results, the major sources of error are often difficult to understand. The errors may either come from incorrect 2D pose estimation or from the incorrect mapping of the features in 2D to 3D. In this work, we aim to understand the sources of error in estimating 3D pose from images and videos. Therefore, we have built three different systems. The first takes 2D joint locations of every frame individually as inputs and predicts 3D joint positions. To our surprise, we found that by using a simple feed-forward fully connected network, with residual connections, the ground truth 2D joint locations can be mapped to 3D space at a remarkably low error rate, outperforming the best reported result by almost 30% on the Human 3.6M dataset, the largest publicly available dataset of motion capture data. Furthermore, training this network on the outputs of an off-the-shelf 2D pose detector gives us state-of-the-art results when compared with a vast array of systems trained end-to-end. To validate the efficacy of this network, we also trained an end-to-end system that takes an image as input and regresses 3D pose directly. We found that it is harder to train the network end-to-end than decoupling the task. To examine whether temporal information over a sequence improves results, we built a sequence-to-sequence network that takes a sequence of 2D poses as input and predicts a sequence of 3D poses as output. We found that the temporal information improves the results from our first system. We argue that a large portion of error of 3D pose estimation systems results from the error in 2D pose estimation.

Item Metadata

Title	Understanding the sources of error for 3D human pose estimation from monocular images and videos
Creator	Hossain, Mir Rayat Imtiaz
Publisher	University of British Columbia
Date Issued	2017
Description	With the success of deep learning in the field of computer vision, most state-of-the-art approaches of estimating 3D human pose from images or videos rely on training a network end-to-end which can regress into 3D joint locations or heatmaps from an RGB image. Although most of these approaches provide good results, the major sources of error are often difficult to understand. The errors may either come from incorrect 2D pose estimation or from the incorrect mapping of the features in 2D to 3D. In this work, we aim to understand the sources of error in estimating 3D pose from images and videos. Therefore, we have built three different systems. The first takes 2D joint locations of every frame individually as inputs and predicts 3D joint positions. To our surprise, we found that by using a simple feed-forward fully connected network, with residual connections, the ground truth 2D joint locations can be mapped to 3D space at a remarkably low error rate, outperforming the best reported result by almost 30% on the Human 3.6M dataset, the largest publicly available dataset of motion capture data. Furthermore, training this network on the outputs of an off-the-shelf 2D pose detector gives us state-of-the-art results when compared with a vast array of systems trained end-to-end. To validate the efficacy of this network, we also trained an end-to-end system that takes an image as input and regresses 3D pose directly. We found that it is harder to train the network end-to-end than decoupling the task. To examine whether temporal information over a sequence improves results, we built a sequence-to-sequence network that takes a sequence of 2D poses as input and predicts a sequence of 3D poses as output. We found that the temporal information improves the results from our first system. We argue that a large portion of error of 3D pose estimation systems results from the error in 2D pose estimation.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2017-12-04
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0361162
URI	http://hdl.handle.net/2429/63808
Degree	Master of Science - MSc
Program	Computer Science
Affiliation	Science, Faculty of; Computer Science, Department of
Degree Grantor	University of British Columbia
Graduation Date	2018-02
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Understanding the sources of error for 3D human pose estimation from monocular images and videos Hossain, Mir Rayat Imtiaz

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights