Off-the-shelf Multiple Person Multiple View 3D Pose Estimation.
If this repository is useful to you, please cite:
@inproceedings{tanke2019iterative,
title={Iterative Greedy Matching for 3D Human Pose Tracking from Multiple Views},
author={Tanke, Julian and Gall, Juergen},
booktitle={German Conference on Pattern Recognition},
year={2019}
}
In this work we propose an approach for estimating 3D human poses of multiple people from a set of calibrated cameras. Estimating 3D human poses from multiple views has several compelling properties: human poses are estimated within a global coordinate space and multiple cameras provide an extended field of view which helps in resolving ambiguities, occlusions and motion blurs. Our approach builds upon a real-time 2D multi-person pose estimation system and greedily solves the association problem between multiple views. We utilize bipartite matching to track multiple people over multiple frames. This proofs to be especially efficient as problems associated with greedy matching such as occlusion can be easily resolved in 3D. Our approach achieves state-of-the-art results on popular benchmarks and may serve as a baseline for future work.
This project requires nvidia-docker and drivers that support cuda 10.
Clone this repository with its submodules as follows:
git clone --recursive https://github.com/jutanke/mv3dpose.git
Your dataset must reside in a pre-defined folder structure:
- dataset
- dataset.json
- cameras
- camera00
- frame00xxxxxxm.json
- camera01
- frame00xxxxxxm.json
- ...
- camera_n
- frame00xxxxxxm.json
- camera00
- videos
- camera00
- frame00xxxxxxm.png
- camera01
- frame00xxxxxxm.png
- ...
- camera_n
- frame00xxxxxxm.png
- camera00
The file names per frame utilize the following schema:
"frame%09d.{png/json}"
The camera json files follow two types of structures: A simple camera with only the projection matrix and width and height:
{
"P" : [ 3 x 4 ],
"w" : int(width),
"h" : int(height)
}
or a more complex camera setup with distortion coefficients. This camera is based on OpenCV.
{
"K" : [ 3 x 3 ], /* intrinsic paramters */
"rvec": [ 1 x 3 ], /* rotation vector */
"tvec": [ 1 x 3 ], /* translation vector */
"discCoef": [ 1 x 5 ], /* distortion coefficient */
"w" : int(width),
"h" : int(height)
}
The system expects a camera for each view at each point in time. If your dataset uses fixed cameras you will need to simply repeat them for all frames.
The dataset.json file contains general information for the model:
{
"n_cameras": int(#cameras), /* number of cameras */
"scale_to_mm": 1, /* scales the calibration to mm */
}
The variable scale_to_mm is needed as we operate in [mm] but calibrations might be in other metrics. For example, when the calibration is done in meters, scale_to_mm must be set to 1000.
- valid_frames: if frames do not start at 0 and/or are not continious you can set a list of frames here
- epi_threshold: epipolar line distance threshold in PIXEL
- max_distance_between_tracks: maximal distance in [mm] between tracks so that they can be associated
- min_track_length: drop any track which is shorter than min_track_length frames
- last_seen_delay: allow to skip last_seen_delay frames for connecting a lost track
- smoothing_sigma: sigma value for Gaussian smoothing of tracks
- smoothing_interpolation_range: define how far fill-ins should be reaching
- do_smoothing: should smoothing be done at all? (Default is True)
./mvpose.sh /path/to/your/dataset
The resulting tracks will be in your dataset folder under tracks3d, each track represents a single person. The files are organised as follows:
{
"J": int(joint number), /* number of joints */
"frames": [int, int], /* ordered list of the frames where this track is residing */
"poses": [ n_frames x J x 3 ] /* 3D poses, 3d location OR None, if joint is missing */
}