DeepPoseKit, a software toolkit for fast and robust animal pose estimation using deep learning

Abstract

Quantitative behavioral measurements are important for answering questions across scientific disciplines—from neuroscience to ecology. State-of-the-art deep-learning methods offer major advances in data quality and detail by allowing researchers to automatically estimate locations of an animal’s body parts directly from images or videos. However, currently available animal pose-estimation methods have limitations in speed and robustness. Here, we introduce a new easy-to-use software toolkit, DeepPoseKit, that addresses these problems using an efficient multi-scale deep-learning model, called Stacked DenseNet, and a fast GPU-based peak-detection algorithm for estimating keypoint locations with subpixel precision. These advances improve processing speed >2× with no loss in accuracy compared to currently available methods. We demonstrate the versatility of our methods with multiple challenging animal pose-estimation tasks in laboratory and field settings — including groups of interacting individuals. Our work reduces barriers to using advanced tools for measuring behavior and has broad applicability across the behavioral sciences.

eLife digest

Researchers studying behaviour increasingly rely on video recordings of animals — single individuals or groups of them — to measure how the body moves. Reliably tracking the position of limbs, head, and other body parts in video is, however, hard: animals come in different shapes, lighting changes, individuals get partially hidden, and species-specific tools are scarce.

Graving and colleagues built DeepPoseKit, a software package that makes it dramatically easier and faster for any researcher to track an animal’s body in video. By combining a new neural-network design with a fast algorithm for finding keypoint locations on the GPU, the toolkit processes video more than twice as fast as previous tools without losing accuracy. The result is a piece of software that puts state-of-the-art pose estimation within reach of biologists who do not need to be experts in deep learning.

Introduction

Understanding the behaviour of an animal — what it does, when, and why — is a central goal across behavioural disciplines, from neuroscience and developmental biology to ecology and conservation. High-resolution video has made it routine to collect long behavioural recordings, but extracting reliable kinematic measurements from raw video remains a bottleneck. Recent deep-learning approaches have shown that user-defined keypoints can be tracked through video without markers, opening up new experimental designs that span from tethered fruit flies to free-ranging zebras in the field.

Existing tools — most prominently LEAP and DeepLabCut — have proven powerful but trade off speed against robustness. In this work we present DeepPoseKit, a Python toolkit that addresses both limitations through an efficient multi-scale model architecture, a GPU-based subpixel peak detector and a small, opinionated API.

Results

DeepPoseKit introduces Stacked DenseNet, a multi-scale architecture that combines DenseNet-style feature reuse with stacked, multi-resolution supervision. We pair this backbone with a GPU subpixel peak detector that retrieves keypoint locations from confidence maps in a single GPU pass, avoiding the costly CPU post-processing that dominated runtime in earlier tools.

Across four evaluation datasets — covering tethered Drosophila, freely moving flies, locusts, and overhead recordings of grouping zebras — Stacked DenseNet matches or exceeds the keypoint accuracy of LEAP and DeepLabCut while running 2× to 5× faster on the same GPU. Figure 3 summarises the speed/accuracy trade-off; Figure 4 details the datasets.

Figures

Figure 1

Figure 1. An illustration of the workflow for DeepPoseKit.

Users iteratively annotate keypoints on raw frames, train a Stacked DenseNet model on the resulting dataset, and apply the trained network to unseen videos. The interactive annotation tool, training loop and prediction utilities are all part of the open-source toolkit.

Figure 2

Figure 2. An illustration of the model training process.

Annotated images and ground-truth keypoint locations are converted into Gaussian confidence maps and used to train the Stacked DenseNet backbone. Multi-scale supervision is applied at the output of each stack.

Figure 3

Figure 3. DeepPoseKit is fast, accurate, and easy-to-use.

Across four evaluation datasets, Stacked DenseNet matches or exceeds the accuracy of LEAP and DeepLabCut while running 2× to 5× faster on the same GPU.

Figure 4

Figure 4. Datasets used for evaluation.

Datasets span tethered Drosophila, freely moving flies, locusts, and overhead recordings of zebras and other grouping species in the wild.

Discussion

An efficient backbone, a fast peak detector and a small Python API together let researchers move from raw video to keypoint trajectories without first becoming experts in deep-learning infrastructure. We expect this to broaden the use of pose estimation across behavioural sciences, and to make it practical to study behaviour in groups and natural environments at high temporal resolution.

DeepPoseKit complements existing tools rather than replacing them: alternative backbones such as Stacked Hourglass remain available within the library, and trained models can be exported and consumed by downstream tracking software such as idtracker.ai.

Materials and methods

The toolkit is implemented in Python on top of TensorFlow and Keras, with the GPU peak detector written as a custom TensorFlow op. All datasets, code and trained models used in this study are publicly available under permissive licences. Source code is hosted at github.com/jgraving/deepposekit and full documentation, including step-by-step tutorials, lives at docs.deepposekit.org.

References

Mathis A, Mamidanna P, Cury KM, Abe T, Murthy VN, Mathis MW, Bethge M (2018). DeepLabCut: markerless pose estimation of user-defined body parts with deep learning. Nature Neuroscience 21:1281–1289.
Pereira TD, Aldarondo DE, Willmore L, Kislin M, Wang SS-H, Murthy M, Shaevitz JW (2019). Fast animal pose estimation using deep neural networks. Nature Methods 16:117–125.
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017). Densely connected convolutional networks. CVPR.
Newell A, Yang K, Deng J (2016). Stacked hourglass networks for human pose estimation. ECCV.
Insafutdinov E, Pishchulin L, Andres B, Andriluka M, Schiele B (2016). DeeperCut: A deeper, stronger, and faster multi-person pose estimation model. ECCV.
Cao Z, Simon T, Wei S-E, Sheikh Y (2017). Realtime multi-person 2D pose estimation using part affinity fields. CVPR.
He K, Zhang X, Ren S, Sun J (2016). Deep residual learning for image recognition. CVPR.
Romero-Ferrero F, Bergomi MG, Hinz RC, Heras FJH, de Polavieja GG (2019). idtracker.ai: tracking all individuals in small or large collectives of unmarked animals. Nature Methods 16:179–182.
See the canonical eLife article for the complete reference list (115 entries).

Article and author information

Received

October 1, 2019

Accepted & Published

December 6, 2019

DOI

10.7554/eLife.47994

License

CC BY 4.0

Funded by the Max Planck Society, the Deutsche Forschungsgemeinschaft, the Office of Naval Research and the Struktur- und Innovationsfonds für die Forschung of the State of Baden-Württemberg. See the canonical article for the full acknowledgements and ethics statements.