October 12, 2023
Have you ever wondered how movies and video games create realistic animations of human characters? How do they capture the subtle movements of the face, hands, and body? Well, one of the techniques they use is called expressive human pose and shape estimation (EHPS), which is a fancy way of saying that they use artificial intelligence to estimate the 3D shape and pose of a person from a single photo or video. Sounds cool, right?
But there is a catch. EHPS is not an easy task, and it requires a lot of data to train the AI models. Different scenarios, such as indoor or outdoor, sunny or cloudy, standing or sitting, can affect how well the models perform. And not all data sets are created equal. Some may have more variety, more quality, or more annotations than others. So how can we find the best data sets to train the most general and robust models for EHPS?
That is the question that a team of researchers from Nanyang Technological University, SenseTime Research, Shanghai AI Laboratory, The University of Tokyo and the International Digital Economy Academy (IDEA) tried to answer in their latest paper. They conducted a comprehensive benchmark of 32 data sets for EHPS, covering different aspects such as capture environment, pose distribution, body visibility, and camera viewpoints. They evaluated four state-of-the-art models on these data sets and found that there are significant gaps and inconsistencies between them. They also discovered some interesting insights, such as:
Data sets do not need to be very large to be useful, as long as they have more than 100K instances.
If outdoor data collection is not feasible, diverse indoor scenes are a good alternative.
Synthetic data sets are becoming surprisingly effective, despite having noticeable domain gaps.
If SMPL-X annotations are not available, pseudo-SMPL-X labels are helpful.
SMPL-X is a parametric human model that can represent the complex human anatomy, face, and hands in 3D. Based on these findings, they proposed SMPLer-X, a generalist foundation model for EHPS that is trained using a variety of data sets and achieves remarkably balanced results in different situations. SMPLer-X can handle challenging poses, occlusions, clothing variations, and even 4D (time-varying) human motion capture from monocular inputs.
This work demonstrates the power of massively selected data for EHPS and provides useful guidance for future data collection and model development. It also opens up new possibilities for applications in animation, gaming, fashion, and beyond. Imagine being able to create your own 3D avatar from a single selfie or video clip. How cool would that be?
Project page with video examples: https://caizhongang.github.io/projects/SMPLer-X/