3D Pose and Shape Estimation from a Camera System

Figueiredo, Lino Manuel Baptista

http://hdl.handle.net/10400.22/23871

Use this identifier to reference this record.

Name:	Description:	Size:	Format:
DM_AndreOliveira_2023_MEEC.pdf		65.63 MB	Adobe PDF	Download

Send Feedback

Authors

Figueiredo, Lino Manuel Baptista

Abstract(s)

Neste trabalho, é abordada uma solução que procura estimar a posição articular 3D de várias pessoas em cenários reais, bem como a sua forma corporal e trajetória global a partir de um único vídeo RGB, gravado com uma câmara estática ou dinâmica. Em contraste com sistemas multi-view complexos, esta solução prioriza a simplicidade e adaptabilidade em diferentes aplicações. Face ao cenário desafiador, desenvolveu-se um sistema baseado em diferentes frameworks, individualmente otimizadas para o seu propósito. Como tal, o autor procurou estender o processo realizado num pose and shape estimator convencional, implementando, de forma robusta, a capacidade tracking de humanos e uma inferência com base em coerência temporal, capaz de lidar com oclusões completas em longos intervalos de tempo. Os humanos, presentes no cenário, são detetados e devidamente identificados ao longo do vídeo, a partir de um Multiple Person Tracking (MPT) (i.e., Deep OCSORT com YOLOv8x e Re-Identication (Re-ID) model). Esta informação, alimenta o Human Pose and Shape (HPS) estimator (i.e., HybrIK com backbone da rede HRNet-W48) capaz de gerar, a partir de uma combinação da representação volumétrica das articulações com a capacidade de extração de features das DCNNs, uma sequência que define o movimento do humano no sistema de coordenadas da câmara (i.e., root translations, root rotations, pose do corpo e os parâmetros do shape). Complementarmente, o movimento humano, localmente definido, é preenchido segundo um processo iterativo, dado pela integração do generative motion optimizer, por sua vez organizado numa arquitetura baseada em Transformers e apoiado pelas relações temporais presentes na informação das deteções visíveis. Para um conjunto de parâmetros descritivos do movimento corporal de cada humano é obtido a respetiva trajetória global, propriamente relacionadas, num processo baseado na variação posicional local (posição no plano e orientação) e numa otimização iterativa dos parâmetros da câmara consistente com as evidências do vídeo, e.g., 2D keypoints. Os resultados, obtidos no dataset 3DPW, demonstram que a abordagem proposta superar os métodos anteriores na reconstrução do movimento, com 68, 2 mm PAMPJPE em oclusões e 46, 4 mm PA-MPJPE em poses visíveis.

In this work, a solution is addressed that try to estimate the 3D joint position of several people in in-the-wild scenes, as well as their body shape and global trajectory from a single RGB video, recorded with a static or dynamic camera. In contrast to complex multi-view systems, this solution prioritizes simplicity and adaptability in different applications. Faced with the challenging scenario, a system was developed based on different frameworks, individually optimized for their purpose. As such, the author sought to extend the process carried out in a conventional pose and shape estimator, robustly implementing the tracking capability of humans and an inference based on temporal coherence, capable of dealing with complete occlusions over long time intervals. The humans, present in the scene, are detected and duly identified throughout the video using an Multiple Person Tracking (MPT) (i.e., Deep OC-SORT with YOLOv8x and Re-ID model). This information is fed into the HPS estimator (i.e., HybrIK with backbone from the HRNet-W48 network), which is able to generate, from a combination of the volumetric representation of the joints and the ability to extract features from the DCNNs, a sequence that defines the body motion of the human in the camera’s coordinate system (i.e., root translations, root rotations, body pose and shape parameters). In addition, the body motion, locally defined, is filled according to an iterative process, given by the integration of the generative motion optimizer, in turn organized in an architecture based on Transformers and supported by the temporal relationships present in the information of the visible detections. For a set of parameters describing the body motion of each human, the respective global trajectory is obtained, properly related, in a process based on local positional variation (position in the plane and orientation) and an iterative optimization of the camera parameters consistent with the video evidence, e.g., 2D keypoints. The results, obtained in the 3DPW dataset, show that the proposed approach outperforms previous methods in motion reconstruction, with 68.2 mm PA-MPJPE in occlusions and 46.4 mm PA-MPJPE in visible poses.

Keywords

3D Human pose and shape estimation Camera parameters optimization Deep learning Global pose estimation Multi-person motion reconstruction Multi-person tracking Occlusion-aware pose estimation Transformer-based Camera parameters optimization