Percorrer por autor "DATSENKO, DARYNA"
A mostrar 1 - 1 de 1
Resultados por página
Opções de ordenação
- Deep learning for monocular visual odometry: From sequential pose regression to self-attention learningPublication . DATSENKO, DARYNA; Dias, André Miguel PinheiroMonocular visual odometry (VO) estimates the position and orientation of a moving system using images from a single camera. It is widely used in robotics, autonomous driving, and UAVs. Compared to stereo or LiDAR systems, monocular VO avoids extra hardware, but it faces challenges such as scale ambiguity, sensitivity to lighting changes, and poor generalization to new environments. Deep learning has recently become a promising approach, as it allows networks to learn motion and geometry directly from images. This thesis studies deep learning methods for monocular VO. First, a simple CNN–LSTM baseline inspired by DeepVO is evaluated. This model works well on KITTI with Absolute Trajectory Error(ATE): 37.14 m; scale recovery: 0.998) and trains relatively fast, but it fails to converge on more dynamic or indoor datasets like TartanAir and EuRoC MAV, showing the limitations of learning pose from images alone. To improve performance, the model is gradually extended with self-attention and an auxiliary depth prediction branch, forming a multi-task framework that jointly learns pose and depth. This adds geometric constraints that reduce scale drift and improve trajectory consistency. The training strategy combines synthetic pretraining on TartanAir, using perfect depth supervision, with fine-tuning on EuRoC MAV using pseudo-depth maps. Experiments show significant improvements: on EuRoC V102, the multi-task model achieves an ATE of 0.825 m over a 42.53 m path, closely matching the ground truth (40.12 m) with a scale recovery of 1.059. These results outperform classical methods like ORB-SLAM3 and approach state-of-the-art learning-based approaches. The two main contributions of this work are: first, proposing and testing a framework that gradually moves from simple CNN–LSTM pose regression to a multi-task model with depth and self-attention; second, analyzing the benefits and limitations of this approach. The results show that depth supervision, even if not perfect, stabilizes motion estimation and improves consistency, pointing to promising directions for learning-based pose estimation in complex environments.
