Browsing by Author "CARDOSO, RUI DAVID FREITAS"
Now showing 1 - 1 of 1
Results Per Page
Sort Options
- Image Captioning under Extreme Occlusion SettingsPublication . CARDOSO, RUI DAVID FREITAS; Viana, Paula Maria Marques Moura Gomes; Vilaça, Luís Miguel Salgado NunesImage captioning is a research area in Artificial Intelligence (AI) that aims to generate coherent and contextually accurate textual descriptions of images. Some of its practical applications include image retrieval, video summarization and enhancing human–computer interactions in areas like robotics and virtual reality. Vision- Language Model (VLM) are suited to solve this multimodal task and often rely on pretrained vision encoders such as Contrastive Language-Image Pre-training (CLIP). However, CLIP underperforms when faced with occluded objects, where crucial visual cues are missing. In this work, we investigate whether a lightweight unified multimodal decoder that does not use pretrained data can outperform CLIP-based baselines under the same settings. Given an input image, we learn a model that generates a textual caption with just a few selected patches of the images as context. The baseline experiment replaces CLIP’s embeddings with flattened patches in the text sequence, and subsequent experiments iteratively extend this setup to probe different aspects of the methodology. Specifically, we ask: (i) does inserting patch embeddings both before and after the text sequence improve alignment between modalities? (ii) can replacing a single occluded CLIP embedding with multiple patch tokens under the same occlusion conditions enhance semantic recovery? (iii) do convolutional preprocessed patches yield more informative visual representations? (iv) does adding two-dimensional positional encoding improve spatial awareness? (v) how sensitive is caption quality to the specific set of randomly sampled patches? (vi) can additional regularization to align patch embeddings further strengthen visual grounding? Most of our results show consistent gains over the baseline, narrowing the gap to using CLIP embeddings. Nonetheless, the unified decoder lags behind CLIP on standard captioning metrics (BLEU@4, METEOR, CIDEr, SPICE), suggesting either the need for substantially larger models and datasets, or that architectures with uni-modal encoders, e.g. image specific encoders, remain better suited for robust captioning under extreme partial occlusion.
