| Name: | Description: | Size: | Format: | |
|---|---|---|---|---|
| 1.38 MB | Adobe PDF |
Authors
Advisor(s)
Abstract(s)
Os modelos de previsão de séries temporais que recorrem apenas a dados
numéricos tendem a ignorar fatores exógenos e padrões estruturais que podem ser mais
facilmente captados por representações visuais. Esta dissertação propõe um
enquadramento multimodal que integra sequências temporais numéricas e representações
visuais (plots/imagens) para melhorar a exatidão, a robustez e a interpretabilidade da
previsão. Metodologicamente, emprega-se um FT-Transformer para a componente
temporal e uma rede convolucional TIMM para a componente visual, combinadas por um
esquema de fusão híbrida a meio da rede. A pipeline inclui normalização e padronização
de plots, geração consistente de janelas temporais, otimização bayesiana com Optuna e
protocolos de avaliação reprodutíveis.
A avaliação é realizada no subconjunto de séries com frequência horária do
conjunto de dados M4, em múltiplos horizontes (1–48 passos), reportando o NRMSE
agregado e estratificado (1–12, 13–24, 25–36, 37–48). Os resultados mostram que o
modelo multimodal supera consistentemente as variantes unimodais (apenas numéricas e
apenas visuais), com melhorias até 7,0% em NRMSE face ao melhor baseline, enquanto
estudos de ablação evidenciam o contributo específico do ramo visual e do mecanismo
de fusão.
Contribui-se, assim, com: (i) um framework multimodal eficiente e reprodutível;
(ii) um protocolo experimental transparente para fusão numérico-visual; e (iii) diretrizes
práticas sobre normalização de plots, janelas e tuning. Discutem-se limitações, como a
sensibilidade ao estilo do gráfico e à sincronização temporal, e traçam-se direções futuras
que incluem a integração de texto contextual e a previsão sensível a intervenções, visando
sistemas de previsão mais adaptativos e aplicáveis ao mundo real.
Time series forecasting models that rely solely on numerical data often overlook exogenous factors and structural patterns that can be more effectively captured through visual representations. This thesis proposes a multimodal framework that integrates numerical time series sequences and visual representations (plots/images) to enhance forecasting accuracy, robustness, and interpretability. Methodologically, the approach employs an FT-Transformer for temporal processing and a TIMM-based convolutional network for visual feature extraction, combined through a hybrid mid-level fusion strategy. The training pipeline includes normalization and standardization of plots, consistent generation of temporal windows, Bayesian hyperparameter optimization with Optuna, and reproducible evaluation protocols. The framework is evaluated on the hourly subset of the M4 dataset, across multiple forecasting horizons (1–48 steps), reporting both aggregated and stratified Normalized Root Mean Squared Error (NRMSE) metrics (1–12, 13–24, 25–36, 37–48). Results demonstrate that the multimodal model consistently outperforms unimodal variants (numerical-only and visual-only), achieving up to 7.0% NRMSE reduction compared to the best baseline, whereas ablation studies highlight the specific contribution of the visual branch and the fusion mechanism. This research contributes: (i) an efficient and reproducible multimodal forecasting framework; (ii) a transparent experimental protocol for numerical–visual fusion; and (iii) practical guidelines on plot normalization, window generation, and model tuning. Limitations, such as sensitivity to plot style and temporal synchronization, are discussed, along with future directions including the integration of contextual text and interventionaware forecasting for more adaptive, real-world prediction systems.
Time series forecasting models that rely solely on numerical data often overlook exogenous factors and structural patterns that can be more effectively captured through visual representations. This thesis proposes a multimodal framework that integrates numerical time series sequences and visual representations (plots/images) to enhance forecasting accuracy, robustness, and interpretability. Methodologically, the approach employs an FT-Transformer for temporal processing and a TIMM-based convolutional network for visual feature extraction, combined through a hybrid mid-level fusion strategy. The training pipeline includes normalization and standardization of plots, consistent generation of temporal windows, Bayesian hyperparameter optimization with Optuna, and reproducible evaluation protocols. The framework is evaluated on the hourly subset of the M4 dataset, across multiple forecasting horizons (1–48 steps), reporting both aggregated and stratified Normalized Root Mean Squared Error (NRMSE) metrics (1–12, 13–24, 25–36, 37–48). Results demonstrate that the multimodal model consistently outperforms unimodal variants (numerical-only and visual-only), achieving up to 7.0% NRMSE reduction compared to the best baseline, whereas ablation studies highlight the specific contribution of the visual branch and the fusion mechanism. This research contributes: (i) an efficient and reproducible multimodal forecasting framework; (ii) a transparent experimental protocol for numerical–visual fusion; and (iii) practical guidelines on plot normalization, window generation, and model tuning. Limitations, such as sensitivity to plot style and temporal synchronization, are discussed, along with future directions including the integration of contextual text and interventionaware forecasting for more adaptive, real-world prediction systems.
Description
Keywords
Time series Multimodal learning Numerical–visual fusion Forecasting (NRMSE)
