Repository logo
 
Loading...
Thumbnail Image
Publication

A multi-omics and primer database for virus identification: Focus on HIV, Ebola, and SARS-CoV-2

Use this identifier to reference this record.
Name:Description:Size:Format: 
POSTER_Vitor Júlio Sá2.pdf420.89 KBAdobe PDF Download

Advisor(s)

Abstract(s)

Highly infectious viruses such as HIV, Ebola, and SARS-CoV-2 have presented ongoing challenges to global health. Consequently, the optimization of rapid detection tests, including PCR, and the identification of new therapeutic targets remain of paramount importance. The development of genomic and proteomic databases like the HIV Oligonucleotide Database (HIVoligoDB), EbolaID, and CoV2ID has facilitated the accumulation and accessibility of knowledge through comprehensive, user-friendly, open-access platforms. This study aims to update, expand, and integrate these databases into a single resource, while conducting thorough analyses of informative genomic regions with the goal of enhancing viral detection methods and treatment strategies. Complete genomic sequence variants for each virus were compiled using Geneious Prime and NCBI Virus, followed by multiple sequence alignment via MAFFT within the Galaxy platform. The extraction of primers and probes from research articles was attempted using two approaches: Large Language Models (LLMs), specifically NotebookLM and DonutAI/OpenLLaMa-7b, and a classic method combining the Python package PyMuPDF4LLM for PDF data extraction with regular expressions (RegEx) for oligonucleotide identification. Preliminary testing revealed that DonutAI/OpenLLaMa-7b had the lowest accuracy, failing to correctly identify any primers. NotebookLM achieved an accuracy of 39%, while the PyMuPDF4LLM + RegEx method attained the highest accuracy at 71%, successfully identifying 85 out of 121 primers in the test batch of articles. Due to its superior performance and execution speed, the PyMuPDF4LLM + RegEx approach was selected for further refinement. This methodology improves upon previous RegEx-based techniques by eliminating the need for PDF preprocessing and refining the capture of relevant information while minimizing non-relevant captures. Future steps include cross-validation of the extracted primers against the reference genome to eliminate primers intended for other viruses and to accurately identify the binding regions of the identified oligonucleotides. Additionally, parameters such as percentage of identical sites and pairwise identity will be calculated to determine the optimal primer pairs for PCR optimization. Further structural analysis of the collected sequences will form the foundation for 3D modelling and molecular dynamics simulations.

Description

Keywords

Pedagogical Context

Citation

Lima, A., Carneiro, J., Sousa, S. F., Sá, V. J., & Pratas, D. (2024). A Multi-Omics and Primer Database for Virus Identification: Focus on HIV, Ebola, and SARS-CoV-2. https://doi.org/10.13140/RG.2.2.31826.47041

Research Projects

Organizational Units

Journal Issue