Repository logo
 
Publication

A multi-omics and primer database for virus identification: Focus on HIV, Ebola, and SARS-CoV-2

dc.contributor.authorLima, A. S.
dc.contributor.authorCarneiro, J.
dc.contributor.authorSousa, S.
dc.contributor.authorSá, Vítor Júlio
dc.contributor.authorPratas, D.
dc.contributor.authorSá, Vítor J.
dc.date.accessioned2025-11-06T10:41:30Z
dc.date.available2025-11-06T10:41:30Z
dc.date.issued2024-12-19
dc.description.abstractHighly infectious viruses such as HIV, Ebola, and SARS-CoV-2 have presented ongoing challenges to global health. Consequently, the optimization of rapid detection tests, including PCR, and the identification of new therapeutic targets remain of paramount importance. The development of genomic and proteomic databases like the HIV Oligonucleotide Database (HIVoligoDB), EbolaID, and CoV2ID has facilitated the accumulation and accessibility of knowledge through comprehensive, user-friendly, open-access platforms. This study aims to update, expand, and integrate these databases into a single resource, while conducting thorough analyses of informative genomic regions with the goal of enhancing viral detection methods and treatment strategies. Complete genomic sequence variants for each virus were compiled using Geneious Prime and NCBI Virus, followed by multiple sequence alignment via MAFFT within the Galaxy platform. The extraction of primers and probes from research articles was attempted using two approaches: Large Language Models (LLMs), specifically NotebookLM and DonutAI/OpenLLaMa-7b, and a classic method combining the Python package PyMuPDF4LLM for PDF data extraction with regular expressions (RegEx) for oligonucleotide identification. Preliminary testing revealed that DonutAI/OpenLLaMa-7b had the lowest accuracy, failing to correctly identify any primers. NotebookLM achieved an accuracy of 39%, while the PyMuPDF4LLM + RegEx method attained the highest accuracy at 71%, successfully identifying 85 out of 121 primers in the test batch of articles. Due to its superior performance and execution speed, the PyMuPDF4LLM + RegEx approach was selected for further refinement. This methodology improves upon previous RegEx-based techniques by eliminating the need for PDF preprocessing and refining the capture of relevant information while minimizing non-relevant captures. Future steps include cross-validation of the extracted primers against the reference genome to eliminate primers intended for other viruses and to accurately identify the binding regions of the identified oligonucleotides. Additionally, parameters such as percentage of identical sites and pairwise identity will be calculated to determine the optimal primer pairs for PCR optimization. Further structural analysis of the collected sequences will form the foundation for 3D modelling and molecular dynamics simulations.por
dc.identifier.citationLima, A., Carneiro, J., Sousa, S. F., Sá, V. J., & Pratas, D. (2024). A Multi-Omics and Primer Database for Virus Identification: Focus on HIV, Ebola, and SARS-CoV-2. https://doi.org/10.13140/RG.2.2.31826.47041
dc.identifier.doihttps://doi.org/10.13140/RG.2.2.31826.47041
dc.identifier.urihttp://hdl.handle.net/10400.22/30748
dc.language.isoeng
dc.peerreviewedn/a
dc.relationUIDB/04423/2020, UIDP/04423/2020
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.titleA multi-omics and primer database for virus identification: Focus on HIV, Ebola, and SARS-CoV-2por
dc.typeconference poster
dspace.entity.typePublication
oaire.citation.conferenceDate2024
oaire.citation.conferencePlaceVila do Conde
oaire.citation.titleXX ENBE Annual Meeting of the Portuguese Association for Evolutionary Biology
oaire.versionhttp://purl.org/coar/version/c_b1a7d7d4d402bcce
person.familyName
person.givenNameVítor J.
person.identifierC-8775-2009
person.identifier.ciencia-id211C-CF41-90E7
person.identifier.orcid0000-0002-4982-4444
person.identifier.ridC-8775-2009
person.identifier.scopus-author-id49864475600
relation.isAuthorOfPublication89f27ca4-b34c-4813-9c23-2d7300d1743d
relation.isAuthorOfPublication.latestForDiscovery89f27ca4-b34c-4813-9c23-2d7300d1743d

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
POSTER_Vitor Júlio Sá2.pdf
Size:
420.89 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
4.03 KB
Format:
Item-specific license agreed upon to submission
Description: