Repository logo
 
Publication

Improving word embeddings in Portuguese: increasing accuracy while reducing the size of the corpus

dc.contributor.authorPinto, José Pedro
dc.contributor.authorViana, Paula
dc.contributor.authorTeixeira, Inês
dc.contributor.authorAndrade, Maria
dc.date.accessioned2023-01-19T12:03:19Z
dc.date.available2023-01-19T12:03:19Z
dc.date.issued2022-07-18
dc.description.abstractThe subjectiveness of multimedia content description has a strong negative impact on tag-based information retrieval. In our work, we propose enhancing available descriptions by adding semantically related tags. To cope with this objective, we use a word embedding technique based on the Word2Vec neural network parameterized and trained using a new dataset built from online newspapers. A large number of news stories was scraped and pre-processed to build a new dataset. Our target language is Portuguese, one of the most spoken languages worldwide. The results achieved significantly outperform similar existing solutions developed in the scope of different languages, including Portuguese. Contributions include also an online application and API available for external use. Although the presented work has been designed to enhance multimedia content annotation, it can be used in several other application areas.pt_PT
dc.description.sponsorshipThis work is financed by National Funds through the Portuguese funding agency, FCT - Fundacão para a Ciência e a Tecnologia, within project LA/P/0063/2020. The funders had ¸ no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.pt_PT
dc.description.versioninfo:eu-repo/semantics/publishedVersionpt_PT
dc.identifier.doi10.7717/peerj-cs.964pt_PT
dc.identifier.urihttp://hdl.handle.net/10400.22/21674
dc.language.isoengpt_PT
dc.peerreviewedyespt_PT
dc.publisherPeerJpt_PT
dc.relationLA/P/0063/2020pt_PT
dc.relation.publisherversionhttps://peerj.com/articles/cs-964/pt_PT
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/pt_PT
dc.subjectNatural language processingpt_PT
dc.subjectMachine learningpt_PT
dc.subjectMultimedia systemspt_PT
dc.subjectContext awarenesspt_PT
dc.subjectWord2Vecpt_PT
dc.titleImproving word embeddings in Portuguese: increasing accuracy while reducing the size of the corpuspt_PT
dc.typejournal article
dspace.entity.typePublication
oaire.citation.startPagee964pt_PT
oaire.citation.titlePeerJ Computer Sciencept_PT
oaire.citation.volume8pt_PT
person.familyNameViana
person.givenNamePaula
person.identifier936138
person.identifier.ciencia-idEA17-B097-BD2E
person.identifier.orcid0000-0001-8447-2360
person.identifier.scopus-author-id7003678537
rcaap.rightsopenAccesspt_PT
rcaap.typearticlept_PT
relation.isAuthorOfPublication17ac1586-7589-4027-a541-3aea351fd6ae
relation.isAuthorOfPublication.latestForDiscovery17ac1586-7589-4027-a541-3aea351fd6ae

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
ART_DEE_PMV_peerj-cs-964_2022.pdf
Size:
2.23 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: