Percorrer por autor "Campos, Maria Teresa Pinto da Silva de Almeida"
A mostrar 1 - 1 de 1
Resultados por página
Opções de ordenação
- Automating feature selection in binary classification datasets: a metadata-driven approach using machine learning algorithms and large language modelsPublication . Campos, Maria Teresa Pinto da Silva de Almeida; Rodrigues, Maria de Fátima CoutinhoIdentifying a representative subset of features for building a classification model from a given dataset remains a significant challenge in the field of machine learning. The manual process of selecting and experimenting with different feature selection algorithms is both time-consuming and resource intensive. Given that feature selection is a well-established process, there is significant evidence that automating will improve efficiency and reduce the need for manual intervention. This research proposes an automated process for selecting the best feature selection algorithms in binary classification datasets, aiming to streamline the feature selection process. The proposed process evaluates multiple feature selection algorithms, Forward Feature Selection, Lasso Regularization, Decision Trees, and Feature Shuffling, across diverse binary classification datasets. The effectiveness of each algorithm is assessed using classification models, with the mean ROC score serving as the evaluation metric. The results are compiled into a metadata repository that stores the dataset metadata characteristics and the corresponding optimal feature selection algorithm. This repository is then embedded into vector representations that enable efficient querying and recommendation of feature selection algorithms for new datasets based on the similarity of their metadata to previously analyzed datasets. This process then integrates a Large Language Model to provide users with clear, context-aware recommendations on the most suitable feature selection techniques based on the query response of the vector database for the best feature selection algorithm match. By automating the feature selection process and incorporating LLM-generated response, the project significantly reduces manual effort while ensuring a recommendation of the best feature selection for a given binary classification dataset. The process's performance is evaluated using Leave-One-Out Cross-Validation across 72 binary classification datasets. The top one and top three hit rates are used as metrics to assess the accuracy of the algorithm recommendations. The evaluation results demonstrate the effectiveness of the proposed process in automating feature selection, thereby saving time and computational resources.
