| Name: | Description: | Size: | Format: | |
|---|---|---|---|---|
| 1.32 MB | Adobe PDF |
Advisor(s)
Abstract(s)
As plataformas de Internet of Things (IoT) industriais geram quantidades significativas
de dados heterogéneos que excedem as capacidades das pipelines de dados convencionais,
sobretudo em contextos empresariais como o da Bosch. Responder a estas exigências
requer não apenas uma infraestrutura escalável, mas também ferramentas ergonómicas
que permitam o desenvolvimento de fluxos de trabalho de machine learning (ML) reprodutíveis.
Nesta dissertação apresenta-se a SparklyAI, uma biblioteca PySpark que
formaliza o ciclo de vida definido pelo Cross Industry Standard Process for Data Mining
(CRISP-DM) para ML distribuído, colmatando a lacuna entre as práticas de engenharia
de dados industriais e os princípios modernos de Machine Learning Operations (MLOps).
A SparklyAI fornece uma estrutura coerente para pipelines de ponta a ponta, abrangendo
coleta, limpeza, amostragem, engenharia de características, modelação e avaliação, através
de uma Application Programming Interface (API) procedural com utilidades sensíveis ao
esquema e tipos de dados. A validação em fluxos de trabalho reais na Bosch demonstrou
melhorias significativas na consistência, manutenibilidade e reprodutibilidade dos
processos, bem como uma redução acentuada nos erros de implementação e no tempo de
integração de novos utilizadores. Quantitativamente, o Engineering Time Proxy (ETP)
reduziu-se de 220,8 para 90,4 minutos (–59,1%), comprovando ganhos expressivos de eficiência
em implementações práticas.
Em síntese, a SparklyAI constitui uma base estruturada e eficiente para fluxos de
trabalho de ML em escala industrial, alinhando a prática quotidiana com a metodologia
CRISP-DM e preservando as capacidades de escalabilidade do Spark.
Industrial Internet of Things (IoT) platforms create significant amounts of heterogeneous data that exceed the capabilities of conventional data pipelines, especially in enterprise environments like Bosch. Addressing these demands not only requires a scalable infrastructure but also ergonomic tools that allow the development of reproducible machine learning (ML) workflows. In this dissertation, we present SparklyAI, a PySpark library that formalizes the Cross Industry Standard Process for Data Mining (CRISP-DM) lifecycle for distributed ML, bridging the gap between industrial data engineering practices and modern Machine Learning Operations (MLOps) principles. SparklyAI provides a coherent framework for end-to-end pipelines including data ingestion, cleansing, sampling, feature engineering, modeling and evaluation through a procedural Application Programming Interface (API) with schema-aware and type-safe constructs. Validation in real Bosch workflows has demonstrated strongly improved workflow consistency, maintainability and reproducibility as well as a marked decrease in implementation errors and onboarding times for end-users. Quantitatively, the Engineering Time Proxy (ETP) has decreased from 220.8 to 90.4 minutes (–59.1%), certifying strongly improved efficiency in practical deployments. In summary, SparklyAI provides a structured and efficient basis for industrial-scale workflow for ML, aligning day-to-day practice with the CRISP-DM methodology while retaining Spark’s scalability features.
Industrial Internet of Things (IoT) platforms create significant amounts of heterogeneous data that exceed the capabilities of conventional data pipelines, especially in enterprise environments like Bosch. Addressing these demands not only requires a scalable infrastructure but also ergonomic tools that allow the development of reproducible machine learning (ML) workflows. In this dissertation, we present SparklyAI, a PySpark library that formalizes the Cross Industry Standard Process for Data Mining (CRISP-DM) lifecycle for distributed ML, bridging the gap between industrial data engineering practices and modern Machine Learning Operations (MLOps) principles. SparklyAI provides a coherent framework for end-to-end pipelines including data ingestion, cleansing, sampling, feature engineering, modeling and evaluation through a procedural Application Programming Interface (API) with schema-aware and type-safe constructs. Validation in real Bosch workflows has demonstrated strongly improved workflow consistency, maintainability and reproducibility as well as a marked decrease in implementation errors and onboarding times for end-users. Quantitatively, the Engineering Time Proxy (ETP) has decreased from 220.8 to 90.4 minutes (–59.1%), certifying strongly improved efficiency in practical deployments. In summary, SparklyAI provides a structured and efficient basis for industrial-scale workflow for ML, aligning day-to-day practice with the CRISP-DM methodology while retaining Spark’s scalability features.
Description
Keywords
Data Processing Distributed Machine Learning PySpark Big Data Industrial Data
