Name: | Description: | Size: | Format: | |
---|---|---|---|---|
DM_RuiSilva_MEI_2012 | 1.85 MB | Adobe PDF |
Authors
Abstract(s)
Cada vez mais as organizações necessitam de estar preparadas para enfrentar um mundo em constante
evolução, onde é necessário agregar um conjunto de informações provenientes de diversas áreas de
negócio, de forma a tomar decisões que influenciam o desempenho da organização no seu meio
competitivo. Para tal, as organizações utilizam Sistemas de Data Warehousing (SDW) que aglomeram
e integram dados recorrendo a um processo de Extracção, Transformação e Carregamento (ETL). Os
processos de ETL apresentam uma grande complexidade, pois têm de aceder a um conjunto de
sistemas fonte, muitas vezes heterogéneos, de forma a realizar tarefas de transformação e limpeza de
dados de acordo com as regras de negócio, exigindo para isso um elevado poder computacional. Com
o crescimento de um SDW, o seu processo de ETL possui cada vez mais dados para processar. No
entanto, é desejável que o tempo de processamento dos dados não comprometa o sistema,
independentemente do volume de dados a tratar. Recorrendo à paralelização de tarefas, é possível
reduzir o tempo de processamento dos dados, uma vez que algumas tarefas independentes podem ser
executadas por máquinas diferentes ao mesmo tempo. O principal conceito dos ambientes Grid assenta
na reutilização e aproveitamento de recursos, beneficiando assim do poder de processamento
distribuído de forma a reduzir o impacto do crescimento de dados a tratar. Desta forma, é possível
utilizar um ambiente Grid para realizar o escalonamento de um processo ETL, reduzindo o impacto
oriundo do crescimento de dados, uma vez que os ambientes Grid permitem tirar partido dos recursos
distribuídos disponíveis.
Organizations need to prepare themselves to a changing world, and gathering and storing information from the various business areas will enhance decision making processes that affect the organization's performance in its competitive environment. To do this, organizations use Data Warehousing Systems (DWS) as data repository, where they store and integrate data using an Extraction, Transformation and Loading (ETL) process. The ETL process is known for its great complexity, mainly because it has to access a set of source systems, often heterogeneous, in order to extract data, perform cleaning tasks and process the data according to business rules, which requires great computational power. With the growth of a DWS, its ETL component has increasingly more data to process. However, it is desired that the data processing time remains within its window of opportunity regardless of the volume of data to be processed. Using task parallelization, it is possible to reduce the data processing time, since some independent tasks can be performed by different machines at the same time. The main concept of Grid environments is to reuse and harness resources, making it possible to benefit from the distributed processing power to reduce the impact of data growth. Thus, it is possible to use a Grid environment to perform the scheduling of an ETL process, reducing the impact of the data growth, since Grid environments allow the use of available distributed resources.
Organizations need to prepare themselves to a changing world, and gathering and storing information from the various business areas will enhance decision making processes that affect the organization's performance in its competitive environment. To do this, organizations use Data Warehousing Systems (DWS) as data repository, where they store and integrate data using an Extraction, Transformation and Loading (ETL) process. The ETL process is known for its great complexity, mainly because it has to access a set of source systems, often heterogeneous, in order to extract data, perform cleaning tasks and process the data according to business rules, which requires great computational power. With the growth of a DWS, its ETL component has increasingly more data to process. However, it is desired that the data processing time remains within its window of opportunity regardless of the volume of data to be processed. Using task parallelization, it is possible to reduce the data processing time, since some independent tasks can be performed by different machines at the same time. The main concept of Grid environments is to reuse and harness resources, making it possible to benefit from the distributed processing power to reduce the impact of data growth. Thus, it is possible to use a Grid environment to perform the scheduling of an ETL process, reducing the impact of the data growth, since Grid environments allow the use of available distributed resources.
Description
Keywords
Data Warehouse ETL Ambientes Grid Processamento em Paralelo Escalonamento