Nowadays, we are confronted with an explosion of generation and storage of
vast quantities of digital information and the collective store of digital data is
expanding at an estimate rate of 30% per year. As a result, data collections
have grown to size up to multiple terabytes and soon will reach petabytes.
These enormous repositories of digital information require ability to analyze
and to understand in a reasonable time large-scale data in order to extract
useful knowledge. Many Data mining algorithms have been proposed for the
extraction of knowledge in data set of medium sizes: Association Rules search-
ing, Classification, Clustering, Segmentation. The present challenge is issued
from the connexion between Data Grids and Computational Grids, i.e. the
meeting of both computing-intensive and data-intensive applications, the con-
junction of large scale and of complexity.
With the WODKA (Webservices oriented Datamining in Knowledge Architecture) infrastructure
we provide components and services enabling a platform-independent access, sharing and application of potentially
distributed complex data mining workflows and resources, including database and information systems and hardware resources. It supports resource discovery and will supply context-aware recommendations for the
dynamic composition of data mining operations and workflows. The underlying agent-based layer of the SOAJA infrastructure provides means to orchestrate very large, heterogeneous and dynamic hardware and
software resources across multiple platforms.
We have proposed previously to apply high performance computing (i.e. parallel and distributed computing)
to perform datamining tasks. In the DisDaMin project (Distributed Data Mining), we have introduced solutions for some data mining problems by providing new distributed algorithms suited to a Grid environment
PhD of Dr Valérie Fiolet
. The
SOAJA infrastructure provides a middleware platform for the Grid that is used as a base for the deployment of DisDaMin algorithms. We propose to distribute the process of knowledge discovery by parallelization of data mining tasks. In particular, we intend to give an effective solution for association rules search in very large databases. The exponential nature of the complexity, in the problem of association rules search, forces the adaptation of existing algorithms. The use of specific parallelization techniques permits to obtain speedup coming from parallel executions and
from the reduction of time complexity. In the DisDaMin project, an intelligent data distribution for the problem of association rules is used. Computing fragments according to this intelligent distribution leads to a decrease of the
global treatment complexity for the problem. The WODKA approach aims to offer additional support for seamless data distribution, by deploying an workflow enactment environment, able to support the execution of data workflows