WODKA project

Nowadays, we are confronted with an explosion of generation and storage of vast quantities of digital information and the collective store of digital data is expanding at an estimate rate of 30% per year. As a result, data collections have grown to size up to multiple terabytes and soon will reach petabytes. These enormous repositories of digital information require ability to analyze and to understand in a reasonable time large-scale data in order to extract useful knowledge. Many Data mining algorithms have been proposed for the extraction of knowledge in data set of medium sizes: Association Rules search- ing, Classification, Clustering, Segmentation. The present challenge is issued from the connexion between Data Grids and Computational Grids, i.e. the meeting of both computing-intensive and data-intensive applications, the con- junction of large scale and of complexity.

With the WODKA (Webservices oriented Datamining in Knowledge Architecture) infrastructure we provide components and services enabling a platform-independent access, sharing and application of potentially distributed complex data mining workflows and resources, including database and information systems and hardware resources. It supports resource discovery and will supply context-aware recommendations for the dynamic composition of data mining operations and workflows. The underlying agent-based layer of the SOAJA infrastructure provides means to orchestrate very large, heterogeneous and dynamic hardware and software resources across multiple platforms.

Choregraphy of the DisDamin project in BPMN form

We have proposed previously to apply high performance computing (i.e. parallel and distributed computing) to perform datamining tasks. In the DisDaMin project (Distributed Data Mining), we have introduced solutions for some data mining problems by providing new distributed algorithms suited to a Grid environment PhD of Dr Valérie Fiolet . The SOAJA infrastructure provides a middleware platform for the Grid that is used as a base for the deployment of DisDaMin algorithms. We propose to distribute the process of knowledge discovery by parallelization of data mining tasks. In particular, we intend to give an effective solution for association rules search in very large databases. The exponential nature of the complexity, in the problem of association rules search, forces the adaptation of existing algorithms. The use of specific parallelization techniques permits to obtain speedup coming from parallel executions and from the reduction of time complexity. In the DisDaMin project, an intelligent data distribution for the problem of association rules is used. Computing fragments according to this intelligent distribution leads to a decrease of the global treatment complexity for the problem. The WODKA approach aims to offer additional support for seamless data distribution, by deploying an workflow enactment environment, able to support the execution of data workflows