KDID'06 logo

Berlin photo (by Land Berlin/Thie)

The 5th International Workshop on Knowledge Discovery in Inductive Databases (KDID'06)

Accepted Papers

Printer friendly version: open.

Mining Bi-sets in Numerical Data

by Jérémy Besson, Céline Robardet, Luc De Raedt and Jean-François Boulicaut

Thanks to an important research effort the last few years, inductive queries on set patterns and complete solvers which can evaluate them on large 0/1 data sets have been proved extremely useful. However, for many application domains, the raw data is numerical (matrices of real numbers whose dimensions denote objects and properties). Therefore, using efficient 0/1 mining techniques needs for tedious Boolean property encoding phases. This is, e.g., the case, when considering microarray data mining and its impact for knowledge discovery in molecular biology. We consider the possibility to mine directly numerical data to extract collections of relevant bi-sets, i.e., couples of associated sets of objects and attributes which satisfy some user-defined constraints. Not only we propose a new pattern domain but also we introduce a complete solver for computing the so-called numerical bi-sets. Preliminary experimental validation is given.

Weighted and Probabilistic Instances of the Soft Constraint Based Pattern Mining Paradigm

by Stefano Bistarelli and Francesco Bonchi

The paradigm of pattern discovery based on constraints has been recognized as a core technique in inductive querying: constraints provide to the user a tool to drive the discovery process towards potentially interesting patterns, with the positive side effect of achieving a more efficient computation. However, due to the lack of research on methodological issues, the constraint-based pattern mining framework still suffers from many problems which limit its practical relevance. In our previous work, we analyzed such limitations and showed how they flow out from the same source: the fact that in the classical constraint-based mining, a constraint is a rigid boolean function which returns either true or false. To overcome such limitations we introduced the new paradigm of pattern discovery based on Soft Constraints, and instantiated our idea to the fuzzy soft constraints. In this paper we extend the framework to deal with probabilistic and weighted soft constraints: we provide theoretical basis and detailed experimental analysis. Finally we discuss of how deal with top-k queries.

On Interactive Pattern Mining from Relational Databases

by Francesco Bonchi, Fosca Giannotti, Claudio Lucchese, Salvatore Orlando, Raffaele Perego and Roberto Trasarti

In this paper we introduce a constraint based querying system devised with the aim of supporting the intrinsically exploratory (i.e., human-guided, interactive, iterative) nature of pattern discovery. Following the Inductive Database vision, our framework provides users with an expressive constraint based query language which allows the discovery process to be effectively driven toward potentially interesting patterns. Such constraints are also exploited to reduce the cost of pattern mining computation. We implemented a comprehensive mining system that can access real world relational databases from which extract data. After a preprocessing step, users mining queries are answered by an efficient pattern mining engine which entails several data and search space reduction techniques. Finally, results are presented to the user, and then stored in the database. New user-defined constraints can be easily added to the system in order to target the particular application considered.

Analysis of Time Series Data with Predictive Clustering Trees

by Sašo Džeroski, Ivica Slavkov, Valentin Gjorgjioski and Jan Struyf

Predictive clustering is a general framework that unifies clustering and prediction. This paper investigates how to apply this framework to cluster time series data. The resulting system, Clus-TS, constructs predictive clustering trees (PCTs) that partition a given set of time series into homogeneous clusters. In addition, PCTs provide a symbolic description of the clusters. The paper considers several distance metrics to measure cluster homogeneity (both quantitative and qualitative). We evaluate Clus-TS on time series data from microarray experiments. Each data set records the change over time in the expression level of yeast genes in response to a change in environmental conditions. Our evaluation shows that Clus-TS is able to identify interesting clusters of genes with similar responses. Clus-TS is part of a larger project where the goal is to investigate how global models can be combined with inductive databases.

Integrating Decision Tree Learning into Inductive Databases

by Élisa Fromont and Hendrik Blockeel

In inductive databases, there is no conceptual difference between data and the models describing the data: both can be stored and queried using some query language. The approach that adheres most strictly to this philosophy is probably the one proposed by the ADReM group from Antwerp: in that approach, models are stored in relational tables and queried using standard SQL. The approach has been described in detail for association rule discovery. In this work we study how decision tree induction could be integrated in that approach. We propose a representation format for decision trees similar to the one proposed earlier for association rules, and queryable using standard SQL; and we present a prototype system in which part of the needed functionality is implemented. In the process, we identify a number of important differences between discovery of global models (such as decision trees) and local models (such as association rules), which force us to re-evaluate the motivation for the approach.

An Integrated Multi-task Inductive Database and Decision Support System VINLEN: An Initial Implementation and First Results

by Kenneth A. Kaufman, Ryszard S. Michalski, Jaroslaw Pietrzykowski and Janusz Wojtusiak

A brief review of the current research on VINLEN multitask inductive database and decision support system is presented. VINLEN integrates a wide range of knowledge generation operators that given input data and/or knowledge create new knowledge. The central operator of VINLEN is a natural induction module that generates hypotheses from data in the form of attributional rules. Such rules are easy to understand and interpret because they directly correspond to equivalent natural language descriptions.  This operator is illustrated by an application to discovering relationships between lifestyles and diseases in men. The conclusion outlines plans for future research.

Frequent Pattern Mining and Knowledge Indexing Based on Zero-suppressed BDDs

by Shin-ichi Minato and Hiroki Arimura

Frequent pattern mining is one of the fundamental techniques for knowledge discovery and data mining. In the last decade, a number of efficient algorithms for frequent pattern mining have been presented, but most of them focused on just enumerating the patterns which satisfy the given conditions, and it was a different matter how to store and index the result of patterns for efficient inductive analysis. In this paper, we propose a fast algorithm of extracting all/maximal frequent patterns from transaction databases and simultaneously indexing the result of huge patterns using Zero-suppressed BDDs (ZBDDs). Our method is fast as competitive as the existing state-of-the-art algorithms, and not only enumerating/listing the patterns but also indexing the output data compactly on main memory. After mining, the result of patterns can efficiently be analyzed by using algebraic operations. The data structures of BDDs have already been used in VLSI logic design systems successively, but our method will be the first practical work of applying the BDD-based techniques for data mining area.

Quantitative Episode Trees

by Mirco Nanni and Christophe Rigotti

Among the family of the local patterns, episodes are commonly used when mining a single or multiple sequences of discrete events. An episode reflects a qualitative relation is-followed-by over event types, and the refinement of episodes to incorporate quantitative temporal information is still an on going research, with many application opportunities. In this paper, focusing on serial episodes, we design such a refinement called quantitative episodes and give a corresponding extraction algorithm. The three most salient features of these quantitative episodes are: (1) their ability to characterize main groups of homogeneous behaviors among the occurrences, according to the duration of the is-followed-by steps, and providing quantitative bounds of these durations organized in a tree structure; (2) the possibility to extract them in a complete way; and (3) to perform such extractions at the cost of a limited overhead with respect to the extraction of standard episodes.

IQL: A Proposal for an Inductive Query Language

by Siegfried Nijssen and Luc De Raedt

The inductive query language IQL is introduced. It is intended as a general, descriptive, declarative, extendable and implementable language for inductive querying that supports the mining of both local and global patterns, reasoning about inductive queries and query processing using logic, as well as the flexible incorporation of new primitives and solvers. IQL is an extension of the tuple relational calculus with functions, a typing system and various primitives for data mining. We hope that it will be useful as an overall specification language for integrating data mining systems and principles.

Mining Correct Properties in Incomplete Databases

by François Rioult and Bruno Crémilleux

Missing values issue in databases is an important problem because missing values bias the information provided by the usual data mining methods. In this paper, we are searching for mining patterns satisfying correct properties in presence of missing values (it means that these patterns must satisfy the properties in the corresponding complete database). We focus on k-free patterns. Thanks to a new definition of this property suitable for incomplete data and compatible with the usual one, we certify that the extracted k-free patterns in an incomplete database also satisfy this property in the corresponding complete database. Moreover, this approach enables to provide an anti-monotone criterion with respect to the pattern inclusion and thus design an efficient level-wise algorithm which extracts correct k-free patterns in presence of missing values.

Efficient Mining under Flexible Constraints through Several Datasets

by Arnaud Soulet, Jiří Kléma and Bruno Crémilleux

Mining patterns under many kinds of constraints is a key point to successfully get new knowledge. In this paper, we propose an efficient new algorithm Music-dfs which soundly and completely mines patterns with various constraints from large data and takes into account external data represented by several heterogeneous datasets. Constraints are freely built of a large set of primitives and enable to link the information scattered in various knowledge sources. Efficiency is achieved thanks to a new closure operator providing an interval pruning strategy applied during the depth-first search of a pattern space. A genomic case study shows both the effectiveness of our approach and the added-value of background knowledge such as free texts or gene ontologies in discovery of meaningful patterns.



Photo by Land Berlin/Thie. Last update: December 5, 2006.