|
The 5th International Workshop on Knowledge Discovery in Inductive Databases (KDID'06)
Accepted Papers
Printer friendly version: open.
Mining Bi-sets in Numerical Data
by Jérémy Besson, Céline Robardet, Luc De Raedt and Jean-François Boulicaut
Thanks to an important research effort the last few years,
inductive queries on set patterns and complete solvers which can
evaluate them on large 0/1 data sets have been proved extremely
useful. However, for many application domains, the raw data is
numerical (matrices of real numbers whose dimensions denote
objects and properties). Therefore, using efficient 0/1 mining
techniques needs for tedious Boolean property encoding phases.
This is, e.g., the case, when considering microarray data mining
and its impact for knowledge discovery in molecular biology. We
consider the possibility to mine directly numerical data to
extract collections of relevant bi-sets, i.e., couples of
associated sets of objects and attributes which satisfy some
user-defined constraints. Not only we propose a new pattern domain
but also we introduce a complete solver for computing the
so-called numerical bi-sets. Preliminary experimental validation
is given.
Weighted and Probabilistic Instances of the Soft Constraint Based Pattern Mining Paradigm
by Stefano Bistarelli and Francesco Bonchi
The paradigm of pattern discovery based on constraints has been recognized
as a core technique in inductive querying: constraints provide to the user
a tool to drive the discovery process towards potentially interesting patterns, with
the positive side effect of achieving a more efficient computation. However, due to
the lack of research on methodological issues, the constraint-based pattern mining
framework still suffers from many problems which limit its practical relevance.
In our previous work, we analyzed such limitations and showed how they
flow out from the same source: the fact that in the classical constraint-based mining,
a constraint is a rigid boolean function which returns either true or false. To
overcome such limitations we introduced the new paradigm of pattern discovery
based on Soft Constraints, and instantiated our idea to the fuzzy soft constraints.
In this paper we extend the framework to deal with probabilistic and weighted
soft constraints: we provide theoretical basis and detailed experimental analysis.
Finally we discuss of how deal with top-k queries.
On Interactive Pattern Mining from Relational Databases
by Francesco Bonchi, Fosca Giannotti, Claudio Lucchese, Salvatore Orlando, Raffaele Perego and Roberto Trasarti
In this paper we introduce a constraint based querying system
devised with the aim of supporting the intrinsically exploratory (i.e.,
human-guided, interactive, iterative) nature of pattern discovery. Following
the Inductive Database vision, our framework provides users with an
expressive constraint based query language which allows the discovery
process to be effectively driven toward potentially interesting patterns.
Such constraints are also exploited to reduce the cost of pattern mining
computation. We implemented a comprehensive mining system that can
access real world relational databases from which extract data. After a
preprocessing step, users mining queries are answered by an efficient pattern
mining engine which entails several data and search space reduction
techniques. Finally, results are presented to the user, and then stored in
the database. New user-defined constraints can be easily added to the
system in order to target the particular application considered.
Analysis of Time Series Data with Predictive Clustering Trees
by Sašo Džeroski, Ivica Slavkov, Valentin Gjorgjioski and Jan Struyf
Predictive clustering is a general framework that unifies clustering and prediction. This
paper investigates how to apply this framework to cluster time series data. The resulting
system, Clus-TS, constructs predictive clustering trees (PCTs) that partition a given set
of time series into homogeneous clusters. In addition, PCTs provide a symbolic
description of the clusters. The paper considers several distance metrics to measure
cluster homogeneity (both quantitative and qualitative). We evaluate Clus-TS on time
series data from microarray experiments. Each data set records the change over time in
the expression level of yeast genes in response to a change in environmental conditions.
Our evaluation shows that Clus-TS is able to identify interesting clusters of genes with
similar responses. Clus-TS is part of a larger project where the goal is to investigate
how global models can be combined with inductive databases.
Integrating Decision Tree Learning into Inductive Databases
by Élisa Fromont and Hendrik Blockeel
In inductive databases, there is no conceptual difference between data and the models
describing the data: both can be stored and queried using some query language. The
approach that adheres most strictly to this philosophy is probably the one proposed
by the ADReM group from Antwerp: in that approach, models are stored in relational
tables and queried using standard SQL. The approach has been described in detail for
association rule discovery. In this work we study how decision tree induction could
be integrated in that approach. We propose a representation format for decision trees
similar to the one proposed earlier for association rules, and queryable using
standard SQL; and we present a prototype system in which part of the needed
functionality is implemented. In the process, we identify a number of important
differences between discovery of global models (such as decision trees) and local
models (such as association rules), which force us to re-evaluate the motivation for
the approach.
An Integrated Multi-task Inductive Database and Decision Support System VINLEN: An Initial Implementation and First Results
by Kenneth A. Kaufman, Ryszard S. Michalski, Jaroslaw Pietrzykowski and Janusz Wojtusiak
A brief review of the current research on VINLEN multitask inductive database and
decision support system is presented. VINLEN integrates a wide range of knowledge
generation operators that given input data and/or knowledge create new knowledge. The
central operator of VINLEN is a natural induction module that generates hypotheses
from data in the form of attributional rules. Such rules are easy to understand and
interpret because they directly correspond to equivalent natural language
descriptions. This operator is illustrated by an application to discovering
relationships between lifestyles and diseases in men. The conclusion outlines plans
for future research.
Frequent Pattern Mining and Knowledge Indexing Based on Zero-suppressed BDDs
by Shin-ichi Minato and Hiroki Arimura
Frequent pattern mining is one of the fundamental techniques for
knowledge discovery and data mining. In the last decade, a
number of efficient algorithms for frequent pattern mining have
been presented, but most of them focused on just enumerating the
patterns which satisfy the given conditions, and it was a
different matter how to store and index the result of patterns
for efficient inductive analysis. In this paper, we propose a
fast algorithm of extracting all/maximal frequent patterns from
transaction databases and simultaneously indexing the result of
huge patterns using Zero-suppressed BDDs (ZBDDs). Our method is
fast as competitive as the existing state-of-the-art algorithms,
and not only enumerating/listing the patterns but also indexing
the output data compactly on main memory. After mining, the
result of patterns can efficiently be analyzed by using
algebraic operations. The data structures of BDDs have already
been used in VLSI logic design systems successively, but our
method will be the first practical work of applying the
BDD-based techniques for data mining area.
Quantitative Episode Trees
by Mirco Nanni and Christophe Rigotti
Among the family of the local patterns, episodes are commonly used when mining a single or
multiple sequences of discrete events. An episode reflects a qualitative relation
is-followed-by over event types, and the refinement of episodes to incorporate quantitative
temporal information is still an on going research, with many application opportunities. In
this paper, focusing on serial episodes, we design such a refinement called quantitative
episodes and give a corresponding extraction algorithm. The three most salient features of
these quantitative episodes are: (1) their ability to characterize main groups of homogeneous
behaviors among the occurrences, according to the duration of the is-followed-by steps,
and providing quantitative bounds of these durations organized in a tree structure; (2) the
possibility to extract them in a complete way; and (3) to perform such extractions at the
cost of a limited overhead with respect to the extraction of standard episodes.
IQL: A Proposal for an Inductive Query Language
by Siegfried Nijssen and Luc De Raedt
The inductive query language IQL is introduced. It is intended as a general,
descriptive, declarative, extendable and implementable language for inductive
querying that supports the mining of both local and global patterns,
reasoning about inductive queries and query processing using logic, as well
as the flexible incorporation of new primitives and solvers. IQL is an
extension of the tuple relational calculus with functions, a typing system
and various primitives for data mining. We hope that it will be useful as an
overall specification language for integrating data mining systems and
principles.
Mining Correct Properties in Incomplete Databases
by François Rioult and Bruno Crémilleux
Missing values issue in databases is an important problem because missing
values bias the information provided by the usual data mining methods. In this
paper, we are searching for mining patterns satisfying correct properties in presence
of missing values (it means that these patterns must satisfy the properties
in the corresponding complete database). We focus on k-free patterns. Thanks to a
new definition of this property suitable for incomplete data and compatible with the
usual one, we certify that the extracted k-free patterns in an incomplete database
also satisfy this property in the corresponding complete database. Moreover, this
approach enables to provide an anti-monotone criterion with respect to the pattern
inclusion and thus design an efficient level-wise algorithm which extracts correct
k-free patterns in presence of missing values.
Efficient Mining under Flexible Constraints through Several Datasets
by Arnaud Soulet, Jiří Kléma and Bruno Crémilleux
Mining patterns under many kinds of constraints is a key point to
successfully get new knowledge. In this paper, we propose an
efficient new algorithm Music-dfs
which soundly and completely mines
patterns with various constraints from large data and takes into
account external data represented by several heterogeneous datasets.
Constraints are freely built of a large set of primitives and enable
to link the information scattered in various knowledge sources.
Efficiency is achieved thanks to a new closure operator providing an
interval pruning strategy applied during the depth-first search of a
pattern space. A genomic case study shows both the effectiveness of
our approach and the added-value of background knowledge such as
free texts or gene ontologies in discovery of meaningful patterns.
|