CPSM
Presentation
The goal of constraint-based sequence mining is to find sequences of symbols that are included in a large number of input sequences and that satisfy some constraints specified by the user. Many constraints have been proposed in the literature, but a general framework is still missing. We investigate the use of constraint programming as general framework for this task. We first identify four categories of constraints that are applicable to sequence mining. We then propose two constraint programming formulations. The first formulation introduces a new global constraint called exists-embedding. This formulation is the most efficient but does not support one type of constraint. To support such constraints, we develop a second formulation that is more general but incurs more overhead. Both formulations can use the projected database technique used in specialised algorithms. Experiments demonstrate the flexibility towards constraint-based settings and compare the approach to existing methods.More details in: Constraint-based sequence mining using constraint programming
- Conference: International Conference on Integration of Artificial Intelligence (AI) and Operations Research (OR) techniques in Constraint Programming (CPAIOR), 2015
- Authors: Benjamin Negrevergne, Tias Guns
- Keywords: sequential pattern mining, sequence mining, episode mining, constrained pattern mining, constraint programming, declarative programming
- Paper: PDF
Software
The software consists of the framework and a number of example datasets and sets of constraints.- cpsm-0.5.tar.gz
- License: LGPL
- Contact person: Benjamin.Negrevergne@cs.kuleuven.be, Tias.Guns@cs.kuleuven.be
Installation
Depends on Gecode. Instructions for installing and running Gecode are in README file.
Usage
General usage: ./cpsm <dataset> <#minfreq>
For example, executing ./cpsm test.dat 0.5 should produce the following output.
$ ./cpsm test.dat 0.5 Absolute minsup threshold: 2 Pattern : 1 Pattern : 1 1 Pattern : 2 Pattern : 2 1 Pattern : 3 Search statistics: PROPS 69 SOLS 5 FAIL 0 NODE 8 DEPTH 2 Number of solutions: 5
More details in the README file.
Datasets
Datasets JMLR, FIFA, iPRG and User as used available here
Examples
Note: to run these examples, you will need the JMLR dataset. More info and more datasets are available here.
-
All frequent sequence in jmlr.dat with a minimum frequency threshold = 0.4 (i.e. 40%)
./cpsm jmlr.dat 0.4
-
20%-frequent sequence of size >= 3 and <= 5 in test.dat with a minimum frequency threshold = 0.2 (i.e. 20%)
./cpsm --min-size=3 --max-size=5 jmlr.dat 0.2
-
10%-frequent sequence with a maxixum gap of 2 and a minimum size of 2
./cpsm_emb --shrink-data=N --max-gap=2 --min-size=2 jmlr.dat 0.1
(Note the cpsm_emb executable file required for --max-gap and other type 3 constraints).