Datasets
Format: The datasets are in annotated transaction format with labels: every line is one transaction. A transaction is a space-separated list of item identifiers (offset 0), the last item is either 1 or 0 and represents the class label.
The meaning of every label is given in the header of the file: @<nr>: ... lines describe item number <nr>, @class: ... describes the two classes. To parse the files correctly, all lines starting with @, with % and empty lines should be ignored. (the format is a combination of the FIMI format with annotations like the ARFF format).
Sources: The original datasets were collected from the UCI machine learning repository. More datasets can be found in the FIMI repository, but they are not annotated.
Preprocessing: Preprocessing steps were added to the @relation tag of every file.
- Attributes having more then 10% missing values were removed, as well as the remaining examples that had missing values. Also, zoo-1 and splice-1 have their unique ID attribute removed,
- Numerical attributes were binarized using unsupervised discretisation with 7 binary split points (8 bins) and equal-frequency binning,
- Nominal attributes are transformed using one item for every value,
- Multi-class problems were made binary by selecting the largest class.
Properties: Different datasets have different properties and will behave differently. A key property to watch is density (the relative number of 1's in the binary format): traditional itemset mining focussed on very large and sparse datasets (see the FIMI competitions). In constraint-based mining dense datasets are considered harder to mine because of the large number of candidates. For discriminative itemset mining class labels are given, the number of positive transactions are indicated below for each dataset.
The number of itemsets (standard and closed/maximal condensed) is also given, for verification of correctness and as a guideline for usage. LCM ver. 4 was used to find them.
| Original data:
UCI - zoo-1
|
| Dataset properties: |
total transactions / items: 101 / 36 | density: 44% | positives: 41% |
| Patterns at 10% frequency (=10): | 151 807 standard | 3 292 closed | 230 maximal |
| Download dataset: zoo-1.txt |
| Original data:
UCI - vote
|
| Dataset properties: |
total transactions / items: 435 / 48 | density: 33% | positives: 61% |
| Patterns at 10% frequency (=44): | 49 098 standard | 35 771 closed | 2 636 maximal |
| Download dataset: vote.txt |
| Original data:
UCI - tic-tac-toe
|
| Dataset properties: |
total transactions / items: 958 / 27 | density: 33% | positives: 65% |
| Patterns at 10% frequency (=96): | 192 standard | 192 closed | 165 maximal |
| Download dataset: tic-tac-toe.txt |
| Original data:
UCI - splice-1
|
| Dataset properties: |
total transactions / items: 3190 / 287 | density: 21% | positives: 52% |
| Patterns at 10% frequency (=319): | 1 606 standard | 1 606 closed | 988 maximal |
| Download dataset: splice-1.txt |
| Original data:
UCI - soybean
|
| Dataset properties: |
total transactions / items: 630 / 50 | density: 32% | positives: 15% |
| Patterns at 10% frequency (=63): | 27 636 standard | 2 908 closed | 331 maximal |
| Download dataset: soybean.txt |
| Original data:
UCI - primary-tumor
|
| Dataset properties: |
total transactions / items: 336 / 31 | density: 48% | positives: 24% |
| Patterns at 10% frequency (=34): | 50 041 standard | 31 025 closed | 2 043 maximal |
| Download dataset: primary-tumor.txt |
| Original data:
UCI - mushroom
|
| Dataset properties: |
total transactions / items: 8124 / 119 | density: 18% | positives: 52% |
| Patterns at 10% frequency (=812): | 155 734 standard | 3 287 closed | 453 maximal |
| Download dataset: mushroom.txt |
| Original data:
UCI - lymph
|
| Dataset properties: |
total transactions / items: 148 / 68 | density: 40% | positives: 55% |
| Patterns at 10% frequency (=15): | 9 967 402 standard | 46 802 closed | 5 191 maximal |
| Download dataset: lymph.txt |
| Original data:
UCI - kr-vs-kp
|
| Dataset properties: |
total transactions / items: 3196 / 73 | density: 49% | positives: 52% |
| Patterns at 10% frequency (=320): | 59 000 000+ in 30 min. standard | 59 000 000+ in 30 min. closed | 1 984 963 maximal |
| Download dataset: kr-vs-kp.txt |
| Original data:
UCI - hypothyroid
|
| Dataset properties: |
total transactions / items: 3247 / 88 | density: 49% | positives: 91% |
| Patterns at 10% frequency (=325): | 56 000 000+ in 30 min. standard | 56 000 000+ in 30 min. closed | 2 925 833 maximal |
| Download dataset: hypothyroid.txt |
| Original data:
UCI - hepatitis
|
| Dataset properties: |
total transactions / items: 137 / 68 | density: 50% | positives: 81% |
| Patterns at 10% frequency (=14): | 270 000 000+ in 30 min. standard | 1 827 264 closed | 189 205 maximal |
| Download dataset: hepatitis.txt |
| Original data:
UCI - heart-cleveland
|
| Dataset properties: |
total transactions / items: 296 / 95 | density: 47% | positives: 54% |
| Patterns at 10% frequency (=30): | 250 000 000+ in 30 min. standard | 12 774 456 closed | 1 647 364 maximal |
| Download dataset: heart-cleveland.txt |
| Original data:
UCI - german-credit
|
| Dataset properties: |
total transactions / items: 1000 / 112 | density: 34% | positives: 70% |
| Patterns at 10% frequency (=100): | 40 486 976 standard | 2 080 153 closed | 232 107 maximal |
| Download dataset: german-credit.txt |
| Original data:
UCI - australian-credit
|
| Dataset properties: |
total transactions / items: 653 / 125 | density: 41% | positives: 55% |
| Patterns at 10% frequency (=65): | 165 000 000+ in 30 min. standard | 24 208 803 closed | 2 580 684 maximal |
| Download dataset: australian-credit.txt |
| Original data:
UCI - audiology
|
| Dataset properties: |
total transactions / items: 216 / 148 | density: 45% | positives: 26% |
| Patterns at 10% frequency (=22): | 167 000 000+ in 30 min. standard | 167 000 000+ in 30 min. closed | 12 500 755 maximal |
| Download dataset: audiology.txt |
| Original data:
UCI - anneal
|
| Dataset properties: |
total transactions / items: 812 / 93 | density: 45% | positives: 77% |
| Patterns at 10% frequency (=81): | 147 000 000+ in 30 min. standard | 1 224 754 closed | 15 977 maximal |
| Download dataset: anneal.txt |