HMC Software and Datasets

This page contains supporting materials for the paper C. Vens, J. Struyf, L. Schietgat, S. Džeroski, H. Blockeel, "Decision trees for hierarchical multi-label classification", Machine Learning 73(2):185-214, 2008.

Download pre-print in PDF format.

Software

The Clus-HMC algorithm is implemented in the Clus system.

Datasets

The datasets used in our experimental comparison are from the field of functional genomics. Amanda Clare kindly provides us with the data sets. The original versions can be found here. We keep the input features, but add new class labels. In a first version, we took annotations from MIPS Functional Catalogue; in a second version we took Gene Ontology terms. The table below shows the details.

	FunCat	Gene Ontology
Scheme version	2.1 (2007/01/09)	1.2 (2007/04/11)
Yeast annotations	2007/03/16	2007/04/07
Total classes	1362	22960
Avg nb classes per dataset	492 (6 levels)	3997 (14 levels)
Avg nb labels per example	8.8 (3.2 most specific)	35.0 (5.0 most specific)

The datasets are recorded in Weka's arff format, and are ready to be used with Clus. For each dataset, there are 3 arff files: train, valid, and test. The file valid was used in our article to tune the f-test stopping criterion. The final model, constructed on the union of train and valid, was tested on test.

FunCat annotated datasets

Gene Ontology annotated datasets

Several of these datasets suffer from non-unique feature representations, making the learning task more difficult. More information about this issue can be found in Pliakos et al., Representational power of gene features for function prediction, 2015.

Extra materials

Example settings files to be used with Clus (to run Clus in HMC mode):
- FunCat annotated datasets
- GO annotated datasets
- The settings file for the GO annotated datasets needs an additional file evalclasses.txt, which ignores the performance on the three root GO terms (they never occur as a label in the datasets).

Optimal HMC ftest values for all datasets (Clus automatically selects the optimal f-value for the area under the average PR curve with the above settings files).

Scripts to run the SC and HSC settings (run_sc.pl, run_hsc.pl), and scripts to create precision-recall curves based on a fixed set of thresholds (prcurves.pl, computepr.pl, ipol_pr.pl). These are also included in the "data/church_FUN" directory that is distrubuted with the Clus software.

Questions?

Please direct questions about Clus-HMC/HSC/SC to Celine Vens, Jan Struyf, and Leander Schietgat.