HMC Software and Datasets

This page contains supporting materials for the paper C. Vens, J. Struyf, L. Schietgat, S. Džeroski, H. Blockeel, "Decision trees for hierarchical multi-label classification", Machine Learning 73(2):185-214, 2008.

Software

The Clus-HMC algorithm is implemented in the Clus system.

Datasets

The datasets used in our experimental comparison are from the field of functional genomics. Amanda Clare kindly provides us with the data sets. The original versions can be found here. We keep the input features, but add new class labels. In a first version, we took annotations from MIPS Functional Catalogue; in a second version we took Gene Ontology terms. The table below shows the details.

FunCat Gene Ontology
Scheme version 2.1 (2007/01/09) 1.2 (2007/04/11)
Yeast annotations 2007/03/16 2007/04/07
Total classes 1362 22960
Avg nb classes per dataset 492 (6 levels) 3997 (14 levels)
Avg nb labels per example 8.8 (3.2 most specific) 35.0 (5.0 most specific)

The datasets are recorded in Weka's arff format, and are ready to be used with Clus. For each dataset, there are 3 arff files: train, valid, and test. The file valid was used in our article to tune the f-test stopping criterion. The final model, constructed on the union of train and valid, was tested on test.

FunCat annotated datasets          Gene Ontology annotated datasets
Several of these datasets suffer from non-unique feature representations, making the learning task more difficult. More information about this issue can be found in Pliakos et al., Representational power of gene features for function prediction, 2015.

Extra materials

Questions?

Please direct questions about Clus-HMC/HSC/SC to Celine Vens, Jan Struyf, and Leander Schietgat.