HMC Software and Datasets

This page contains supporting materials for the paper L. Schietgat, C. Vens, J. Struyf, H. Blockeel, D. Kocev, S. Džeroski, "Predicting gene function using hierarchical multi-label decision tree ensembles", BMC Bioinformatics 2010, 11:2.

Software

The Clus-HMC and Clus-HMC-Ens algorithms are implemented in the Clus system.

Datasets

The datasets used in our experimental comparison are from the field of functional genomics. Amanda Clare, Zafer Barutcuoglu, Tim Hughes and Fritz Roth kindly provided us with the datasets. The original versions of D1-D18 can be found here. The datasets originate from the organisms S. cerevisiae and A. thaliana and have annotations from the MIPS Functional Catalogue and the molecular function branch of Gene Ontology. Dataset D19 originates from the organism M. musculus and has annotations from the 3 branches of the Gene Ontology. The original data can be found here.

The datasets are recorded in Weka's arff format, and are ready to be used with Clus. For each dataset, there are 3 arff files: train, valid, and test. The file valid was used in our article to tune the f-test stopping criterion. The final model, constructed on the union of train and valid, was tested on test.

S. cerevisiae datasets

FunCat annotated datasets          Gene Ontology annotated datasets

A. thaliana datasets

FunCat annotated datasets          Gene Ontology annotated datasets

M. musculus dataset

Parameter settings for Clus-HMC(-Ens)

Data files for figures in the paper

  • Pooled AUPRC comparison (Fig. 3): csv file
  • Average AUPRC comparison (Fig. 7): csv file
  • Average precision at C4.5H/M's recall (Fig. 8): csv file
  • AUROC comparison (Fig. 12): csv file

Questions?

Please direct questions about Clus-HMC(-Ens) to Leander Schietgat, Celine Vens, and Jan Struyf.