Prof. dr. ir. Yves Moreau
Department of Electrical Engineering, K.U.Leuven
Candidate Gene Prioritization by Genomic Data Fusion
Despite significant advances in omics techniques, the identification of genes causing rare genetic diseases and the understanding of the molecular networks underlying those diseases remains difficult. Gene prioritization attempts to integrate multiple, heterogeneous data sources to identify candidate genes most likely to be associated or causative with a disorder. Such strategies are useful both to support clinical genetic diagnosis and to speed up biological discovery. Genomic data fusion algorithms are rapidly maturing statistical and machine learning techniques that integrate complex, heterogeneous information (such as sequence similarity, interaction networks, expression data, annotation, or biomedical literature) towards prioritization, clustering, or prediction. In particular, text mining is a particularly powerful methodology underlying genomic data fusion. We present a number of gene prioritization strategies, focusing on kernel methods and network analysis. We illustrate these approaches at the hand of several applications in genetic diagnosis and disease gene discovery. We also go beyond learning methods as such by addressing how such strategies can be embedded into the daily practice of geneticists, mostly through collaborative knowledge bases that integrate tightly with prioritization and network analysis methods.
Prof. dr. Ross D. King
Aberystwyth University, Wales, UK
Automating Biology using Robot Scientists
A Robot Scientist is a physically implemented robotic system that applies techniques from artificial intelligence to execute cycles of automated scientific experimentation. A Robot Scientist can automatically execute cycles of: hypothesis formation, selection of efficient experiments to discriminate between hypotheses, execution of experiments using laboratory automation equipment, and analysis of results. We have developed the Robot Scientist “Adam” to investigate yeast (Saccharomyces cerevisiae) functional genomics. Adam has autonomously identified genes encoding locally “orphan” enzymes in yeast. This is the first time a machine has discovered novel scientific knowledge. To describe Adam's research we have developed an ontology and logical language. Use of these produced a formal argument involving over 10,000 different research units that relates Adam's 6.6 million biomass measurements to its conclusions. We are now developing the Robot Scientist “Eve” to automate drug screening and QSAR development.
Prof. dr. Andrea Passerini
Università degli Studi di Trento, Italy
Frankenstein Junior: a Relational Learning Approach toward Protein Engineering
Protein engineering is the process of developing novel proteins with useful functions. Rational design aims at exploiting available knowledge to suggest promising mutations to be later verified by site-directed mutagenesis. Machine learning techniques have been extensively employed for predicting characteristics of proteins (e.g. stability) from sequence information. A naive approach to protein engineering consists of trying all possible mutations of a certain sequence and evaluating each resulting mutant by these predictors. However, this approach is computationally infeasible when multiple mutations have to be jointly evaluated.
We propose a simple relational learning approach for protein engineering. First, we learn a set of relational rules from mutation data, then we use them for generating a set of candidate mutations that are most probable to improve protein function, e.g. conferring resistance to a certain inhibitor or improving activity on a specific substrate. Encouraging preliminary results were obtained in predicting HIV drug resistance mutations. We'll discuss the potentials and criticalities of the approach and suggest some directions for future research.
Prof. dr. Manfred Jaeger
Aalborg Universitet, Denmark
Factorial Clustering with an Application to Plant Distribution Data
We propose a latent variable approach for multiple clustering of categorical data. We use logistic regression models for the conditional distribution of observable features given the latent cluster variables. This model supports an interpretation of the different clusterings as representing distinct, independent factors that determine the distribution of the observed features. We apply the model for the analysis of plant distribution data, where multiple clusterings are of interest to determine the major underlying factors that determine the vegetation in a geographical region.
Public PhD defense of Kurt De Grave
Department of Computer Science, K.U.Leuven
Predictive Quantitative Structure-Activity Relationship Models and their use for the Efficient Screening of Molecules
We explore two avenues where machine learning can help drug discovery: predictive models of in vivo or in vitro effects of molecules (known as Quantitative Structure-Activity Relationship or QSAR models), and the selection of efficient experiments based on such models. In the first part, we present methods to improve the predictive power of graph kernel based molecule classifiers. The bias of existing graph kernels can be improved by augmenting atom-bond graphs with functional groups. This novel representation allows a machine learning algorithm to use both high-level functional and low-level atomic information, without any change to the kernel or learning algorithm. In internal validation tests, we observe consistently higher AUROCs for all tested kernels. We also introduce a novel, efficient graph kernel called the Neighborhood Subgraph Pairwise Distance Kernel. The feature space of this kernel is the space of pairs of topological balls and the interpair distance. Using this kernel, a standard support vector machine outperforms existing methods in the prediction of all investigated target properties: mutagenicity, in vivo toxicity, antiviral activity, and cancer suppression. In the second part, we tackle the problem of efficient experimentation in drug discovery using optimization assisted by a learned surrogate model and we evaluate different experiment selection strategies. The algorithm is extended to accommodate drug discovery needs, such as the selection of many parallel experiments. The algorithm is integrated in an automated drug discovery platform, the robot scientist Eve. It is also applied to the optimization of the design of nanofiltration membranes.
The candidate gives a 40 minutes presentation in Dutch, with English slides, followed by an examination and a deliberation by the jury.
The reception, lunch and coffee break take place in the thermodynamics museum adjacent to the auditorium.
Download the Ph.D. thesis (PDF, 6.5 MB).
There will be complimentary paperbacks available at the defense.
You can also order a paperback from Amazon.com.