DTAI

  • Increase font size
  • Default font size
  • Decrease font size
DTAI Research Projects OT: Probabilistic structured models: learning from large-scale hybrid domains
DTAI Projects

OT: Probabilistic structured models: learning from large-scale hybrid domains

Period: 10-2011 → 09-2015
Subgroup: ml
Type: project

People

  • Jesse Davis
  • Jan Ramon

Probabilistic logical models have been shown to be successful for modeling biological and medical data. These models are particularly effective for two reasons. First, they can capture the uncertainty and correlations present in the data such as between genome and disease, observations and diagnosis or cell conditions and expression levels. Second, they can represent complex, inter-related, structured data, such as patient clinical histories, molecular structures and protein-protein interaction information. However, most current models and implementations have two significant limitations. First, many domains contain both discrete (e.g., genetic profile, prescribed drugs, etc.) and continuous (e.g., patient temperature, blood pressure, heart rate, etc.) data. Yet, current system offer almost no support for continuous variables. Second, current methods do not scale well in two ways. One, current implementations depend on loading all data into memory, which is unrealistic given that the model should account for the ever growing amount of data. Two, current algorithms can only model interactions among a small number, usually less than ten, of variables and/or relations whereas biological and medical phenomena can involve interactions among many variables. This proposal's goal is to develop learning and inference algorithms that address these issues. For learning, we are given a model structure and data and the goal is to learn the local models (e.g., conditional probability tables, parameters, etc.). Large quantities of data complicate learning. To address this issue, we will pursue two lines of research: developing sampling techniques for relational data that derive accurate estimates from a fraction of the data and devising intelligent data access schemes that minimize disk I/O operations. Both modeling continuous variables and scaling up will increase the computation cost of inference. We will investigate novel algorithms that limit the cost of inference by exploiting model regularities (e.g., repeated computations) and domain knowledge (e.g., provided by biologists). These steps will be performed in an integrated manner. For example, the learning algorithms could attach a higher cost to model details that greatly increase the cost of inference.

This proposal offers benefits on two fronts. For machine learning, it will result in new learning and inference algorithms that are more applicable to real-world domains. From an application perspective, these techniques have the potential to facilitate novel scientific discoveries.

← return to the projects