Machine Learning for Health

Our lab’s research agenda in this area is to develop artificial intelligence, machine learning, and data mining tools that can help clinicians and health care researchers analyze, interpret, and exploit the burgeoning collection of data and knowledge in order to significantly impact health care.

With the widespread adoption of electronic health/medical records (EHRs/EMRs) and continuing improvements in data collection (e.g., wearable sensors, imminent achievement of the $1000-genome), electronic health data is growing at a staggering rate. An emerging area of research tries to use this data, in combination with domain knowledge, to build models (e.g., predictive) to help with tasks such as clinical decision support.

Detailed topics: 

Learn predictive models from complex data and knowledge

We are interested in building models that combine existing knowledge with collected data. Specifically, our expertise lays in integrating multiple sources of data and knowledge, reasoning at the symbolic level, coping with uncertainty, and learning models from relational databases that capture uncertainty.

Reason and make inferences about data

The goal of learning is to construct a learned model that can be used to make predictions about the future or inferences about the data. Here, our expertise lies in how to do this efficiently.

Discover insights that are comprehensible to domain experts

Our research focuses on learning methods, such as sets of if-then rules, that are readily comprehensible for researchers without a technical background in computer science to interpret. This can lead to advances such as novel discoveries or the formation of new research hypothesis for a given application.

Use cases: 
  • Learning from Electronic Health Records
    The analysis of electronic medical records (EMR) data poses significant technical challenges for learning and rea- soning. EMRs are relational databases that store a wealth of information about a patient’s clinical history: disease diagnoses, procedures, prescriptions, lab results, etc. Using EMRs it is possible to build models to address important medical problems such as predicting which patients are most at risk for having an adverse response to a certain drug. Successfully analyzing EMRs requires accounting for their relational schemas (i.e., the database contains separate relational tables for diagnoses, prescriptions, labs, etc.), longitudinal nature (e.g., time of diagnosis may be impor- tant), and the fact that different patients may have dramatically different numbers of entries in any given table, such as diagnoses or vitals. Furthermore, it is important to model the uncertain, non-deterministic relationships between patients’ clinical histories and current and future predictions about their health status. Our group employs techniques from statistical relational learning (SRL) to address these challenges. SRL offers three benefits for analyzing medical data. First, it can capture important relationships, such as the time between two events occurring or the interactions between two individuals, that occur in the data. Second, it models the inherent uncertainty in the underlying data. Third, it can naturally make use of existing domain knowledge during the learning and mining process. Along with SRL contributions, we develop new algorithmic ideas making our approaches scalable and appropriate for big databases. We have developed a suite of algorithms for analyzing data that focus on automatically discovering statistical, structural regularities (e.g, rules or probabilistic models) from data and have successfully applied them to the following problems.
  • Diagnosing Breast Cancer from Structured Mammography Reports
    Labeling an abnormality as benign or malignant from a structured mammography report is a challenging task for both radiologists and machines. To tackle this problem, we have developed an algorithm that automatically constructs prob- abilistic first-order logical rules from the data that lead to two important results. First, presently most of the women identified for a possible malignancy on a mammogram are called back unnecessarily, with concomitant stress, proce- dure (additional imaging and/or biopsy) and expense. Our research, which achieves superior performance compared to both previous machine learning approaches and radiologists, has demonstrated the potential to dramatically re- duce this fraction without reducing the number of cancers correctly diagnosed. Second, the (probabilistic) first-order logical rules are easy for domain experts to understand. In our work on mammography, a radiologist collaborator reviewed several learned rules and was particularly intrigued by the following rule:

    Abnormality A in mammogram M for Patient P may be malignant if:
    A has BI-RADS category 5, and
    A has a mass present, and
    A has a mass with high density, and
    P has a prior history of breast cancer, and
    P has another abnormality on same mammogram (B), and
    B has no pleomorphic microcalcifications, and
    B had no punctate calcifications.

    This rule suggested a hitherto unknown relationship between malignancy and high density masses. In general, mass density was not previously thought to be a highly predictive feature.

  • Predicting Adverse Drug Events from Electronic Medical Records.
    Consider the task of learning a model that, based on a patient’s clinical history, can predict at prescription time whether a patient may be susceptible to an adverse reaction (i.e., side effect) of a medication. A patient’s clinical history records information about specific prescribed medications (e.g., name, dosage, duration) or specific disease diagnoses. It does not explicitly mention important connections between different medications or diagnoses, such as which other medications could have been prescribed to treat an illness. This latent information may be necessary to build accurate models. We addressed this problem by developing an algorithm that, while learn- ing a model, automatically discovers clusters of diseases or medicines that are informative for the specific prediction task. We evaluated our algorithm on three real-world tasks where the goal is to use electronic medical records to predict whether a patient will have an adverse reaction to a medication. We found that our approach is more accurate than performing no clustering, pre-clustering, and using expert-constructed medical heterarchies. Furthermore, our algorithm uncovered latent structure that a doctor with expertise in our tasks of interest deemed (a) to capture impor- tant known relationships and (b) to suggest possible connections that deserve further investigation.
Selected publications: