|
|
Hypothesis-finding in Systems Biology
PDF Version URL: http://research.nii.ac.jp/il/fj 1. IntroductionThe main purpose of this research is to develop a framework to discover unknown patterns, laws, and information from biological databases using logic-based Artificial Intelligence. Given a new observation, we would like to make a hypothesis that is accommodated to the existing knowledge. If such a hypothesis can, together with the background theory, logically derive the observation, we can consider it as a possible explanation. In this research, we will clarify the principle of hypothesis formation and apply it to discovery of scientific knowledge. Specifically, we aim at finding new and hidden rules in systems biology, and explaining causal relationships from genotype to phenotype using generic models in biology. In particular,we try to solve the challenging problem to identify master reactions in metabolic pathways, which are involved in physiological states in the growth of Escherichia Coli and Saccharomyces Cerevisiae. "Hypothesis-finding in Systems Biology" is one of the main subgoals of the project entitled "Knowledge-based Discovery in Systems Biology". This project is supported by both Japan Science and Technology Agency (JST) and Centre National de la Recherche Scientifique (CNRS, France) under the 2007-2009 Strategic Japanese-French Cooperative Program on Information and Communications Technology including Computer Science. The project brings together researchers from Computer Science, Biology, Artificial Intelligence, and Mathematics. The project members in Japan are: Katsumi Inoue (NII), Taisuke Sato (Tokyo Institute of Technology), Koji Iwanuma (University of Yamanashi), Hidetomo Nabeshima (University of Yamanshi), Yoshitaka Kameya (Tokyo Institute of Technology), Chiaki Sakama (Wakayama University), Asao Fujiyama (NII/National Institute of Genomics), Reiko Tanaka (RIKEN), Yoshitaka Yamamoto (SOKENDAI), and Takehide Soh (SOKENDAI). The French members are: Andrei Doncescu (LAAS), Louise Trave-Massuyes (LAAS), Gèrard Montseny (LAAS), Jean-Louis Uribelarea (LBB), Gèrard Goma (LBB), Luis Farinas del Cerro (IRIT), Jacques Demongeot (TIMC-IMAG), Pierre Siegel (University of Provence), Cèline Casenave (LAAS), and Emmanuel Montseny (LAAS). Oliver Ray (University of Bristol) also participates in the project. Some of the Japanese members are also supported in part by 2008-2011 JSPS Grant-in-Aid for Scientific Research (A) (No. 20240016). 2. Goals of Biological Systems AnalysisIn systems biology, the behavior of a biological system is predicted from a point of view of a complex system which involves time and non-linearity. To this end, we set the following three tasks to be solved in this project:
Two application domains are set in this project. One is systems analysis of metabolic pathways, and the other is diagnostic and therapy prediction for breast cancer. Here we mainly describe the former domain in this article, while the latter domain is shortly explained in the last section. A metabolic path is a coherent sequence of enzymatic reactions, which are interconnected via substrates. The study of metabolic pathways is becoming increasingly important to exploit an integrated, systemic approach for optimizing a desired cellular property or phenotype. Due to interactions at the genomic and metabolic levels, integration of genomics data with genetic, metabolic, and regulatory models is essential for a systematic design of artificial biological systems. Computational tools for precise descriptions of natural pathways would then improve performance of biochemical products. Among a lot of approaches to describe metabolic pathways, the logic model is the most precise representation of pathway knowledge. An explanation of metabolic mechanisms is provided by a computational system based on inductive logic programming (ILP) and statistical relational learning (SRL). Given the background theory for the network structure of a pathway and observations, most probable hypotheses are then constructed to explain the behavior of the metabolic system. The final goal of this project is to develop an in silico model of E. coli and S. Cerevisiae, enabling prediction of responses in metabolic systems to exogenous and endogenic perturbations. Main steps to achieve the proposed goal are as follows:
3. Analytical Models and Quantitative AnalysisAnalytical models associated with microorganisms are studied in Laboratory of Biotechnology and Bioprocess (LBB). The metabolism is organized in a complex network of interconnected reactions, where the global behavior is the result of individual properties of enzymes and metabolite and global properties of the network organization. The approach is based on kinetic methods which also account for dynamic behavior. The choice of a dynamics model with a Michaelis-Menten formalism has been made as a representation of a non-linear allosteric regulation system. Responses of intracellular metabolites to a pulse of glucose are measured continuously by employing automatic stopped flow and manual fast sampling techniques in the scale of milliseconds. By this approach, metabolites tendencies are determined and are modeled by logical representation conforming to input of hypothesis-finding systems. From the experimental viewpoint, this step will benefit from experience of LBB, which has been one of the precursors in the field of microbial metabolism quantitative analysis.Quantitative analysis of metabolic and signaling pathways are pursued in LAAS. In order to reduce the number of possible dynamic models in a top-down analysis, a logical description consisting of intentional notions and their relationships should be developed. These logical notions and relationships are then used to describe control in bacteria and yeast at a level of abstraction. Then abstraction in higher order concepts such as flux, regulon, and operon pathways is possible so that cellular steady states can be characterized. The new model will be handled by diffusive representation developed in LAAS. We hope to obtain a complete representation of generic microorganisms and to generate new data for metabolites which have not been measured. By this way, the French team provides experimental data and biological models which are sufficient for the Japanese team to extract the rules governing microorganisms. 4. Hypothesis-finding by Abduction and InductionDiscovery and definition of the rules which govern microorganisms are developed in the Japanese team. The data set obtained from experiments are used in the ILP framework.In this project, both abduction and induction are used to infer hypotheses in ILP, and are characterized in an ILP system called CF-Induction [1]. When we have current background knowledge B and a new observation E is obtained to update B, E should be assimilated into our knowledge in such a way that E should change the current theory B into an augmented theory such that . In this case, H is called a hypothesis, which is either abductive or inductive. CF-Induction is the only existing system which is sound and complete for finding inductive hypotheses from full clausal theories, and thus can be used for inducing not only definite clauses but also non-Horn (indefinite) clauses and integrity constraints. In implementation of CF-induction, SOLAR [2] has been used to realize an efficient and complete inverse entailment. SOLAR is a sophisticated implementation of a first-order clausal consequence-finding system based on a connection tableau format with the skip operation. In the latest implementation of SOLAR, various state-of-the-art pruning techniques have been implemented to avoid redundant deductions, and a practical search strategy is introduced for finding important consequences in limited time. SOLAR itself can also be used to compute abduction from full clausal theories [3]. In an initial achievement of the project, a new method for estimating states of enzyme reactions in metabolic pathways is presented in [5]. The proposed model logically represents causal relations between concentration changes of metabolites and enzyme activities. When we observe concentration changes of metabolites, we can assume which enzyme reactions are accelerated under this model. This computation can be realized using CF-induction, which has a unique feature that integrates inductive and abductive inferences. While both inductive and abductive inferences are used to find hypotheses that account for given observations, they differ in the usage in applications. Abduction is applied for finding specific explanations (causes) of observations obtained by using the current background theory. On the other hand, induction is applied for finding general rules that hold universally in the domain, but are missing in the background theory. In our problem setting, an explanation obtained by abduction corresponds to an estimation of each reaction state. If a background theory is complete with respect to the regulation mechanism of enzyme activities, then possible states of reactions can be computed only using abduction. However, since background theories are incomplete in general, it is necessary to find such missing rules that represent some unknown control mechanisms using induction. Therefore, it is a crucial advantage for us to analyze metabolic pathways by realizing both abductive and inductive inferences in CF-induction. Finally, hypothesis-finding in systems biology is considered in incomplete metabolic pathways. Abduction by SOLAR is applied to graph completion and prediction of inhibitors in metabolic pathways. Induction by CF-induction is applied to hypothesis generation in a partial pathway including pyruvate. In any biological application, hypothesis selection is indispensable because the hypothesis space is so huge. 5. Statistical Relational LearningIt is possible to perform hypothesis selection in both statistical and non-statistical ways. The non-statistical hypotheses selection adopts preferred abduction and preferred inductionwhere the most preferred hypotheses are chosen with respect to the given preference ordering. The statistical hypothesis selection, on the other hand, relies on a probabilistic model specifying a distribution of hypotheses. The most likely hypothesis is selected by computing the probabilities of hypotheses. One thing we have to consider in the case of probabilistic models is how to incorporate rich background knowledge in biology in a model since current popular probabilistic models such as log-linear models are feature-based and the description is at a propositional level, logically speaking. SRL, often called Probabilistic Logic Learning (PLL), is a formalism that describes probabilistic models at the first-order level using relations, functions, logical variables and so on. It has been pursued for several years and has established itself as a subfield of machine learning now. We expect PRISM (PRogrammig In Statistical Modeling) [4], an SRL language developed in Tokyo Institute of Technology, is applicable to statistical hypothesis selection. Simply put, PRISM is a probabilistic extension of Prolog coming with probabilistic formal semantics called the distribution semantics. It is Turing complete. In addition, it computes efficiently probabilities by dynamic programming and can learn probabilities by the EM algorithm generalized for logic programs. It covers most probabilistic models such as Bayes net and PCFGs with a single computing/learning algorithm as long as they are discrete and generatively describable. The problem with PRISM, however, is that it is designed for definite clauses, and hence may not be applicable to the selection of hypotheses consisting of non-definite clauses generated by CF-induction. Currently, we are developing yet another software for probability computation/learning based on (Z)BDDs to cope with this problem. We hope in the very near future we can apply PRISM and/or the new software combined with CF-induction to metabolic pathway modeling to predict the toxic effects of compounds, and generate the most probable and easily understandable explanations for the user. 6. Prediction of Physiological StatesPrediction of physiological states and experimental validation is done in LBB. The produced logic programs allow us to define an experimental protocol of metabolites used for production of other metabolites. Based on experimental data, an in silico model of central metabolic pathways of industrial microorganisms having the capacity of prediction is proposed.Validation of developed tools for concrete experimental situation is applied for ethanol production by S. Cerevisiae and genetic modification in the case of E. coli. 7. Diagnosis and Therapy Prediction for Breast CancerAlthough biologists and doctors have little knowledge about pertinence of the cancer descriptors, such knowledge is mainly empirical and subjective. Moreover, each knowledge deals withno more than three parameters in general, while there is generally at least 10 parameters such as gender, aging, genetic risk, genetic testing, personal and family histories of breast cancer, and parameters like BRCA1, BRCA2, ATM, CHEK2, p53, and PTEN. A lot of information is possibly redundant, non-usable, or even erroneous. It is then important to develop theoretical tools for analyzing these parameters to confirm or revise the expert's knowledge and then to add or remove descriptors from learning. A machine learning or data mining tool must be able to handle numerical and symbolic breast cancer data for grading the cancer's sub-types according with medical diagnosis. Specifically, a consistent set of knowledge should be extracted in the selection of the most appropriate treatment, based on the existent information and new data provided by the micro-array technology. In data mining processes, models and patterns must be extracted under acceptable computational resources. This reduction of data space by the construction of a learning model can be achieved by defining logical predicates and using ILP too. In the first step, theories constructed by an ILP tool will be evaluated by a domain expert, and then we will go further using learned theories to solve new problem instances. References[1] K. Inoue,\newblock Induction as consequence finding, Machine Learning, 55(2):109-135, 2004. [2] H. Nabeshima, K. Iwanuma and K. Inoue, SOLAR: a consequence finding system for advanced reasoning, Proceedings of the 11th International Conference on Automated Reasoning with Analytic Tableaux and Related Methods, Lecture Notes in Artificial Intelligence, LNAI, 2796, Springer, pp.257-263, 2003. [3] O. Ray and K. Inoue, A consequence finding approach for full clausal abduction, Proceedings of the 10th International Conference on Discovery Science, Lecture Notes in Artificial Intelligence, LNAI, 4755, Springer, pp.173-184, 2007. [4] T. Sato and Y. Kameya, New advances in logic-based probabilistic modeling by PRISM, Probabilistic Inductive Logic Programming, Lecture Notes in Artificial Intelligence, LNAI, 4911, Springer, pp.118-155, 2008. [5] Y. Yamamoto, K. Inoue and A. Doncescu, Estimation of possible reaction states in metabolic pathways using inductive logic programming, Proceedings of the 22nd International Conference on Advanced Information Networking and Applications, IEEE Computer Society, pp.808-813. 2008. |