PhD Thesis COMPUTATIONAL FRAMEWORKS FOR WIRING HUMAN AND YEAST PROTEOMES


Abstract

One of the major challenges of modern system biology is to decipher how the information for the life processes is encoded in the protein networks of a complex organism. Interactions among proteins serve as an important basis for the biological complexity of higher organisms. In recent years, there have been several large scale efforts to map protein interactions on model organisms. In this thesis, I address the problem of how to wire the proteomes, an impor- tant question, central in systems biology; in human (H. sapiens), the diffcul- ties in generating protein interaction data have stimulated the development of sequence based prediction frameworks, i.e. co-evolutionary information of the interacting partners and interologs; following this trend, some pipelines were de- veloped aimed to transfer interaction data from model organism to human (the HomoMINT interactome ) and to looking for correlations in the distance matri- ces representing the trees of the ortholog groups to which the human reference proteins under analysis belong. A third method based on the identification of co-evolving residues displaying statistically significant patterns of co-evolution, as measured by mutual information metric was tested; in yeast (S. cerevisiae) , the questions related to the wiring problem remain still unanswered in spite of the abundance of protein interaction data from high-throughput experiments. Unfortunately, these large-scale studies show embarrassing discrepancies in their results and coverage. The recent completion of a comprehensive literature cura- tion effort, have made available an interesting new reference set and stimulated building of a simple logistic regression model on wiring of the yeast proteome, based on the definition of some predictors of functional relationships for the pairs of interacting proteins in the reference set: the probability of sharing a path on the Gene Ontology trees, the degree of correlated evolution, the degree of co-expression and co-abundance. Moreover, the value distributions for the analysed genomic features differ respect the distributions of the same predictors computed in previously defined null models (i.e. artificial protein networks). The model was evaluated by standard criteria and ROC curve analysis. The complete frameworks were implemented in a suite of R - PERL programs.



Paper Details

Authors

M. Persico

Language

English
.