SFU Computational Biology Seminar 2011-12

2011-2012 Computational Biology Seminar

We are happy to announce the return of the Computational Biology Seminar at SFU. We plan to meet roughly once a month to discuss interesting research topics in computational biology. This seminar is sponsored by BCID (Bioinformatics for Combating Infectious Diseases), CORDS (Centre for Operations Research and Decision Sciences) and IRMACS. Unless noted the talks will be at 1:00 on Thursday in the IRMACS Theatre, ASB 10900. Please contact Cenk Sahinalp or Tamon Stephen if you would like to speak.

Date Speaker Title and Abstract

July 10th

*Tuesday, 10 a.m.*
Bojan Losic

Mount Sinai Hospital CNV Detection in Whole Genome Sequencing

Abstract:
It is well established that copy number variations (CNVs) are a major source of genomic variability between any two individuals. At present, the genomic resolution of CNVs that can be detected using high resolution microarray technologies is on the order of a kilo-base. Advances in next-generation sequencing technologies, however, enable us to identify, in principle, structural variations at the single-nucleotide level.

In this talk I will describe a new method we have developed to detect breakpoints of CNVs from next-generation sequencing data, which leverages wavelets to carry out an inherently multi-scale search for CNVs over all genomic scales. This is joint work with the Muthuswamy group at OICR.

May 22nd

*Tuesday*
Haixu Tang

Informatics and Computing

Indiana University Privacy-Preserving Sharing and Analysis of Human Genomic Data

Abstract:
Given rapid cost reduction, genome sequencing may soon become a routine tool for clinical diagnosis and therapy selection. However, the analytical demand is hard to meet because computational and personnel resources for storing and analyzing sequencing data are expensive. Furthermore, there are barriers related to complicated procedures for researchers to get access to sensitive human genomic data, which are designed to protect the privacy of human subjects. A critical issue is the need for the techniques that offer practical protection: data are expected to be conveniently used by biomedical researchers and healthcare practitioners, but need to be protected to the highest possible level, making it impractical to re-identify human subjects from the data. These techniques are also expected to help outsource the intensive computation involved in data analyses to low-cost public/commercial computing systems, without endangering the privacy of the parties that donate the data. In this talk, I will discuss our recent works on developing these techniques. I will first present our practical approaches to quantify the potential privacy risks in human genome data, using genome-wide association data as an example, and the techniques to mitigate these threats. In the second part of my talk, I will the computational technique that allows a human genome center to leverage the low-cost public resources for the analysis on human sequencing data. Based on thorough privacy analysis, I will show that this technique can outsource most human genome computing to the public server, while completely preserve the privacy of the participants of the genome studies.

This is joint work with Professor Xiaofeng Wang at Indiana University, and several graduate students.

April 5th
Bonnie Kirkpatrick

Computer Science

University of British Columbia Non-Identifiable Pedigrees and a Bayesian Solution

Abstract:
Some methods aim to correct or test for relationships or to reconstruct the pedigree, or family tree. We show that these methods cannot resolve ties for correct relationships due to identifiability of the pedigree likelihood which is the probability of inheriting the data under the pedigree model. This means that no likelihood-based method can produce a correct pedigree inference with high probability. This lack of reliability is critical both for health and forensics applications.

Pedigree inference methods use a structured machine learning approach where the objective is to find the pedigree graph that maximizes the likelihood. Known pedigrees are useful for both association and linkage analysis which aim to find the regions of the genome that are associated with the presence and absence of a particular disease. This means that errors in pedigree prediction have dramatic effects on downstream analysis.

We present the first discussion of multiple typed individuals in non-isomorphic pedigrees where the likelihoods are non-identifiable. Additionally, deeper understanding of the general discrete structures driving these non-identifiability examples has been provided, as well as results to guide algorithms that wish to examine only identifiable pedigrees. This paper introduces a general criteria for establishing whether a pair of pedigrees is non-identifiable and two easy-to-compute criteria guaranteeing identifiability. Finally, we suggest a method for dealing with non-identifiable likelihoods: use Bayes rule to obtain the posterior from the likelihood and prior. We propose a prior guaranteeing that the posterior distinguishes all pairs of pedigrees.

February 2nd
Wyeth Wasserman

Centre for Molecular Medicine and Theraputics

Child and Family Research Institute

University of British Columbia

MeSH Over-representation Profiles (MeSHOPs): Generation and applications of quantitative annotation vectors for biological entities

Abstract:
A key challenge in computational biology is the transfer of literature-based knowledge into formats suitable for large-scale computation. Challenges range from the implementation of controlled vocabularies and ontologies to expert curation of knowledge repositories. While domain-focused efforts offer the greatest resolution, annotation providing broad coverage across biomedical research could facilitate the creation of algorithms of wide utility. In this lecture I will introduce MeSH Overrepresentation Profiles, which build on decades of curator-assigned keywords attached to biomedical abstracts in the Medline database. Given a set of articles sharing reference to a common entity, one can generate an annotation vector describing the curator keyword assignments over-represented within the set. An entity may be any object or concept for which a group of articles can be defined, such as a disease, gene, drug or author. Metrics are explored for the quantitative comparison of any two MeSHOPs, and the methods are extended for the prediction of entity-entity relationships. The computational challenges of comparing vast numbers of MeSHOPs are discussed.

The research to be described was performed with Warren Cheung and B.F. Francis Ouellette.

December 1st
Paul Pavlidis

Centre for High-Throughput Biology (CHiBi)

University of British Columbia

The challenge of interpreting gene lists in the face of multifunctionality

Abstract:
In functional genomics, prior knowledge of gene function is used in numerous contexts, both in 'manual' analyses and in computational approaches. This includes "guilt-by-association" approaches to predicting function and "set enrichment" analyses used to interpret gene lists. I will describe recent investigations into the effect gene multifunctionality plays in such analyses, concluding that many analyses are dominated by its impact. I will first define multifunctionality and explain its general importance. I will then briefly review published and unpublished work on the correlation between network node degree and multifunctionality. Finally, I will describe unpublished findings on the the impact of highly multifunctional genes on "enrichment" analyses.

October 6th
Nicolas Lartillot

Department of Biochemistry,

University of Montreal

Reconstructing rates, traits and dates. Towards integrated models of macroevolution.

Abstract:
Estimating divergence times, understanding molecular evolutionary mechanisms, or testing macroevolutionary hypotheses about patterns of diversification and morphological evolution, are usually considered as separate research questions, addressed by distinct, although overlapping, scientific communities. Yet, many connections would deserve to be made between these various topics in evolutionary sciences. As a first step towards integrated macroevolutionary modeling, we recently introduced a method for jointly estimating divergence times, substitution rates, life-history traits, and correlations between them, along phylogenies. The framework is a fusion between classical codon models, relaxed molecular clocks, and comparative independent-contrast methods, and works by conditioning a probabilistic model simultaneously on a codon sequence alignment and on a matrix of quantitative characters such as morphological data or life-history traits.As an application of the method, we will present a reconstruction of body size, substitution rate, dN/dS and GC content at the scale of placental mammals. Our analysis confirms classical predictions of the nearly neutral theory concerning generation time and population size, while supporting the hypothesis of small and fast-reproducing ancestors with large populations for most mammalian orders, and suggesting the existence of systematic trends in body-size evolution.
Reference: Lartillot N and Poujol R, 2011. A phylogenetic model for investigating correlated evolution of substitution rates and continuous phenotypic characters. Molecular Biology and Evolution, 28:729.
Software Download
Work in collaboration with Frédéric Delsuc, Raphael Poujol and Nicole Uwimana.

Archives of the 2009-10 and the 2010-11 Computational Biology Seminars.

Last modified September 29th, 2011.

Date	Speaker	Title and Abstract
July 10th *Tuesday, 10 a.m.*	Bojan Losic Mount Sinai Hospital	CNV Detection in Whole Genome Sequencing Abstract: It is well established that copy number variations (CNVs) are a major source of genomic variability between any two individuals. At present, the genomic resolution of CNVs that can be detected using high resolution microarray technologies is on the order of a kilo-base. Advances in next-generation sequencing technologies, however, enable us to identify, in principle, structural variations at the single-nucleotide level. In this talk I will describe a new method we have developed to detect breakpoints of CNVs from next-generation sequencing data, which leverages wavelets to carry out an inherently multi-scale search for CNVs over all genomic scales. This is joint work with the Muthuswamy group at OICR.
May 22nd *Tuesday*	Haixu Tang Informatics and Computing Indiana University	Privacy-Preserving Sharing and Analysis of Human Genomic Data Abstract: Given rapid cost reduction, genome sequencing may soon become a routine tool for clinical diagnosis and therapy selection. However, the analytical demand is hard to meet because computational and personnel resources for storing and analyzing sequencing data are expensive. Furthermore, there are barriers related to complicated procedures for researchers to get access to sensitive human genomic data, which are designed to protect the privacy of human subjects. A critical issue is the need for the techniques that offer practical protection: data are expected to be conveniently used by biomedical researchers and healthcare practitioners, but need to be protected to the highest possible level, making it impractical to re-identify human subjects from the data. These techniques are also expected to help outsource the intensive computation involved in data analyses to low-cost public/commercial computing systems, without endangering the privacy of the parties that donate the data. In this talk, I will discuss our recent works on developing these techniques. I will first present our practical approaches to quantify the potential privacy risks in human genome data, using genome-wide association data as an example, and the techniques to mitigate these threats. In the second part of my talk, I will the computational technique that allows a human genome center to leverage the low-cost public resources for the analysis on human sequencing data. Based on thorough privacy analysis, I will show that this technique can outsource most human genome computing to the public server, while completely preserve the privacy of the participants of the genome studies. This is joint work with Professor Xiaofeng Wang at Indiana University, and several graduate students.
April 5th	Bonnie Kirkpatrick Computer Science University of British Columbia	Non-Identifiable Pedigrees and a Bayesian Solution Abstract: Some methods aim to correct or test for relationships or to reconstruct the pedigree, or family tree. We show that these methods cannot resolve ties for correct relationships due to identifiability of the pedigree likelihood which is the probability of inheriting the data under the pedigree model. This means that no likelihood-based method can produce a correct pedigree inference with high probability. This lack of reliability is critical both for health and forensics applications. Pedigree inference methods use a structured machine learning approach where the objective is to find the pedigree graph that maximizes the likelihood. Known pedigrees are useful for both association and linkage analysis which aim to find the regions of the genome that are associated with the presence and absence of a particular disease. This means that errors in pedigree prediction have dramatic effects on downstream analysis. We present the first discussion of multiple typed individuals in non-isomorphic pedigrees where the likelihoods are non-identifiable. Additionally, deeper understanding of the general discrete structures driving these non-identifiability examples has been provided, as well as results to guide algorithms that wish to examine only identifiable pedigrees. This paper introduces a general criteria for establishing whether a pair of pedigrees is non-identifiable and two easy-to-compute criteria guaranteeing identifiability. Finally, we suggest a method for dealing with non-identifiable likelihoods: use Bayes rule to obtain the posterior from the likelihood and prior. We propose a prior guaranteeing that the posterior distinguishes all pairs of pedigrees.
February 2nd	Wyeth Wasserman Centre for Molecular Medicine and Theraputics Child and Family Research Institute University of British Columbia	MeSH Over-representation Profiles (MeSHOPs): Generation and applications of quantitative annotation vectors for biological entities Abstract: A key challenge in computational biology is the transfer of literature-based knowledge into formats suitable for large-scale computation. Challenges range from the implementation of controlled vocabularies and ontologies to expert curation of knowledge repositories. While domain-focused efforts offer the greatest resolution, annotation providing broad coverage across biomedical research could facilitate the creation of algorithms of wide utility. In this lecture I will introduce MeSH Overrepresentation Profiles, which build on decades of curator-assigned keywords attached to biomedical abstracts in the Medline database. Given a set of articles sharing reference to a common entity, one can generate an annotation vector describing the curator keyword assignments over-represented within the set. An entity may be any object or concept for which a group of articles can be defined, such as a disease, gene, drug or author. Metrics are explored for the quantitative comparison of any two MeSHOPs, and the methods are extended for the prediction of entity-entity relationships. The computational challenges of comparing vast numbers of MeSHOPs are discussed. The research to be described was performed with Warren Cheung and B.F. Francis Ouellette.
December 1st	Paul Pavlidis Centre for High-Throughput Biology (CHiBi) University of British Columbia	The challenge of interpreting gene lists in the face of multifunctionality Abstract: In functional genomics, prior knowledge of gene function is used in numerous contexts, both in 'manual' analyses and in computational approaches. This includes "guilt-by-association" approaches to predicting function and "set enrichment" analyses used to interpret gene lists. I will describe recent investigations into the effect gene multifunctionality plays in such analyses, concluding that many analyses are dominated by its impact. I will first define multifunctionality and explain its general importance. I will then briefly review published and unpublished work on the correlation between network node degree and multifunctionality. Finally, I will describe unpublished findings on the the impact of highly multifunctional genes on "enrichment" analyses.
October 6th	Nicolas Lartillot Department of Biochemistry, University of Montreal	Reconstructing rates, traits and dates. Towards integrated models of macroevolution. Abstract: Estimating divergence times, understanding molecular evolutionary mechanisms, or testing macroevolutionary hypotheses about patterns of diversification and morphological evolution, are usually considered as separate research questions, addressed by distinct, although overlapping, scientific communities. Yet, many connections would deserve to be made between these various topics in evolutionary sciences. As a first step towards integrated macroevolutionary modeling, we recently introduced a method for jointly estimating divergence times, substitution rates, life-history traits, and correlations between them, along phylogenies. The framework is a fusion between classical codon models, relaxed molecular clocks, and comparative independent-contrast methods, and works by conditioning a probabilistic model simultaneously on a codon sequence alignment and on a matrix of quantitative characters such as morphological data or life-history traits.As an application of the method, we will present a reconstruction of body size, substitution rate, dN/dS and GC content at the scale of placental mammals. Our analysis confirms classical predictions of the nearly neutral theory concerning generation time and population size, while supporting the hypothesis of small and fast-reproducing ancestors with large populations for most mammalian orders, and suggesting the existence of systematic trends in body-size evolution. Reference: Lartillot N and Poujol R, 2011. A phylogenetic model for investigating correlated evolution of substitution rates and continuous phenotypic characters. Molecular Biology and Evolution, 28:729. Software Download Work in collaboration with Frédéric Delsuc, Raphael Poujol and Nicole Uwimana.