Student Perspectives: Unravelling Ancestry – When Genes Don’t Follow the Family Tree

A post by Daniella Montgomery, PhD student on the Compass programme.

Introduction

In my project, I am working with my two supervisors, Dan Lawson in the School of Mathematics and Sion Bayliss from the School of Veterinary Science, to investigate the analysis of genomic data and their inferred ancestry trees, to detect problematic lineages of bacterial pathogens.

An ancestry tree is a tree which describes how genetic data is passed down through generations. By understanding the evolution of bacteria, we can develop strategies to alert us when dangerous pathogens evolve. Bacteria typically only have one parent, and if this were true, their evolution can be described as a tree. However, bacteria also frequently evolve using horizontal gene transfer, where genetic data is exchanged between lineages with different ancestries, as seen in Figure 1. This disrupts the traditional parent-to-offspring tree, and instead, one needs to represent it using a complex graph.

Figure 1: An example of a phylogeny with horizontal gene transfer shown by the red dashed line and the resulting recombined lineage shown as a full red line breaking the structure of this tree.

In this case, each location on the genome may be described by a different tree obtained by following the correct parent at that location, i.e. the “left” or “right” parent of the red individual in Figure 1. These trees can be called “local ancestries”.

Simulating Ancestry with Msprime

The Python package msprime allows us to simulate genetic ancestral data using the coalescent method. The coalescent method is a backwards-in-time stochastic process where one has a set of sample lineages from which n are randomly selected, as seen in Figure 2. As we go back in time, their parent nodes are iteratively redrawn from this set at random. Once two lineages pick the same random parent, the lineage coalesces into one. This process is repeated until a common ancestor is achieved.

Figure 2: A depiction of the coalescent method taken from [1] for a population of 10 individuals and a sample size of 10, by keeping track of the times between coalescence events (T(3) and T(2)) and which lineages coalesce with which, we have a full picture of the phylogenetic tree.

The Impact of Gene Conversion

In this experiment, I am investigating how population structure manifests in genetic data and how this is affected by varying gene conversion rates. Gene conversion is a type of horizontal gene transfer where a donor genome replaces a sequence of DNA in a homologous acceptor genome. Our simulation has one population that splits into two populations with some gene conversion within the populations, as seen in Figure 3. From this, we can obtain local pedigrees across the genome for several sample genomes. Each local pedigree has a complex history, but gene conversion allows each gene to have a different random history.

Figure 3: A conceptual picture of the true population structure and the local pedigree of the sampled population obtained from simulation with nodes coloured by population. Blue represents the ancestral population and red and green represent the two descendent populations, A and B. The leaf nodes are labelled for comparison with future analysis.

Analyzing the Data

One common way to visualise complex histories is through Principle Component Analysis (PCA) where the data undergoes eigenvalue decomposition which will group similar genomes together in a far lower dimensional space. This dimensionality reduction also allows us to visualise certain population structure characteristics [2]. For example, in all of our 2D PCA graphs in Figure 4, we can see a clear split between population A and population B.

However, there is a limit to how interpretable these PCs are. We use the dendrogram from hierarchical clustering to help sort our data such that more similar data is kept together. Then we create a covariance plot of how similar the principal components of each genome are to each other. This plot is a rudimentary method to help us visualize the population structure of the simulation’s resulting lineages seen in Figure 5a. The population structure is clear, but there is still structure given by the random pedigree shared by all individuals.

Figure 4: The principal component analysis plots with colours showing the true populations for a gene conversion rate of 1e06.

Figure 5: Covariance matrices for increasing gene conversion rates (reading left to right, up to down) 1e-6, 1e-5, 1e-4, showing a breakdown of the sub-population structure.

In Figures 5a to 5c, we can see that as gene conversion is increased, the covariance matrix less represents one random history, and instead “averages out” into the population structure. This is a visualization of the dependence on the history breaking down as the genomes within each population become more similar to each other due to gene conversion.

If you would like to know more about this topic, please contact me at ic23897@bristol.ac.uk.

[1] Rosenberg, N., Nordborg, M. Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. Nat Rev Genet 3, 380–390 (2002). (https://doi.org/10.1038/nrg795])

[2] McVean G. A genealogical interpretation of principal components analysis. PLoS Genetics, 5, e1000686 (2009). (https://doi.org/10.1371/journal.pgen.1000686)