Sunday, 6 April 2014

Medical genomics – an introduction


The proliferation of genomics technology within the last decade has left relatively few areas of medicine untouched. Much of this is built upon the common disease, common variant hypothesis (Lander 1996) – this suggests that in contrast to uncommon Mendelian conditions that are caused by rare, single gene defects, many of our more common diseases may be predisposed to by the possession of a combination of relatively common genetic variants. The majority of such genetic variation occurs via single-nucleotide polymorphisms (SNPs) – individually, it is expected that any SNP associated with a common, complex disease will only contribute a small additional risk. Whilst the summative affect of multiple such variants may produce more significant risk profiles, of greater importance is the potential for such associations to enhance our understanding of the underlying pathogenic mechanisms. By investigating how a variant contributes to increased risk of a disease, novel mechanisms and potential therapeutic interventions may be discovered.

In this month’s Lancet Neurology, the genetics of stroke is reviewed (Falcone 2014). Whilst providing an excellent example of the wealth of information that can be obtained from modern genomics, it also highlights how increasingly, the modern clinician must have an appreciation for how this information is acquired. At least a superficial understanding of the techniques involved is required, as shall be reviewed in this article.



SNPs, linkage disequilibrium and imputation – reducing what we need to look for

It is estimated that the human genome contains around 38 million variants, over 90% of these being SNPs (Levinson 2009). The combinations of such variants found within any one person is not, however, random – certain variants tend to be associated with each other. The explanation for this requires a brief discussion of homologous recombination.

During meiosis, homologous recombination allows the exchange of genetic information between homologous chromosomes. It therefore enhances variation, ultimately promoting evolution via natural selection. Such exchanges essentially shuffle the pack of genetic information, producing an element of randomisation in terms of which genetic variations are passed on to one’s offspring.

The nature of homologous recombination, however, means that the chance of any two variants being passed on together is not always equal. Recombination occurs as specific “recombination hotspots” on the chromosomes. If two variants are separated by some distance, with said hotspots located between them, it becomes likely that recombination will shuffle the pack and prevent them being passed on to the same offspring. If, on the other hand, two genetic variants are located closely on the chromosome with no recombination hotspots between them, they are likely to be passed on through many generations together. Such non-random distribution of two genetic markers is known as linkage disequilibrium (LD) (Falcone 2014). In general, two genetic variants that are within 10,000 base pairs of each other will demonstrate LD; the closer they are, the stronger this will be. This property can also be extended beyond two loci – LD blocks represent areas of DNA with potentially multiple variants, all of which tend to segregate together. Such blocks are also often referred to as a haplotype – a combination of alleles at adjacent loci that tend to be inherited together.

Linkage disequilibrium is a useful property – it allows us to simplify the huge amount of variation, as demonstrated by the HapMap project (HapMap n.d.). As stated above, it is thought that the genome contains around 38 million variants, most being SNPs – there will be either a major allele or a minor allele at each of these sites. How common these SNPs are is represented by the minor allele frequency (MAF) - >5% = a common SNP, 0.5-5% = low frequency, <0.5% = rare. The HapMap project, by densely genotyping a reference cohort, estimated that there are around 10 million SNPs with a MAF >1%, and that many of these SNPs are contained within the same haplotypes due to being in LD. They therefore set out to identify single SNPs that could be used to identify whole haplotypes; such “tagged” loci would greatly reduce the number of variants that needed to be analysed, as if one was shown to be present, it could be assumed that the others are there also. This process is known as imputation – using references sources from densely genotyped cohorts to create haplotypes, so that when a variant contained in such a haplotype is detected in a study, it can be assumed that all of the other related variants are also present.

As a result of the HapMap project, 500,000 tagged SNPs were identified that, via imputation, would allow one to assess for the presence of 80% of all common (>5% MAF) SNPs in European populations. Subsequent research, particularly the 1000 genomes project (Consortium 2010), however, has significantly improved this:

By using next generation sequencing, the 1000 genome project produced low-coverage whole genome sequencing of 2,500 individuals. Coverage is the number of times each individual genome is sequenced; due to the random fragments that the genome is broken up into before sequencing, if one was to only sequence a length of DNA equivalent to the whole genome, some areas would be sequenced more than once, and some areas not at all. It is accepted that 28x coverage is sufficient to ensure the whole genome is sequenced in sufficient detail; due to restrictions on resources, a low coverage 4x approach was used – they have calculated that this is now sufficient to discover 95% of variants present at a 1% frequency – far more than that available from the HapMap. This equates to a dataset containing around 20 million SNPs.

Chip arrays – increasing what we can look for

The development of SNP microarrays in parallel to the above advances has been essential to the proliferation of genomics knowledge. A key point to appreciate, however, is that such arrays can only assess the presence or absence of known SNPs. The reason for this becomes clear with a superficial understanding of the technology – in order to investigate a SNP, a probe consisting of a single stranded complementary sequence must be generated.

Samples of DNA to be assayed must be fragmented into short, single-stranded segments. These are then applied to arrays containing the thousands of different complementary probes. The two different array technologies then detect the presence or absence of a SNP in slightly different ways (LaFramboise 2009):
1.     Affymetrix assays – 25 base pair probes are used, with the SNP base of interest in the centre. The complementary sequence from the sample DNA will bind to these probes. If the major allele is present, it will bind perfectly; if the minor allele is present, it will still bind, but less efficiently. The efficiency of the binding, via a process of controls and replicates, produces a signal that mathematical algorithms can use to predict the presence or absence of the SNP
2.     Illumina assays – these instead use 50 base pair probes, containing the sequence immediately adjacent to the SNP base. When the sample DNA binds to the probe, the SNP site will just overlap at the end. The addition of a single additional base to the probe will therefore complete the SNP site. The additional bases used are fluorescently labelled, hence the signal emitted by the additional single base will identify the complementary base present at the SNP site.

It is the ability of the SNP chip assays to do this in parallel for many SNPs that makes them so useful for genomics research. As projects such as the HapMap and 1000 genome project have reduced the number of known SNPs required to assess one’s genetic variation, SNP chip assays have increased our ability to investigate such SNPs; once the two advances met in the middle, the potential for genome-wide association studies was created.

Genome-wide association studies

In order to investigate the contribution of a previously unknown genetic variant in a common disease, one would previously have had to use a hypothesis driven approach. This typically involved an assessment of the known pathogenesis of the disease, drawing up a list of biologically plausible potential candidates (Patnala 2013). The resulting candidate gene analysis, although responsible for many advances, is substantially limited by the fact that it cannot identify associated variants in areas of the genome not already suspected to contribute to the disease. This is the major advantage of genome-wide association (GWA) studies (Manolio 2010). They are hypothesis-free – no prior assumptions of expected associations are required. Initially using data from the HapMap project and the technological advances of SNP chip assays, whole genomes could be assessed for the presence of between 0.5 and 2.5 million tagged SNP loci. This is most commonly used in a case-control format, looking for SNPs that contribute significant risk or protection in the setting of a specific disease.

With the addition of the 1000 genomes project dataset, imputation allows the 2.5 million SNPs potentially detected on a chip assay to identify associations, via their haplotypes, in up to 5-15 million SNPs (Falcone 2014). It is important to emphasise, however, that the outcome of a GWA study is still to associate loci, not individual SNPs, with a disease state. At said loci, through LD, haplotypes and subsequent imputation, many different variants may be identified as potential candidates for producing the single association detected at that locus. A significant amount of further basic research is often required to clarify which of the associated SNPs is actually responsible – this involves analysis of the likely impact of each SNP and the subsequent biological plausibility of its association.

Statistics of GWA studies

Power
A prominent feature of GWA studies is their large sample size. As with other studies, sample size is all a matter of power – what chance has your study got of detecting a significant difference if it is there? In order to increase this chance, you must increase the power, which requires a larger sample.

In GWA studies, power is determined by a number of factors, but principally the MAF of the SNPs you are looking for and the relative risk they impart on the carrier (genotypic relative risk) (Sawcer 2012) (Levinson 2009). As previously mentioned, the imputation made possible by the 1000 genomes project allows GWA studies to detect SNPs with a MAF >1%.  Each of these individual SNPs is unlikely to cause more than a small increase in relative risk. In the early, smaller GWA studies, smaller sample sizes meant that the studies were only powered to detect SNPs with genotypic relative risks (GRR) >2. Many more recent larger studies are able to detect SNPs with a GRR in the region of 1.1-1.4; in order to create the power to do this, however, they require sample sizes of 8,000-20,000 cases, and similar numbers of controls.


Odds ratios and their interpretation

The results of GWA studies generally present their results in terms of odds ratios (OR) for the identified risk loci. This is due to the regression equations used. For any given SNP identified, the odds of that SNP in the cases with disease equates to the ratio between the number carrying the SNP and the number not carrying it; the same applies to the controls. The OR is then generated from the odds in the cases divided by the odds in the controls.

When interpreting the resulting OR for each SNP locus, one must bear in mind the above – it only compares those with the risk allele to those without it (Sawcer 2012). For example, when interpreting the presence of the 19q13 risk allele (corresponding to the APOE4 allele) that generates an OR of 2.20 in cases of intracerebral haemorrhage vs. controls (Falcone 2014), it is easy to assume that possession of this allele more than doubles your risk of developing intra-cerebral haemorrhage compared to background incidence. This is incorrect – what it does is relate to the relative odds of people with the allele compared to those without it. As the general background incidence of intracerebral haemorrhage is created by people both with the allele and without it (the MAF of APOE4 is a substantial 12%), the effects its possession as compared to the background will be smaller than the reported OR.

Such nuances of interpretation may seem insignificant, but they become all the more significant when studies start reporting the combined OR of multiple alleles. For example, in a recent GWA study of multiple sclerosis, the combined OR of the 57 SNPs and 4 human leukocyte antigen alleles was >2,800 (International Multiple Sclerosis Genetics Consortium (IMSGC) and Wellcome Trust Case Control Consortium 2011). However, as each of these alleles has a MAF >1% (often significantly more so), it is very uncommon for an individual to possess none of the risk alleles. Such a person without any alleles would sit towards the very left of the bell-shaped distribution of a populations risk of developing multiple sclerosis; a person possessing all such alleles would sit towards the very right with a risk 2,800 greater than the former person, but it would not be 2,800 greater than the general population’s risk.

Conclusions

Figure 2 demonstrates the extent of the recent proliferation of GWA studies. This article has superficially discussed the advances in genetic technology required for this to occur – from the use of LD to group SNPs into more easily accessible haplotypes, to the chip arrays that allow huge numbers of such haplotypes be to assessed for associations with a particular disease state. A cautious note has also been aired about over-interpretation – it must be remembered that the association of a locus only informs us of the associated haplotype, with further work required to elucidate the exact causal SNP; and that the potentially very high ORs produced when the risks of multiple loci are combined are not relative to the general population, but only to those who carry none of the loci involved.

Just as GWA studies have become commonplace in most fields of medical research, however, the accelerating pace of genomics technology has already pushed forward to the next step. Where as previously studies would only sequence the DNA of participants around loci of known association (“targeted re-sequencing”) in order to gain more information about the potential causal SNPs, primary sequencing studies are now becoming more common (Jiang 2013). Rather than relying on detecting the association of known SNPs in recognised haplotypes, these studies generate vastly greater volumes of information as complete DNA sequences for each participant. Some produce exome sequences, other full genome sequences. Such techniques facilitate the detection of rare and even de novo variants, and prior knowledge of the associated SNP is not required; as their use becomes more widespread, we can be assured that the exponential growth of information relating to the association of genetic variants to disease states will continue.



Bibliography

1.     The 1000 Genomes Project Consortium. “A map of human genome variation from population-scale sequencing .” NATURE 467 (October 2010).
2.     Falcone. “Current concepts and clinical applications of stroke genetics .” The Lancet Neurology 13, no. 4 (APRIL 2014): 405-418.
3.     HapMap. http://hapmap.ncbi.nlm.nih.gov/abouthapmap.html.en.
4.     National Human Genome Research Institute. https://www.genome.gov/26525384.
5.     International Multiple Sclerosis Genetics Consortium (IMSGC) and Wellcome Trust Case Control Consortium, 2. “Genetic risk and a primary role for cell-mediated immune mechanisms in multiple sclerosis .” Nature 476 (2011): 214-219.
6.     Jiang. “Detection of Clinically Relevant Genetic Variants in Autism Spectrum Disorder by Whole-Genome Sequencing .” The American Journal of Human Genetics 93 (August 2013): 249-263.
7.     Keogh. “Exome sequencing: how to understand it .” Pract Neurol 13 (2013): 399-407.
8.     LaFramboise. “Single nucleotide polymorphism arrays: a decade of biological, computational and technological advances .” Nucleic Acids Research 37, no. 13 (2009): 4181-4193.
9.     Lander. “The new genomics: global views of biology.” Science 25, no. 274 (October 1996): 536-9.
10. Levinson. “Genomewide association studies: History, rationale and prospects for psychiatric disorders .” Am J Psychiatry 166 (May 2009): 540-556.
11. Manolio. “Genomewide Association Studies and Assessment of the Risk of Disease .” New England Journal of Medicine 363 (2010): 166-76.
12. Patnala. “Candidate gene association studies: a comprehensive guide to useful in silico tools .” BMC genetics 14, no. 39 (2013).
13. Sawcer. “Risk in Complex Genetics: ‘‘All Models Are Wrong but Some Are Useful’’ .” ANN NEUROL 72 (2012): 502-509.


No comments:

Post a Comment