Medical genomics – an introduction
The proliferation of genomics technology within the last decade has left relatively few areas of medicine untouched. Much of this is built upon the common disease, common variant hypothesis
In this month’s Lancet
Neurology, the genetics of stroke is reviewed (Falcone 2014) . Whilst
providing an excellent example of the wealth of information that can be
obtained from modern genomics, it also highlights how increasingly, the modern
clinician must have an appreciation for how this information is acquired. At
least a superficial understanding of the techniques involved is required, as
shall be reviewed in this article.
SNPs, linkage disequilibrium and imputation – reducing what we need to look
for
It is estimated that the human genome contains around 38
million variants, over 90% of these being SNPs
During meiosis, homologous recombination allows the exchange
of genetic information between homologous chromosomes. It therefore enhances
variation, ultimately promoting evolution via natural selection. Such exchanges
essentially shuffle the pack of genetic information, producing an element of
randomisation in terms of which genetic variations are passed on to one’s
offspring.
The nature of homologous recombination, however, means that
the chance of any two variants being passed on together is not always equal.
Recombination occurs as specific “recombination hotspots” on the chromosomes.
If two variants are separated by some distance, with said hotspots located
between them, it becomes likely that recombination will shuffle the pack and
prevent them being passed on to the same offspring. If, on the other hand, two
genetic variants are located closely on the chromosome with no recombination
hotspots between them, they are likely to be passed on through many generations
together. Such non-random distribution of two genetic markers is known as
linkage disequilibrium (LD) (Falcone 2014) . In general,
two genetic variants that are within 10,000 base pairs of each other will
demonstrate LD; the closer they are, the stronger this will be. This property
can also be extended beyond two loci – LD blocks represent areas of DNA with
potentially multiple variants, all of which tend to segregate together. Such
blocks are also often referred to as a haplotype – a combination of alleles at
adjacent loci that tend to be inherited together.
Linkage disequilibrium is a useful property – it allows us
to simplify the huge amount of variation, as demonstrated by the HapMap project (HapMap n.d.) . As stated
above, it is thought that the genome contains around 38 million variants, most
being SNPs – there will be either a major allele or a minor allele at each of
these sites. How common these SNPs are is represented by the minor allele
frequency (MAF) - >5% = a common SNP, 0.5-5% = low frequency, <0.5% =
rare. The HapMap project, by densely genotyping a reference cohort, estimated
that there are around 10 million SNPs with a MAF >1%, and that many of these
SNPs are contained within the same haplotypes due to being in LD. They
therefore set out to identify single SNPs that could be used to identify whole
haplotypes; such “tagged” loci would greatly reduce the number of variants that
needed to be analysed, as if one was shown to be present, it could be assumed
that the others are there also. This process is known as imputation – using
references sources from densely genotyped cohorts to create haplotypes, so that
when a variant contained in such a haplotype is detected in a study, it can be
assumed that all of the other related variants are also present.
As a result of the HapMap project, 500,000 tagged SNPs were
identified that, via imputation, would allow one to assess for the presence of
80% of all common (>5% MAF) SNPs in European populations. Subsequent
research, particularly the 1000 genomes project (Consortium 2010) , however, has
significantly improved this:
By using next generation sequencing, the 1000 genome project
produced low-coverage whole genome sequencing of 2,500 individuals. Coverage is
the number of times each individual genome is sequenced; due to the random
fragments that the genome is broken up into before sequencing, if one was to
only sequence a length of DNA equivalent to the whole genome, some areas would
be sequenced more than once, and some areas not at all. It is accepted that 28x
coverage is sufficient to ensure the whole genome is sequenced in sufficient
detail; due to restrictions on resources, a low coverage 4x approach was used –
they have calculated that this is now sufficient to discover 95% of variants
present at a 1% frequency – far more than that available from the HapMap. This
equates to a dataset containing around 20 million SNPs.
Chip arrays – increasing what we can look for
The development of SNP microarrays in parallel to the above advances
has been essential to the proliferation of genomics knowledge. A key point to
appreciate, however, is that such arrays can only assess the presence or
absence of known SNPs. The reason for this becomes clear with a superficial understanding
of the technology – in order to investigate a SNP, a probe consisting of a
single stranded complementary sequence must be generated.
Samples of DNA to be assayed must be fragmented into short, single-stranded
segments. These are then applied to arrays containing the thousands of
different complementary probes. The two different array technologies then
detect the presence or absence of a SNP in slightly different ways (LaFramboise 2009) :
1.
Affymetrix assays – 25 base pair probes are
used, with the SNP base of interest in the centre. The complementary sequence
from the sample DNA will bind to these probes. If the major allele is present,
it will bind perfectly; if the minor allele is present, it will still bind, but
less efficiently. The efficiency of the binding, via a process of controls and
replicates, produces a signal that mathematical algorithms can use to predict
the presence or absence of the SNP
2.
Illumina assays – these instead use 50 base pair
probes, containing the sequence immediately adjacent to the SNP base. When the
sample DNA binds to the probe, the SNP site will just overlap at the end. The
addition of a single additional base to the probe will therefore complete the
SNP site. The additional bases used are fluorescently labelled, hence the
signal emitted by the additional single base will identify the complementary
base present at the SNP site.
It is the ability of the SNP chip assays to do this in
parallel for many SNPs that makes them so useful for genomics research. As
projects such as the HapMap and 1000 genome project have reduced the number of
known SNPs required to assess one’s genetic variation, SNP chip assays have
increased our ability to investigate such SNPs; once the two advances met in
the middle, the potential for genome-wide association studies was created.
Genome-wide association studies
In order to investigate the contribution of a previously
unknown genetic variant in a common disease, one would previously have had to
use a hypothesis driven approach. This typically involved an assessment of the
known pathogenesis of the disease, drawing up a list of biologically plausible
potential candidates (Patnala 2013) . The
resulting candidate gene analysis, although responsible for many advances, is
substantially limited by the fact that it cannot identify associated variants
in areas of the genome not already suspected to contribute to the disease. This
is the major advantage of genome-wide association (GWA) studies (Manolio 2010) . They are
hypothesis-free – no prior assumptions of expected associations are required. Initially
using data from the HapMap project and the technological advances of SNP chip
assays, whole genomes could be assessed for the presence of between 0.5 and 2.5
million tagged SNP loci. This is most commonly used in a case-control format,
looking for SNPs that contribute significant risk or protection in the setting
of a specific disease.
With the addition of the 1000 genomes project dataset, imputation
allows the 2.5 million SNPs potentially detected on a chip assay to identify
associations, via their haplotypes, in up to 5-15 million SNPs (Falcone 2014) . It is
important to emphasise, however, that the outcome of a GWA study is still to
associate loci, not individual SNPs, with a disease state. At said loci,
through LD, haplotypes and subsequent imputation, many different variants may
be identified as potential candidates for producing the single association detected
at that locus. A significant amount of further basic research is often required
to clarify which of the associated SNPs is actually responsible – this involves
analysis of the likely impact of each SNP and the subsequent biological
plausibility of its association.
Statistics of GWA studies
Power
A prominent feature of GWA studies is their large sample
size. As with other studies, sample size is all a matter of power – what chance
has your study got of detecting a significant difference if it is there? In
order to increase this chance, you must increase the power, which requires a
larger sample.
In GWA studies, power is determined by a number of factors, but principally the MAF of the SNPs you are looking for and the relative risk they impart on the carrier (genotypic relative risk)
The results of GWA studies generally present their results
in terms of odds ratios (OR) for the identified risk loci. This is due to the
regression equations used. For any given SNP identified, the odds of that SNP
in the cases with disease equates to the ratio between the number carrying the
SNP and the number not carrying it; the same applies to the controls. The OR is
then generated from the odds in the cases divided by the odds in the controls.
When interpreting the resulting OR for each SNP locus, one must
bear in mind the above – it only compares those with the risk allele to those
without it (Sawcer 2012) . For example,
when interpreting the presence of the 19q13 risk allele (corresponding to the
APOE4 allele) that generates an OR of 2.20 in cases of intracerebral
haemorrhage vs. controls (Falcone 2014) , it is easy
to assume that possession of this allele more than doubles your risk of
developing intra-cerebral haemorrhage compared to background incidence. This is
incorrect – what it does is relate to the relative odds of people with the
allele compared to those without it. As the general background incidence of
intracerebral haemorrhage is created by people both with the allele and without
it (the MAF of APOE4 is a substantial 12%), the effects its possession as
compared to the background will be smaller than the reported OR.
Such nuances of interpretation may seem insignificant, but
they become all the more significant when studies start reporting the combined
OR of multiple alleles. For example, in a recent GWA study of multiple
sclerosis, the combined OR of the 57 SNPs and 4 human leukocyte antigen alleles
was >2,800 (International Multiple Sclerosis Genetics
Consortium (IMSGC) and Wellcome Trust Case Control Consortium 2011) . However, as
each of these alleles has a MAF >1% (often significantly more so), it is
very uncommon for an individual to possess none of the risk alleles. Such a
person without any alleles would sit towards the very left of the bell-shaped
distribution of a populations risk of developing multiple sclerosis; a person
possessing all such alleles would sit towards the very right with a risk 2,800
greater than the former person, but it would not be 2,800 greater than the
general population’s risk.
Conclusions
Figure 2 demonstrates the extent of the recent proliferation
of GWA studies. This article has superficially discussed the advances in
genetic technology required for this to occur – from the use of LD to group
SNPs into more easily accessible haplotypes, to the chip arrays that allow huge
numbers of such haplotypes be to assessed for associations with a particular
disease state. A cautious note has also been aired about over-interpretation –
it must be remembered that the association of a locus only informs us of the
associated haplotype, with further work required to elucidate the exact causal
SNP; and that the potentially very high ORs produced when the risks of multiple
loci are combined are not relative to the general population, but only to those
who carry none of the loci involved.
Just as GWA studies have become commonplace in most fields
of medical research, however, the accelerating pace of genomics technology has
already pushed forward to the next step. Where as previously studies would only
sequence the DNA of participants around loci of known association (“targeted
re-sequencing”) in order to gain more information about the potential causal
SNPs, primary sequencing studies are now becoming more common (Jiang 2013) . Rather than
relying on detecting the association of known SNPs in recognised haplotypes,
these studies generate vastly greater volumes of information as complete DNA
sequences for each participant. Some produce exome sequences, other full genome
sequences. Such techniques facilitate the detection of rare and even de novo
variants, and prior knowledge of the associated SNP is not required; as their
use becomes more widespread, we can be assured that the exponential growth of
information relating to the association of genetic variants to disease states
will continue.
Bibliography
1. The 1000
Genomes Project Consortium. “A map of human genome variation from
population-scale sequencing .” NATURE 467 (October 2010).
2. Falcone.
“Current concepts and clinical applications of stroke genetics .” The
Lancet Neurology 13, no. 4 (APRIL 2014): 405-418.
3. HapMap.
http://hapmap.ncbi.nlm.nih.gov/abouthapmap.html.en.
4. National
Human Genome Research Institute. https://www.genome.gov/26525384.
5. International
Multiple Sclerosis Genetics Consortium (IMSGC) and Wellcome Trust Case
Control Consortium, 2. “Genetic risk and a primary role for cell-mediated
immune mechanisms in multiple sclerosis .” Nature 476 (2011): 214-219.
6. Jiang.
“Detection of Clinically Relevant Genetic Variants in Autism Spectrum
Disorder by Whole-Genome Sequencing .” The American Journal of Human
Genetics 93 (August 2013): 249-263.
7. Keogh. “Exome
sequencing: how to understand it .” Pract Neurol 13 (2013): 399-407.
8. LaFramboise.
“Single nucleotide polymorphism arrays: a decade of biological, computational
and technological advances .” Nucleic Acids Research 37, no. 13
(2009): 4181-4193.
9. Lander. “The
new genomics: global views of biology.” Science 25, no. 274 (October
1996): 536-9.
10. Levinson.
“Genomewide association studies: History, rationale and prospects for
psychiatric disorders .” Am J Psychiatry 166 (May 2009): 540-556.
11. Manolio.
“Genomewide Association Studies and Assessment of the Risk of Disease .” New
England Journal of Medicine 363 (2010): 166-76.
12. Patnala.
“Candidate gene association studies: a comprehensive guide to useful in
silico tools .” BMC genetics 14, no. 39 (2013).
13. Sawcer. “Risk
in Complex Genetics: ‘‘All Models Are Wrong but Some Are Useful’’ .” ANN
NEUROL 72 (2012): 502-509.
No comments:
Post a Comment