You are on page 1of 2

A genetic association study is a test of whether a given sequence has any effect on the phenotype of

a specific trait. Association studies can be done on a single SNP or as many SNPs as you like (each
SNP done individually under additive assumptions). One of these association studies is a Genome
Wide Association Study (GWAS) which is a positional approach to these association studies. This
works by scanning the genome with 100,000s of single nucleotide polymorphism (SNP) markers that
capture as much variation as possible in the genome. This differs from a candidate approach which
predesignates the gene of interest. These SNPs in the genome at particular loci can be linked to a
wide range of diseases/phenotypes. Because of this the GWAS can test each of the SNPs with the
trait of interest to determine if it is causal or not. GWAS require a large sample of individuals who
are collected via highly multiplex genotype technology such as the Illumina platform which queries
up to 1 million SNPs per individual.

The traits we are looking at in these studies can be either binary or continuous. Binary traits
represent phenotypes that you either have or don’t have for example diabetes. For binary traits we
compare allele frequency between cases (people with the disease) and controls that are matched as
well as possible in terms of age, sex etc. This is known as a case-control study. The summary statistic
of this is the minor allele which is measured by an odds ratio. This is the ratio of the odds of disease
among the exposed the unexposed. If it is equal to one it means the disease is equally likely in both
groups. If it is greater than 1 there is increased risk with minor allele and if below 1 there is
decreased risk. For binary phenotypes we use a logistic model to describe the relationship between
the controls.

Continuous traits, also known as quantitative traits, differ from binary as each allele at the many
genes that encodes for a trait has a small effect on the phenotype. This is known as additive genetic
variance where the phenotype results from a cumulative effect of common and rare causal alleles.
Some examples of continuous traits are height, weight and, what we will be using for our example,
serum urate levels. Continuous traits such as these are usually under polygenic control meaning they
tend to follow a normal distribution. They are however subject to substantial environmental
influence in expression which may cause this normal distribution to be shifted to the left or right.
Because of these factors when studying continuous traits, we must use an existing population-based
cohort study. These are cohort studies This is a statistical approach that uses genetic information to
assess associations between traits/outcomes and causality. Because we are testing multiple
hypothesis these studies need a very large sample size. The participants will also not be selected
based on a given disease, but rather to represent the general population so observations are able to
be generalized.

Because the trait we want to test for is continuous we need to measure the amount of phenotype at
the SNP for each genotype (i.e. heterozygous, homozygous for dominant allele, homozygous for
minor allele). To do this we use linear regression which we can use to calculate beta. The beta value,
represented by the slope of the regression line, is the average change in the continuous measure per
copy and is the summary statistic of interest such ass odds ratio is for binary traits. An example of
this can be seen on slide 15 lecture 2. In this figure we can see the average between the SNP
genotype is different and when a line is drawn between these averages we can see the slope of the
line which beta is. If there is a positive beta, slope up the marker associates with SUA levels meaning
the minor allele is a risk to phenotype. The marker could also be less than 0 meaning there is a
negative beta. This means slope down the marker associates with SUA levels and the minor allele is
protective. Beta could also be 0 meaning no correlation between the two. The linear regression also
then allows you to compute power of the test (p-value). Power is the chance of detecting
association, provided it is present. For GWAS the typical target p-value of 0.05 is too big due to the
large number of tests being done. in a study of 1 million, 5% will be false positives which is a lot in a
sample size this big. Due to this we use the Bonferroni corrected value of 5x10-8 which should
ensure an association between the SNP and disease. We can calculate these p-values using R to test
the significance of association between every SNP in the genome to serum urate levels. Using these
we can create a Manhattan plots displaying significant SNP’s on a negative log scale of the
corresponding p-value.

 
80% of the GWAS signals aren't caused by a change in the amino acid/change in the function of a
protein. They are instead changing the gene expression in regulatory regions. From this can
hypothesise that the variants found in these regions have influence on gene expression in some way
effecting the final phenotype. However, the biological interpretation of this is quite unclear
therefore we must carry out an expression quantitative trait locus analysis eQTL at a particular locus
to determine the effect they are having. Doing an expression SNP analysis is much the same the
process we used when associating a genetic variant with a continuous outcome as we did earlier.
Like for the continuous allele we calculate both the beta and p-value using R which we can then map
on a Manhattan plot. The samples for this experiment need to be genome-wide genotyped to look
at as many possible causal genes as we can. We also want gene expression to be measured in as
wide a range of tissues as possible. This is because we don’t want to look at expression in one tissue
but instead look at the different tissues that the gene is likely to be expressed in. We also want to
include different developmental stages to see when they are expressed. The most commonly used
data base for this at the moment is GTeX which uses resources from people that have donated their
bodies to science.
 
An example of an eQTL can be seen using the serum urate. From the results of our Manhattan plot
for the GWAS we can do a locus zoom and see the local association of our results for a particular
locus. In our example experiment we zoomed in on the locus rs1967017 which can be seen in the
figure on slide 8 of lecture 5. The graph is comparing the P value of the SNPs on the Y axis with its
position on chr1. We can see that majority of the SNP peak signal is in the middle of the graph which
is just in front of the PDZK1 gene. This PDZK1 gene is involved in urate transport therefore we could
use it as a candidate gene to carry out the eQTL to identify causal gene at the loci. With this
candidate gene they carried out an eQTL where instead of measuring the SNPs on the serum urate
levels its measured on the PDZK1 gene expression in the colon. The reason they measured the
expression in the colon is because it is an important place for urate expression. With this they then
did another linear regression for our eQTL data where they found the p-value of association of
expression for each of the SNPs. With this they then made another Manhattan plot which rather
than being SNP genotype vs phenotype (urate levels) it compares genotype to the PDZK1 gene
expression. This figure can also be found on slide 8 of lecture 5. Comparing the two graphs we can
see that the two signals are overlapping. This tells us two things. Firstly, the effect of the genetic
variants on the phenotype (serum urate) is likely mediated via regulation of expression of PDZK1. It
also tells us that PDZK1 is a causal gene which means we can use it for future studies.
 
 

You might also like