(English (Auto-Generated) ) Chiara Sabatti - Knockoff Genotypes - Value in Counterfeit (DownSub - Com)

hello thank you for inviting me and this
conference it's too bad that we cannot
do it in person but it's a very
ingenious what we managed to do it this
way so what I would like to talk to you
about today it's work that I have been
doing with a number of collaborators and
here you see their picture mitosis they
are using cartilage Stefan dates and
Emanuele Candace and we have a paper
that has been published earlier this
year in nature communication that
contains most of the information that
I'll be talking about so the topic I'll
be dealing with its gene mapping for
complex traits what is the typical
setting we have n observation on a
phenotype Y and on P dimensional
genotype x now just for a having a sense
of dimension P the number of possible
explanatory variables that are genetic
variants it's large 1 billion and it's
also nowadays not small but typically
smaller than P 10,000 in some cases
300,000 or so in large biobank datasets
now the goal with these data it's to
pass it through some statistical method
so that we can identify a set s hat off
the core collect all the genetic
variants that are important for the

phenotype why that is they carry
information on the possible value of the
phenotype though as Hat represent
position engine the genome and the
variation in the genome a disposition
has an impact on the phenotype as we
tackle this problem we have to remind
ourselves that there is a fair amount of
dependence between the XS and that we
want to solve this problem in a way the
guarantee is as replicability in the
selections really selecting the right
positions in the genome is very
important for the type of medical
application one has in mind we want to
and assess hard because we want to
develop drugs so we need to know which
biological pathway we need to target
with these drugs so we don't want to
make too many mistakes in the selections
another challenge is that we really want
these selections to be meaningful across
ethnic groups so we want them to be
robust the typical analysis for this
type of data it's done at the univariate
level and what I have here is maybe the
simplest form of it but the substance
doesn't really matter so we scan through
every single genotype genotypic Valiant

and we say that a variant x i it's no if
its distribution it's independent from
the one of the phenotype the way we test
this node analysis is using a linear
model so we use the phenotype as the
outcome variable we fit a univariate
linear model and we test the hypothesis
if the coefficient relative to the
variant X I is equal to zero or not we
do this one snip at the time and to take
into account the fact that we are
looking at three hundred thousand
million snips and we are going to
control the family well familiar Y zero
rate at a level point oh five
approximately what one can notice is
that this approach definitely doesn't
use the model machine learning models
and this is perhaps a bit surprising to
the community of statisticians
historically I think this is mainly due
to the fact that it's it has been
difficult to provide the guarantees that
geneticists are interested in with these
standard machine learning models in
particular remember the replicability
and the sorts of guarantees of connect
selections that they have in mind let me
give you an example this is really all
data arm from 2009 it represents a

standard way of displaying the results
on the x-axis you have genomic position
on the y-axis you have minus log 10
p-values for each of these test
corresponding to different genomic
positions the numbers that you see on
the x-axis correspond to the different
chromosomes and you see that you have
these long plots of different p-values
and you had in this plot a threshold
line at 8 that indicates in this
particular study what has been
considered significant and you can see
that there are some locations in the
genome that lead to very small T values
and so we have some discoveries there
and the trades here are triglyceride HDL
and LDL and this is data from the nfpc
board so what are the limitations of
this approach one is that not all the
signal is captured so these plot refers
to the data that I have shown you a
minute ago even if we have a larger
collection of phenotypes and what this
entire bar represents is the proportion
of phenotypic variants that we could
explain using genetic information and
other key areas like sex and age what
you can see is that the portion of this

phenotypic variance explained by the
genetic information which is the one
highlighted in red it's fairly small and
in fact the other covariance explained
much more this is the general phenomenon
that has been labeled as lost missing
heritability irritability being the
portion of the phenotypic variance that
should be explainable by genetic
information the solution of the field to
this problem has been to increase the
sample size undoubtably increasing the
sample size increases our power to
detect loss this data said that I am
showing you only had five thousand
individuals nowadays the Gordon fairly
often have datasets in the 50 thousand a
hundred thousand individuals the other
type of solution that the field has
adopted is that when they the interest
is not simply identifying the variants
that
are important but trying to come up with
a prediction of what the phenotypic
value may be instead of including only
the selected snip we use models that
include many snips sometimes in fact the
entire collections of snips the
properties of these models are however
rather unclear and by now what the field

is really concerned with is the models
that have been constructed in this
fashion without any selection really do
not seem to replicate when across
population another limitation is that
the interpretation of the findings are
difficult so the plot that you see here
it's much more contemporary than the
ones that I have shared with you so far
he refers to height and the data set
that has been used to constructed is the
UK biobank this is a large collection
that includes 500,000 individuals at
approximately with a very large number
of snips genotype and you can see that
when we use height as a trait in this
data set we can get very strong signals
for associations the minus log 10
p-values that you see on the y-axis
which values are 300 now there's lots of
points here that pass what would be the
significant threshold so very many snips
that seems to signal some Association
what are the different colors here the
different colors represent an
interpretation that geneticists do of
these different snips what's happening
is that there may be even only one
causal variant in this region so there

might be only one genetic variation that
really influences heart but because
there is a strong disequilibrium a
strong dependence between all the snips
in a neighborhood in the genome these
associations signal between the
phenotype and the causal snip really
spreads across all the neighboring snips
so when we do
univariate tests like the one we have
described all of this nips in a neighbor
will show up as significant so in order
to interpret how many real findings we
have across the genome geneticist first
get focused on lost sight that is it
they might say well on this area around
Megabass 66 there is something going on
for height and then they study the
correlation between the different snips
trying to understand how many different
groups of snips are there here that
might show some association and so the
colors here corresponds to these
different identified groups the name
that is used currently is that of clumps
and so you might say oh there is a clump
of greens near a clump of purple snip a
cup of orange nips and they my all carry
an independent signal really the odd to
understand what are the independent

signals after this sort of approximate
counting procedure geneticists log post
by locust resort what they call fine
mapping that is multivariate models that
are used to disambiguate the roles of
different variants so what I want to
propose to you is an alternative
approach where we'll take the same data
and we try to get an output that it's
corresponding to the desire of the
geneticist but instead of using these
univariate models we're going to use
really leverage the modern machine
learning approach and we can really go
wild with what type of approach to use
to identify Association and that's why I
called black box rather than statistics
to identify the fact that we really have
access to a lot of different algorithms
that we don't have to necessarily
understand very well for example we
could use a lasso we could use random
forests we could use your net what we're
going to output are discoveries that are
distinct and coherent so we're not going
to run into the problem of difficulty or
an interpretation of how many lower side
are really associated to hide how many
certain variants are important in a

given region our discoveries are going
to be distinct they are going to be
replicable in the sense that we are
going to try to do what the best we can
from the statistical point of view to
increase their applicability by
providing the FDR control on our
discoveries and we are now going to do
this looking just at one region of the
genome but we are going to do it across
the entire genome so that our return set
of discoveries has these FDR control and
this interpretability genome wise and
represent a comprehensive description of
how the genes can affect our genotype of
interest there's two ingredients of this
slightly different approach that I want
to highlight one is the fact that
instead of doing marginal testing we are
going to do what we can call conditional
testing that is we're going to declare
that the snips exj it's not if it is
independent from the phenotype why
conditionally on all the rest of the
losses in the genome so we don't want to
repeat we want to avoid this duplication
of signals across variants in the same
regions due to linkage disequilibrium
and so we want to discover only the
snips that have something to tell us

about the phenotype in addition to the
rest of the genome the other thing that
we are doing differently from the
standard approach is that instead of
trying to control family-wise error rate
we're going to control the false
discovery rate so the expected value of
the number of false false tip over the
number of feature selected this will
translate in a power increase and it's
certainly a better error rate for
situations like the one we are
interested in considering where our
phenotypes are complex that is we expect
them to be influenced by a very large
number of snips how are we going to do
this we are going to leverage a new
theory
now not entirely new has been around for
five years or so cool knockoff tears let
me just summarize for you at a high
level what this approach is trying to do
so what I have here is simulated data on
500 variables and I have simulated an
outcome for these 500 explanatory
variables the outcome is a function of
the variable represented with the red
dot the variables that are represented
with the pale blue color do not have any

influence on the outcome then I ran on
these data set a lasso with a lambda
parameter equal to three and I obtained
a feature important statistics that it
simply be absolute value of the
coefficient of the LA Zoo and what I
plot here is the variable in simple
order 1 to 500 and the value of their
future important statistics now we can
see that the lasso does a fairly good
job in the sense that lots of the red
variables of the variables that are
important have a high value of the
future important statistics however some
of the red variables have a small value
of the feature important statistics and
it's not clear where I would do a
cut-off out what value of the absolute
value of data should I use to separate
the variables that are important versus
those that are not for example if I were
to decide that I'm going to include in
my model all the variables that have a
coefficient different from 0 I would
have a full discovery proportion of
about 70% how do we approach this
problem we are going to leverage some
dummy variable our dummy variables are
not a new idea in statistics they have
been proposed a number of times in these

contexts and the idea is that okay maybe
you can augment your problem with a
series of dummy variables that you know
are not related to your phenotype and
then you monitor when these dummy
variable are selected and you use this
as a mesh
when you start making mistakes what I
have listed here are some proposal for
how to create these dummy variables and
the one we're going to use it's a
derivative of the last one without
spending too much time on this because I
want to get to the analysis of genetic
data let me just give you an idea of
what it's possible one construction
would be to create dummy variables that
are Gaussian that have the same mean and
the same variance in the same covariance
of the original variables and this is
the result of an experiment on my first
simulation data when I have augmented my
set of 500 variables with 500 dummy
variables generated in this fashion the
dummy variables are these darker blue
correspond to this darker blue points
and what I've plot here is a future
important statistics you can see that
the lasso does a good job recognizing

that these are dummy variables and they
have nothing to do with the phenotype
indeed most of their future important
statistics is equal to zero however you
can also see that the distri motion of
these dark blue dummy variables is
fundamentally different from the
distribution of the pale blue one here
so if I'm going to use this dire
quantities to understand what happens to
the blue ones and when to stop I'm not
going to have really a very informative
set of controls because this controls
these dummy variables are very different
from the null variables in my original
dataset another way of constructing them
if arrabal that many of you would be
familiar with is that of permutation so
you take your original data a new
permute the phenotype again you create
variables that are completely not again
the lasso recognizing is completely
known now we get a distribution that
it's fundamentally different from their
original damage the original node
variable in our data set what this
construction of knockoff tell us is to
create dummy variable that give us a
distribution that's much closer in fact
is the same as those are the null

variable in our original data set
so what are these special knockoff
gummies they here we see there are busy
characteristics so we have original set
of variables the knockoff dummy version
of those and the property that has to
hold its what is we call pairwise
exchange ability so that for any set of
variables if you swap the original with
the dumpees you get a vector that has
the same distribution so in this case
I'm swapping variable two and variable
three and you can see what happens here
if I change the original variable 23
with their knockoff dummies and place
the original at the place of the
knockoff I get the same distribution the
other thing that has to be true is that
these dummy variables have to be
independent on the phenotype
conditionally on the X's if dummy
variables are going to do well because
what you can show is that when you
construct feature important statistics
the feature important statistics of null
variables have the same distribution
between the feature important statistic
of the original no variables and the
feature important statistics of the

knockoffs so if a variable is now these
property holds so they provide a good
negative control instead let me explain
a little bit why permutation do not
provide a good negative control so say
then we have x1 which is an important
variable and x2 it's a no variable
however x1 and x2 are correlated now x2
can act as a proxy for x1 so we expect
the correlation between x2 and y2 be
different from 0
now let's say the we construct x1
permute and x2 permute permuting the row
of the data matrix x then what is going
to be true it's going to be true that
the correlation between x1 and x2 it's
preserved in the correlation of their
commuted copies and the correlation of
x1 permute with why it's
and the correlation of x2 with y permute
it's with y is zero because they have I
have broken any relationship between the
X's and the y so this is gone
I preserve the correlation between the
variables and these variables are
definitely no but what is also happening
is the correlation of X to permute with
x1 is 0 so this X to permute does not
capture the fact that x2 gets a lift
from x1 so if I'm going to compare the

performance of X to permute as a
predictor with x2 I'm not going I'm
going to see the text to do does a much
better job because x2 permute has
forgotten the correlation that existed
between the fact that the correlation
existing between x2 and x1 translates in
a correlation between X 2 and Y instead
knockoffs are constructed such that this
correlation is also preserved so we
don't preserve only the correlation
between original variables among the
corresponding permuted copy but also the
correlations across so X 2 tilde the
knockoff x2 has the same correlation
with the original x1 as the original x2
had with x1 so this negative control
will mimic the fact that x2 can get a
lift from x1 so in order to use these
knockoff procedures we need to do two
things we need to construct this
knockoff variables and then we are going
to complete use them to construct these
important statistics and come up with a
criterion how we're going to filter
discovery to make to come up with our
set of discoveries FDR control set I'm
going to spend a bit of time on knockoff
construction because that's an area

where we have done specific work for the
case of genotypes I do have a set of
slides on how to construct the knockoff
filter but we probably do not have
really time to go over them
I have included them just for the sake
of completeness in
some of the people that will look at
this slaw these slides are not really
familiar with the knockoff construction
yet so here is the general recipe for
the construction of knock-offs
which could call sequential
conditionally independent pairs let's
just look at its realization in the case
of three variables so suppose we have
three variable X 1 X 2 X 3 and we want
to construct there knock off the
assumption we are going to make is that
we know the Joint Distribution of their
valuable so we're going to assume that
we know that Joint Distribution of X 1 X
2 and X 3 given this knowledge we can
sample the knockoff 4 X 1 from the
conditional distribution of X 1 given X
2 and X 3 we also can write down a joint
flow of X 1 X 2 X X 3 the original X and
X 1 knockoff which is simply given by
the original o x this conditional
density now we then have a law on four

variables and we are going to sample X 2
tilde from the conditional distribution
of X 2 given X 1 and X 2 original
variables and then knock off 4 X 1 again
once we have done this sampling we can
add this knock of variable X 2 to our
collection of variables and we know the
joint law of all of these variables and
we're going to sample X 3 knock off from
the conditional distribution of X 3
given the original variable X 1 and X 2
and the knockoffs variable X 1 and X 2
now this is an algorithm that it's
constructed it also interesting because
it tells you that variables with the
exchangeable distribution that I
described do exist but it might not be
typically very practical however the
good news it is very practical
in the case of Markov chains so this is
a schematic representation of a Markov
chain so we have our map in this case
our the low of the Markov chains these
are it's our observed variable let's see
how we're going to construct knockoffs
for them
so the first notice has to be sample
from the Joint Distribution of X 1 given
the rest now because the rest is a

Markov chain really only X 2 if the
value is the only important variables
that we need to condition on so you can
get a sense that this will simplify a
little bit computations when we go and
sample X 2 tilde again we have to
conditioned on X 1 tilde and all the
other variable except X 2 but again all
the other variables are going to be only
X 1 and X 3 because of the Markov
property and so on so there is a
reduction of the number of conditions
that we have to do and we can carry this
through example
our knock-offs now remember at each step
we have to calculate these new joint
distribution so that we can evaluate the
conditional distribution from which we
have to sample the knockoffs now
calculating this new joint distribution
requires calculating the normalizing
factor and this might be generally
speaking rather tricky that's where the
difficulty of the general recipe lysing
but again in the case of Markov chains
we can actually do a recursive update of
this normalizing constant so that we
don't have really substantial
computational cost unless the number of
states is very large now why have I

talked to you about sampling knocked off
from Markov models because genotypes are
very often described as in starting from
hidden Markov models this is a fairly
incomprehensible pictures but it
represents one of the algorithms that
have become very popular in the
literature and these are sort of Oh
old versions of all of these algorithms
there are even more efficient ones that
are used routinely in genetics for
imputation and phased reconstruction
that is going from the identification of
genotypes to reconstructing which
alleles reside in the chromosome that
every individual has inherited either
from the mother or the father without
going into details just believe me that
this hidden Markov models are very
commonly used to model genotypes and if
we it is what we discovered is that once
we can generate a knockout for Markov
chains it's also fairly easy to generate
knockoffs for hidden Markov models so
this is the typical structure of a
hidden Markov model these are our
observations and this is the hidden
Markov chain and what we are going to do
to generate knockoffs is that if we

start from these observations we impute
the hidden latent variables then we
create no calls for this hidden latent
variables using the algorithm for the
knockoff for the Markov chains and then
give him his knockoff of latent variable
we generate knock of variables using the
emission probability of the hidden
Markov model we are interested in when
we talk about genotypes we are really
talking about two hidden Markov models
one that describes the alleles at an
individual code from the founder and
another one that described individual
then the alleles individual got from the
mother and these two are put together so
we are going to do that in our data sets
as well now let me spend just a short
amount of time telling you how we are
going to use these knock offs again
please refer to the original
contributions in order to understand the
properties of this method
I just want to recall them here because
they are crucial to understand some of
the limitations and next steps that we
need to do in order to analyze
effectively genetic data so once we have
these knock offs we are going to augment
our X bar
variable matrix with the knock-offs copy
of X and then we are going to use any
machine learning method that we think is
going to give us the best way of
measuring the importance of all of these
variables for our phenotypes X for our
thing attracts Y and we're going to then
get this important variable Z's for the
original variables and for the
knock-offs
these variables are then going to be
used to construct what we call knockoff
scores
so for each original variable J we're
going to have only one knockoff score
which is a function of the importance
variable for the original variable J and
for the knockoff copy of your third
variable J all this combination of
original variable has to simply satisfy
this symmetry condition and you can
think of it for example you take the
difference between the importance bar
the importance valuable for variable J
and the importance sadistic for the
knockoff of variable J now what are
these knockoff scores let's familiarize
them as with them a little bit a large
WJ says that the variable J appears

important a negative WJ says that the
knockoff of that variable seemed to be
more important than the original
variable now what is crucial is that for
variables that are known the W J's are
symmetrically distributed and in fact
conditionally on the value of the
absolute value of W designed for no
variable J the size of W J's are iid
conflicts and we can leverage that to
estimate the FDR so we are going to say
what is our candidate selection set so
we want to select variables that have a
large W right because a large positive W
says the variable
the more important that it snuck off and
seemed important to poor so this is our
set of candidates elections then
exploiting the fact that for no
variables dub use have the say have a
symmetric distribution the the sign of
w's it's a coin flip conditional on the
absolute value of W we can estimate we
can get an estimate of the null set and
here is how we can do that simple
calculations and ultimately what we're
going to do is to say we're going to
estimate the FDP of every selection
using at the denominator the number of
candidates elections that are the number

of variables with a large w where t is
the threshold that we have in mind and
at the numerator the number of variables
that have a knock off score that is
smaller than minus t so we look at the
opposite direction and see how many
negative signs with that particular size
of w we have seen so if this is our
candidate selection the candidate
selection includes only the plus values
we see that there is one variable that
has a W larger than T but a negative
sign and so that will be our estimate of
how many false positive we have in this
data set so this gives us an estimate of
the FDP
and an estimate of the FDP that works
well so that we can actually control the
FDR if we use this criteria to do our
selections okay so let's now go back to
really the genetic problem and hopefully
try to convince you that we know how to
construct knock offs for this genetic
data because we can approximate the
distribution of the genotypes with
hidden Markov models this it's well
accepted and
actually a well proven technique so now
we can construct these knock-offs

we could run them across through
whatever machine learning that we are
interested in and use the knocker filter
which I have just sketched to make the
selection so how well is this going to
do so remember doing this we're doing
we're testing these conditional
hypotheses the whole meat of the
knockoffs is that differently from what
permutation do they knock off for a
variable preserve the correlation of
that variable with any other variables
and so they allow us to test these
conditional hypotheses we are going to
select an XJ to be in our model and
excited to be in our model even only if
this X I looks last not independent from
Y conditional on all the other genotypes
so this is very nice because it avoids
the apparent large number of discoveries
due to linkage disequilibrium
there's no repetition of the signal
however there are challenges that come
with it because the dependence between
variants is high the power to detect any
of them becomes low and this is not a
problem that it's a unique of the
knock-offs
it's a problem that really exists for
any type of model selection approach let

me show you in the context of knockoff
how this becomes the incarnation of the
problem the context of knockoff so say
that we want to create a knockoff of
poor variable XJ to act as a palette
control our XJ tail that needs to be as
correlated to any other XK as XJ is so
let's represent correlation with angles
so we have our original XJ and one other
variable X K we need to construct
naca variable XJ that has the same angle
with XK so one solution is to make XJ
tilde equal to XJ that's certainly
satisfies the correlation constraint but
we realize immediately that this would
be a solution that gives us zero power
because when we go and check is XJ tilde
or XJ more important for explaining the
phenotype we're going to get that they
have exactly the same level of
importance and we're never going to
select XJ so the intelligent solution is
to try to place XJ tilde in a way so
that it has the same angle with X day
but it's also as far as possible sorry
the same angle with XK but it's also as
far as possible from XJ to them and
that's what Reena focalin Candice did in
their original paper when they actually

do this for ethics design but it's
something that has to be kept in mind in
any type of construction of knock-offs
now what happens when we have many
correlated predictors and I'm trying to
capture this making my plots going from
two variables to three so that's not
many but I think it allows us to capture
what's going on so now again we want to
create a new variable X Z tilde that
it's a knockoff for XJ if the
correlation is strong we can use the
same strategy as before but you can see
that the angle from xjxj tilde is now
going to be much larger okay we have
gone as far as possible from XJ but
we're still very close
so if our predictor are highly
correlated this strategy of trying to go
as far as possible doesn't buy us much
mileage as I mention this it's a problem
for any type of multiple
mostly period model you know those of
you they have worked with the LA Zoo
will probably be familiar with the fact
that sometimes they select one of the
correlating predictor arbitrarily in
Bayesian model selection these results
in having posterior probability for all
of the different predictors that are

lower than they should be because they
can each of them can act as substitute
for each other in genetics find mapping
method resolve this problem by refusing
to select among hi click array
predictors and returning a set of
possible variants and we are going to
use a similar approach what we're going
to do is that instead of trying to make
decision at the single variant level we
are going to define group hypothesis the
corresponds to set of variables and
these set of variables are set of adh
and variables ideation positioning the
genomes and our group hypothesis are
going to be of the type we're going to
say that your group is no if it's
independent from the phenotype
conditioning on all the variables in the
other group and you can extend the
exchange ability property for the single
variable to this group setting and let
me just graphically describe what this
means so we have this group of x1 x2
highly correlated variable we have to
create a knockoff copy of them and
instead of trying to keep the current
instead of creating a knock of copy of
x1 only which we know is going to force

us to put X 1 tilde over here very close
to the original exponent we're going to
create a knockoff copy for the entire
group of X 1 and X 2 and this entire
group will need to keep the correlation
with X 3 but can move together so our
knockoff copy of X 1 and X 2 can be
placed over here and you see that the
within correlation is maintained and the
correlation of this group with X 3 is
none
but we do not maintain the correlation
across the variables in this particular
group so this group no cause where the
exchange ability holds only for bad
across groups rather than within groups
allows us to move this correlated
variable together and move them further
apart from the originals once we have
these groups of course we also have to
define a group feature importance and
you have many choices for that in the
following what we're going to do is to
simply the data analysis that I'm going
to show you what we're going to do it
simply sum up the beta coefficients
estimated by the LA Zoo in absolute
value across all the variables in the
group okay so we are going to use these
to analyze genetic data the issue with

genetic data is that this genotype
variable that corresponds to different
locations across the genome are highly
correlated locally this is because of
how we inherit our DNA from our parents
every time that there is a meiosis there
is somewhat of a patchwork ton of the
parental chromosomes to create a gamete
and this patchwork happens so that
Parshin of the genome that are close
tend to be inherited together
so these translates of the level of
genotypes in correlation between the X
J's correlations that are local so what
we're going to do is that we are going
to use the squared empirical correlation
as a similarity measure between the
snips and then partition the collection
of snips by agency constrain anarchical
classroom so we do a your article
clustering but for variable to be put in
the same Craster's they actually have to
be at the agent one to the other and
then we're going to cap the dendogram
are different heights each leading to a
different resolution of groups so what
you have here it's actually a portion
of one chromosome and they shaded blue
white the groups of snips corresponding

to cutting the dendogram at the level
indicated by this broken line at the top
here you have single snips with distance
with each other represented by the
correlation we are going to then have
depending on where we can't we are going
to have different resolutions for each
of these different resolutions we are
going to construct knockouts for the
group of the variables and then we are
going to analyze the data using the
lasso the examples I'm going to show you
comes from the UK biobank which I have
already mentioned to you in these data
set we are looking at about five hundred
six hundred thousand snips and we are
only looking at a subset of the
individuals the unrelated British
individuals to avoid problems that are
related to family relatedness and to
different ethnic background I have to
say that actually the current version of
these algorithms can deal both with
family relationship and individuals of
ethnic background but I don't have
results in good shape yet to share them
with all of you here is a sense of what
are these different resolutions that we
construct and we indicate them in terms
of resolution as a hundred percent means

that you have one snip per block so you
are actually using all the level of feed
for the resolution that you could get
this means that we're going to have six
hundred thousand blocks and the average
block width in megabases is zero because
it's represented by one snip only then
you have these decreasing levels of
resolution the smallest resolution the
coarser resolution so we get its
2-percent which corresponds to an
average block size of 15 an average per
lock we love point to megabases
we have 12,000 blocks now what you can
see on these column is the average
correlation between the knockoff and its
original variable and you can see
exactly the phenomenon I've described
for you for which I said we are going to
instead of looking at single variables
knockoff we're going to construct this
group when we look at simple variables
each variable it's correlated to it
snuck up with an average our square of
point 7 which is quite high
instead when we look at this coarser
resolution the correlation between each
variable in there knock off its low work
and down 2.2 this graph represents the

same information but instead of just
giving you 1 our value gives you box
plot for those and these graphs simply
and observes that if we look at the
correlations that we have to conserve to
construct correct no cut off we indeed
conserve them so if you look at the
correlation between one variable and
another one and that variable and the
knockoff of the other one they are very
much concerned and here it's an one
snapshot view of how this method works
and what type of results is going to
provide so this is again height and it's
we're looking at chromosome 12 at the
very top here you have the entire
chromosome 12 and you have a
representation of multiple lines that
corresponds to the different resolution
at the bottom here you have the coarser
resolution and you have a box drawn in
correspondence of every group of snipped
that has been selected as significant
and then as you increase you are going
to have boxes corresponding at the lower
zone at the higher resolution and so on
so for it's perhaps easier to look at
this zoo
version of this central region here here
you can count the different number of

resolutions they are indicated on the
y-axis so you can see here that we
detect that there is something going on
among this isn't it this group of nips
has something to tell me about height
conditionally on all the other blocks
across the gene then when I try to be a
bit more specific and I say what
among these nips is important I can also
reject these more specific hypothesis it
says this block of nip has something to
tell me about height conditionally on
all blocks of comparable size across the
genome however I am NOT able to make a
more specific discovery that says
exactly which nips in this block are
important in other positions like in
this one I actually can make very very
specific discoveries and say this snip
here this is actually only one variable
can tell you something about height
conditionally on every other single snip
in the genome and by contrast this is
the plot that I already shown you at the
very beginning of this presentation that
is what you get from the analysis of
this region from the standard univariate
approach you can see that while there is
an attempt to interpret these p-values

with these different colors this
different colors block are not terribly
interpretable because if you actually
look they overlap so we don't know what
we're talking about where is the signal
and we don't know how specific the
signal can be to bring this point may be
clearer let's look at a simulation so
that where we know what the right answer
is and we can compare how these two
approaches are fair this is simulated
data and
the phenotype depends truly on five
different variants that are indicated by
these little black lines at the bottom
here and in the univariate analysis when
the snip it's really one of the snip
that it's important for the phenotype
instead of having a black dot these
nipples represented with a green
asterisk what's difference between one
column and the next one it's the
relative importance of these nips so in
this column the irritability is small so
these nibs are going to explain a
smaller portion of the phenotypes and in
this other column their readability
stronger so these nibs are going to
explore explain a bigger portion of the
phenotype or if you want the effect size

is larger now let's see what happens
with the univariate models now the
univariate models in both cases detects
that there is something going on here
right because in both cases there are
some snips that pass the significant
threshold so the geneticist would be
alerted of the fact that there is
something going on now which of the two
cases is better when you have lower
signal you have a smaller number of
snips that pass the threshold this means
that some of the true snip do not pass
it but most of those that pass the
thresholds are actually important when
you have stronger signal which is
arguably what you would want the picture
is much more muddled again you have a
sense that something is going on here
but you see that there is a larger
number of gray dots that pass the
threshold in fact the importance of the
truth nibs as opposed to the not causal
ones those ones that are associated just
because of linkage disequilibrium it's
very difficult to discern here if you
look at what happens with the luck of
zones you have ups in some sense similar
in some sense a profoundly different

picture in both cases
at the low-resolution the methodology
picks up that there is something going
on what was the difference between low
signal versus high signal is that in the
lower signal case the method cannot
resolve very much among the five snips
in fact it cannot resolve among the five
snips at all and we just know that in
this portion of the genome that it's not
very specific there's something going on
when we increase signal the method can
resolve the different the importance of
the different steps up to a very high
resolution in fact here you can
distinguish every single one of the five
snips independently so we have done a
number of simulations and we are kind of
running out of time so I don't want to
bother you very much you can go and look
on the published paper to get a sense of
them the simulations have done trying to
be realistic so we have used already
true genotypes and we have simulated
phenotypes under a variety of scenarios
and we have compared our methodology
with others and what we see is that this
is a list of the methodologies that we
have used what we see is that the
methods that we have proposed always

controlled the false discovery
proportion and has really power that
it's comparable to other methods or
better but what we can what we really
shine on is that we can control the
false discovery rate which is not a
trivial task and we can localize
precisely and informative where the
signal is I want to show you just before
finishing some of the results on real
data so again we are in the UK biobank
but at this level instead of simulating
phenotypes we analyze true phenotypes
and I am presenting for you the results
at low resolution compared to the
results of
a univariate method that it's
state-of-the-art and whose results have
been passed processes to be interpreted
as independent lost in the genome so
that they would be equivalent to these
low resolution blocks and what you have
here is the number of findings so you
can see that our approach has a much
larger number of findings than this um
completing one a couple of slides on
computational resources clearly doing
this it's not something that happens
immediately but in truth the largest

amount of time it's spent 15 these hmm
models for our data and these agent and
models are actually fit on a regular
basis for genetics data for a variety of
other reasons so you can imagine that
this is cost that the analyst would bear
in any case geneticists treat these hmm
models to impute snips regularly to
phrase their data so we don't really
have to add a very serious computational
burden on top of what is already done so
one way is looking at that these
contexts this is fitting the hmm it's
where most of the time goes then you
generate knock-offs and then you have to
do the analysis now I said this is a
time that the geneticists would spend
anyway generate knockoff is what we mean
to do and this is some serious time
contribution but not unbearable you also
have to keep in mind that for every new
phenotype that you might want to analyze
on the same cohort you can reuse the
same knock offs so this final mapping
step can be really quite fast
Mateo says he has a very nice github
start where he presents where he keeps
the code and some simulation analysis
and
have some markdown documents so that you

can actually try out the map put and let
me thank you your for your attention and
thank you for putting up this conference
in these strange times of pandemic in
the United States where I find myself on
top of the pandemic we are also leaving
very sad moments because of racism so
let me conclude with an invitation for
everybody to do what we can to combat
racism thank you very much

(English (Auto-Generated) ) Chiara Sabatti - Knockoff Genotypes - Value in Counterfeit (DownSub - Com)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

(English (Auto-Generated) ) Chiara Sabatti - Knockoff Genotypes - Value in Counterfeit (DownSub - Com)

Uploaded by

Copyright:

Available Formats

hello thank you for inviting me and this

conference it's too bad that we cannot

do it in person but it's a very

ingenious what we managed to do it this

way so what I would like to talk to you

about today it's work that I have been

doing with a number of collaborators and

here you see their picture mitosis they

are using cartilage Stefan dates and

Emanuele Candace and we have a paper

that has been published earlier this

year in nature communication that

contains most of the information that

I'll be talking about so the topic I'll

be dealing with its gene mapping for

complex traits what is the typical

setting we have n observation on a

phenotype Y and on P dimensional

genotype x now just for a having a sense

of dimension P the number of possible

explanatory variables that are genetic

variants it's large 1 billion and it's

also nowadays not small but typically

smaller than P 10,000 in some cases

300,000 or so in large biobank datasets

now the goal with these data it's to

pass it through some statistical method

so that we can identify a set s hat off

the core collect all the genetic

variants that are important for the

information on the possible value of the

phenotype though as Hat represent

position engine the genome and the

variation in the genome a disposition

has an impact on the phenotype as we

tackle this problem we have to remind

ourselves that there is a fair amount of

dependence between the XS and that we

want to solve this problem in a way the

guarantee is as replicability in the

selections really selecting the right

positions in the genome is very

important for the type of medical

application one has in mind we want to

and assess hard because we want to

develop drugs so we need to know which

biological pathway we need to target

with these drugs so we don't want to

make too many mistakes in the selections

another challenge is that we really want

these selections to be meaningful across

ethnic groups so we want them to be

robust the typical analysis for this

type of data it's done at the univariate

level and what I have here is maybe the

simplest form of it but the substance

doesn't really matter so we scan through

every single genotype genotypic Valiant

its distribution it's independent from

the one of the phenotype the way we test

this node analysis is using a linear

model so we use the phenotype as the

outcome variable we fit a univariate