You are on page 1of 40

hello thank you for inviting me and this

conference it's too bad that we cannot

do it in person but it's a very

ingenious what we managed to do it this

way so what I would like to talk to you

about today it's work that I have been

doing with a number of collaborators and

here you see their picture mitosis they

are using cartilage Stefan dates and

Emanuele Candace and we have a paper

that has been published earlier this

year in nature communication that

contains most of the information that

I'll be talking about so the topic I'll

be dealing with its gene mapping for

complex traits what is the typical

setting we have n observation on a

phenotype Y and on P dimensional

genotype x now just for a having a sense

of dimension P the number of possible

explanatory variables that are genetic

variants it's large 1 billion and it's

also nowadays not small but typically

smaller than P 10,000 in some cases

300,000 or so in large biobank datasets

now the goal with these data it's to

pass it through some statistical method

so that we can identify a set s hat off

the core collect all the genetic

variants that are important for the


phenotype why that is they carry

information on the possible value of the

phenotype though as Hat represent

position engine the genome and the

variation in the genome a disposition

has an impact on the phenotype as we

tackle this problem we have to remind

ourselves that there is a fair amount of

dependence between the XS and that we

want to solve this problem in a way the

guarantee is as replicability in the

selections really selecting the right

positions in the genome is very

important for the type of medical

application one has in mind we want to

and assess hard because we want to

develop drugs so we need to know which

biological pathway we need to target

with these drugs so we don't want to

make too many mistakes in the selections

another challenge is that we really want

these selections to be meaningful across

ethnic groups so we want them to be

robust the typical analysis for this

type of data it's done at the univariate

level and what I have here is maybe the

simplest form of it but the substance

doesn't really matter so we scan through

every single genotype genotypic Valiant


and we say that a variant x i it's no if

its distribution it's independent from

the one of the phenotype the way we test

this node analysis is using a linear

model so we use the phenotype as the

outcome variable we fit a univariate

linear model and we test the hypothesis

if the coefficient relative to the

variant X I is equal to zero or not we

do this one snip at the time and to take

into account the fact that we are

looking at three hundred thousand

million snips and we are going to

control the family well familiar Y zero

rate at a level point oh five

approximately what one can notice is

that this approach definitely doesn't

use the model machine learning models

and this is perhaps a bit surprising to

the community of statisticians

historically I think this is mainly due

to the fact that it's it has been

difficult to provide the guarantees that

geneticists are interested in with these

standard machine learning models in

particular remember the replicability

and the sorts of guarantees of connect

selections that they have in mind let me

give you an example this is really all

data arm from 2009 it represents a


standard way of displaying the results

on the x-axis you have genomic position

on the y-axis you have minus log 10

p-values for each of these test

corresponding to different genomic

positions the numbers that you see on

the x-axis correspond to the different

chromosomes and you see that you have

these long plots of different p-values

and you had in this plot a threshold

line at 8 that indicates in this

particular study what has been

considered significant and you can see

that there are some locations in the

genome that lead to very small T values

and so we have some discoveries there

and the trades here are triglyceride HDL

and LDL and this is data from the nfpc

board so what are the limitations of

this approach one is that not all the

signal is captured so these plot refers

to the data that I have shown you a

minute ago even if we have a larger

collection of phenotypes and what this

entire bar represents is the proportion

of phenotypic variants that we could

explain using genetic information and

other key areas like sex and age what

you can see is that the portion of this


phenotypic variance explained by the

genetic information which is the one

highlighted in red it's fairly small and

in fact the other covariance explained

much more this is the general phenomenon

that has been labeled as lost missing

heritability irritability being the

portion of the phenotypic variance that

should be explainable by genetic

information the solution of the field to

this problem has been to increase the

sample size undoubtably increasing the

sample size increases our power to

detect loss this data said that I am

showing you only had five thousand

individuals nowadays the Gordon fairly

often have datasets in the 50 thousand a

hundred thousand individuals the other

type of solution that the field has

adopted is that when they the interest

is not simply identifying the variants

that

are important but trying to come up with

a prediction of what the phenotypic

value may be instead of including only

the selected snip we use models that

include many snips sometimes in fact the

entire collections of snips the

properties of these models are however

rather unclear and by now what the field


is really concerned with is the models

that have been constructed in this

fashion without any selection really do

not seem to replicate when across

population another limitation is that

the interpretation of the findings are

difficult so the plot that you see here

it's much more contemporary than the

ones that I have shared with you so far

he refers to height and the data set

that has been used to constructed is the

UK biobank this is a large collection

that includes 500,000 individuals at

approximately with a very large number

of snips genotype and you can see that

when we use height as a trait in this

data set we can get very strong signals

for associations the minus log 10

p-values that you see on the y-axis

which values are 300 now there's lots of

points here that pass what would be the

significant threshold so very many snips

that seems to signal some Association

what are the different colors here the

different colors represent an

interpretation that geneticists do of

these different snips what's happening

is that there may be even only one

causal variant in this region so there


might be only one genetic variation that

really influences heart but because

there is a strong disequilibrium a

strong dependence between all the snips

in a neighborhood in the genome these

associations signal between the

phenotype and the causal snip really

spreads across all the neighboring snips

so when we do

univariate tests like the one we have

described all of this nips in a neighbor

will show up as significant so in order

to interpret how many real findings we

have across the genome geneticist first

get focused on lost sight that is it

they might say well on this area around

Megabass 66 there is something going on

for height and then they study the

correlation between the different snips

trying to understand how many different

groups of snips are there here that

might show some association and so the

colors here corresponds to these

different identified groups the name

that is used currently is that of clumps

and so you might say oh there is a clump

of greens near a clump of purple snip a

cup of orange nips and they my all carry

an independent signal really the odd to

understand what are the independent


signals after this sort of approximate

counting procedure geneticists log post

by locust resort what they call fine

mapping that is multivariate models that

are used to disambiguate the roles of

different variants so what I want to

propose to you is an alternative

approach where we'll take the same data

and we try to get an output that it's

corresponding to the desire of the

geneticist but instead of using these

univariate models we're going to use

really leverage the modern machine

learning approach and we can really go

wild with what type of approach to use

to identify Association and that's why I

called black box rather than statistics

to identify the fact that we really have

access to a lot of different algorithms

that we don't have to necessarily

understand very well for example we

could use a lasso we could use random

forests we could use your net what we're

going to output are discoveries that are

distinct and coherent so we're not going

to run into the problem of difficulty or

an interpretation of how many lower side

are really associated to hide how many

certain variants are important in a


given region our discoveries are going

to be distinct they are going to be

replicable in the sense that we are

going to try to do what the best we can

from the statistical point of view to

increase their applicability by

providing the FDR control on our

discoveries and we are now going to do

this looking just at one region of the

genome but we are going to do it across

the entire genome so that our return set

of discoveries has these FDR control and

this interpretability genome wise and

represent a comprehensive description of

how the genes can affect our genotype of

interest there's two ingredients of this

slightly different approach that I want

to highlight one is the fact that

instead of doing marginal testing we are

going to do what we can call conditional

testing that is we're going to declare

that the snips exj it's not if it is

independent from the phenotype why

conditionally on all the rest of the

losses in the genome so we don't want to

repeat we want to avoid this duplication

of signals across variants in the same

regions due to linkage disequilibrium

and so we want to discover only the

snips that have something to tell us


about the phenotype in addition to the

rest of the genome the other thing that

we are doing differently from the

standard approach is that instead of

trying to control family-wise error rate

we're going to control the false

discovery rate so the expected value of

the number of false false tip over the

number of feature selected this will

translate in a power increase and it's

certainly a better error rate for

situations like the one we are

interested in considering where our

phenotypes are complex that is we expect

them to be influenced by a very large

number of snips how are we going to do

this we are going to leverage a new

theory

now not entirely new has been around for

five years or so cool knockoff tears let

me just summarize for you at a high

level what this approach is trying to do

so what I have here is simulated data on

500 variables and I have simulated an

outcome for these 500 explanatory

variables the outcome is a function of

the variable represented with the red

dot the variables that are represented

with the pale blue color do not have any


influence on the outcome then I ran on

these data set a lasso with a lambda

parameter equal to three and I obtained

a feature important statistics that it

simply be absolute value of the

coefficient of the LA Zoo and what I

plot here is the variable in simple

order 1 to 500 and the value of their

future important statistics now we can

see that the lasso does a fairly good

job in the sense that lots of the red

variables of the variables that are

important have a high value of the

future important statistics however some

of the red variables have a small value

of the feature important statistics and

it's not clear where I would do a

cut-off out what value of the absolute

value of data should I use to separate

the variables that are important versus

those that are not for example if I were

to decide that I'm going to include in

my model all the variables that have a

coefficient different from 0 I would

have a full discovery proportion of

about 70% how do we approach this

problem we are going to leverage some

dummy variable our dummy variables are

not a new idea in statistics they have

been proposed a number of times in these


contexts and the idea is that okay maybe

you can augment your problem with a

series of dummy variables that you know

are not related to your phenotype and

then you monitor when these dummy

variable are selected and you use this

as a mesh

when you start making mistakes what I

have listed here are some proposal for

how to create these dummy variables and

the one we're going to use it's a

derivative of the last one without

spending too much time on this because I

want to get to the analysis of genetic

data let me just give you an idea of

what it's possible one construction

would be to create dummy variables that

are Gaussian that have the same mean and

the same variance in the same covariance

of the original variables and this is

the result of an experiment on my first

simulation data when I have augmented my

set of 500 variables with 500 dummy

variables generated in this fashion the

dummy variables are these darker blue

correspond to this darker blue points

and what I've plot here is a future

important statistics you can see that

the lasso does a good job recognizing


that these are dummy variables and they

have nothing to do with the phenotype

indeed most of their future important

statistics is equal to zero however you

can also see that the distri motion of

these dark blue dummy variables is

fundamentally different from the

distribution of the pale blue one here

so if I'm going to use this dire

quantities to understand what happens to

the blue ones and when to stop I'm not

going to have really a very informative

set of controls because this controls

these dummy variables are very different

from the null variables in my original

dataset another way of constructing them

if arrabal that many of you would be

familiar with is that of permutation so

you take your original data a new

permute the phenotype again you create

variables that are completely not again

the lasso recognizing is completely

known now we get a distribution that

it's fundamentally different from their

original damage the original node

variable in our data set what this

construction of knockoff tell us is to

create dummy variable that give us a

distribution that's much closer in fact

is the same as those are the null


variable in our original data set

so what are these special knockoff

gummies they here we see there are busy

characteristics so we have original set

of variables the knockoff dummy version

of those and the property that has to

hold its what is we call pairwise

exchange ability so that for any set of

variables if you swap the original with

the dumpees you get a vector that has

the same distribution so in this case

I'm swapping variable two and variable

three and you can see what happens here

if I change the original variable 23

with their knockoff dummies and place

the original at the place of the

knockoff I get the same distribution the

other thing that has to be true is that

these dummy variables have to be

independent on the phenotype

conditionally on the X's if dummy

variables are going to do well because

what you can show is that when you

construct feature important statistics

the feature important statistics of null

variables have the same distribution

between the feature important statistic

of the original no variables and the

feature important statistics of the


knockoffs so if a variable is now these

property holds so they provide a good

negative control instead let me explain

a little bit why permutation do not

provide a good negative control so say

then we have x1 which is an important

variable and x2 it's a no variable

however x1 and x2 are correlated now x2

can act as a proxy for x1 so we expect

the correlation between x2 and y2 be

different from 0

now let's say the we construct x1

permute and x2 permute permuting the row

of the data matrix x then what is going

to be true it's going to be true that

the correlation between x1 and x2 it's

preserved in the correlation of their

commuted copies and the correlation of

x1 permute with why it's

and the correlation of x2 with y permute

it's with y is zero because they have I

have broken any relationship between the

X's and the y so this is gone

I preserve the correlation between the

variables and these variables are

definitely no but what is also happening

is the correlation of X to permute with

x1 is 0 so this X to permute does not

capture the fact that x2 gets a lift

from x1 so if I'm going to compare the


performance of X to permute as a

predictor with x2 I'm not going I'm

going to see the text to do does a much

better job because x2 permute has

forgotten the correlation that existed

between the fact that the correlation

existing between x2 and x1 translates in

a correlation between X 2 and Y instead

knockoffs are constructed such that this

correlation is also preserved so we

don't preserve only the correlation

between original variables among the

corresponding permuted copy but also the

correlations across so X 2 tilde the

knockoff x2 has the same correlation

with the original x1 as the original x2

had with x1 so this negative control

will mimic the fact that x2 can get a

lift from x1 so in order to use these

knockoff procedures we need to do two

things we need to construct this

knockoff variables and then we are going

to complete use them to construct these

important statistics and come up with a

criterion how we're going to filter

discovery to make to come up with our

set of discoveries FDR control set I'm

going to spend a bit of time on knockoff

construction because that's an area


where we have done specific work for the

case of genotypes I do have a set of

slides on how to construct the knockoff

filter but we probably do not have

really time to go over them

I have included them just for the sake

of completeness in

some of the people that will look at

this slaw these slides are not really

familiar with the knockoff construction

yet so here is the general recipe for

the construction of knock-offs

which could call sequential

conditionally independent pairs let's

just look at its realization in the case

of three variables so suppose we have

three variable X 1 X 2 X 3 and we want

to construct there knock off the

assumption we are going to make is that

we know the Joint Distribution of their

valuable so we're going to assume that

we know that Joint Distribution of X 1 X

2 and X 3 given this knowledge we can

sample the knockoff 4 X 1 from the

conditional distribution of X 1 given X

2 and X 3 we also can write down a joint

flow of X 1 X 2 X X 3 the original X and

X 1 knockoff which is simply given by

the original o x this conditional

density now we then have a law on four


variables and we are going to sample X 2

tilde from the conditional distribution

of X 2 given X 1 and X 2 original

variables and then knock off 4 X 1 again

once we have done this sampling we can

add this knock of variable X 2 to our

collection of variables and we know the

joint law of all of these variables and

we're going to sample X 3 knock off from

the conditional distribution of X 3

given the original variable X 1 and X 2

and the knockoffs variable X 1 and X 2

now this is an algorithm that it's

constructed it also interesting because

it tells you that variables with the

exchangeable distribution that I

described do exist but it might not be

typically very practical however the

good news it is very practical

in the case of Markov chains so this is

a schematic representation of a Markov

chain so we have our map in this case

our the low of the Markov chains these

are it's our observed variable let's see

how we're going to construct knockoffs

for them

so the first notice has to be sample

from the Joint Distribution of X 1 given

the rest now because the rest is a


Markov chain really only X 2 if the

value is the only important variables

that we need to condition on so you can

get a sense that this will simplify a

little bit computations when we go and

sample X 2 tilde again we have to

conditioned on X 1 tilde and all the

other variable except X 2 but again all

the other variables are going to be only

X 1 and X 3 because of the Markov

property and so on so there is a

reduction of the number of conditions

that we have to do and we can carry this

through example

our knock-offs now remember at each step

we have to calculate these new joint

distribution so that we can evaluate the

conditional distribution from which we

have to sample the knockoffs now

calculating this new joint distribution

requires calculating the normalizing

factor and this might be generally

speaking rather tricky that's where the

difficulty of the general recipe lysing

but again in the case of Markov chains

we can actually do a recursive update of

this normalizing constant so that we

don't have really substantial

computational cost unless the number of

states is very large now why have I


talked to you about sampling knocked off

from Markov models because genotypes are

very often described as in starting from

hidden Markov models this is a fairly

incomprehensible pictures but it

represents one of the algorithms that

have become very popular in the

literature and these are sort of Oh

old versions of all of these algorithms

there are even more efficient ones that

are used routinely in genetics for

imputation and phased reconstruction

that is going from the identification of

genotypes to reconstructing which

alleles reside in the chromosome that

every individual has inherited either

from the mother or the father without

going into details just believe me that

this hidden Markov models are very

commonly used to model genotypes and if

we it is what we discovered is that once

we can generate a knockout for Markov

chains it's also fairly easy to generate

knockoffs for hidden Markov models so

this is the typical structure of a

hidden Markov model these are our

observations and this is the hidden

Markov chain and what we are going to do

to generate knockoffs is that if we


start from these observations we impute

the hidden latent variables then we

create no calls for this hidden latent

variables using the algorithm for the

knockoff for the Markov chains and then

give him his knockoff of latent variable

we generate knock of variables using the

emission probability of the hidden

Markov model we are interested in when

we talk about genotypes we are really

talking about two hidden Markov models

one that describes the alleles at an

individual code from the founder and

another one that described individual

then the alleles individual got from the

mother and these two are put together so

we are going to do that in our data sets

as well now let me spend just a short

amount of time telling you how we are

going to use these knock offs again

please refer to the original

contributions in order to understand the

properties of this method

I just want to recall them here because

they are crucial to understand some of

the limitations and next steps that we

need to do in order to analyze

effectively genetic data so once we have

these knock offs we are going to augment

our X bar
variable matrix with the knock-offs copy

of X and then we are going to use any

machine learning method that we think is

going to give us the best way of

measuring the importance of all of these

variables for our phenotypes X for our

thing attracts Y and we're going to then

get this important variable Z's for the

original variables and for the

knock-offs

these variables are then going to be

used to construct what we call knockoff

scores

so for each original variable J we're

going to have only one knockoff score

which is a function of the importance

variable for the original variable J and

for the knockoff copy of your third

variable J all this combination of

original variable has to simply satisfy

this symmetry condition and you can

think of it for example you take the

difference between the importance bar

the importance valuable for variable J

and the importance sadistic for the

knockoff of variable J now what are

these knockoff scores let's familiarize

them as with them a little bit a large

WJ says that the variable J appears


important a negative WJ says that the

knockoff of that variable seemed to be

more important than the original

variable now what is crucial is that for

variables that are known the W J's are

symmetrically distributed and in fact

conditionally on the value of the

absolute value of W designed for no

variable J the size of W J's are iid

conflicts and we can leverage that to

estimate the FDR so we are going to say

what is our candidate selection set so

we want to select variables that have a

large W right because a large positive W

says the variable

the more important that it snuck off and

seemed important to poor so this is our

set of candidates elections then

exploiting the fact that for no

variables dub use have the say have a

symmetric distribution the the sign of

w's it's a coin flip conditional on the

absolute value of W we can estimate we

can get an estimate of the null set and

here is how we can do that simple

calculations and ultimately what we're

going to do is to say we're going to

estimate the FDP of every selection

using at the denominator the number of

candidates elections that are the number


of variables with a large w where t is

the threshold that we have in mind and

at the numerator the number of variables

that have a knock off score that is

smaller than minus t so we look at the

opposite direction and see how many

negative signs with that particular size

of w we have seen so if this is our

candidate selection the candidate

selection includes only the plus values

we see that there is one variable that

has a W larger than T but a negative

sign and so that will be our estimate of

how many false positive we have in this

data set so this gives us an estimate of

the FDP

and an estimate of the FDP that works

well so that we can actually control the

FDR if we use this criteria to do our

selections okay so let's now go back to

really the genetic problem and hopefully

try to convince you that we know how to

construct knock offs for this genetic

data because we can approximate the

distribution of the genotypes with

hidden Markov models this it's well

accepted and

actually a well proven technique so now

we can construct these knock-offs


we could run them across through

whatever machine learning that we are

interested in and use the knocker filter

which I have just sketched to make the

selection so how well is this going to

do so remember doing this we're doing

we're testing these conditional

hypotheses the whole meat of the

knockoffs is that differently from what

permutation do they knock off for a

variable preserve the correlation of

that variable with any other variables

and so they allow us to test these

conditional hypotheses we are going to

select an XJ to be in our model and

excited to be in our model even only if

this X I looks last not independent from

Y conditional on all the other genotypes

so this is very nice because it avoids

the apparent large number of discoveries

due to linkage disequilibrium

there's no repetition of the signal

however there are challenges that come

with it because the dependence between

variants is high the power to detect any

of them becomes low and this is not a

problem that it's a unique of the

knock-offs

it's a problem that really exists for

any type of model selection approach let


me show you in the context of knockoff

how this becomes the incarnation of the

problem the context of knockoff so say

that we want to create a knockoff of

poor variable XJ to act as a palette

control our XJ tail that needs to be as

correlated to any other XK as XJ is so

let's represent correlation with angles

so we have our original XJ and one other

variable X K we need to construct

naca variable XJ that has the same angle

with XK so one solution is to make XJ

tilde equal to XJ that's certainly

satisfies the correlation constraint but

we realize immediately that this would

be a solution that gives us zero power

because when we go and check is XJ tilde

or XJ more important for explaining the

phenotype we're going to get that they

have exactly the same level of

importance and we're never going to

select XJ so the intelligent solution is

to try to place XJ tilde in a way so

that it has the same angle with X day

but it's also as far as possible sorry

the same angle with XK but it's also as

far as possible from XJ to them and

that's what Reena focalin Candice did in

their original paper when they actually


do this for ethics design but it's

something that has to be kept in mind in

any type of construction of knock-offs

now what happens when we have many

correlated predictors and I'm trying to

capture this making my plots going from

two variables to three so that's not

many but I think it allows us to capture

what's going on so now again we want to

create a new variable X Z tilde that

it's a knockoff for XJ if the

correlation is strong we can use the

same strategy as before but you can see

that the angle from xjxj tilde is now

going to be much larger okay we have

gone as far as possible from XJ but

we're still very close

so if our predictor are highly

correlated this strategy of trying to go

as far as possible doesn't buy us much

mileage as I mention this it's a problem

for any type of multiple

mostly period model you know those of

you they have worked with the LA Zoo

will probably be familiar with the fact

that sometimes they select one of the

correlating predictor arbitrarily in

Bayesian model selection these results

in having posterior probability for all

of the different predictors that are


lower than they should be because they

can each of them can act as substitute

for each other in genetics find mapping

method resolve this problem by refusing

to select among hi click array

predictors and returning a set of

possible variants and we are going to

use a similar approach what we're going

to do is that instead of trying to make

decision at the single variant level we

are going to define group hypothesis the

corresponds to set of variables and

these set of variables are set of adh

and variables ideation positioning the

genomes and our group hypothesis are

going to be of the type we're going to

say that your group is no if it's

independent from the phenotype

conditioning on all the variables in the

other group and you can extend the

exchange ability property for the single

variable to this group setting and let

me just graphically describe what this

means so we have this group of x1 x2

highly correlated variable we have to

create a knockoff copy of them and

instead of trying to keep the current

instead of creating a knock of copy of

x1 only which we know is going to force


us to put X 1 tilde over here very close

to the original exponent we're going to

create a knockoff copy for the entire

group of X 1 and X 2 and this entire

group will need to keep the correlation

with X 3 but can move together so our

knockoff copy of X 1 and X 2 can be

placed over here and you see that the

within correlation is maintained and the

correlation of this group with X 3 is

none

but we do not maintain the correlation

across the variables in this particular

group so this group no cause where the

exchange ability holds only for bad

across groups rather than within groups

allows us to move this correlated

variable together and move them further

apart from the originals once we have

these groups of course we also have to

define a group feature importance and

you have many choices for that in the

following what we're going to do is to

simply the data analysis that I'm going

to show you what we're going to do it

simply sum up the beta coefficients

estimated by the LA Zoo in absolute

value across all the variables in the

group okay so we are going to use these

to analyze genetic data the issue with


genetic data is that this genotype

variable that corresponds to different

locations across the genome are highly

correlated locally this is because of

how we inherit our DNA from our parents

every time that there is a meiosis there

is somewhat of a patchwork ton of the

parental chromosomes to create a gamete

and this patchwork happens so that

Parshin of the genome that are close

tend to be inherited together

so these translates of the level of

genotypes in correlation between the X

J's correlations that are local so what

we're going to do is that we are going

to use the squared empirical correlation

as a similarity measure between the

snips and then partition the collection

of snips by agency constrain anarchical

classroom so we do a your article

clustering but for variable to be put in

the same Craster's they actually have to

be at the agent one to the other and

then we're going to cap the dendogram

are different heights each leading to a

different resolution of groups so what

you have here it's actually a portion

of one chromosome and they shaded blue

white the groups of snips corresponding


to cutting the dendogram at the level

indicated by this broken line at the top

here you have single snips with distance

with each other represented by the

correlation we are going to then have

depending on where we can't we are going

to have different resolutions for each

of these different resolutions we are

going to construct knockouts for the

group of the variables and then we are

going to analyze the data using the

lasso the examples I'm going to show you

comes from the UK biobank which I have

already mentioned to you in these data

set we are looking at about five hundred

six hundred thousand snips and we are

only looking at a subset of the

individuals the unrelated British

individuals to avoid problems that are

related to family relatedness and to

different ethnic background I have to

say that actually the current version of

these algorithms can deal both with

family relationship and individuals of

ethnic background but I don't have

results in good shape yet to share them

with all of you here is a sense of what

are these different resolutions that we

construct and we indicate them in terms

of resolution as a hundred percent means


that you have one snip per block so you

are actually using all the level of feed

for the resolution that you could get

this means that we're going to have six

hundred thousand blocks and the average

block width in megabases is zero because

it's represented by one snip only then

you have these decreasing levels of

resolution the smallest resolution the

coarser resolution so we get its

2-percent which corresponds to an

average block size of 15 an average per

lock we love point to megabases

we have 12,000 blocks now what you can

see on these column is the average

correlation between the knockoff and its

original variable and you can see

exactly the phenomenon I've described

for you for which I said we are going to

instead of looking at single variables

knockoff we're going to construct this

group when we look at simple variables

each variable it's correlated to it

snuck up with an average our square of

point 7 which is quite high

instead when we look at this coarser

resolution the correlation between each

variable in there knock off its low work

and down 2.2 this graph represents the


same information but instead of just

giving you 1 our value gives you box

plot for those and these graphs simply

and observes that if we look at the

correlations that we have to conserve to

construct correct no cut off we indeed

conserve them so if you look at the

correlation between one variable and

another one and that variable and the

knockoff of the other one they are very

much concerned and here it's an one

snapshot view of how this method works

and what type of results is going to

provide so this is again height and it's

we're looking at chromosome 12 at the

very top here you have the entire

chromosome 12 and you have a

representation of multiple lines that

corresponds to the different resolution

at the bottom here you have the coarser

resolution and you have a box drawn in

correspondence of every group of snipped

that has been selected as significant

and then as you increase you are going

to have boxes corresponding at the lower

zone at the higher resolution and so on

so for it's perhaps easier to look at

this zoo

version of this central region here here

you can count the different number of


resolutions they are indicated on the

y-axis so you can see here that we

detect that there is something going on

among this isn't it this group of nips

has something to tell me about height

conditionally on all the other blocks

across the gene then when I try to be a

bit more specific and I say what

among these nips is important I can also

reject these more specific hypothesis it

says this block of nip has something to

tell me about height conditionally on

all blocks of comparable size across the

genome however I am NOT able to make a

more specific discovery that says

exactly which nips in this block are

important in other positions like in

this one I actually can make very very

specific discoveries and say this snip

here this is actually only one variable

can tell you something about height

conditionally on every other single snip

in the genome and by contrast this is

the plot that I already shown you at the

very beginning of this presentation that

is what you get from the analysis of

this region from the standard univariate

approach you can see that while there is

an attempt to interpret these p-values


with these different colors this

different colors block are not terribly

interpretable because if you actually

look they overlap so we don't know what

we're talking about where is the signal

and we don't know how specific the

signal can be to bring this point may be

clearer let's look at a simulation so

that where we know what the right answer

is and we can compare how these two

approaches are fair this is simulated

data and

the phenotype depends truly on five

different variants that are indicated by

these little black lines at the bottom

here and in the univariate analysis when

the snip it's really one of the snip

that it's important for the phenotype

instead of having a black dot these

nipples represented with a green

asterisk what's difference between one

column and the next one it's the

relative importance of these nips so in

this column the irritability is small so

these nibs are going to explain a

smaller portion of the phenotypes and in

this other column their readability

stronger so these nibs are going to

explore explain a bigger portion of the

phenotype or if you want the effect size


is larger now let's see what happens

with the univariate models now the

univariate models in both cases detects

that there is something going on here

right because in both cases there are

some snips that pass the significant

threshold so the geneticist would be

alerted of the fact that there is

something going on now which of the two

cases is better when you have lower

signal you have a smaller number of

snips that pass the threshold this means

that some of the true snip do not pass

it but most of those that pass the

thresholds are actually important when

you have stronger signal which is

arguably what you would want the picture

is much more muddled again you have a

sense that something is going on here

but you see that there is a larger

number of gray dots that pass the

threshold in fact the importance of the

truth nibs as opposed to the not causal

ones those ones that are associated just

because of linkage disequilibrium it's

very difficult to discern here if you

look at what happens with the luck of

zones you have ups in some sense similar

in some sense a profoundly different


picture in both cases

at the low-resolution the methodology

picks up that there is something going

on what was the difference between low

signal versus high signal is that in the

lower signal case the method cannot

resolve very much among the five snips

in fact it cannot resolve among the five

snips at all and we just know that in

this portion of the genome that it's not

very specific there's something going on

when we increase signal the method can

resolve the different the importance of

the different steps up to a very high

resolution in fact here you can

distinguish every single one of the five

snips independently so we have done a

number of simulations and we are kind of

running out of time so I don't want to

bother you very much you can go and look

on the published paper to get a sense of

them the simulations have done trying to

be realistic so we have used already

true genotypes and we have simulated

phenotypes under a variety of scenarios

and we have compared our methodology

with others and what we see is that this

is a list of the methodologies that we

have used what we see is that the

methods that we have proposed always


controlled the false discovery

proportion and has really power that

it's comparable to other methods or

better but what we can what we really

shine on is that we can control the

false discovery rate which is not a

trivial task and we can localize

precisely and informative where the

signal is I want to show you just before

finishing some of the results on real

data so again we are in the UK biobank

but at this level instead of simulating

phenotypes we analyze true phenotypes

and I am presenting for you the results

at low resolution compared to the

results of

a univariate method that it's

state-of-the-art and whose results have

been passed processes to be interpreted

as independent lost in the genome so

that they would be equivalent to these

low resolution blocks and what you have

here is the number of findings so you

can see that our approach has a much

larger number of findings than this um

completing one a couple of slides on

computational resources clearly doing

this it's not something that happens

immediately but in truth the largest


amount of time it's spent 15 these hmm

models for our data and these agent and

models are actually fit on a regular

basis for genetics data for a variety of

other reasons so you can imagine that

this is cost that the analyst would bear

in any case geneticists treat these hmm

models to impute snips regularly to

phrase their data so we don't really

have to add a very serious computational

burden on top of what is already done so

one way is looking at that these

contexts this is fitting the hmm it's

where most of the time goes then you

generate knock-offs and then you have to

do the analysis now I said this is a

time that the geneticists would spend

anyway generate knockoff is what we mean

to do and this is some serious time

contribution but not unbearable you also

have to keep in mind that for every new

phenotype that you might want to analyze

on the same cohort you can reuse the

same knock offs so this final mapping

step can be really quite fast

Mateo says he has a very nice github

start where he presents where he keeps

the code and some simulation analysis

and

have some markdown documents so that you


can actually try out the map put and let

me thank you your for your attention and

thank you for putting up this conference

in these strange times of pandemic in

the United States where I find myself on

top of the pandemic we are also leaving

very sad moments because of racism so

let me conclude with an invitation for

everybody to do what we can to combat

racism thank you very much

You might also like